Interview: Pedro Domingos, Winner of KDD 2014 Data Mining/Data Science Innovation Award
Tags: Cost Sensitive, Machine Learning, Pedro Domingos, SIGKDD, Social Network Analysis, Stream Mining
We discuss the differences between Machine Learning, Data Mining, and Statistics; the importance of costsensitive classification; social network mining and identifying influencers; mining data streams, and more.
best paper award for KDD98 in New York City.
When Pedro again won the best paper award next year at KDD99, I halfjokingly asked him to give others a chance.
However, Pedro has won numerous awards since then for research in areas spanning Machine Learning, Data Mining, Artificial Intelligence, Statistics, and Databases, and his outstanding contributions were recognized by ACM SIGKDD 2014 Innovation Award, the highest awards in the field  a Data Mining/Data Science "Nobel Prize".
Many of his award winning research ideas are implemented in software, which Prof. Domingos has made freely available. This includes
To learn more about his research, here are some of his most cited papers via Google Scholar and Citeseerx.
This is the first part of the interview. Here is the second part of the interview: the Master Algorithm, new type of Deep Learning, great advice for young researchers
Gregory Piatetsky: Q1. Congratulations on the KDD 2014 Innovation Award! The award recognizes your work in data mining related areas, but you actually have more publications in Machine Learning and AI journals and conferences. How do you see the relationship between the fields of Machine Learning, Data Mining/Data Science, and Statistics? Is there a significant difference in goals, methodologies, tools ?
Pedro Domingos: My work fits into all three, and I gave up long ago trying to parse the differences between them. But from a practical point of view these are different  if overlapping  communities, and it's good to have a map of the territory. Machine learning, because of its origins in AI, is more concerned with making the computer as autonomous as possible.
There's a lot of emphasis on scaling up, of course, but also on dealing with real data in all its messiness. Statistics provides a lot of the underpinnings of this whole enterprise. As a field, however, it's also more conservative than computer science. So it's good to have both a common core and different emphases in these fields.
GP: Q2. One of your contributions cited in the award is MetaCost algorithm for costsensitive classification, (for which you received a KDD99 bestpaper award. ) Can you summarize the key ideas? What is the current state of the art 15 years later?
PD: The problem MetaCost addresses is that most classifiers assume all errors are equally costly, but in reality this is seldom the case. Not deleting a spam email will cost a fraction of a second of your attention, but deleting an email from your boss could cost you your job. Now you could go and redesign every classifier from scratch to be costsensitive, but that would take a long time. Instead,
What it does is relabel examples so that the frontier between the classes becomes the optimal one according to the cost measure. For example, perhaps an email is labeled as spam in the training set, but it looks enough like a legitimate one that MetaCost relabels it as such, just to be safe. Then it just runs the costfree classifier to learn the frontiers according to the relabeled examples.
There was a tremendous followup to MetaCost, and today costsensitive classification is a large field, with many techniques that are much more sophisticated than MetaCost. The bottom line is, you want to use either a natively costsensitive learner or an algorithm like MetaCost, or your system will be making a lot of costly mistakes.
GP: Q3. You were also a pioneer in social network mining, (see paper: Mining the Network Value of Customers, KDD 2001) and defined the influence maximization problem and proposed the first algorithms for it. Tell us about key ideas in that work. What do you think of current social influence rankers, like Klout, PeerIndex, and Kred?
PD: The motivation for the work was that companies spend a lot of effort trying to predict the value of a customer, because that determines how much they're willing to spend to acquire and keep him. But they ignore (or used to) what is often the single biggest component of that value: the customer's influence on other customers. If I switch to a different cell phone service provider, that makes my friends and family more likely to switch as well, particularly if I'm a persuasive guy and keep telling them how much better my new carrier is.
So the question I asked myself was: can we model the influence among customers, mine these models from data, and then use the models to choose the optimal customers to market to, in the sense that they'll generate the highest expected return for least marketing expenditure? We measured what we called the network value of a customer, which is the expected number of other customers that that customer will directly or indirectly convert, and found that it was often huge. For example, in the "web of trust" of the Epinions review site, which we mined, the network value of the top user was over 20,000. That was pretty exciting! And today's social networks take it to a whole other level.
I think the social influence ranking industry has a great future  the power of traditional mass marketing and direct marketing is waning, and increasingly social influence is what determines the fate of your product. Having said that, the industry is still in its infancy  a lot of the modeling is still quite ad hoc, as far as I can tell, and misses a lot of important effects. But the better it gets, the more widely it will be used, so watch this space.
GP: Q4. Another of your key contributions is in mining data streams, with the most recent result being VFML toolkit  one of the best opensource resources for stream mining. Tell us about key ideas in VFML and examples of applications.
PD: The goal in VFML is take any traditional batch learning algorithm  which has random access to all the data, and typically runs in time that's quadratic or worse in the size of the data  and easily turn it into a stream mining algorithm  which has fixed memory, can only see each sample once, and needs to run in time that's at worst linear in the size of the data, but ideally independent of it. And we want statistical guarantees that the model learned by the stream miner is essentially equivalent to the batch one.
This may seem like a tall order at first, but is in fact quite feasible, for a simple reason.
If you want to predict who will will the next presidential election, you don't need to ask every single voter who they'll vote for; a sample of a few thousand suffices, if you're willing to accept a little bit of uncertainty.
Once I do, I can move on and use the remainder of the stream to pick the tests for that node's children, etc. And I can aggregate all the uncertainties in the individual decisions into the overall uncertainty that I've learned the same model I would if I had used infinite time and data, and use just as much data as needed to keep this uncertainty within limits set by the user.
GP: Q5. You also made key contributions to statistical multirelational learning, and proposed Markov logic networks as a means to unify firstorder logic and probabilistic graphical models. Can you summarize the key ideas and tell us about your opensource system Alchemy: Algorithms for statistical relational AI
PD: Most learning algorithms only look at a single relation  a single table of data  and this is very limiting. Real databases have many relations, and we want to mine straight from them. This requires a pretty substantial rethink of how to do statistical learning, because samples are no longer independent and identically distributed, due to joins, etc. The solution in Markov logic networks is to use firstorder logic to express patterns among multiple relations  which it can easily do  but treat logical formulas as features in a Markov network or loglinear model, which sets up a probability distribution over relational databases that we can then use to answer questions.
Alchemy is an opensource implementation of the learning and inference algorithms we've developed for Markov logic. The goal is to provide for machine learning and data mining something akin to what the relational model and SQL provide for databases.
GP: Q6. You won 2 best papers awards in a row at KDD98 and KDD99 conferences, but have very few publications in KDD or data mining conferences/journals after 2004. What caused the shift? (I was KDD98 general chair and KDD99 Best Paper awards chair, so I actually presented you with the awards, and probably said at KDD99 something like "Pedro, give others a chance", but I was only joking!)
PD: I remember that! Made me feel quite guilty. Choosing where to submit my papers is always tough. I think everything I've published in the last ten years could have creditably appeared in KDD or another data mining venue. The reason I've published mostly in machine learning and AI venues is that Markov logic networks draw heavily on ideas from those fields, and so the audience in those communities is more primed to like and follow the work. Also, because of the difficulty of the issues involved, we had to at first focus on data sets that are fairly small by KDD standards. But we're now rapidly scaling to larger and messier data, so expect to see more KDD papers from my group in the future.
BIO: Pedro Domingos is Professor of Computer Science and Engineering at the University of Washington. His research interests are in machine learning, artificial intelligence and data mining. He received a PhD in Information and Computer Science from the University of California at Irvine, and is the author or coauthor of over 200 technical publications.
He is a member of the editorial board of the Machine Learning journal, cofounder of the International Machine Learning Society, and past associate editor of JAIR. He was program cochair of KDD2003 and SRL2009, and has served on numerous program committees. He is a winner of the SIGKDD Innovation Award, the highest honor in the data mining field. He is a AAAI Fellow, and received a Sloan Fellowship, an NSF CAREER Award, a Fulbright Scholarship, an IBM Faculty Award, and best paper awards at several leading conferences.
Related:
I first met Pedro at KDD95 in Montreal, the first international conference on
Knowledge Discovery and Data Mining. His modest manner did not conceal his brilliance,
and he impressed me with his rare talent of being able to explain difficult technical issues quite clearly.
I quite enjoyed a dinner conversation we had then in a small Montreal restaurant,
but enjoyed our interaction even more when,
in my capacity of KDD Awards chair, I presented him with the
However, Pedro has won numerous awards since then for research in areas spanning Machine Learning, Data Mining, Artificial Intelligence, Statistics, and Databases, and his outstanding contributions were recognized by ACM SIGKDD 2014 Innovation Award, the highest awards in the field  a Data Mining/Data Science "Nobel Prize".
Many of his award winning research ideas are implemented in software, which Prof. Domingos has made freely available. This includes
 Alchemy: Algorithms for statistical relational AI alchemy.cs.washington.edu
 VFML: A toolkit for mining massive data sources www.cs.washington.edu/dm/vfml/
 NBE: A Bayesian learner with very fast inference www.cs.washington.edu/ai/nbe
 BVD: A biasvariance decomposition for zeroone loss www.cs.washington.edu/homes/pedrod/bvd.c
 RISE: A unified rule and instancebased learner www.cs.washington.edu/homes/pedrod/rise.c
To learn more about his research, here are some of his most cited papers via Google Scholar and Citeseerx.
This is the first part of the interview. Here is the second part of the interview: the Master Algorithm, new type of Deep Learning, great advice for young researchers
Gregory Piatetsky: Q1. Congratulations on the KDD 2014 Innovation Award! The award recognizes your work in data mining related areas, but you actually have more publications in Machine Learning and AI journals and conferences. How do you see the relationship between the fields of Machine Learning, Data Mining/Data Science, and Statistics? Is there a significant difference in goals, methodologies, tools ?
Pedro Domingos: My work fits into all three, and I gave up long ago trying to parse the differences between them. But from a practical point of view these are different  if overlapping  communities, and it's good to have a map of the territory. Machine learning, because of its origins in AI, is more concerned with making the computer as autonomous as possible.
The great thing for me about the data mining/data science community is its diversity  there you'll find machine learners, statisticians, database, algorithms, and systems folks, all working for a common goal.
There's a lot of emphasis on scaling up, of course, but also on dealing with real data in all its messiness. Statistics provides a lot of the underpinnings of this whole enterprise. As a field, however, it's also more conservative than computer science. So it's good to have both a common core and different emphases in these fields.
GP: Q2. One of your contributions cited in the award is MetaCost algorithm for costsensitive classification, (for which you received a KDD99 bestpaper award. ) Can you summarize the key ideas? What is the current state of the art 15 years later?
PD: The problem MetaCost addresses is that most classifiers assume all errors are equally costly, but in reality this is seldom the case. Not deleting a spam email will cost a fraction of a second of your attention, but deleting an email from your boss could cost you your job. Now you could go and redesign every classifier from scratch to be costsensitive, but that would take a long time. Instead,
MetaCost is something you can wrap around a standard classifier to make it costsensitive.
What it does is relabel examples so that the frontier between the classes becomes the optimal one according to the cost measure. For example, perhaps an email is labeled as spam in the training set, but it looks enough like a legitimate one that MetaCost relabels it as such, just to be safe. Then it just runs the costfree classifier to learn the frontiers according to the relabeled examples.
There was a tremendous followup to MetaCost, and today costsensitive classification is a large field, with many techniques that are much more sophisticated than MetaCost. The bottom line is, you want to use either a natively costsensitive learner or an algorithm like MetaCost, or your system will be making a lot of costly mistakes.
GP: Q3. You were also a pioneer in social network mining, (see paper: Mining the Network Value of Customers, KDD 2001) and defined the influence maximization problem and proposed the first algorithms for it. Tell us about key ideas in that work. What do you think of current social influence rankers, like Klout, PeerIndex, and Kred?
PD: The motivation for the work was that companies spend a lot of effort trying to predict the value of a customer, because that determines how much they're willing to spend to acquire and keep him. But they ignore (or used to) what is often the single biggest component of that value: the customer's influence on other customers. If I switch to a different cell phone service provider, that makes my friends and family more likely to switch as well, particularly if I'm a persuasive guy and keep telling them how much better my new carrier is.
So the question I asked myself was: can we model the influence among customers, mine these models from data, and then use the models to choose the optimal customers to market to, in the sense that they'll generate the highest expected return for least marketing expenditure? We measured what we called the network value of a customer, which is the expected number of other customers that that customer will directly or indirectly convert, and found that it was often huge. For example, in the "web of trust" of the Epinions review site, which we mined, the network value of the top user was over 20,000. That was pretty exciting! And today's social networks take it to a whole other level.
I think the social influence ranking industry has a great future  the power of traditional mass marketing and direct marketing is waning, and increasingly social influence is what determines the fate of your product. Having said that, the industry is still in its infancy  a lot of the modeling is still quite ad hoc, as far as I can tell, and misses a lot of important effects. But the better it gets, the more widely it will be used, so watch this space.
GP: Q4. Another of your key contributions is in mining data streams, with the most recent result being VFML toolkit  one of the best opensource resources for stream mining. Tell us about key ideas in VFML and examples of applications.
PD: The goal in VFML is take any traditional batch learning algorithm  which has random access to all the data, and typically runs in time that's quadratic or worse in the size of the data  and easily turn it into a stream mining algorithm  which has fixed memory, can only see each sample once, and needs to run in time that's at worst linear in the size of the data, but ideally independent of it. And we want statistical guarantees that the model learned by the stream miner is essentially equivalent to the batch one.
This may seem like a tall order at first, but is in fact quite feasible, for a simple reason.
If you want to predict who will will the next presidential election, you don't need to ask every single voter who they'll vote for; a sample of a few thousand suffices, if you're willing to accept a little bit of uncertainty.
VFML generalizes this idea to models with many parameters and many structural decisions: Which is the best attribute to test at this node in a decision tree? How many examples do I need to be pretty sure I picked the right one?
Once I do, I can move on and use the remainder of the stream to pick the tests for that node's children, etc. And I can aggregate all the uncertainties in the individual decisions into the overall uncertainty that I've learned the same model I would if I had used infinite time and data, and use just as much data as needed to keep this uncertainty within limits set by the user.
GP: Q5. You also made key contributions to statistical multirelational learning, and proposed Markov logic networks as a means to unify firstorder logic and probabilistic graphical models. Can you summarize the key ideas and tell us about your opensource system Alchemy: Algorithms for statistical relational AI
PD: Most learning algorithms only look at a single relation  a single table of data  and this is very limiting. Real databases have many relations, and we want to mine straight from them. This requires a pretty substantial rethink of how to do statistical learning, because samples are no longer independent and identically distributed, due to joins, etc. The solution in Markov logic networks is to use firstorder logic to express patterns among multiple relations  which it can easily do  but treat logical formulas as features in a Markov network or loglinear model, which sets up a probability distribution over relational databases that we can then use to answer questions.
Alchemy is an opensource implementation of the learning and inference algorithms we've developed for Markov logic. The goal is to provide for machine learning and data mining something akin to what the relational model and SQL provide for databases.
GP: Q6. You won 2 best papers awards in a row at KDD98 and KDD99 conferences, but have very few publications in KDD or data mining conferences/journals after 2004. What caused the shift? (I was KDD98 general chair and KDD99 Best Paper awards chair, so I actually presented you with the awards, and probably said at KDD99 something like "Pedro, give others a chance", but I was only joking!)
PD: I remember that! Made me feel quite guilty. Choosing where to submit my papers is always tough. I think everything I've published in the last ten years could have creditably appeared in KDD or another data mining venue. The reason I've published mostly in machine learning and AI venues is that Markov logic networks draw heavily on ideas from those fields, and so the audience in those communities is more primed to like and follow the work. Also, because of the difficulty of the issues involved, we had to at first focus on data sets that are fairly small by KDD standards. But we're now rapidly scaling to larger and messier data, so expect to see more KDD papers from my group in the future.
BIO: Pedro Domingos is Professor of Computer Science and Engineering at the University of Washington. His research interests are in machine learning, artificial intelligence and data mining. He received a PhD in Information and Computer Science from the University of California at Irvine, and is the author or coauthor of over 200 technical publications.
He is a member of the editorial board of the Machine Learning journal, cofounder of the International Machine Learning Society, and past associate editor of JAIR. He was program cochair of KDD2003 and SRL2009, and has served on numerous program committees. He is a winner of the SIGKDD Innovation Award, the highest honor in the data mining field. He is a AAAI Fellow, and received a Sloan Fellowship, an NSF CAREER Award, a Fulbright Scholarship, an IBM Faculty Award, and best paper awards at several leading conferences.
Related:
 Interview: Pedro Domingos: the Master Algorithm, new type of Deep Learning, great advice for young researchers  part 2
 Data Mining/Data Science "Nobel Prize": ACM SIGKDD 2014 Innovation Award to Pedro Domingos
 Pedro Domingos: A few useful things to know about Machine Learning
 KDD99 bestpaper award
 KDD98 best paper awards
Top Stories Past 30 Days

