Interviews with Data Scientists: Claudia Perlich
In this wide-ranging interview, Roberto Zicari talks to a leading Data Scientist Claudia Perlich about what they must know about Machine Learning and evaluation, domain knowledge, data blending, and more.
By Roberto Zicari, ODBMS.org.
For the series Q&A with Data Scientists: Claudia Perlich
Q1. What should every data scientist know about machine learning?
I will speak primarily about predictive modeling/supervised learning because this is where my expertise is. Also – I am looking at this question from the perspective of a ‘practical’ data scientist who is looking to solve a specific problem using machine learning, not somebody who is trying to develop new machine learning algorithms – although it would be good to know this too.
In practice, correct evaluation is incredibly difficult – and I am not even talking in vs. out of sample or validation vs. testset. Those are the table stakes, but not what matters most in applications. Almost anybody can build hundreds if not thousands of models on a given dataset, but being sure that the one you picked is indeed most likely the best for the job is an art! The question is typically not even which algorithm (logistic regression, SVM, decision tree, deep learning), but rather the entire pipeline from sampling a training set, preprocessing, feature representation, labeling, etc. None of this has anything to do with just ‘out-of-sample’ evaluation. So here is your compass for doing it right:
Your evaluation setting has to be as close as possible to the intended USE of your model.
You basically want to come as close as you can to simulating having that model in production and track the impact as far to the bottom line as you can. That means in a perfect world you need to simulate the decision that your model is going to influence. This is often not entirely possible.
Here is an example: You want to evaluate a new model to predict the probability of a person clicking on an ad. The first problem you have is that almost surely you have neither adequate training not evaluation data … Because until you actually show the ads you have nothing to learn from. So welcome to the chicken and egg part of the world with a lot of literature on exploration vs exploitation. So already getting a decent data set to use for evaluation is hard. You can of course consider some ideas from transfer learning and build your model on some other ad campaign and hope for the best – which is fine for learning but really adds just one more question to your evaluation – which alternative dataset is best suited and of course you still have not data for evaluation.
But let’s for the moment assume that you have a somewhat right dataset. Now you can of course calculate all kinds of things. But again, you only added to the many questions – what should you look at: Likelihood, AUC, Lift (at what percentage), Cost per click? And while there are some statistical arguments for one over the other, there is no right answer.
What matters is what you are going to do with the model: Are you using it to select the creative in 100% of all cases? Are you using it to select only the top n percent of most likely opportunities? Do you want to change the bid price in an online auction based on this prediction? Or do you want to understand what makes people click on ads in general? All of those questions can be answered by more or less the same predictive task – predict whether somebody will click. But you need to look at different metrics in each case (in fact there is some correspondence between the above 4 metrics and the 4 questions here) and I would bet that you should select very different models for each of these uses.
Finally – have a baseline! One thing is to know when you are doing better or worse. But there is still the question – is it even worth it or is there a simple solution that gets you close? Having a simple solution to compare to is a fundamental component of good evaluation. At IBM we always used ‘Willie Sutton’. He was a bank robber and when asked why he did it, the answer was because that’s “where the money was”. Any sales model we build was always compared to Willie Sutton – just rank companies by revenue. How much better does your fancy model get than that?
Q2. Is domain knowledge necessary for a data scientist?
Welcome to a long standing debate. I was drafted on short notice to a panel back in 2013 at KDD called The evolution of the expert on exactly this topic.
There are many different ways I have tried to answer in the past:
“If you are smart enough to be a good data scientist, you can for most cases probably learn whatever domain knowledge you need in a month or two.”
“Kaggle competitions have shown over and over again that good machine learning beats experts.”
“When hiring data scientists I am more interested by somebody having worked in many industries than having experience in mine.”
Or I just let my personal credentials speak for themselves: I have won 5 data mining competitions without being an expert on breast cancer, the yeast genome, CRM in telecom, Netflix movie reviews, or hospital management.
All this might suggest that the answer is no. I in fact would say NO in the usual interpretation of domain knowledge. But here is where things change dramatically:
I do not need to know much about the domain in general, BUT I need to understand EVERYTHING about how the data got created and what it means. Is this domain knowledge? Not really – if you talk to a garden variety oncologist, he or she will be near useless at explaining the details of the fMRI data set you just got. The person you need to talk to is probably the technician who understand the machine and all the data processing that is happening in there including stuff like calibration.
Q3. What is your experience with data blending? (*)
I have to admit that I had never heard of the concept of ‘data blending’ before reading the reference. After some digging, I am contemplating whether it is simply a sales pitch for some ‘new’ capability of a big data solution or a somewhat general attempt to cover a broad class of feature construction and ‘annotation’ that are based on some form of a ‘fuzzy join’ where you do not have the luxury of a shared key. Giving this the benefit of doubt, I will go with the second. There are a few ways to look at the need for adding (fuzzy or not) data:
1) On the most abstract level it is a form of feature construction. I my experience, features often trump algorithm – so I am a huge fan of feature construction. And if you are doing the predictive modeling right – the model will tell you if your blending worked or not. So you really have little to lose and you can try all kinds of blending even of the same information. This tends to be the most time consuming (and also in my case fun) part of modeling, so having some tools that simplify this and in particular allow for fuzzy matches and automated aggregation would be neat …
2) Let me put on my philosopher’s hat: All you do with blending is navigating the bias-variance tradeoff in the context of the limitations of the expressiveness of your current model. Most often the need to blend arises around identifiers of events/entities. Say you have a field that is ZIP code (or Name). You might want to blend some actual features of ZIP codes at a certain time – so you are really just dealing with some identity of a combination of time and space. You can add in some census data based on ZIP and date and hope that this improves your model. But in some information theoretical sense, you in fact did not add any data. ZIP and date implicitly contain all you need to know (think of it as a factor). In a world of infinite data you do not need to bring in that other stuff because a universal function approximator can learn all you can bring in directly from the ZIP date combination.
This of course only works in theory. In practice it matters how often that ZIP and date appear in your training set and whether your model can deal naturally with interaction effects, which for instance linear models cannot unless you add them. In order to learn anything from it, it has to appear multiple times. If it does not – blending in information de facto replaces the super high-dimensional identifier space (ZIP and date combination) with a much lower common space of say n features (average income, etc). So in terms of bias variance, you just managed a huge variance reduction but you may also have lost all the relevant information (huge increase of bias): say some hidden feature like the occurrence of a natural catastrophe that was not available in the blend but that ZIP and date as a combination was a good proxy for …
In terms of related experience, I in fact spend a good 3 years (my dissertation) on something very closely related. It was not so much on the ‘fuzzy’ part but rather on the practical question how to automatically create features in multi-relational databases. I did assume that the link structure (keys) were known to join between tables. We published this work and some conceptual thoughts around the role of identifiers in the Machine Learning Journal: Distribution-based aggregation for relational learning with identifier attributes
And then I spent a good 3 years at IBM wrangling the data annotation problem for Company names. We had to build a propensity model for IBM sales accounts. While all kinds of internal info was available for an account, we had no external set of features. Each account was linked somehow to a real company. However, that match was fuzzy at best. What we needed for a model was some information about industry, size, revenue, etc. So in this case, each ‘identifier’ is unique in my dataset and that nice theory gets me nothing. The match between accounts and Dun & Bradstreet entities was something of a n-to-m string matching. For a while we used their matching solution and eventually replaced it with our own (took us a good 2 years).
In the end we were wrong about the match for about 15% of the accounts. The project nevertheless won a good number of awards internally and externally (Introduction). We also published a lot of the methodology on the modeling side, but of course, the hard matching part was not scientific enough …