Interviews with Data Scientists: Claudia Perlich

In this wide-ranging interview, Roberto Zicari talks to a leading Data Scientist Claudia Perlich about what they must know about Machine Learning and evaluation, domain knowledge, data blending, and more.

Q4. Predictive Modeling: How can you perform accurate feature engineering/extraction?

I will take a shot at this from my experience winning KDD CUP’s back in the days (2007–09 and the publications I have on this). But before I go into the tricks of the trade, beware that what one does for competitions is not necessarily the right thing for building a model ‘in the real world’.

First of, in principle, if you have a universal function approximator (neural networks, decision trees, also read about VC dimensions if you want to know more) – a model that can express anything, and infinite data, you do not really need to worry about feature engineering. This is basically what we see with the advances of deep learning. Decades of research on feature construction from images have become entirely obsolete….

But in reality you usually do not have the luxury of ‘infinite data’. And all of a sudden having a super powerful model class that can express anything is no longer the obvious best choice (see my answer to my preferred algorithm). The reason is the bias-variance trade-off. With great power comes great responsibility … and in the world of predictive modeling overfitting. For instance you may find yourself in a position where a linear model might be the best choice. In fact I won most of my competitions with smart feature construction for linear models.

The answer on how to construct features depends VERY MUCH on which type of model you want to use and the strategies for trees are very different from those for linear models. Essentially you are trying to make it easy for a model type to find a relationship that is there. Linear models are great at taking differences, trees are terrible at it. Say you want a model to predict whether a company is profitable – this is simply a question whether revenue is greater than cost. If you have both features, this is really easy for a linear model and really hard for a tree to learn. So you can help the tree by adding differences of pairs of numeric features if you suspect they could matter. Linear models on the other hand have a really hard time with nonlinear relationships (I know I am stating the obvious). Say in health you know that both being too heavy and too light is a problem – it is therefore a good idea to include the square of weight. What about interaction effects (the infamous XOR problem)? Same story, you need to include pairwise products in the model to make it easy for a linear model to find.

That also means that there is a good amount of domain knowledge that comes to bear when deciding what might matter and next thinking about whether your model could easily take advantage of this if given the information. In general – having good hypothesis of what might influence the outcome is a great place to start thinking about features.

In competitions, something else comes into play – can you find weakness in the data to exploit? Almost all datasets carry traces of their creation and often by exploiting those, you can get better performance. This is know as leakage. In reality you do not want to exploit leakage because it does not really improve your generalization performance in the real world, but in competitions all is game. But before you get all excited, these days Kaggle is trying very hard to remove all such leaked information.

So now to a few examples from winning competitions based on feature construction in combination with mostly linear models (logistic, linear svm, etc.) We usually tried all other model classes as well but often ended up with linear (and good feature construction) having the best performance.

In a task to predict breast cancer based on 117 features from fMRI data our team was apparently the only one to observe that the patient ID was super predictive. So we added the patient ID range to the model and got a 10% performance increase. Finding accidental information in supposedly random identifiers is a common problem and always suggests that something is wrong in the dataset construction. BTW: we also tried to construct many other features that did not add any value.

On a telecom task with 50K anonymous features two members of our team independently tried to do a feature ranking and while we agreed somewhat, there were a number of features that had high mutual information and low AUC as individual contributors. This essentially means that we have a highly non-monotonic relationship. It turned out that in some features somebody had replaced missing values with some method and you could see spikes in the histogram. Those values with spikes were highly predictive but the linear model could obviously not learn from it – so we used decision trees on such single features as a means of discretizing the numeric feature and feed it into the linear models.

Common generic tricks to help linear models: discretize numeric variables, cap extreme values of linear and add indicator, include interaction effects, include log/squares, replace missing by zero and include indicator as well as interaction effect.

Another broad class of techniques in feature construction deal with the very common case of relational model where a large part of the information is either in a network structure or in 1-n or n-m relationships with the main entity for which you are trying to make a prediction. These cases are notoriously difficult and I spend my entire PhD on trying to create methods for automated feature construction.

To conclude, smart feature construction makes much more of a difference than fancy algorithms (that was until deep learning came along, now I am no longer sure …)

Q5. Can data ingestion be automated?

In the day and age of ‘Big Data”, data ingestion has to be automated on some level – anything else is out of the question.

The more interesting question is how to best automate it. And which parts of the data preparation stages can be done during digestion. I have a very strong opinion on wanting my data as ‘raw’ as possible. So you should for instance NOT automate how to deal with missing data. I’d much rather know that it was missing than it being replaced by the system. Likewise I prefer the highest granularity of information to be maintained – consider for instance the full URL address of a webpage that a consumer went to vs. keeping only the hostname (less desirable, but OK) vs. keeping only some content category. From a privacy perspective there are good arguments against the former – but tools like hashing can mediate some of these concerns.

So let’s talk about the how: There are 3 really important parts of the automation process:

  1. Flexibility in sampling if the full data stream is too large: if you are dealing with 50 Billion events per day – just stuffing all into a Hadoop system is nice – but makes later manipulation tedious. Instead, it is great to have in addition a process that ‘fishes out’ events of specific interest. See some of the details in a recent blog we wrote on this.
  2. Annotation of histories on the fly: having event logs of everything is great, but for predictive modeling I usually need to have features that capture the entity’s history. Joining every time over Billions of rows to create a history is impossibly. So part of the ingestion process is an annotation process that appends vital historical information to each event.
  3. Having statistical tests that evaluate if the properties of the incoming data flow is changing and sends alarms if for instance some data sources go temporarily dark. Some of this is covered here.

Q6. How do you ensure data quality?

The sad truth is – you cannot. Much is written about data quality and it is certainly a useful relative concept, but as an absolute goal it will remain an unachievable ideal (with the irrelevant exception of simulated data …).

First of, data quality has many dimensions.

Secondly – it is inherently relative: the exact data can be quite good for one purpose and terrible for another.

Third, data quality is a very different concept for ‘raw’ event log data vs. aggregated and processed data.

Finally, and this is by far the hardest part: you almost never know what you don’t know about your data.

In the end, all you can do is your best! Skepticism, experience, and some sense of data intuition are the best sources of guidance you will have.

Q7. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

First of, one should not even have to ask whether the insight is relevant – one should have designed the analysis that led to the insight based on the relevant practical problem one is trying to solve! The answer might be that there is nothing better you can do than status quo. That is still a highly relevant insight! It means that you will NOT have to waste a lot or resources. Taking negative answer into account as ‘relevant’ – if you are running into this issue of the results of data science not being relevant you are clearly not managing data science correctly. I have commented on this here: What are the greatest inefficiencies data scientists face today?

Let’s look at ‘correct’ next. What exactly does it mean? To me it somewhat narrowly means that it is ‘true’ given the data: did you do all the due diligence and right methodology to derive something from the data you had? Would somebody answering the same question on the same data come to the same conclusion (replicability)? You did not overfit, you did not pick up a spurious result that is statistically not valid, etc. Of course you cannot tell this from looking at the insight itself. You need to evaluate the entire process (or trust the person who did the analysis) to make a judgment on the reliability of the insight.

Now to the ‘good’. To me good captures the leap from a ‘correct’ insight on the analyzed dataset to supporting the action ultimately desired. We do not just find insights in data for the sake of it! (well – many data scientists do, but that is a different conversation). Insights more often than not drive decisions. A good insight indeed generalizes beyond the (historical) data into the future. Lack of generalization is not just a matter of overfitting, it is also a matter of good judgment whether there is enough temporal stability in the process to hope that what I found yesterday is still correct tomorrow and maybe next week. Likewise we often have to make judgment calls when the data we really needed for the insight is simply not available. So we look at a related dataset (this is called transfer learning) and hope that it is similar enough for the generalization to carry over. There is no test for it! Just your gut and experience …

Finally, good also incorporates the notion of correlation vs. causation. Many correlations are ‘correct’ but few of them are good for the action one is able to make. The (correct) fact that a person who is sick has temperature is ‘good’ for diagnosis, but NOT good for prevention of infection. At which point we are pretty much back to relevant! So think first about the problem and do good work next!


(*) What is data blending By Oleg Roderick, David Sanchez, Geisinger Data Science, November 2015

Claudia PerlichClaudia Perlich leads the machine learning efforts that power Dstillery’s digital intelligence for marketers and media companies. With more than 50 published scientific articles, she is a widely acclaimed expert on big data and machine learning applications, and an active speaker at data science and marketing conferences around the world.

Claudia is the past winner of the Advertising Research Foundation’s (ARF) Grand Innovation Award and has been selected for Crain’s New York’s 40 Under 40 list, Wired Magazine’s Smart List, and Fast Company’s 100 Most Creative People. Claudia holds multiple patents in machine learning. She has won many data mining competitions and awards at Knowledge Discovery and Data Mining (KDD) conferences, and served as the organization’s General Chair in 2014.

Prior to joining Dstillery in 2010, Claudia worked at IBM’s Watson Research Center, focusing on data analytics and machine learning.  She holds a PhD in Information Systems from New York University (where she continues to teach at the Stern School of Business), and an MA in Computer Science from the University of Colorado.

Original. Reposted by permission.