KDnuggets Home » News » 2015 » Sep » Opinions, Interviews, Reports » Data Science Data Logic ( 15:n31 )

Data Science Data Logic


Even though participating in MOOCs and online competitions are good exercises to learn data science, but it is more than algorithms and accuracies. Understand how to formulate hypothesis, data creation, sampling, validation, etc. to become true data scientist.



By Olav Laudy, (Chief Data Scientist, IBM Analytics, Asia-Pacific).

These days there is an incredible amount of attention for the algorithms side of data science. The term ‘deep learning’ is the absolute hype at the moment, with Random Forest and Gradient Boosting coming second and third as approaches that got many people fantastic scores on Kaggle.

One thing, however, is almost never talked about: what decisions are made to come to a training, testing and validation logic? How is the target variable defined? How are the features created? Of course, you hear people speak about feature creation, but that is typically as recombination or transformation of the features that have initially been chosen to be part of the dataset. How do you get to that initial dataset?

In this article, I’d like to discuss the many decisions that make up the logic to come to an analytic dataset and show common approaches. To start off, consider the following: in Marketing, one frequently builds propensity (or classification) models. The target is created as ‘customer has purchased in last three months’, and the predictors are static customer characteristics and rollups of the behavior prior to those three months. Let’s ask some questions about such a dataset:

  • If you predict the customers who are going to buy, doesn’t that mean that they will buy anyway? Wouldn’t you want to predict customers who you need to reach out to in order to make them buy? Indeed, this is the uplift thought, for those who are recognizing it. There may be the uplift model to solve this, but also, you may rethink the original question and instead come up with a clever experiment such that the resulting data gives an answer to that exact question, rather than going with the default target “who bought in the last 3 months’.
  • If your historic modeling campaign window is three months, and a customer buys something the first day of this three months, isn’t the meaning of his last observed predictor data containing different information than someone who buys at the end of the three months?
  • Why would you choose a three month campaign window in the first place?
  • If you score a new dataset, how do you handle the fact that a customer you want to score was also part of the training dataset (yet, potentially with an updated state).
  • How do you create a model where the predictions change if a customer’s behavior changes, rather than a static model that only differentiates between customers?

Those seem all fairly straightforward questions, yet, answers are not found in literature. In this article, I will discuss the various approaches I encountered in the wild. In addition, I will demonstrate the (visual) ways I use to make many of those choices discussable. Note that many examples revolve around purchasing, but the methods used are applicable in a wide range of domains (replace ‘purchase’ with your event of interest).

Phrasing the right question

A bank once approached me with the following question: we would like to predict the amount of cash taken out from ATMs, 30 days out, on a daily basis. Thinking about this question, the following comes to mind:

  • There will be periodic patterns, as people go to ATMs more when they get paid.
  • There will be weather patterns, as people don’t go to the ATM when it rains.
  • There will be location patterns, as rural will show different patterns than urban locations.
  • There will be calendar patterns, as (days prior to) celebration days have their own dynamic.
  • There will be ethnic patterns, as different ethnicities have their own celebrations and ways to spend money.

And so on. Since one mentions the word ‘time’, somehow, there’s the immediate spastic response ‘time series’. So, this means, you go ahead and start building 10.000 time series, for every ATM one.

Wait! Full stop here! The question is: “predict the amount of cash taken out from ATMs, 30 days out, on a daily basis.” Why? How will you use the results? Why 30 days? The number 30 simply seem to come to mind as a convenient number of days to look forward. Yet, their business process showed weekly money provisioning (money transport driving from ATM to ATM). So, why not start by making the 30-days out, a 7-day out prediction? Next question: why do you need a daily prediction? Again, turns out, with a daily prediction, the bank imagined they could easily work out when the ATM was going to be empty. The only thing: the person asking the question had not thought about adding predictions together also means adding the uncertainty around the predictions together. Here we are: the word ‘predict’ is heavily misused by anyone who can use the word, not being hindered by any knowledge about data science. On a side note: considering each ATM as a separate time series does not give you the benefit of sharing predictors across different ATMs, instead I would consider a model (or dataset) that contains all ATMs (longitudinal data; accounting for time as additional predictor).

What would be the right way to phrase the question? There are many different options:

  • Daily prediction of total money taken out (regression)
  • Daily total cash left in the ATM (regression)
  • Number of days till empty (survival)
  • Daily probability to hit empty (classification)
  • Summed amount of predicted individual cash take-outs (aggregation model)
  • Predict routes of the money transport (optimization)

As the 4th law of data mining says: there’s no free lunch for the data miner. Prior to experimenting, one cannot tell which of those models would result in the best approach to be used for cash replenishment. There’s one intuition, but more importantly, do not fix your model approach and think that the solution is to use the caret package in R (i.e run 150 commonly used machine learning models). All machine learning will be automated in a matter of years, and hence as a data scientist, you will not have value anymore, unless you focus on exactly the points I’m making here: it’s about creativity, it’s about understanding how to phrase the right question and how to create your training set accordingly.

These are my tips to give an initial ranking of things to try out:

  • Exotic models do not solve your problem; getting the question right is more important.
  • Try the simplest thing first.
  • Be prepared to rephrase your modeling experiment.
  • In phrasing your question, try to understand in what way, or via what mechanism, your predictors influence your targets.
  • Do not predict more than you need to solve the business challenge (for example: 7 days out vs. the earlier discussed 30 days out).
  • Understand really well how a resulting model can be implemented, and rethink that again.
  • Understand what data will be available for scoring (specially the timing aspect, later more).
  • Understand how well you need to predict (i.e. quality of the prediction) in order to improve on the current business process.
  • Classification questions are often easier than regression questions (for example: predict if someone has a job or not vs. how much they earn).

Google recently broke records in face identification using a model they call Facenet. Their way of looking at the data was truly new and creative: instead of (the traditional way) trying to predict a face, they paired three faces; two were the same person and one was different. It was up to the model to distinguish the pairs. These are the innovative ways of looking at data that characterizes a good data scientist.