Data Science Data Logic
Even though participating in MOOCs and online competitions are good exercises to learn data science, but it is more than algorithms and accuracies. Understand how to formulate hypothesis, data creation, sampling, validation, etc. to become true data scientist.
Finally, we come to the whole purpose of the exercise: the moment of scoring. The predictors are calculated in the exact same way as in the training set and the model is applied to that data. Out comes the prediction for the same time window as was being trained on. Figure 7 shows an overview, and here it also becomes clear why in the modeling phase, you cannot create predictors from the time frame between ‘now’ and ‘t-3’, namely, this would imply that at the time of scoring, you have also data available from that timeframe, which in this case is future data.
An often heard complaint is that marketing (or any other business department) does not accept black box models such as neural nets or support vector machines. I have found that using those simple visualizations greatly enhance the understanding and acceptance how models can be valid. Business is more than willing to accept black box models, provided that somehow they can follow the logic that is used to validate the models. Figure 8 shows the training, testing and validation logic for the earlier mentioned container arrival example. The (black box) bagged neural network outperformed the existing method and Figure 8 helped to explain that the logic was sound.
Dynamic vs. static models
So far, the discussion revolved around the data selection criteria, but not about the features. If a model only contains demographic data (which is usually fairly static), a customer will always get the same predicted score. In the early days of marketing, this was acceptable, however, today’s models are requested to incorporate more dynamic features. Somehow, if customer changes behavior, this needs to be picked up by the model and may result in a change in the prediction. This implies that one needs to capture behavior, and change thereof as part of the predictors. Common way of doing this are to specify a larger set of time moments ‘last week, last month, last year, ever’ and per each of those moments aggregate behavior using summary functions as max, mean, sum and standard deviation. Examples are ‘max time spent on website last week’, mean time between purchases in lasts month, total spent on services last year, or volatility (standard deviation) of balance last hour. It is not uncommon to take those features and divide or difference them with each other to measure change in behavior. Examples are ‘ratio of max time spent on website last week and max time spent on website last month’.
Alternatively one can create ‘histogram predictors’, by binning a continuous variable (from a particular timeframe) and count or percentage the bins. Examples are ‘percentage of expenses < 10, percentage of expenses 10-25, percentage of expenses 25-50, etc. Subsequently, one can derive features that indicate the change over time for those bins.
A model using these types of predictors (in combination with static predictors) allows the prediction to change once a certain event happens. Note that in order to allow combinations of demographics to have their ‘own’ effect of behavioral change, one needs to use models that incorporate interaction between features (i.e., no main effects logistic regression).
It provides a lot of insight to score the model for a set of prototypical data and use visualization to show how a propensity score changes when one or more dynamic features change.
Purchased in the last three months vs. ever purchased
Note that so far, the discussion dealt with making a selection based on a time window. I’ve encountered cases where there’s no time available, and simply on knows if the customer has bought the product or not. Such a model is often referred to as ‘look-alike’ model (although all models can be explained in terms of finding look-a-likes). Modeling is still possible, but be aware that the predictors may occur later than the purchase event, which may result in predictor leakage. Also be aware that one can model very historic behavior. For example, take a dataset where the target is ‘customer has mobile data verses not’ without a time period. A model will relate the mobile data uptake to a set of background characteristics, however, since there’s no timing in the data, the model cannot distinguish early adopters of mobile data vs. late adopters. The result is that at scoring time, the model will (partially) point at customers that look like early adopters, while as we speak, mobile data is far beyond the early adoption stage. Often those things can be detected if one really thinks through the implications of the approach, rather than blindly try to model the first data set that comes in mind.
A transaction based approach
An entirely different approach is displayed in Figure 9. Rather than making a time selection, one recognizes that customers follow certain patterns in their subsequent transactions. The data is very close to an original transaction table. The table is sorted on customer ID and then ascending on purchase data (again, the example given is purchase, but the same logic is widely applicable). At the moment a customer is doing a transaction, this transaction represents the latest data available. In the historic data, however, one can look beyond that ‘current’ transaction and observe the next transaction. In order to predict the next transaction, one can bring that next transaction one line up and call it a target. As predictors for that target, the details of the current transaction can be used, or previous transaction details can be placed in the same row as the current transaction.
The target can contain various characteristics: it can be the next product, the next price, the time to next purchase (inter purchase time), the next channel, the next payment method etc. The predictor set can also be extended: rather than previous product one can take previous price, previous channel, or some lagging variable average spent over last n purchases. Even Markov like predictors can be derived here: given the current purchase, historically, what is the transition probability to the next product. One element that should be considered here is that many machine learning models implicitly assume independent data records. If this is violated, one underestimates standards errors, with the error proportional to the dependence between observations. One option is to use models that handle this dependency correctly, such as linear mixed models (random effect models). Alternatively, one could argue that if the right predictors are used, conditional on the model, the observations are independent. The interesting aspect of this model is the fact that there’s no selection logic needed, and the scoring is done at the moment a new transaction comes in.
Concentration models vs. mechanism models
Say, you like fishing, and after fishing in a variety of ponds, you happen to just establish that in some ponds, you are more likely to catch fish. You remember those ponds by remembering the route to those ponds, however, the route itself bears no explanatory power of why in those ponds you are more successful. That is the principle behind a concentration model. Many (current) marketing models can be referred to as concentration models: within certain combinations of (simple) demographic variables, there seems to be a higher uptake. This has some stability over time, and hence, one can make use of this when scoring. The lift of those models is high enough to make the marketing business case (say, a lift of 3-5 for the first 30%), however, with the typically low base rate of campaigns, the false positive rate can be as high as 90%.
In contrast, there is a type of model that can be referred to as mechanism model. In a common predictive maintenance use case, one tries to predict to breakdown of equipment. Here’s the breakdown of a machine typically precedes by increased vibrations (caused by loose parts) or higher oil temperature (caused by soot deposit). A sensor picking up those measurements will yield high quality predictors, with very accurate models as result. The connection (vibrations – loose parts) or (higher oil temperature – soot deposit) can be establish by speaking with engineers. A way to look at this is to say that the engineers who developed the equipment knew that vibrations and oil temperature are important aspects to log, however, they lacked the ability to formally ‘convert’ the measurements to a propensity score.
In a way this is the correlation-causations discussion applied to modeling. It shows that correlation can be extremely useful (and profitable), however, when a high model quality is required (or expected), one needs to look for data that may contain traces of mechanisms. As an example, consider a bank who tries to predict mortgage uptake from a set of financial summaries. In order to improve prediction quality, they started monitoring transaction details, and more specifically, whether customers would start or extend year contracts. Clearly, a mechanism starts being in sight.
Businesses that are looking to advance in data science invariably expect models to be of ‘mechanism model’ quality, while their available data only gives rise to ‘concentration models’. Although the difference between a concentration model and a mechanism model is purely interpretational, using those terms helps talking about models, putting model performance in realistic relation with the available data.
It is difficult to give a comprehensive overview of all types of data set creations. For example, I recognize not having discussed is the case where there’s a product hierarchy available and one can take the product category level predictors as predictors to stabilize the predictions at product level, and as such, there are many others. Those use cases maybe too specific and require a better understanding of the challenge at hand.
Overall, I find that the topics discussed in this article are always assumed to be known and well understood, however, when asking deeper, methods followed are more based on habituation and imitation rather than arguments or data.
Apart from the selection logic, I hope to have shown interesting visualizations that help to guide discussions around those topics. Although the algorithm side of machine learning is ever expanding in more complex approaches, the daily practice of a data scientist to build and tune those models is based on a surprising small number of (easy) principles. Those principles can and will be automated (once more: look at the caret package in R and alike). The data scientist who does not want to become obsolete, will have to develop the creative skills to frame a business challenge in such a way that the resulting dataset will be able to give new and valuable answers.
- Forecasting revenue for a hospital in New York – case study
- Automatic Statistician and the Profoundly Desired Automation for Data Science
- The Inconvenient Truth About Data Science
- Data Science Data Architecture