Data Science Data Logic

Even though participating in MOOCs and online competitions are good exercises to learn data science, but it is more than algorithms and accuracies. Understand how to formulate hypothesis, data creation, sampling, validation, etc. to become true data scientist.

Basic dataset creation

Let’s return to the most basic of data science application in marketing: propensity modeling. This will allow to setup of a data selection approach and a way to communicate what is being done.

In Figure 1A, the most basic selection is displayed on a timeline. Between the moment ‘now’ and ‘t-3’, one watches for the event that needs to be modeled. In this case, the green dots indicate ‘buying’ and at the end of the three months, once determines if a customer has bought or not. This will become the target. Note that the moment ‘now’ refers to the most recent historic data that one can get their hands at. Prior to ‘t-3’ is predictor space. The small ‘+’ signs indicate aggregation (roll-ups) of customer behavior, up to ‘t-3’ (actually, one needs to account for the data delay as well; later more). It is crucially important to not take any predictors beyond the ‘t-3’, as this will cause model leakage: your predictors will contain information about the target in a way that you won’t have available at the time of scoring. This type of visualization makes it very clear how your selections take place and it makes the often implicit choices explicit, and thus up for debate.

Note that ‘buying’ here refers to the interesting event one wants to model. It can be the purchase of a particular article, it can be a category of articles, or even it can be a characteristic of an article (such as color, brand, etc). Multiple models, modeling different characteristics, can then be scored to finally map back to the ‘ideal’ product recommendation for a customer. Alternative to purchase, the target can refer to a customer showing a particular behavior that seems interesting (upgrading, downgrading, browsing, requesting etc). Again, the selection logic is widely applicable, and only the imagination of the data scientist is the limit.

On particular thing that needs discussion in Figure 1A is the distance between the last available predictor moment and the moment of buying. For some customers this may be as little as one day (you buy in the first day of the three month window). Likely, the closer your predictor data is to the target, the easier it is to predict. A customer buying a product online today may have looked at an item yesterday. The model then will find this relationship (and hence this case contributes positively to the model performance), yet, how deployable is this? Scoring the model on new customers will point to campaign to customer who went to the website, yet, if they went to the website, and they didn’t buy next day, would any campaign helping a customer to purchase? One particular way of solving this is displayed in Figure 1B. Here, the last predictor moment is always three months prior to purchase (for those who purchased) or three month prior to the end of the observation windows (for those who did not purchase). In this way, you prevent the model looking performant while it’s non-deployable due to the timing issue. Also here, it gives rise to experimenting with the time prior to purchase: although the observation window can still be, say, three months (i.e. ‘now’ – ‘t-3’), the last predictor date can be experimented with. It will lead to a building a series of models, from say, six month prior to purchase, down by steps of one, to 1 month prior to purchase. You expect the model performance to go up as the time difference between the event and last predictor time decreases. This method also allows you to test for model stability and shows how much time prior to an event, one really start seeing clear signal.

So far, we spoke about a three month campaign window. Why three months? There are the following criteria to base this decision on: first of all, the event window is also what you predict forward if you score the model on new data. Using three months, marketing has enough time to roll out campaigns. Imagine using a one day campaign window. That would mean, at the moment you score, you predict customer who will buy the product tomorrow, and thus, there will be no time to send out a campaign. There’s another issue with taking a too short campaign window: the number of purchases will be very small. Balancing your data, working with cost structure or adding priors to the model are ways to deal with unbalanced data, however, general practice shows modeling becomes harder the more unbalanced the sample is. In many marketing applications, a three month window results in a 2%-5% up take rate, which seems to be a fair level of unbalance to still build valid models (with or without the balancing options; some models need it, some don’t). Widening the event window leads to other issues: although the percentage of purchasing customers increases, many of them will have predictor data far in the past. Given a uniform uptake, if your event window is one year, half of your customers will have the last predictor data that is half a year ago or longer (in case of the method from Figure 1A). I’ve modeled slow moving automotive parts with a one year window: here it seemed reasonable, since a particular car part could be sold once in two years.

Summarizing, the event window depends on the expected number of positive examples in your training set: it should be reasonable to model, which is >1% (please see this as a rough guideline and not as a hard border), it depends on how the resulting model will be used, and finally, it depends on the industry dynamics.

When the take rate is low, rather than working with a wider window, another approach is to use a sliding window. The principle is the same as explained above, however, this is done for a number of consecutive months and those are stack to form one training set. This is illustrated in Figure 2.

Figure 3 shows the data layout of a prepaid churn model. Churn in a prepaid scenario has a particular difficulty: you do not know when the churn took place (such as an end-date of a contract, or a phone call of the customer to quit the service). Typically, churn is inferred from the customer not showing on the network for a number of days (say, 60 days). This is the event window. In this example, the event window is separated from the campaign window in order to explicitly make room to conduct a campaign. Note the dot indicating ‘last seen active’. This shows the separation of the campaign window and the event window: this particular customer was still active on the last day of the campaign window, but not in the event window, and hence was classified as churn. Another new element here is the definition of the active customer: a customer is part of the training set if they are on the network prior to the start of the data window. The ‘last day in data window’ here also plays an important role: customers need to be active at least once in the week ending at ‘the last day in data window’, in order to make sure they have not already churned. Including those customers who are already not active gets a very performant model which says: if you have not been active in the last week, you will likely churn. At scoring time, your model will point to all customers who have not been active for a week, which, very likely, you can’t reach using campaigns because they already switched sim cards. Lastly, in this figure, the data delay is made explicit. This can be an important point when the campaign window is small. If it takes a full week to get the data, at the time of scoring, the predictor data is one week old, and hence, the campaign window is shortened with one week.

Figure 4 shows the data delay situation in a model that was tasked to predict the number of containers arriving in a port in the next three eight-hour shifts (shift A, B and C). Building a model to predict the number of containers at time t, based on the activity on t-1 leads to a good model, however, at scoring time, the only data available during ‘t-1’ was the data at ‘t-2’. Luckily, by pointing out the importance of understanding data delay in an early phase, we never went into building a non-functioning model based on ‘t-1’ data. The visualizations using to communicate those points are displayed in Figure 4.

Training a model on samples

The method outlined above will result in a training set. To be clear in terms of terminology: a training set is used to train and tune the model. Once the final model is ready, I consider it good practice to conduct an out of time validation. This out of time validation is discussed in the next paragraph. In this paragraph I would like to discuss how to efficiently train a model. Frequently I see people argue that models need to be trained on as much data as possible (and here we are in the midst of the Big Data hype). In some complex cases this is true, however, in the majority of industry models, I see no point in this. Best practice is to test (using data, data scientists!) how large your training set should be. In most cases, there is more than enough data available (say, your training data has >1M cases) to do something smarter with your data than just throw it all in one model.

Figure 5 outlines the procedure. The data is sorted in random order and there’s a column available to quickly select percentiles of customers (an easy, repeatable way to achieve this is taking the last two digits of customer ID, if you are working with customers). The first model is build using selection ‘Training 1’ and tested on ‘Validation 1’, next, the training set is increased (Training 2) and again validated on ‘Validation 1’. This process is continued until the evaluation measure on ‘Validation 1’ doesn’t show an increase when increasing the training data (say, this is ‘Training 3’). Now, since you have not used the whole dataset, you can take another partition of size ‘Training 3’ and build another model to test it against ‘Validation 1’. Does this give the same performance? Do the same predictors come up? This tells a lot about the stability of the model. When done with the training, the model can be now validated on larger set ‘Validation 2’. Does the model still hold? And still, we might not have used the full dataset. The partition called ‘Other use’ can now be used to build an ensemble by mixing the models built so far, and determining the combination weights (or non-linear combinations thereof) on yet the last unseen data. This approach gives rise to much smarter use of your data; rather than waiting till the estimation of your 1M row model completes, now you get the chance to quickly test and re-test and test again.

Out of time validation

Once you have a trained and tested model, you would like to bring it in production. However, the model was built in one time period, and will be scored on a different time period. Due to changing circumstances, the relation between target and predictors may change over time (this is called this drift), and this is something you would like to find out prior to bringing the model in production. In Figure 6, the out of time validation scheme is displayed. Given the three month campaign window, the exact same selection is made, but now ranging from ‘t-3’ to ‘t-6’. The model is scored on this data set, and compared with the known results. The reason that the out of time validation is backward, rather than forward is the following: if the model is trained on ‘t-3’-‘t-6’, it is validated one period further, and scored two periods further. Two periods is twice the drift, and likely the model performs more poorly than in the out of time validation. Assuming a constant drift, scoring one period backwards shows the same degradation in model performance as scoring one period forwards. Moreover, I feel it is a good idea to train the final scoring model on the most recent data available.