Business intuition in data science
Data Science projects are not just use of algorithms & building models; there are other steps of the project which are equally important. Here we explain them in detail.
3. Modeling and Evaluation:
This is the step where we have to select the “right algorithm” to get the “right set of solutions” for our business problem. This, as you can see, is an extremely important step and key is to find the most suitable algorithm for the given business objective. In the case above, without going into a lot of details, we have two sets of objectives – (1) finding the most responsive set of customers out of 100MM – lets say that’s x (2) for each customer out of this list of x customers, show offers that are most relevant to him/her preferences. For the first objective, we need a response prediction algorithm (e.g. regression techniques) that will give a response likelihood score/probability for each customer, which then can be used to rank-order the customers and select the most responsive ones for the campaign. For objective (2) – finding customer’s offer preferences, we need algorithms that can help select the product offers that are most likely to be preferred by a customer (e.g. recommender algorithms or classification techniques)
Once we have built the algorithms, their evaluation is also based on how well they meet the objectives at hand. Lets understand this using the case study above. Assuming we have built a response prediction algorithm that rank-orders the 100 MM customers based on their probability to buy a product after seeing the email offers:
Customer ID | Probability of response |
1 | 98.0% |
2 | 95.0% |
3 | 90.0% |
4 | 89.0% |
5 | 88.0% |
. | . |
. | . |
. | . |
100MM | 0.1% |
Now, we bucket these 100MM customers in 10 equal buckets, rank-ordered in the descending order from highest probability of response to lowest. For each of these buckets of customers, we look at their actual response rate to a previously sent email offer campaign, which was sent to all 100 MM customers:
A | B | C | C/B | |
Customer Buckets | Average probability of response to an email campaign | #customers who were sent emails in last campaign | #responders in last email campaign | Actual response rate in each bucket of customers to the last email campaign(#of responders/#customers) |
Bucket 1 | 98.0% | 1,00,00,000 | 1,00,00,000 | 100.0% |
Bucket 2 | 97.0% | 1,00,00,000 | 92,00,000 | 92.0% |
Bucket 3 | 78.0% | 1,00,00,000 | 68,00,000 | 68.0% |
Bucket 4 | 65.0% | 1,00,00,000 | 50,00,000 | 50.0% |
Bucket 5 | 45.0% | 1,00,00,000 | 42,50,000 | 42.5% |
Bucket 6 | 23.0% | 1,00,00,000 | 25,00,000 | 25.0% |
Bucket 7 | 15.0% | 1,00,00,000 | 13,00,000 | 13.0% |
Bucket 8 | 4.0% | 1,00,00,000 | 14,80,000 | 14.8% |
Bucket 9 | 1.0% | 1,00,00,000 | 5,00,000 | 5.0% |
Bucket 10 | 0.1% | 1,00,00,000 | 30,000 | 0.3% |
Total | 10,00,00,000 | 4,10,60,000 | 41.1% |
Please note: the response here is product purchase after seeing the email offer
So, to meet objective 1, we just have to decide till which bucket we want to send the email offer.
Now, in the above table, you can see that there are discrepancies between the values of “average probability of response” and “actual response rate” for some of the buckets, for e.g., buckets 3 and 4. So, the predictions are not very “accurate” when compared to the actual values. However, since the objective here is to select a set of high response likelihood customers, we are more concerned about how well the model is rank-ordering the customers in terms of response. Looking at the actual rates, it seems to be doing a pretty good job (the actual response rates from past campaigns is also pretty much ordered in a descending order).
So, here, model result evaluation is more around how well it is rank-ordering the customers by their response probability rather than the accuracy of the predictions.
However, when we evaluate the results of the second model, which gives a preference score for every product offer for every customer, prediction accuracy can be more important. Lets say, in the above case, there are 10 product offers. So, we built a model that gives preference score for each of the customers for each of the 10 product offers:
Customer ID | Product offer 1 | Product offer 2 | Product offer 3 | Product offer 4 | Product offer 5 |
1 | 0.7 | 0.13 | 0.01 | 0.15 | 0.01 |
2 | 0.8 | 0.5 | 0.02 | 0.005 | 0.4 |
3 | 0.01 | 0.02 | 0.002 | 0.03 | 0.04 |
Here, customer 1 has a higher preference for offers 1, 2 and 4 in that order. For products 3 and 5, since the preference score is very low, we can assume that he doesn’t have any preference for these products. Similarly, we can say that customer 2 is not showing preference for any particular product. We can create a threshold score where if a customer’s score is higher than that threshold, then we will consider the preference, otherwise not.
So, you can see here, that we are doing such assessments based on the value of the score and therefore, it is important that we have accurate scores that reflect the true preferences of the customer. Hence, in this model evaluation, prediction accuracy is very important.
4. Prototype:
By building a data prototype, what we mean is creating the necessary infrastructure to implement the solution in a production environment. Given that implementation is a time and resource intensive process, appropriate consideration needs to be given. In the above case, some of those could be:
- Is this email campaign a one-off marketing initiative or a more regular one? If regular, then it makes sense to create a production platform to execute such campaigns.
- For such a platform, how will all the data feeds from different sources be put together? Assessments need to be made in terms of the effort and cost involved in cleaning the source data, its update frequency, internal data hygiene checks and balances etc.
- How will all this data be stored and processed? This involves decisions like the need for parallel processing (if data volume is huge) or real-time processing as well as storage infrastructure.
- How will the emails be delivered? Again here, decisions required include – need for a third party email delivery vendor, customer data privacy checks and balances, speed to market including need for real- time processing etc.
These are some of the considerations, but depending on scale and complexity of the assignment, there can be many other things that need to assessed and evaluated.
So, as you can see, a data science assignment is a sum total of many stages that requires domain expertise and detailed understanding of the business objectives along with technical expertise. One cannot do without the other!!
Want to learn more? Join the 8 weeks case study based Data science course offered by www.deeplearningtrack.com
Bio: Jahnavi Mahanta is co-founder of Deeplearningtrack.
Related: