How to ace A/B Testing Data Science Interviews
Understanding the process of A/B testing and knowing how to discuss this approach during data science job interviews can give you a leg up over other candidates. This mock interview provides a step-by-step guide through how to demonstrate your mastery of the key concepts and logical considerations.
By Preeti Semwal, Data Science & Analytics Leader.
A/B Testing is an in-demand skill that is often tested in data science interviews. At the same time, there are very few resources out there to help prepare for A/B testing interviews. In my 15 year career and as a hiring manager in Data Science, I have found that most candidates perform poorly in these interviews. In fact, the field of experimentation has been evolving, and there are new concepts and approaches that are becoming more relevant each year. This means that even seasoned data scientists who may have done A/B testing some years back often find themselves stumped in interviews.
In this post, we will go through a mock interview that will help you understand what the interviewer is looking for and how to approach these interviews. Why mock interview, you might ask? Well, as data scientists, we sometimes struggle with communication, and having a template in mind helps tremendously. Personally, I have also found that when I can visualize a high stake situation and how it may play out, it helps me be better prepared mentally, handle pressure well and perform better overall.
We will use an example from Doordash, a food delivery company with a mobile app that currently ranks #1 on the iPhone App Store. They are constantly improving their app through experimentation and look for strong experimentation skills during their data science interviews — especially for product data science or product analytics roles.
INTERVIEWER — Doordash is expanding into other categories such as convenience store delivery. Their notifications have had good success in the past, and they are considering sending an in-app notification to promote this newly launched category.
How would you design and analyze an experiment to decide if they should roll out the notification?
Part 1 — Ask clarifying questions to understand business goals and product feature details well
What Interviewer is looking for -
Did you begin by stating the product/business goal before diving into the experiment details? Talking about the experiment without knowing the product goal is a red flag.
INTERVIEWEE — Before we begin with the experiment details, I would like to make sure my understanding of the background is clear. There could be multiple goals with a feature like this one — such as increasing new user acquisition, increase conversion for this category, increasing the number of orders in the category, or increasing total order value. Can you help me understand what the goal is here?
INTERVIEWER — That’s a fair question. With the in-app notification, we are primarily trying to increase the conversion rate for the new category — i.e., the percent of users that place an order in the new category out of all users that login to the app.
INTERVIEWEE — OK, that’s helpful. Now I would like to also understand more about the notification — what is the messaging, and who is the intended audience?
INTERVIEWER — We are not offering any discount at this point. The messaging is simply going to be to let them know we have a new category that they can start ordering from. If the experiment is successful, we intend to roll out the notification to all users.
INTERVIEWEE — OK. Thanks for that background. I am now ready to dive into the experiment details.
Part 2 — State Business Hypothesis, Null Hypothesis, & define metrics to be evaluated
What Interviewer is looking for -
That you think through secondary metrics and guardrail metrics in addition to the primary metrics
INTERVIEWEE — So, to state the business hypothesis, we expect that if we send an in-app notification, then the daily number of orders in the new category will increase. That means our Null Hypothesis (Ho) is that there is no change in the conversion rate due to the notification.
Now let me state the different metrics that we will want to include in the experiment. Since the goal of the notifications is to increase the conversion rate in the new category. That will be our primary metric. In terms of Secondary metrics, we should also watch the average order value to see what the impact is. It is possible that the conversion rate increases, but the average order value decreases such that the resulting impact is lower overall revenue. That is something we may want to watch out for.
We should also consider guardrail metrics — these are metrics that are critical to the business that we do not want to impact through the experiment, such as time spent on app or app uninstalls, for example. Are there any such metrics that we should include in this case?
INTERVIEWER — I agree with your choice of primary metric, but you can ignore the secondary metrics for this exercise. And you are spot on in terms of guardrail metrics. Doordash wants to be judicious about any features or releases when it comes to their app because we know that the LTV of a customer who has installed the app is much higher. We want to be careful so as not to drive users to uninstall the app.
INTERVIEWEE — OK, that’s good to know. So we will include the percent of uninstalls as our guardrail metric.
Part 3 — Choose significance level, power, MDE, and calculate the required sample size and duration for the test
What Interviewer is looking for -
Your knowledge of the statistical concepts and the calculation for sample size and duration
Whether you consider factors such as network effect (common in two-sided marketplaces such as Doordash, Uber, Lyft, Airbnb or social networks such as Facebook and LinkedIn), day of the week effect, seasonality, or novelty effect that may affect the validity of the test and need to be considered while arriving at the experiment design
INTERVIEWEE — Now, I would like to get into the design of the experiment.
Let’s first see if we need to consider network effects — these occur when the behavior of the control is influenced by the treatment given to the test group. Since Doordash is a double-sided marketplace, it is more prone to seeing network effects. In this specific case, it is possible that if the treatment given to the test increases the demand from the test group, that may result in a deficit of supply (i.e., dashers) that could, in turn, affect the performance of the control group.
To account for network effects, we will need to choose the randomization unit differently than we would typically do. There are many ways to do this — we could do geo-based randomization, or time-based randomization, or network-cluster randomization, or network ego-centric randomization. Would you like me to go into the details for these?
INTERVIEWER — I am glad you brought up network effects as it is, in fact, something we carefully look for in our experiments in Doordash. In the interest of time, let’s assume there are no network effects in play here and move on.
INTERVIEWEE — So if we are assuming there are no network effects to be accounted for, the randomization unit for the experiment is simply the user — i.e., we will randomly select users and assign them to treatment and control. Treatment will receive notifications, while control will not receive any notifications. Next, I would like to calculate the sample size and duration. For this, I need a few inputs.
- Baseline conversion — which is the existing conversion of the control before changes are made.
- Minimum detectable difference or MDE— which is the smallest change in conversion rate we are interested in detecting. A smaller change than this will not be practically significant to the business — it is typically chosen such that the improvement in the desired outcome will justify the cost of implementing and maintaining the feature.
- Statistical Power— which is the probability that a test correctly rejects the null hypothesis.
- Significance Level— which is the probability of rejecting a null hypothesis when it is true.
A 5% significance level and power of 80% are usually chosen, and I will assume these unless you say otherwise. Also, I will assume a 50–50 split between the control and treatment. Once I have these inputs finalized, I will use power analysis to calculate the sample size. I would use a programming language for this. For example, in R, there is a package called ‘pwr’ that can be used for this.
INTERVIEWER — Yes, let’s say, based on the analysis, we get a sample size of 10,000 users per variation needed. How will you calculate the duration of the test?
INTERVIEWEE — Sure, for this, we will need the daily number of users that log in to the app.
INTERVIEWER — Assume we have 10,000 users that log in to the app daily.
INTERVIEWEE — OK, in that case, we would need at a minimum 2 days to run the experiment — I arrived at this by taking the total sample size of Control & Treatment and dividing by daily user count. However, there are other factors we should consider when finalizing the duration.
- Day of the week effect — You may have a different population of users on weekends than weekdays — hence it is important to run long enough to capture weekly cycles.
- Seasonality — There can be times when users behave differently that are important to consider, such as holidays.
- Novelty effect — When you introduce a new feature, especially one that’s easily noticed, initially it attracts users to try it. A test group may appear to perform well at first, but the effect will quickly decline over time.
- External effects — for example, let’s say the market is doing really well, and more people are likely to ignore the notification with the expectation of making high returns. This will lead us to draw spurious conclusions from the experiment
Due to the above, I would recommend running the experiment for at least one week.
INTERVIEWER — OK, that’s fair. How would you analyze the test results?
Part 4 — Analyze the results and draw valid conclusions
What interviewer is looking for -
Your knowledge of the appropriate statistical tests to be used in different scenarios (e.g., t-test for the sample mean and z-test for sample proportions)
You check for randomization — this will get you some brownie points
You provide a final recommendation (or a framework to get there)
INTERVIEWEE — Sure. There are two key parts to the analysis -
- Check for randomization— As a best practice, we should check that the randomization was done correctly when assigning the test and control. For this, we can look at some baseline metrics that we do not expect to be influenced by the test and compare them for the two groups. We can make this comparison by comparing the histograms or density curves for these metrics between the two groups. If there is no difference, we can conclude that randomization was done correctly.
- Significance test for all metrics(including primary and guardrail metrics) — Both our primary metric (conversion rate) and guardrail metric (uninstall rate) are proportions. We can use the z-test to test for statistical significance. We can do this using a programming language such as R or Python.
If there is a statistically significant increase in conversion rate and uninstall rate is not impacted negatively, I would recommend implementing the test.
If there is a statistically significant increase in conversion rate and uninstall rate is impacted negatively, I would recommend not implementing the test.
And lastly, if there is no statistically significant increase in conversion rate — I would recommend not implementing the test.
INTERVIEWER — That all sounds good. Thanks for your response.
Doing well in A/B testing or experimentation interviews will provide you with an edge in the hiring process and set you apart from other candidates. So I would highly recommend spending focused time to learn key concepts in A/B testing and prepare well for these interviews.
A couple of good resources that I would recommend -
- Complete Course on Product A/B Testing with Interview Guide (Disclaimer — I am the course instructor)
- If you have advanced knowledge of experimentation, this book is great to recap key concepts — ‘Trustworthy Online Controlled Experiments (A Practical Guide to A/B Testing)’ by Ron Kohavi, Diane Tang, and Ya Xu
Original. Reposted with permission.
Bio: Preeti Semwal has15 years of experience helping organizations drive the power of data science and analytics into business strategies. With an exceptional ability to narrate stories through data and a solid experience presenting to the C-suite, Preeti is a leader who truly believes in empowering, nurturing and advocating for her team.
- 5 Things to Know About A/B Testing
- A/B Testing: 7 Common Questions and Answers in Data Science Interviews, Part 1
- A/B Testing: 7 Common Questions and Answers in Data Science Interviews, Part 2