5 Tricks When A/B Testing Is Off The Table
Sometimes you cannot do A/B testing, but it does not mean we have to fly blind - there is a range of econometric methods that can illuminate the causal relationships at play.
The key assumption required for internal validity of the DD estimate is parallel trends: absent the treatment itself, the treatment and control markets would have followed the same trends. That is, any omitted variables affect treatment and control in the same way.
How can we validate the parallel trends assumption? There are a few ways to make progress, both before and after rolling out the test.
Before rolling out the test, we can do two things:
- Make the treatment and control groups as similar as possible. In the experimental set-up, consider implementing stratified randomization. Although generally unnecessary when samples are large (e.g., in user-level randomization), stratified randomization can be valuable when the number of units (here geos) is relatively small. Where feasible, we might even generate “matched pairs” — in this case markets that historically have followed similar trends and/or that we intuitively expect to respond similarly to any internal product changes and to external shocks. In their 1994 paper estimating the effect of minimum wage increases on employment, David Card and Alan Krueger matched restaurants in New Jersey with comparable restaurants in Pennsylvania just across the border; the Pennsylvania restaurants provided a baseline for determining what would have happened in New Jersey if the minimum wage had remained constant.
- After the stratified randomization (or matched pairing), check graphically and statistically that the trends are approximately parallel between the two groups pre-treatment. If they aren’t, we should redefine the treatment and control groups; if they are, we should be good to go.
Ok, so we’ve designed and rolled out a good experiment, but with everyone moving fast, stuff inevitable happens. Common problems with DD often come in two forms:
Problem 1: Confounders pop up in particular treatment or control markets. Maybe mid-experiment our BD team launches a new partnership in some market. And then a different product team rolls out a localized payment processor in some other market. We expect both of these to affect our revenue metric of interest.
Solution: Assuming we have a bunch of treatment and control markets, we can simply exclude those markets — and their matches if it’s a matched design — from the analysis.
Problem 2: Confounders pop up across some subset of treatment and control markets. Here, there’s some change — internal or external — that we’re worried might impact a bunch of our markets, including some treatment and some control markets. For example, the Euro is taking a plunge and we think the fluctuating exchange rate in those markets might bias our results.
Solution: We can add additional differencing by that confounder as a robustness check in what’s called a difference-in-difference-in-differences estimation (DDD). DDD will generally be less precise than DD (i.e., the point estimates will have larger standard errors), but if the two point estimates themselves are similar, we can be relatively confident that that confounder is not meaningfully biasing our estimated effect.
Pricing is an important and complicated beast probably worthy of additional discussion. For example, the estimate above may not be the general equilibrium effect we should expect: in the short run, users may be responding to the change in price, not just to the new price itself; but in the long-run, users likely only to respond to the price itself (unless prices continue to change). There are several ways to make progress here. For example, we can estimate the effect only on new users who had not previously been served a price and so for whom the change would not be salient. But we’ll leave a more extended discussion of pricing to a subsequent post.
Method 4: Fixed Effects Regression
Fixed effects is a particular type of controlled regression, and is perhaps best illustrated by example.
A large body of academic research studies how individual investors respond (irrationally) to market fluctuations. One metric a fin-tech firm might care about is the extent to which it is able to convince users to (rationally) stay the course — and not panic — during market downturns.
Understanding what helps users stay the course is challenging. It requires separating what we cannot control — general market fluctuations and learning about those from friends or news sources — from what we can control — the way market movements are communicated in a user’s investment returns. To disentangle, we once again go hunting for a source of randomness that affects the input we control, but not the confounding external factors.
Our approach is to run a fixed effects regression of percent portfolio sold on portfolio return controlling (with fixed effects) for the week of account opening. Since the fixed effects capture the week a user opened their account, the coefficient on portfolio return is the effect of having a higher return relative to the average return of other users funding accounts in that same week. Assuming users who opened accounts the same week acquire similar tidbits from the news and friends, this allows us to isolate the way we display movements in the user’s actual portfolio from general market trends and coverage.
Sound familiar? Fixed effects regression is similar to RDD in that both take advantage of the fact that users are distributed quasi-randomly around some point. In RDD, there is a single point; with fixed effects regression, there are multiple points — in this case, one for each week of account opening.
In R:
fit <- lm(Y ~ X + factor(F), data = …) summary(fit)
The two assumptions required for internal validity in RDD apply here as well. First, after conditioning on the fixed effects, users are as good as randomly assigned to to their X values — in this case, their portfolio returns. Second, there can be no confounding discontinuities, i.e., conditional on the fixed effects, users cannot otherwise be treated differently based on their X.
For the fixed effects method to be informative, we of course also need variation in the X of interest after controlling for the fixed effects. Here, we’re ok: users who opened accounts the same week do not necessarily have the same portfolio return; markets can rise or fall 1% or more in a single day. More generally, if there’s not adequate variation in X after controlling for fixed effects, we’ll know because the standard errors of the estimated coefficient on X will be untenably large.
Method 5: Instrumental Variables
Instrumental variable (IV) methods are perhaps our favorite method for causal inference. Recall our earlier notation: we are trying to estimate the causal effect of variable X on outcome Y, but cannot take the raw correlation as causal because there exists some omitted variable(s), C. An instrumental variable, or instrument for short, is a feature or set of features, Z, such that both of the following are true:
- Strong First Stage: Z meaningfully affects X.
- Exclusion restriction: Z affects Y only through its effect on X.
Who doesn’t love a good picture?
If these conditions are satisfied, we can proceed in two steps:
- First stage: Instrument for X with Z
- Second stage: Estimate the effect of the (instrumented) X on Y
In R:
library(aer) fit <- ivreg(Y ~ X | Z, data = df) summary(fit, vcov = sandwich, df = Inf, diagnostics = TRUE)
Ok, so where do we find these magical instruments?
Economists often find instruments in policies. Josh Angrist and Alan Krueger instrument for years of schooling with the Vietnam Draft lottery; Steve Levitt instruments for prison populations with prison overcrowding litigation. Although good instruments in the real world can generate incredible insights, they are notoriously hard to come by.
The good news is that instruments are everywhere in tech. As long as your company has an active AB testing culture, you almost certainly have a plethora of instruments at your fingertips. In fact, any AB test that drives a specific behavior is a contender for instrumenting ex post for the effect of that behavior on an outcome you care about.
Suppose we are interested in learning the causal effect of referring a friend on churn. We see that users who refer friends are less likely to churn, and hypothesize that getting users to refer more friends will increase their likelihood of sticking around. (One reason we might think this is true is what psychologists call the Ikea Effect: users care more about products that they have invested time contributing to.)
Looking at the correlation of churn with referrals will of course not give us the causal effect. Users who refer their friends are de facto more committed to our product.
But if our company has a strong referral program, it’s likely been running lots of AB tests pushing users to refer more — email tests, onsite banner ad tests, incentives tests, you name it. The IV strategy is to focus on a successful AB test — one that increased referrals — and use that experiment’s bucketing as an instrument for referring. (If IV sounds a little like RDD, that’s because it is! In fact, IV is sometimes referred to as “Fuzzy RDD”.)
IV results are internally valid provided the strong first stage and exclusion restriction assumptions (above) are satisfied:
- We’ll likely have a strong first stage as long as the experiment we chose was “successful” at driving referrals. (This is important because if Z is not a strong predictor of X, the resulting second stage estimate will be biased.) The R code above reports the F-statistic, so we can check the strength of our first stage directly. A good rule of thumb is that the F-statistic from the first stage should be least 11.
- What about the exclusion restriction? It’s important that the instrument, Z, affect the outcome, Y, only through its effect on the endogenous regressor, X. Suppose we are instrumenting with an AB email test pushing referrals. If the control group received no email, this assumption isn’t valid: the act of getting an email could in and of itself drive retention. But if the control group received an otherwise-similar email, just without any mention of the referral program, then the exclusion restriction likely holds.
A quick note on statistical significance in IV. You may have noticed that the R code to print the IV results isn’t just a simple call to summary(fit). That’s because we have be careful about how we compute standard errors in IV models. In particular, standard errors have to be corrected to account for the two stage design, which generally makes them slightly larger.
Want to estimate the causal effects of X on Y but don’t have any good historical AB tests on hand to leverage? AB tests can also be implemented specifically to facilitate IV estimation! These AB tests even come with a sexy name — Randomized Encouragement Trials.
It’s no surprise that we invest in building predictive models to understand who will do what when. It’s always fun to predict the future.
But it’s even more fun to improve that future. Where, when, how, and with whom can we intervene for better outcomes? By shedding light on the mechanisms driving the outcomes we care about, causal inference gives us the insights to focus our efforts on investments that better serve our users and our business.
Today, we briefly covered a range of methods for causal inference when AB testing is off the table. We hope these methods will help you uncover some of the actionable insights that can move your company’s mission forward.
Comments? Suggestions? Reach out at emily@coursera.org and duncang@uber.com. Just want to do fun stuff with us? We’d love to hear from you! Plus, we’re always hiring.
Original. Reposted with permission.
Related