The Most Common Statistical Traps in FAANG Interviews
Five key statistical traps in FAANG interviews that test your ability to question data, spot bias, and think critically

Image by Author
# Introduction
When applying for a job at Meta (formerly Facebook), Apple, Amazon, Netflix, or Alphabet (Google) — collectively known as FAANG — interviews rarely test whether you can recite textbook definitions. Instead, interviewers want to see whether you analyze data critically and whether you would identify a bad analysis before it ships to production. Statistical traps are one of the most reliable ways to test that.

These pitfalls replicate the kinds of decisions that analysts face on a daily basis: a dashboard number that looks fine but is actually misleading, or an experiment result that seems actionable but contains a structural flaw. The interviewer already knows the answer. What they are watching is your thought process, including whether you ask the right questions, notice missing information, and push back on a number that looks good at first sight. Candidates stumble over these traps repeatedly, even those with strong mathematical backgrounds.
We will examine five of the most common traps.
# Understanding Simpson's Paradox
This trap aims to catch people who unquestioningly trust aggregated numbers.
Simpson's paradox happens when a trend appears in different groups of data but vanishes or reverses when combining those groups. The classic example is UC Berkeley's 1973 admissions data: overall admission rates favored men, but when broken down by department, women had equal or better admission rates. The aggregate number was misleading because women applied to more competitive departments.
The paradox is inevitable whenever groups have different sizes and different base rates. Understanding that is what can separate a surface-level answer from a deep one.
In interviews, a question might look like this: "We ran an A/B test. Overall, variant B had a higher conversion rate. However, when we break it down by device type, variant A performed better on both mobile and desktop. What is happening?" A strong candidate refers to Simpson's paradox, clarifies its cause (group proportions differ between the two variants), and asks to see the breakdown rather than trust the aggregate figure.
Interviewers use this to check whether you instinctively ask about subgroup distributions. If you just report the overall number, you have lost points.
// Demonstrating With A/B Test Data
In the following demonstration using Pandas, we can see how the aggregate rate can be misleading.
import pandas as pd
# A wins on both devices individually, but B wins in aggregate
# because B gets most traffic from higher-converting mobile.
data = pd.DataFrame({
'device': ['mobile', 'mobile', 'desktop', 'desktop'],
'variant': ['A', 'B', 'A', 'B'],
'converts': [40, 765, 90, 10],
'visitors': [100, 900, 900, 100],
})
data['rate'] = data['converts'] / data['visitors']
print('Per device:')
print(data[['device', 'variant', 'rate']].to_string(index=False))
print('\nAggregate (misleading):')
agg = data.groupby('variant')[['converts', 'visitors']].sum()
agg['rate'] = agg['converts'] / agg['visitors']
print(agg['rate'])
Output:

# Identifying Selection Bias
This test lets interviewers assess whether you think about where data comes from before analyzing it.
Selection bias arises when the data you have is not representative of the population you are attempting to understand. Because the bias is in the data collection process rather than in the analysis, it is simple to overlook.
Consider these possible interview framings:
- We analyzed a survey of our users and found that 80% are satisfied with the product. Does that tell us our product is good? A solid candidate would point out that satisfied users are more likely to respond to surveys. The 80% figure probably overstates satisfaction since unhappy users most likely chose not to participate.
- We examined customers who left last quarter and discovered they primarily had poor engagement scores. Should our attention be on engagement to reduce churn? The problem here is that you only have engagement data for churned users. You do not have engagement data for users who stayed, which makes it impossible to know if low engagement actually predicts churn or if it is just a characteristic of churned users in general.
A related variant worth knowing is survivorship bias: you only observe the outcomes that made it through some filter. If you only use data from successful products to analyze why they succeeded, you are ignoring those that failed for the same reasons that you are treating as strengths.
// Simulating Survey Non-Response
We can simulate how non-response bias skews results using NumPy.
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulate users where satisfied users are more likely to respond
satisfaction = np.random.choice([0, 1], size=1000, p=[0.5, 0.5])
# Response probability: 80% for satisfied, 20% for unsatisfied
response_prob = np.where(satisfaction == 1, 0.8, 0.2)
responded = np.random.rand(1000) < response_prob
print(f"True satisfaction rate: {satisfaction.mean():.2%}")
print(f"Survey satisfaction rate: {satisfaction[responded].mean():.2%}")
Output:

Interviewers use selection bias questions to see if you separate "what the data shows" from "what is true about users."
# Preventing p-Hacking
p-hacking (also called data dredging) happens when you run many tests and only report the ones with \( p < 0.05 \).
The issue is that \( p \)-values are only intended for individual tests. One false positive would be expected by chance alone if 20 tests were run at a 5% significance level. The false discovery rate is increased by fishing for a significant result.
An interviewer might ask you the following: "Last quarter, we conducted fifteen feature experiments. At \( p < 0.05 \), three were found to be significant. Do all three need to be shipped?" A weak answer says yes.
A strong answer would firstly ask what the hypotheses were before the tests were run, if the significance threshold was set in advance, and whether the team corrected for multiple comparisons.
The follow-up often involves how you would design experiments to avoid this. Pre-registering hypotheses before data collection is the most direct fix, as it removes the option to decide after the fact which tests were "real."
// Watching False Positives Accumulate
We can observe how false positives occur by chance using SciPy.
import numpy as np
from scipy import stats
np.random.seed(0)
# 20 A/B tests where the null hypothesis is TRUE (no real effect)
n_tests, alpha = 20, 0.05
false_positives = 0
for _ in range(n_tests):
a = np.random.normal(0, 1, 1000)
b = np.random.normal(0, 1, 1000) # identical distribution!
if stats.ttest_ind(a, b).pvalue < alpha:
false_positives += 1
print(f'Tests run: {n_tests}')
print(f'False positives (p<0.05): {false_positives}')
print(f'Expected by chance alone: {n_tests * alpha:.0f}')
Output:

Even with zero real effect, ~1 in 20 tests clears \( p < 0.05 \) by chance. If a team runs 15 experiments and reports only the significant ones, those results are most likely noise. It is equally important to treat exploratory analysis as a form of hypothesis generation rather than confirmation. Before anyone takes action based on an exploration result, a confirmatory experiment is required.
# Managing Multiple Testing
This test is closely related to p-hacking, but it is worth understanding on its own.
The multiple testing problem is the formal statistical issue: when you run many hypothesis tests simultaneously, the probability of at least one false positive grows quickly. Even if the treatment has no effect, you should anticipate roughly five false positives if you test 100 metrics in an A/B test and declare anything with \( p < 0.05 \) as significant. The corrections for this are well known: Bonferroni correction (divide alpha by the number of tests) and Benjamini-Hochberg (controls the false discovery rate rather than the family-wise error rate).
Bonferroni is a conservative approach: for example, if you test 50 metrics, your per-test threshold drops to 0.001, making it harder to detect real effects. Benjamini-Hochberg is more appropriate when you are willing to accept some false discoveries in exchange for more statistical power.
In interviews, this comes up when discussing how a company tracks experiment metrics. A question might be: "We monitor 50 metrics per experiment. How do you decide which ones matter?" A solid response discusses pre-specifying primary metrics prior to the experiment's execution and treating secondary metrics as exploratory while acknowledging the issue of multiple testing.
Interviewers are trying to find out if you are aware that taking more tests results in more noise rather than more information.
# Addressing Confounding Variables
This trap catches candidates who treat correlation as causation without asking what else might explain the relationship.
A confounding variable is one that influences both the independent and dependent variables, creating the illusion of a direct relationship where none exists.
The classic example: ice cream sales and drowning rates are correlated, but the confounder is summer heat; both go up in warm months. Acting on that correlation without accounting for the confounder leads to bad decisions.
Confounding is particularly dangerous in observational data. Unlike a randomized experiment, observational data does not distribute potential confounders evenly between groups, so differences you see might not be caused by the variable you are studying at all.
A common interview framing is: "We noticed that users who use our mobile app more tend to have significantly higher revenue. Should we push notifications to increase app opens?" A weak candidate says yes. A strong one asks what kind of user opens the app frequently to begin with: likely the most engaged, highest-value users.
Engagement drives both app opens and spending. The app opens are not causing revenue; they are a symptom of the same underlying user quality.
Interviewers use confounding to test whether you distinguish correlation from causation before drawing conclusions, and whether you would push for randomized experimentation or propensity score matching before recommending action.
// Simulating A Confounded Relationship
import numpy as np
import pandas as pd
np.random.seed(42)
n = 1000
# Confounder: user quality (0 = low, 1 = high)
user_quality = np.random.binomial(1, 0.5, n)
# App opens driven by user quality, not independent
app_opens = user_quality * 5 + np.random.normal(0, 1, n)
# Revenue also driven by user quality, not app opens
revenue = user_quality * 100 + np.random.normal(0, 10, n)
df = pd.DataFrame({
'user_quality': user_quality,
'app_opens': app_opens,
'revenue': revenue
})
# Naive correlation looks strong — misleading
naive_corr = df['app_opens'].corr(df['revenue'])
# Within-group correlation (controlling for confounder) is near zero
corr_low = df[df['user_quality']==0]['app_opens'].corr(df[df['user_quality']==0]['revenue'])
corr_high = df[df['user_quality']==1]['app_opens'].corr(df[df['user_quality']==1]['revenue'])
print(f"Naive correlation (app opens vs revenue): {naive_corr:.2f}")
print(f"Correlation controlling for user quality:")
print(f" Low-quality users: {corr_low:.2f}")
print(f" High-quality users: {corr_high:.2f}")
Output:
Naive correlation (app opens vs revenue): 0.91
Correlation controlling for user quality:
Low-quality users: 0.03
High-quality users: -0.07
The naive number looks like a strong signal. Once you control for the confounder, it disappears entirely. Interviewers who see a candidate run this kind of stratified check (rather than accepting the aggregate correlation) know they are talking to someone who will not ship a broken recommendation.
# Wrapping Up
All five of these traps have something in common: they require you to slow down and question the data before accepting what the numbers seem to show at first glance. Interviewers use these scenarios specifically because your first instinct is often wrong, and the depth of your answer after that first instinct is what separates a candidate who can work independently from one who needs direction on every analysis.

None of these ideas are difficult to understand, and interviewers inquire about them because they are typical failure modes in real data work. The candidate who recognizes Simpson's paradox in a product metric, catches a selection bias in a survey, or questions whether an experiment result survived multiple comparisons is the one who will ship fewer bad decisions.
If you go into FAANG interviews with a reflex to ask the following questions, you are already ahead of most candidates:
- How was this data collected?
- Are there subgroups that tell a different story?
- How many tests contributed to this result?
Beyond helping in interviews, these habits can also prevent bad decisions from reaching production.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.