Demystifying Statistical Significance
With more professionals from a wide range of less technical fields diving into statistical analysis and data modeling, these experimental techniques can seem daunting. To help with these hurdles, this article clarifies some misconceptions around p-values, hypothesis testing, and statistical significance.
By Omar Martinez, Arcalea.
As statistical analysis and data modeling has become a necessity in almost every field of study, we have started to see professionals from different non-quantitative areas start to use more and more statistical tools to better understand specific events and their outcomes.
As a marketer formally trained in data analytics, it is certainly satisfying to see more and more professionals now implementing processes analogous to a scientific approach. This is happening in many areas that previously showed a serious bias towards dated assumptions and misleading advice. To contribute to that paradigm shift, today we’ll attempt to clarify some misconceptions around p-values, hypothesis testing, and statistical significance in general.
Let’s start by going through the fundamentals. Statistical inference is a process by which we try to extrapolate or deduce information and parameters from a population, based on statistics of a random sample of that same population.
For example, imagine that you work for a big online fashion retailer and that you’ve recently launched a campaign that gives a $10 coupon to every user that added a product to the cart but didn’t complete a purchase. Now suppose that after a few weeks, the campaign has some success, and you see an uplift in conversion rate. After seeing this, one of your coworkers suggests increasing the coupon value to $20 because he thinks that people generally spend more when they get a bigger discount. However, as a seasoned marketer, you know that a good dose of healthy skepticism has never hurt anyone.
For that reason, it might be wise, then, to begin by testing this assumption in a subset of your users, namely a random sample. This is exactly what statistical inference is useful for, as you are trying to estimate a population parameter (population conversion rates) with a statistic of a random sample (sample conversion rates).
Since we have determined that we need to use statistical inference to answer our original question, now it’s necessary to delve into the concept of hypothesis testing before we actually implement and evaluate our experiment (increasing the coupon value). In order to do this, first, we need to formally establish our research question. Following from the case we presented above, the question we’d like to answer would be along the lines of:
Will the “$20 off” campaign conversion rate be different from the “$10 off” campaign conversion rate?
Put differently, we'd like to know if users that get a $10 coupon and users that get a $20 coupon are equally likely to make a purchase. This means that our two competing claims (hypotheses) are:
H0: p$10 = p$20
H1: p$10 ≠ p$20
Notice that we use p and not p̂ to define our hypotheses because we are not interested in determining if there’s a difference between the conversion rates of the samples. Rather, we are interested in inferring population parameters. In other words, we’d like to know if there is a difference in the population conversion rates.
After defining the hypotheses, suppose that you’ve run the experiment and collected the following data:
Data from the two campaigns.
|“$10 Off” Campaign||“$20 Off” Campaign|
|Users that converted = x||220||280|
|Users that didn’t convert||4,310||4,919|
|Total users = n||4,530||5,234|
|p̂ (Conversion Rate)||~ 4.86%||~ 5.35%|
At first glance, we can see that the conversion rates for our samples are different. However, this difference in conversion rate could be happening due to random chance/sampling variability, and our population’s conversion rates could actually tell a different story. To see if that’s the case, we need to set a significance level and calculate a p-value that will help us decide if we can reject or fail to reject the null hypothesis.
Evaluating the results
The next steps after collecting the data are to check the conditions for inference and set a significance level. For this experiment, we’ll set the significance level to 0.1, which means that we’ll have a 10% risk of concluding that there’s a difference between our population conversion rates when in fact, there is no difference between them. This is also known as a Type 1 error.
To process the results of the experiment, you can make a copy of the following Google Sheet template, or you can use the Python script below. In Python, this task is relatively straightforward if we use the Statsmodels library.
In the output of the code above, we can see that we get a p-value of ~0.27 or ~27%. Therefore, we also get a message suggesting that our result is not statistically significant. Let’s try to understand why this is happening by clarifying what the p-value is actually telling us.
To understand what this means, first let’s analyze what the p-value is not telling us:
- The p-value is not the probability of the H0 (null hypothesis) being true.
- The p-value is not the probability of the HA (alternative hypothesis) being true.
- The p-value is not the probability of observing a difference in proportions (or means, depending on the parameter you’ve selected for your test).
The p-value is a conditional probability, which means that we are getting the probability given that a specific condition is true. Formally, we can say that the p-value is the conditional probability of observing a more extreme outcome than the one we got, given that the null hypothesis is true.
p-value = P(observed or more extreme outcome | H0 true)
What does this mean in our example? Suppose that H0 is true, which would mean that our population conversion rates are equal (p$10 = p$20). Put differently, both campaigns convert users at the same rate. Therefore, the p-value is telling us the probability of seeing the same or a more extreme value than the difference in sample proportions (p̂$10 - p̂$20) we observed if the two population proportions were actually the same.
You can avoid the misinterpretation of the p-value by remembering that it is not the probability of one of the hypotheses being true. Rather, it is the probability of obtaining a result from a sample if the null hypothesis were true.
If our p-value is 0.27, this means that there’s a 27% chance to see the difference we observed for our sample proportions (-0.49%), if users from our population were equally likely to convert for both campaigns. Roughly, 1 out of 4 times, we could get that value or a greater difference simply by random chance.
Finally, we can compare our p-value (0.27) to our significance level (0.1), and this will determine if we have enough evidence to reject or fail to reject the null hypothesis, therefore, because our p-value is greater than our significance level ( 0.27 > 0.1 ) we fail to reject the null hypothesis and we conclude that the data do not provide convincing evidence that there is a difference between the conversion rates of our two campaigns. This means that we would be better off if we carry on with the “$10 off” campaign.
Something important to notice here is that the significance level we choose will have an effect on whether we conclude that there’s enough evidence to reject the null hypothesis. For that reason, I’d encourage you to play out all of the different scenarios before selecting a significance level.
Having gone through the whole process, I believe that it is important to dedicate some time to discuss what the term “statistical significance” actually means. As we saw earlier, what we are trying to do is gather evidence and evaluate how much this evidence agrees/disagrees with a null hypothesis. How much the data has to agree/disagree with the null completely depends on us as subject-matter experts (SME’s). Therefore, it is you the SME who should be making the final decisions and selecting the appropriate level for the experiment.
Finally, keep in mind that you should always include a p-value when presenting the results of a hypothesis test, regardless of the outcome. Without an accompanying p-value, the term “statistically significant” carries no meaning whatsoever.
Hopefully, by now, you’ve gained some powerful knowledge that will make the output of your experiments much more valuable in practice.
- How to Compute the Statistical Significance of Two Classifiers Performance Difference
- P-values Explained By Data Scientist
- The 8 Basic Statistics Concepts for Data Science