The Berkson-Jekel Paradox and its Importance to Data Science

Berkson-Jekel: A Statistical Paradox in Data Science that you should know about.

The Berkson-Jekel Paradox and its Importance to Data Science
Image by Author


If you are a Data Scientist or an aspiring one, you will know the importance of statistics in the sector. Statistics help Data Scientists to collect, analyze, and interpret the data by identifying patterns and trends, to then make future predictions.


What is a Statistical Paradox?


A statistical paradox is when a statistical result contradicts expectations. It can be very difficult to pinpoint the exact cause, as it is hard to understand the data without the use of further methods. However, they are an important element for Data Scientists as it gives them a lead on what could possibly be causing the misleading results. 

Here is a list of statistical paradoxes relevant to data science:

  • Simpson's Paradox
  • Berkson's Paradox
  • The False Positive Paradox
  • The Accuracy Paradox
  • The Learnability-Godel Paradox

In this article, we will be focusing on the Berkson-Jekel paradox and its relevance to Data Science. 


What is the Berkson-Jekel Paradox?


Berkson-Jekel paradox is when two variables are correlated in data, however, when the data is grouped or subsetted, the correlation is not identified. To put it in layman's terms, the correlation is different in different subgroups of the data.

The Berkson-Jekel paradox is named after the first statisticians who described the paradox, Joseph Berkson and John Jekel. The discovery of the Berkson-Jekel paradox is when the two statisticians were studying the correlation between smoking and lung cancer. During their study, they found a correlation between people who had been hospitalized for pneumonia and lung cancer, in comparison to the general population. However, they conducted further research which showed that the correlation was due to smokers being hospitalized for pneumonia more, in comparison to people who did not smoke.


Why Does This Happen?


Based on the statistician's first research on the Berkson-Jekel paradox, you may say that more research was required to figure out the exact reasoning behind the correlation. However, there are also other reasons why the Berkson-Jekel paradox occurs.

  • Hidden Variables: Datasets can contain hidden variables that are affecting the results. Therefore, when there is a study between the correlation of two variables, data scientists and researchers may have not considered all the potential factors. 
  • Sample Bias: the sample of the data may not be representative of the population, which can lead to misleading correlations. 
  • Correlation vs Causality: An important thing to remember in data science is that correlation does not mean causality. Two variables may correlate, but it does not mean that one causes the other.


Why is Berkson-Jekel Paradox Important in Data Science?


Statistical reasoning is very important in Data Science, and the main issue is dealing with misleading results. As a data scientist, you want to ensure that you are producing accurate results that can be used in the decision-making process and for future predictions. Making incorrect predictions or misleading results is the last thing on the cards. 


How to Avoid the Berkson-Jekel Paradox


There are a few methods that you can use to avoid the Berkson-Jekel Paradox:


Use Statistical Methods to Control Hidden Variables


  • Statistical modeling: You can use statistical modeling to better understand the relationship between two or more variables. This way, you can identify hidden variables that could be potentially affecting the result.
  • Randomized controlled trials: This is when participants are randomly assigned to a treatment group or a control group. This can help data scientists control hidden variables that may be affecting the results of their study.
  • Combining results: You can combine multiple study results to help you get a better understanding of the study. This way, data scientists have a better understanding and control of hidden variables in each study. 


Variety of Data Sources


If you are dealing with misleading results due to the sample data not being representative of the population, a solution would be to use data from a variety of sources. This will help you to get a more representative sample of the population, research more on the variables, and get a better understanding.


Wrapping Up


Misleading outputs can hold a company back. Therefore, when working with data, data professionals need to understand the limitations of the data they’re working with, different variables and the relationship between them, and how to reduce misleading results from happening. 

If you would like to know more about Simpson’s Paradox, have a read of this: Simpson’s Paradox and its Implications in Data Science

If you would like to know more about the other statistical paradoxes, have a read of this: 5 Statistical Paradoxes Data Scientists Should Know
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.