Hypothesis Testing in Data Science
Defining a hypothesis allows you to collect data effectively and determine whether it provides enough evidence to support your hypothesis.
Image by Author
The word ‘Hypothesis’ originates from the Greek words ‘hupo’, which means under and ‘thesis’, which means placing. Inferring an idea using limited evidence that can be used as a starting point for further investigation.
So you can say that a ‘Hypothesis’ is an informed guess, but it doesn’t mean it can’t be proven to be true.
What is Hypothesis Testing?
When we refer to Hypothesis Testing, it means using a systematic procedure to decide whether data and research study can support our particular theory which applies to a population.
We do this by using two mutually exclusive hypotheses about a population, and evaluating these statements to decide if the statements are supported by the sample data.
When to Use Hypothesis Testing in Data Science?
If you want to compare your results based on predictions, then you want to use hypothesis testing. It will allow you to compare the before and after results of your findings.
It is generally used when we want to compare:
- A single group with an external standard
- Two or more groups with each other
Hypothesis Testing vs Hypothesis Generation
In the world of Data Science, there are two parts to consider when putting together a hypothesis.
Hypothesis Testing is when the team builds a strong hypothesis based on the available dataset. This will help direct the team and plan accordingly throughout the data science project. The hypothesis will then be tested with a complete dataset and determine if it is:
- Null hypothesis - There’s no effect on the population
- The Alternative hypothesis - There’s an effect on the population
Hypothesis Generation is an educated guess based on various factors that can be used to resolve the problem at hand. It is the process of combining our problem-solving skills with our business intuition. You will focus on how specific factors impact the target variable and then move on to conclude the relationship between the variables using hypothesis testing.
Different Types of Hypothesis Testing
There is no relation between statistical variables and refer to this type of testing as null hypothesis testing. A null hypothesis is represented as H0. There are types of null hypotheses:
- Simple Hypothesis
- Composite Hypothesis
- Exact Hypothesis
- Inexact Hypothesis
There is a relationship between two variables, proving that they have a statistical bond. An alternative hypothesis is represented as H1 or HA. The alternative hypothesis can be split into:
- One-tailed. This is when you are testing in one direction and disregarding the possibility of a relationship with another variable in another direction. The sample mean would be higher or lower than the population mean, but not both.
- Two-tailed. This is when you are testing in both directions and shows whether the sample mean is higher than or less than the mean of a population.
This is when a hypothesis does not state a direction but states that one factor affects another, or there is a correlation between two variables. However, the main point is that there is no direction between the 2 variables.
This is when a hypothesis has been built using the specific directional relationship between two variables and is based upon existing theory.
What’s its use in Data Science?
When working with data, you need to ask questions before looking at it, manipulating it, or performing any form of analysis. Asking questions will help you in the preparation stage, making your analysis easier.
Data Scientists will generate different questions that need to be answered to enhance the performance of a business. These questions will help direct the data science project, making it more effective towards the decision-making process.
For example, when asking questions and coming together to form a hypothesis, data scientists can carefully consider which variable will impact their project and others that do not need to be taken into consideration.
Hypothesis helps data scientists to:
- Get a better understanding of the business problem at hand and allow them to dig deeper into the variables in the dataset.
- Allows them to conclude what significant factors are essential to solving the problem, and use their time effectively on factors that don’t.
- Help in the preparation stage of the process by collecting data from various sources that are fundamental to the business problem.
Being able to cross out possibilities by using hypothesis testing helps data scientists draw better conclusions. They will be able to spend more time on the problem at hand and come to effective decision-making factors to present to executives.
Other Terminology for Hypothesis Testing
Parameter is a summary description of the target population. For example, if you were given the task to find the average height of your classmates, you would ask everyone in your class (population) about their height. Because everyone was asked the same question, you will have got a true description and received a parameter.
Statistic is a description of a small portion of a population (sample). Using the same example as above, you are now given the task to find the average height of your age group (population), you can then use the information that you gathered from your class (sample). This type of information is known as a statistic.
Sampling Distribution is a probability distribution by choosing a large number of samples drawn from a specific population. For example, if you were to provide a random sample of 10 coffee shops in your borough, from a population of 200 coffee shops. The random sample could be coffee shop numbers 4, 7, 13, 76, 94, 145, 11, 189, 52, 165, or any of the other combinations.
Standard Error is similar to standard deviation, in the respect that both measure the spread of your data. The higher the value, the more spread your data is. However, the difference is that standard error uses sample data, whereas standard deviation uses population. The standard error tells you how far your sample statistic is from the actual population mean.
Type-I error also known as a false positive and happens when the team incorrectly rejects a true null hypothesis. This means that the report states that your findings are significant, however, they have occurred by chance.
Type-II error also known as a false negative, happens when the team fails to reject a null hypothesis, which is in fact false. This means that the report states that your findings are not significant when there actually are.
The level of significance
The level of significance is the probability and maximum risk of making a false positive conclusion (Type I error) that you are willing to accept. Data Scientists, researchers, etc set this in advance and use it as a threshold for statistical significance.
P-value means probability value and is a number compared to the significance level to decide whether to reject the null hypothesis. It decides whether the sample data support the counter-argument and the null hypothesis is true. If you have a higher p-value than the significance level, the null hypothesis is not wrong or false, and the results are not statistically significant. However, if you have a lower p-value than the significant level, the results will be interpreted as false against the null hypothesis and be seen as statistically significant.
This article is introductory to hypothesis testing and why data scientists use it. Hypothesis testing is an important element of a data scientist's workflow. It provides them with more confidence in their hypothesis and allows them to present their work to executives without hesitation.
If you to know more about hypothesis testing, a good read is Hypothesis Testing: An Intuitive Guide for Making Data-Driven Decisions.
Nisha Arya is a Data Scientist and Freelance Technical Writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.