Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » How to Compare Apples and Oranges, Part 2 – Categorical Variables ( 16:n22 )

# How to Compare Apples and Oranges, Part 2 – Categorical Variables

In the previous article, we looked at some of the ways to compare different numerical variables. In this article, we shall look at techniques to compare categorical variables with the help of an example.

### i) What is p-value ?

Assuming you have a hypothesis (in the above case, Gender and OS are independent of each other), the p-value helps you evaluate if the null hypothesis is true. Statistical test use p-value to determine whether to accept or reject the null hypothesis. It measures how compatible your data is with your null hypothesis or the chance that you are willing to take in being wrong. For example, a p value of 0.05 and 0.1 means you are willing to let 5% and 10% of your predictions be wrong respectively. In other words, p-value is the probability of observing the effect by chance in your data, assuming the null hypothesis is true. So, lower the p-value, lower the probability of observing the effect by chance or at random, and higher the probability of rejecting your default or null hypothesis. In practice, depending on your area of study, generally you have cut-off levels like 1% and 5% for p-values below which you could conclude that the effect is not random or by chance and the null hypothesis could be rejected.

### ii) What is Chi-square statistic ?

Before understanding chi-square statistic, let’s understand the concept of observed and expected frequencies. Observed frequencies are the actual frequencies as seen in the data and shown in the contingency table above as ‘Observed’. This is the same contingency table, we had introduced earlier. Expected frequencies are frequencies we could expect if there was absolutely no association and shown in the contingency table above as ‘Expected’. How did we calculate the expected frequency? Expected frequency is calculated from the observed or actual frequencies. $\boldsymbol{\mathit{Expected&space;\hspace{1mm}Cell&space;\hspace{1mm}Frequency&space;}&space;=&space;\frac{\textit{row&space;total&space;containing&space;the&space;cell&space;*&space;column&space;total&space;containing&space;the&space;cell}}{\textit{total&space;number&space;of&space;observations}}}$ $\inline&space;\boldsymbol{\mathit{Expected&space;\hspace{1mm}cell&space;\hspace{1mm}frequency&space;\hspace{1mm}for&space;\hspace{1mm}Male&space;\hspace{1mm}Android&space;\hspace{1mm}Users&space;}}&space;=&space;\frac{5653&space;*&space;3456}{10000}=1954$ Let’s analyze the expected frequencies further: In the above table, the row and column percentages are quite similar and there doesn’t seem to be a difference in percentages due to the influence of Gender on OS or vice-versa. Chi-square test is a statistical test commonly used to compare observed data with the data we would expect to obtain according to a specific hypothesis. In our example, we would have expected 1954 of 5653 Android users to be Male but actual or observed were 1385. So is this deviation of 569 users statistically significant? Were the deviations (differences between observed and expected) the result of chance, or were they due to other factors? The chi-square test helps us answers this by calculating the chi-square statistic. $\boldsymbol{\mathit{Chi-square&space;\hspace{1mm}statistic&space;\hspace{1mm}(\chi&space;^{2})&space;}}&space;=&space;\sum{\frac{(Observed&space;-&space;Expected)^{2}}{Expected}}$ That is, chi-square statistic is the sum of the squared difference between observed and the expected data, divided by the expected data in all possible categories.

### iii) What is Degrees of Freedom ?

The degrees of freedom is the number of values in a calculation that we can vary. Let’s understand degrees of freedom with the help of an example. Example 1: Suppose you know that the mean for a data with 10 observations is 25 and that variable has many such sets of 10 observations. So, for a new set of 10 observations, we have the freedom to set the value of 9 observations i.e. you can have the freedom to select any 9 values. But, you won’t have the freedom to set the value for the 10th observation. This is because the mean of the data has to be equal to 25. So the value of the 10th observation has to be equal to (25 * 10 – sum of the values of 9 observations). Hence, the degrees of freedom in this case is 9. Example 2:

In order to run a chi-square test on the contingency table, the row total and the column total is like the mean and the other cells in the contingency table are like the observations in the Example 1. In the above contingency table, we can only freely select 2 values so that the row and column totals are not changed. Hence degrees of freedom is 2. The formula to calculate it for a contingency table with 2 categorical variables is (r – 1) * (c – 1), which for our case is (3 – 1) * (2 – 1) = 2

### Steps to calculate p-value

In order to accept or reject the Null Hypothesis, we need to calculate the p-value. p-value is calculated in the following 3 steps: Step 1) Calculate Chi-square statistic Step 2) Calculate the degrees of freedom Step 3) Find the p-value corresponding to chi-square statistic with corresponding degrees of freedom in the chi-square distribution table. The above table is an excerpt of a chi-square distribution table. The first column contains degrees of freedom. The cells of each row give the critical value of chi-square for a given p-value (column heading) and a given number of degrees of freedom. For a given degrees of freedom, higher the chi-square statistic (cell value), lower the p-value.

### OS & Gender

In our example, the Chi-square statistic (χ2) for OS and Gender using the chi-square statistic formula = 675.86. In the chi-square distribution table, χ20.005 statistic is 10.597 at 2 degrees of freedom. Hence, the p-value has to be less than 0.005. This can be easily solved using a computer rather than manually. p-value or P (χ2 > 675.86) at 2 degrees of freedom < 2.2e-16 or almost zero. We have to compare this p-value with an assumed cut-off level of 5% or 1% known as alpha or significance level. The assumed alpha value helps to conclude if the statistic is observed by chance or by any other factor. The p-value calculated is less than the assumed alpha. Hence, we can say that based on the evidence, we fail to accept or reject the Null Hypothesis and conclude that Gender and OS are not independent.

### Gender & Transact

χ2 = 0.11647 P (χ2 > 675.86) at 1 degree of freedom = 0.7329 Since p-value is greater than the alpha value of 0.05, we fail to reject the Null Hypothesis and conclude that Gender and Transact are independent.

### OS & Transact

χ2 = 24.581 P (χ2 > 24.581) at 2 degrees of freedom = 4.595e-06 or almost zero. Since p-value is less than the alpha value of 0.05, we reject the Null Hypothesis and conclude that OS and Transact are not independent.

### Closing Thoughts

To sum up, we have been able to compare 2 categorical variables with the help of contingency table and chi-square test. The same concept can be extended to compare more than 2 categorical variables together. The next article will deal with ways to compare mixed type of variables i.e. when we have to deal with numerical and categorical together. Original. Related:

Top Stories Past 30 Days
Most Popular
Most Shared