How to Compare Apples and Oranges, Part 2 – Categorical Variables
In the previous article, we looked at some of the ways to compare different numerical variables. In this article, we shall look at techniques to compare categorical variables with the help of an example.
Pages: 1 2
i) What is pvalue ?
Assuming you have a hypothesis (in the above case, Gender and OS are independent of each other), the pvalue helps you evaluate if the null hypothesis is true. Statistical test use pvalue to determine whether to accept or reject the null hypothesis. It measures how compatible your data is with your null hypothesis or the chance that you are willing to take in being wrong. For example, a p value of 0.05 and 0.1 means you are willing to let 5% and 10% of your predictions be wrong respectively. In other words, pvalue is the probability of observing the effect by chance in your data, assuming the null hypothesis is true. So, lower the pvalue, lower the probability of observing the effect by chance or at random, and higher the probability of rejecting your default or null hypothesis. In practice, depending on your area of study, generally you have cutoff levels like 1% and 5% for pvalues below which you could conclude that the effect is not random or by chance and the null hypothesis could be rejected.
ii) What is Chisquare statistic ?
Before understanding chisquare statistic, let’s understand the concept of observed and expected frequencies. Observed frequencies are the actual frequencies as seen in the data and shown in the contingency table above as ‘Observed’. This is the same contingency table, we had introduced earlier. Expected frequencies are frequencies we could expect if there was absolutely no association and shown in the contingency table above as ‘Expected’. How did we calculate the expected frequency? Expected frequency is calculated from the observed or actual frequencies. Let’s analyze the expected frequencies further: In the above table, the row and column percentages are quite similar and there doesn’t seem to be a difference in percentages due to the influence of Gender on OS or viceversa. Chisquare test is a statistical test commonly used to compare observed data with the data we would expect to obtain according to a specific hypothesis. In our example, we would have expected 1954 of 5653 Android users to be Male but actual or observed were 1385. So is this deviation of 569 users statistically significant? Were the deviations (differences between observed and expected) the result of chance, or were they due to other factors? The chisquare test helps us answers this by calculating the chisquare statistic. That is, chisquare statistic is the sum of the squared difference between observed and the expected data, divided by the expected data in all possible categories.
iii) What is Degrees of Freedom ?
The degrees of freedom is the number of values in a calculation that we can vary. Let’s understand degrees of freedom with the help of an example. Example 1: Suppose you know that the mean for a data with 10 observations is 25 and that variable has many such sets of 10 observations. So, for a new set of 10 observations, we have the freedom to set the value of 9 observations i.e. you can have the freedom to select any 9 values. But, you won’t have the freedom to set the value for the 10th observation. This is because the mean of the data has to be equal to 25. So the value of the 10th observation has to be equal to (25 * 10 – sum of the values of 9 observations). Hence, the degrees of freedom in this case is 9. Example 2:
In order to run a chisquare test on the contingency table, the row total and the column total is like the mean and the other cells in the contingency table are like the observations in the Example 1. In the above contingency table, we can only freely select 2 values so that the row and column totals are not changed. Hence degrees of freedom is 2. The formula to calculate it for a contingency table with 2 categorical variables is (r – 1) * (c – 1), which for our case is (3 – 1) * (2 – 1) = 2
Steps to calculate pvalue
In order to accept or reject the Null Hypothesis, we need to calculate the pvalue. pvalue is calculated in the following 3 steps: Step 1) Calculate Chisquare statistic Step 2) Calculate the degrees of freedom Step 3) Find the pvalue corresponding to chisquare statistic with corresponding degrees of freedom in the chisquare distribution table. The above table is an excerpt of a chisquare distribution table. The first column contains degrees of freedom. The cells of each row give the critical value of chisquare for a given pvalue (column heading) and a given number of degrees of freedom. For a given degrees of freedom, higher the chisquare statistic (cell value), lower the pvalue.
OS & Gender
In our example, the Chisquare statistic (χ^{2}) for OS and Gender using the chisquare statistic formula = 675.86. In the chisquare distribution table, χ^{2}_{0.005 }statistic is 10.597 at 2 degrees of freedom. Hence, the pvalue has to be less than 0.005. This can be easily solved using a computer rather than manually. pvalue or P (χ^{2} > 675.86) at 2 degrees of freedom < 2.2e16 or almost zero. We have to compare this pvalue with an assumed cutoff level of 5% or 1% known as alpha or significance level. The assumed alpha value helps to conclude if the statistic is observed by chance or by any other factor. The pvalue calculated is less than the assumed alpha. Hence, we can say that based on the evidence, we fail to accept or reject the Null Hypothesis and conclude that Gender and OS are not independent.
Gender & Transact
χ^{2} = 0.11647 P (χ^{2} > 675.86) at 1 degree of freedom = 0.7329 Since pvalue is greater than the alpha value of 0.05, we fail to reject the Null Hypothesis and conclude that Gender and Transact are independent.
OS & Transact
χ^{2} = 24.581 P (χ^{2} > 24.581) at 2 degrees of freedom = 4.595e06 or almost zero. Since pvalue is less than the alpha value of 0.05, we reject the Null Hypothesis and conclude that OS and Transact are not independent.
Closing Thoughts
To sum up, we have been able to compare 2 categorical variables with the help of contingency table and chisquare test. The same concept can be extended to compare more than 2 categorical variables together. The next article will deal with ways to compare mixed type of variables i.e. when we have to deal with numerical and categorical together. Original. Related:
Pages: 1 2
Top Stories Past 30 Days  


