How to Compare Apples and Oranges, Part 2 – Categorical Variables
In the previous article, we looked at some of the ways to compare different numerical variables. In this article, we shall look at techniques to compare categorical variables with the help of an example.
By Jacob Joseph, CleverTap.
In the previous article, we looked at some of the ways to compare different numerical variables. In this article, we shall look at techniques to compare categorical variables with the help of an example. Assume you have been given a dataset totaling 10,000 rows containing user information on Operating System, Gender and whether the user has transacted over a particular period.
All the variables mentioned above are categorical variables. It seems 35% of the users are Female and 65% Male. Female Android users constitute 25% and Male Android users constitute 75% of the Android users. If there is no association between Gender and OS, you will expect that the percentage composition of Female Android and Male Android users (25% & 75%) will be similar to that of the percentage composition between Female and Male users (35% and 65%). The same holds true for Windows and iOS users. But, is there a way to conclude if the observed difference is big enough to concur that the percentage composition indeed is not similar? In short, we are trying to ask the question ‘Is there an association between the categorical variables – Gender and OS?’.
In order to compare categorical variables, we have to work with frequency of levels/attributes of such variables. From the above table, we know that the frequency of ‘Android’ in OS is 5653 users of which Male users are 1385 and Female users are 4268 as can be seen in the first row of the table. We need to use this frequency to compare the categorical variables. The above table is a Contingency table where we are analyzing 2 categorical variables. A contingency table is essentially a display format used to analyze and record the relationship between two or more categorical variables. Let’s further analyze the contingency table:
From the above table, it seems that the break-up of Gender is different across Operating System. For example:
- Android users constitute 56.53% of the total users. But, if we segregate the users based on Gender, we get different percentages for Males and Females on Android.
- Male Android users constitute 40.08% of the Male users whereas Female Android users constitute 65.22% of the Female users.
The question that may arise is why there is a difference in the frequency percentages when we look at levels in a single category compared to the combination of levels of more than 1 categorical variable. Is there an association between the Gender and OS resulting in the difference? Is this deviation in percentages statistically significant to conclude the presence of some association?
We often come across the term ‘statistical significance’ or ‘random chance’. But what does it mean intuitively? Imagine you are tossing 2 coins, A and B 10 times. Coin ‘A’ landed heads 3 times whereas Coin ‘B’ landed heads 5 times. Does it mean that Coin ‘A’ is an unfair coin where chances of landing tails are more than heads? You know intuitively that the difference could have occurred simply due to luck or by chance. But, what if you have tossed the coins 1000 times and Coin ‘A’ landed heads 100 times whereas Coin ‘B’ landed heads 550 times? Would you still attribute this difference to chance or some other underlying factors such as the shape of the coins? We can answer this difference with the help of statistical tests. Coming back to our discussion on our example of User data, we will attempt to answer the difference seen in the contingency table with the help of Hypothesis testing.
Claim 1: Gender is independent of Operating System (No Association)
Claim 2: Gender is not independent of Operating System (Association)
The above 2 claims/statements are essentially what we test in hypothesis testing. We deal with hypothesis on a daily basis. We might have hypothesis on political issues, social issues, financial issues, etc. For example, we might have a hypothesis on whether it will rain today? In any hypothesis, you will have a default or null hypothesis referred to as H0 (Claim 1), which is your default belief and an alternate hypothesis referred to as H1 (Claim 2), which is against your default belief. The null hypothesis is the statement being tested.
Usually the null hypothesis is a statement of “no effect” or “no difference”. So, in our example, we would expect the percentage composition of Gender to be the same for Android, Windows and iOS users (Null Hypothesis). The Alternate Hypothesis is that we don’t expect it to be the same. Here, the word ‘same’ does not imply that the percentage composition has to be exactly equal but it means that there is no statistical difference. We run some appropriate statistical tests to determine it i.e. whether to accept or reject the null hypothesis. But, prior to that, we need to understand 3 statistical concepts, (i) p-value (ii) chi-square statistic (iii) degrees of freedom .