How to Compare Apples and Oranges – Part 1
We are always told that apples and oranges can’t be compared, they are completely different things. Learn as an analyst, how you deal with such difference and make sense of it on a daily basis.
iii) Can I use these variables in my predictive model directly ?
When you compare variables, you look at relative metrics rather than absolute metrics. You won’t be looking at coefficient of variation of Salary in insolation but compare it with the other 2 variables. The first 2 questions attempted to compare the individual metric. But, what if the goal was to compare the observations of the entire dataset rather than an individual metric of the variables. Building predictive models requires the entire dataset as the input.
Can you use the variables in the dataset as the input directly, especially if you use machine learning algorithms? Do you want your algorithms to give importance to variables just because their value is relatively high than other variables? In our example, Salary is on the highest scale. So, if we give all the variables in our dataset as an input to K-means, a popular machine learning algorithm, the algorithm will tend to give more importance to ‘Salary’ and the resultant clusters will be formed, probably segmenting just Salary and not the other variables in conjunction. This is because the algorithm just sees the values of the variables. So if K-means clusters the observations based on the numeric distance between observations, it is logical that Salary gets a higher weight in determining how the observations get clustered since its value is much higher than the other 2 variables.
What’s the way out? We have looked at standardizing individual metrics till now and not the entire data. Let’s look at a method to standardize the entire variable data using z-score.
In the above formula, we are subtracting each observation from its mean and dividing the result by the standard deviation. If you look at the formula closely, the resultant z-score is unitless as the units get canceled out due to the division.
The table on the left shows the z-scores for the first 10 rows of the variables and the table to the right shows the data summary of the variables, post transformation.
As seen from the above table, z-scores for the variables have transformed their mean to zero and standard deviation to 1.
There are situations where you require standardization, especially in machine learning techniques like PCA, K-means, etc but, at the same time, you might be required to transform the data into a particular range like in image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, neural network algorithms may use data that are on a 0-1 scale in a way to avoid bias. This bias may arise due to the observations that are at the extreme end of the range or are outliers. To avoid such issues, a transformation technique, which bounds the data within a range, is required. This is can be achieved with Normalization.
Just like standardization, normalized data too is unitless. Let’s see what does our original dataset look like after normalization.
The table on the left shows the normalized scores for the first 10 rows of the variables and the table to the right shows the data summary of the variables, post transformation. All the variables are bounded between 0 and 1.
So how would you select between Standardization and Normalization? Depending on the objective of the technique, the method has to be selected. For techniques like PCA or K-means, you would like to retain the unbounded nature and the variation in the data while at the same time make the data unitless and relatively on same scale (range of transformed variable values). In such a case, Standardization is your best bet. There are times when you don’t want your data to be unbounded like in the case of Standardization so that your technique does not give a bias to observations, which are towards the higher/lower side of the range (potential outliers). Normalization reduces the impact of outliers as the range of the data is strictly between 0 and 1.
Closing Thoughts
Comparing different variables should first involve identification of the purpose of such comparison depending on which the appropriate technique to transform the variables or the key metric should be selected. With the help of an example, we looked at coefficient of variation, correlation, standardization and normalization as some of the ways to compare and use different numerical variables for analysis and build predictive models on.
In the ensuing part, we will discuss how to compare categorical variables.
Related: