Normalization vs Standardization — Quantitative analysis
Stop using StandardScaler from Sklearn as a default feature scaling method can get you a boost of 7% in accuracy, even when you hyperparameters are tuned!
By Shay Geller, NLP & AI Researcher
Every ML practitioner knows that feature scaling is an important issue (read more here).
The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
In this blog, I conducted a few experiments and hope to answer questions like:
- Should we always scale our features?
- Is there a single best scaling technique?
- How different scaling techniques affect different classifiers?
- Should we consider scaling technique as an important hyperparameter of our model?
I’ll analyze the empirical results of applying different scaling methods on features in multiple experiments settings.
Table of Contests
- 0. Why are we here?
- 1. Out-of-the-box classifiers
- 2. Classifier + Scaling
- 3. Classifier + Scaling + PCA
- 4. Classifier + Scaling + PCA + Hyperparameter Tuning
- 5. All again on more datasets:
- — 5.1 Rain in Australia dataset
- — 5.2 Bank Marketing dataset
- — 5.3 Income classification dataset
- — 5.4 Income classification dataset
0. Why are we here?
First, I was trying to understand what is the difference between Normalization and Standardization.
So, I encountered this excellent blog by Sebastian Raschka that supplies a mathematical background that satisfied my curiosity. Please take 5 minutes to read this blog if you are not familiar with Normalization or Standardization concepts.
There is also a great explanation of the need for scaling features when dealing with classifiers that trained using gradient descendent methods( like neural networks) by famous Hinton here.
Ok, we grabbed some math, that’s it? Not quite.
When I checked the popular ML library Sklearn, I saw that there are lots of different scaling methods. There is a great visualization of the effect of different scalers on data with outliers. But they didn’t show how it affects classification tasks with different classifiers.
I saw a lot of ML pipelines tutorials that use StandardScaler (usually called Z-score Standardization) or MinMaxScaler (usually called min-max Normalization) to scale features. Why does no one use other scaling techniques for classification? Is it possible that StandardScaler or MinMaxScaler are the best scaling methods?
I didn’t see any explanation in the tutorials about why or when to use each one of them, so I thought I’d investigate the performance of these techniques by running some experiments. This is what this notebook is all about
Like many Data Science projects, lets read some data and experiment with several out-of-the-box classifiers.
Sonar dataset. It contains 208 rows and 60 feature columns. It’s a classification task to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.
It’s a balanced dataset:
All the features in this dataset are between 0 to 1, but it’s not ensured that 1 is the max value or 0 is the min value in each feature.
I chose this dataset because, from one hand, it is small, so I can experiment pretty fast. On the other hand, it’s a hard problem and none of the classifiers achieve anything close to 100% accuracy, so we can compare meaningful results.
We will experiment with more datasets in the last section.
As a preprocessing step, I already calculated all the results (it takes some time). So we only load the results file and work with it.
The code that produces the results can be found in my GitHub:
I pick some of the most popular classification models from Sklearn, denoted as:
(MLP is Multi-Layer Perceptron, a neural network)
The scalers I used are denoted as:
*Do not confuse Normalizer, the last scaler in the list above with the min-max normalization technique I discussed before. The min-max normalization is the second in the list and named MinMaxScaler. The Normalizer class from Sklearn normalizes samples individually to unit norm. It is not column based but a row based normalization technique.
- The same seed was used when needed for reproducibility.
- I randomly split the data to train-test sets of 80%-20% respectively.
- All results are accuracy scores on 10-fold random cross-validation splits from the train set.
- I do not discuss the results on the test set here. Usually, the test set should be kept hidden, and all of our conclusions about our classifiers should be taken only from the cross-validation scores.
- In part 4, I performed nested cross-validation. One inner cross-validation with 5 random splits for hyperparameter tuning, and another outer CV with 10 random splits to get the model’s score using the best parameters. Also in this part, all data taken only from the train set. A picture is worth a thousand words:
Let’s read the results file
1. Out-of-the-box classifiers
Nice results. By looking at the CV_mean column, we can see that at the moment, MLP is leading. SVM has the worst performance.
Standard deviation is pretty much the same, so we can judge mainly by the mean score. All the results below will be the mean score of 10-fold cross-validation random splits.
Now, let’s see how different scaling methods change the scores for each classifier
The first row, the one without index name, is the algorithm without applying any scaling method.
Best classifier from each model:
- SVM + StandardScaler : 0.849
- MLP + PowerTransformer-Yeo-Johnson : 0.839
- KNN + MinMaxScaler : 0.813
- LR + QuantileTransformer-Uniform : 0.808
- NB + PowerTransformer-Yeo-Johnson : 0.752
- LDA + PowerTransformer-Yeo-Johnson : 0.747
- CART + QuantileTransformer-Uniform : 0.74
- RF + Normalizer : 0.723