Normalization vs Standardization — Quantitative analysis

Stop using StandardScaler from Sklearn as a default feature scaling method can get you a boost of 7% in accuracy, even when you hyperparameters are tuned!

Let’s analyze the results

1. There is no single scaling method to rule them all.
2. We can see that scaling improved the results. SVM, MLP, KNN, and NB got a significant boost from different scaling methods.
3. Notice that NB, RF, LDA, CART are unaffected by some of the scaling methods. This is, of course, related to how each of the classifiers works. Trees are not affected by scaling because the splitting criterion first orders the values of each feature and then calculate the gini\entropy of the split. Some scaling methods keep this order, so no change to the accuracy score.
NB is not affected because the model’s priors determined by the count in each class and not by the actual value. Linear Discriminant Analysis (LDA) finds it’s coefficients using the variation between the classes (check this), so the scaling doesn’t matter either.
4. Some of the scaling methods, like QuantileTransformer-Uniform, doesn’t preserve the exact order of the values in each feature, hence the change in score even in the above classifiers that were agnostic to other scaling methods.

3. Classifier+Scaling+PCA

We know that some well-known ML methods like PCA can benefit from scaling (blog). Let’s try adding PCA(n_components=4) to the pipeline and analyze the results.

```import operator
temp = results_df.copy()
temp["model"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
return 'font-weight: 800'

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
subset=pd.IndexSlice[pivot_t["CART"].idxmax(),"CART"])
for col in list(pivot_t):
pivot_t_bold = pivot_t_bold.applymap(df_style,
subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold
```

Let’s analyze the results

1. Most of the time scaling methods improve models with PCA, but no specific scaling method is in charge.
Let’s look at “QuantileTransformer-Uniform”, the method with most of the high scores.
In LDA-PCA it improved the results from 0.704 to 0.783 (8% jump in accuracy!), but in RF-PCA it makes things worse, from 0.711 to 0.668 (4.35% drop in accuracy!)
On the other hand, using RF-PCA with “QuantileTransformer-Normal”, improved the accuracy to 0.766 (5% jump in accuracy!)
2. We can see that PCA only improve LDA and RF, so PCA is not a magic solution.
It’s fine. We didn’t hypertune the n_components parameter, and even if we did, PCA doesn’t guarantee to improve predictions.
3. We can see that StandardScaler and MinMaxScaler achieve best scores only in 4 out of 16 cases. So we should think carefully what scaling method to choose, even as a default one.

We can conclude that even though PCA is a known component that benefits from scaling, no single scaling method always improved our results, and some of them even cause harm(RF-PCA with StandardScaler).

The dataset is also a great factor here. To better understand the consequences of scaling methods on PCA, we should experiment with more diverse datasets (class imbalanced, different scales of features and datasets with numerical and categorical features). I’m doing this analysis in section 5.

4. Classifiers+Scaling+PCA+Hyperparameter tuning

There are big differences in the accuracy score between different scaling methods for a given classifier. One can assume that when the hyperparameters are tuned, the difference between the scaling techniques will be minor and we can use StandardScaler or MinMaxScaler as used in many classification pipelines tutorials in the web.
Let’s check that!

First, NB is not here, that’s because NB has no parameters to tune.

We can see that almost all the algorithms benefit from hyperparameter tuning compare to results from o previous step. An interesting exception is MLP that got worse results. It’s probably because neural networks can easily overfit the data (especially when the number of parameters is much bigger than the number of training samples), and we didn’t perform a careful early stopping to avoid it, nor applied any regularizations.

Yet, even when the hyperparameters are tuned, there are still big differences between the results using different scaling methods. If we would compare different scaling techniques to the broadly used StandardScaler technique, we can gain up to 7% improvement in accuracy (KNN column) when experiencing with other techniques.

The main conclusion from this step is that even though the hyperparameters are tuned, changing the scaling method can dramatically affect the results. So, we should consider the scaling method as a crucial hyperparameter of our model.

Part 5 contains a more in-depth analysis of more diverse datasets. If you don’t want to deep dive into it, feel free to jump to the conclusion section.

5. All again on more datasets

To get a better understanding and to derive more generalized conclusions, we should experiment with more datasets.

We will apply Classifier+Scaling+PCA like section 3 on several datasets with different characteristics and analyze the results. All datasets were taken from Kaggel.

• For the sake of convenience, I selected only the numerical columns out of each dataset. In multivariate datasets (numeric and categorical features), there is an ongoing debate about how to scale the features.
• I didn’t hypertune any parameters of the classifiers.

5.1 Rain in Australia dataset

Classification task: Predict is it’s going to rain?
Metric: Accuracy
Dataset shape: (56420, 18)
Counts for each class:
No 43993
Yes 12427

Here is a sample of 5 rows, we can’t show all the columns in one picture.

``````
dataset.describe()
```
```

We will suspect that scaling will improve classification results due to the different scales of the features (check min max values in the above table, it even get worse on some of the rest of the features).

Results Analysis

• We can see the StandardScaler never got the highest score, nor MinMaxScaler.
• We can see differences of up to 20% between StandardScaler and other methods. (CART-PCA column)
• We can see that scaling usually improved the results. Take for example SVM that jumped from 78% to 99%.

5.2 Bank Marketing dataset

Classification task: Predict has the client subscribed a term deposit?
Metric: AUC ( The data is imbalanced)
Dataset shape: (41188, 11)
Counts for each class:
no 36548
yes 4640

Here is a sample of 5 rows, we can’t show all the columns in one picture.

``````
dataset.describe()
```
```

Again, features in different scales.

Results Analysis

• We can see that in this dataset, even though the features are on different scales, scaling when using PCA doesn’t always improve the results. However, the second-best score in each PCA column is pretty close to the best score. It might indicate that hypertune the number of components of the PCA and using scaling will improve the results over not scaling at all.
• Again, there is no one single scaling method that stood out.
• Another interesting result is that in most models, all the scaling methods didn’t affect that much (usually 1%–3% improvement). Let’s remember that this is an unbalanced dataset and we didn’t hypertune the parameters. Another reason is that the AUC score is already high (~90%), so it’s harder to see major improvements.

5.3 Sloan Digital Sky Survey DR14 dataset

Classification task: Predict if an object to be either a galaxy, star or quasar.
Metric: Accuracy (multiclass)
Dataset shape: (10000, 18)
Counts for each class:
GALAXY 4998
STAR 4152
QSO 850

Here is a sample of 5 rows, we can’t show all the columns in one picture.

``````
dataset.describe()
```
```

Again, features in different scales.

Results Analysis

• We can see that scaling highly improved the results. We could expect it because it contains features on different scales.
• We can see that RobustScaler almost always wins when we use PCA. It might be due to the many outliers in this dataset that shift the PCA eigenvectors. On the other hand, those outliers don’t make such an effect when we do not use PCA. We should do some data exploration to check that.
• There is up to 5% difference in accuracy if we will compare StandardScaler to the other scaling method. So it’s another indicator to the need for experiment with multiple scaling techniques.
• PCA almost always benefit from scaling.

5.4 Income classification dataset

Classification task: Predict if income is >50K, <=50K.
Metric: AUC (imbalanced dataset)
Dataset shape: (32561, 7)
Counts for each class:
<=50K 24720
>50K 7841

Here is a sample of 5 rows, we can’t show all the columns in one picture.

``````
dataset.describe()
```
```

Again, features in different scales.

Results Analysis

• Here again, we have an imbalanced dataset, but we can see that scaling do a good job in improving the results (up to 20%!). This is probably because the AUC score is lower (~80%) compared to the Bank Marketing dataset, so it’s easier to see major improvements.
• Even though StandardScaler is not highlighted (I highlighted only the first best score in each column), in many columns, it achieves the same results as the best, but not always. From the running time results(no appeared here), I can tell you that running StandatdScaler is much faster than many of the other scalers. So if you are in a rush to get some results, it can be a good starting point. But if you want to squeeze every percent from your model, you might want to experience with multiple scaling methods.
• Again, no single best scale method.
• PCA almost always benefited from scaling

Conclusions

• Experiment with multiple scaling methods can dramatically increase your score on classification tasks, even when you hyperparameters are tuned. So, you should consider the scaling method as an important hyperparameter of your model.
• Scaling methods affect differently on different classifiers. Distance-based classifiers like SVM, KNN, and MLP(neural network) dramatically benefit from scaling. But even trees (CART, RF), that are agnostic to some of the scaling methods, can benefit from other methods.
• Knowing the underlying math behind models\preprocessing methods is the best way to understand the results. (For example, how trees work and why some of the scaling methods didn’t affect them). It can also save you a lot of time if you know no to apply StandardScaler when your model is Random Forest.
• Preprocessing methods like PCA that known to be benefited from scaling, do benefit from scaling. When it doesn’t, it might be due to a bad setup of the number of components parameter of PCA, outliers in the data or a bad choice of a scaling method.

If you find some mistakes or have proposals to improve the coverage or the validity of the experiments, please notify me.

Thanks to Dvir Cohen and Nicholas Hoernle.

Bio: Shay Geller is a full time AI and NLP researcher and Masters student of AI and Data Science at Ben Gurion University.

Original. Reposted with permission.

Related: