Machine Learning 201: Does Balancing Classes Improve Classifier Performance?

The author investigates if balancing classes improves performance for logistic regression, SVM, and Random Forests, and finds where it helps the performance and where it does not.



How Enriching the Training Data Changes Model Performance

First, let’s look at accuracy.

NewImage

The x-axis is the prevalence of the target in the training data; the y-axis gives the accuracy of the model on the test set (with the target at its native prevalence), averaged over ten draws of the training set. The error bars are the bootstrap estimate of the 98% confidence interval around the mean, and the values for the individual runs appear as transparent dots at each value. The dashed horizontal represents the accuracy of a model trained at the target class’s true prevalence, which we’ll call the model’s baseline performance. Logistic regression degraded the most dramatically of the three models as target prevalence increased. SVM degraded only slightly. Random forest improved, although its best performance (when training at about 19% prevalence, or five times native prevalence) is only slightly better than SVM’s baseline performance, and its performance at 50% prevalence is worse than the baseline performance of the other two classifiers.

Logistic regression’s degradation should be no surprise. Logistic regression optimizes deviance, which is strongly distributional; in fact, logistic regression (without regularization) preserves the marginal probabilities of the training data. Since logistic regression is so well calibrated to the training distribution, changes in the distribution will naturally affect model performance.

The observation that SVM’s accuracy stayed very stable is consistent with my surmise that SVM’s training procedure is not strongly dependent on the class distributions.

Now let’s look at precision:

NewImage

All of the models degraded on precision, random forest the most dramatically (since it started at a higher baseline), SVM the least. SVM and logistic regression were comparable at baseline.

Let’s look at recall:

NewImage

Enrichment improved the recall of all the classifiers, random forest most dramatically, although its best performance, at 50% enrichment, is not really any better than SVM’s baseline recall. Again, SVM’s recall moved the least.

Finally, let’s look at specificity: