Machine Learning 201: Does Balancing Classes Improve Classifier Performance?
The author investigates if balancing classes improves performance for logistic regression, SVM, and Random Forests, and finds where it helps the performance and where it does not.
Enrichment degraded all models’ specificity (i.e. they all make more false positives), logistic regression’s the most dramatically, SVM’s the least.
Based on this experiment, I would say that balancing the classes, or enrichment in general, is of limited value if your goal is to apply class labels. It did improve the performance of random forest, but mostly because random forest was a rather poor choice for this problem in the first place (It would be interesting to do a more comprehensive study of the effect of target prevalence on random forest. Does it often perform poorly with rare classes?).
Enrichment is not a good idea for logistic regression models. If you must do some enrichment, then these results suggest that SVM is the safest classifier to use, and even then you probably want to limit the amount of enrichment to less than five times the target class’s native prevalence — certainly a far cry from balancing the classes, if the target class is very rare.
The Inevitable Caveats
The first caveat is that we only looked at one data set, only three modeling algorithms, and only one specific implementation of each of these algorithms. A more thorough study of this question would consider far more datasets, and more modeling algorithms and implementations thereof.
The second caveat is that we were specifically supplying class labels, using a threshold. I didn’t show it here, but one of the notable issues with the random forest model when it was applied to hold-out was that it no longer scored the datums along the full range of 0-1 (which it did, on the training data); it generally maxed out at around 0.6 or 0.7. This possibly makes using 0.5 as the threshold suboptimal. The following graph was produced with a model trained with the target class at native prevalence, and evaluated on our test set.
The x-axis corresponds to different thresholds for setting class labels, ranging between 0.25 (more permissive about marking datums as positive) and 0.75 (less permissive about marking datums as classifiers). You can see that the random forest model (which didn’t score anything in the test set higher than 0.65) would have better accuracy with a lower threshold (about 0.3). The other two models have fairly close to optimal accuracy at the default threshold of 0.5. So perhaps it’s not fair to look at the classifier performance without tuning the thresholds. However, if you’re tuning a model that was trained on enriched data, you still have to calibrate the threshold on un-enriched data — in which case, you might as well train on un-enriched data, too. In the case of this random forest model, its best accuracy (at threshold=0.3) is about as good as random forest’s accuracy when trained on a balanced data set, again suggesting that balancing the training set doesn’t contribute much. Tuning the threshold may be enough.
However, suppose we don’t need to assign class labels? Suppose we only need the score to sort the datums, hoping to sort most of the items of interest to the top? This could be the case when prioritizing transactions to be investigated as fraudulent. The exact fraud score of a questionable transaction might not matter — only that it’s higher than the score of non-fraudulent events. In this case, would enrichment or class balancing improve? I didn’t try it (mostly because I didn’t think of it until halfway through writing this), but I suspect not.
- Balancing class prevalence before training a classifier does not across-the-board improve classifier performance.
- In fact, it is contraindicated for logistic regression models.
- Balancing classes or enriching target class prevalence may improve random forest classifiers.
- But random forest models may not be the best choice for very unbalanced classes.
- If target class enrichment is necessary (perhaps because of data scarcity issues), SVM may be the safest choice for modeling.
A knitr document of our experiment, along with the accompanying R markdown file, can be downloaded here, along with a copy of the ISOLET data.
Bio: Nina Zumel is a Principal Consultant with Win-Vector LLC, a statistical research and consulting firm based in San Francisco. Her technical interests include statistical learning, data visualization and knowledge representation.
Reposted with permission from here.