Dealing with Unbalanced Classes, SVMs, Random Forests®, and Decision Trees in Python
An overview of dealing with unbalanced classes, and implementing SVMs, Random Forests, and Decision Trees in Python.
Evaluating The Unbalanced Classes
In a binary classification problems, accuracy can be misleading if one class (say, bad wine) is much more common than another (say, good wine), this is when the classes are unbalanced.
I print the percentage of wines that are labeled as “bad” in the dataset and plot a boxplot, but this time I draw a line across the plot denoting the accuracy of always guessing zero (“bad wine”).
When there are unbalanced classes in a dataset, guessing the more common class will often yield very high accuracy. For this reason, you usually want to use different metrics that are less sensitive to imbalance when evaluating the predictive performance of classifiers.
The goal is to identify the members of the positive class (the rare class) successfully — this could be either the good wines or the patients presenting a rare disease. For which you have to use precision and recall
I am not going to discuss about precision and recall here, so please watch professor Andrew Ng's week 6 system design videos. Because precision and recall both provide valuable information about the quality of a classifier, you often want to combine them into a single general-purpose score. The F1 score is defined as the harmonic mean of recall and precision:
F1 = (2 x recall x precision) / (recall + precision)
The F1 score thus tends to favor classifiers that are strong in both precision and recall, rather than classifiers that emphasize one at the cost of the other.
This may all seem very complicated, but implementing f1 scores using scikit learn is very easy. You have to just change the scoring parameter of the cross_val_score function that's it.
You can see that the scores are clustered around the 40% mark. There is only very little gain now by increasing the number of trees.
Setting the cutoff value for prediction
Many classifiers (including random forests) can return prediction probabilities, for example: given a point X there is 70% probability that it belongs to class 1 and 30% probability that it belongs to class 0. However, when the classes in the training data are unbalanced, these predictions calculated by the classifier can be inaccurate because many classifiers do not know how to adjust for this imbalance. This problem is solved using calibration.
If a classifier’s prediction probabilities are accurate, the appropriate way to convert its probabilities into predictions is to simply choose the class with probability > 0.5. This is the default behavior of classifiers when you call their predict method. When the probabilities are inaccurate, this does not work well, but you can still get good predictions by choosing a more appropriate cutoff. In this section, I will choose a cutoff by cross validation.
First, you have to understand how a predict_proba function works in scikit learn. I will illustrate this with an example.
I am going to Fit a random forest classifier to the wine data using 15 trees. Then I compute the predicted probabilities that the classifier assigned to each of the training examples, this can be done using predict_proba method. As a test case, I am going to construct a prediction based on these predicted probabilities that labels all wines with a predicted probability of being in class 1 > 0.5 with a 1 and 0 otherwise. For example, if originally probabilities =[0.1,0.4,0.5,0.6,0.7], the predictions should be [0,0,0,1,1].
predict_proba returns two columns, in which column one is class 0 and column two is class 1.
Class 0 | Class 1 |
0.3 | 0.7 |
0.89 | 0.11 |
0.77 | 0.23 |
0.82 | 0.18 |
This was just an example to show you how things work.
Using 10-fold cross validation, I am going to find a cutoff in np.arange(0.1,0.9,0.1) that gives the best average F1 score when converting prediction probabilities from a 15-tree random forest classifier into predictions.
The custom_f1(cutoff) returns the f1 score by getting a cutoff value, the cutoff value ranges from 0.1 to 0.9. The sklearn.metrics.f1_score accepts real y and predicted y as parameters and returns the f1 score.
Then I use a box plot to show the scores.
A cutoff of about 0.3-0.5 appears to give the best predictive performance. It is intuitive that the cutoff is less than 0.5 because the training data contains many fewer examples of “good” wines, so you need to adjust the classifier’s cutoff to reflect that fact that good wines are, in general, rarer.