Dealing with Unbalanced Classes, SVMs, Random Forests®, and Decision Trees in Python

An overview of dealing with unbalanced classes, and implementing SVMs, Random Forests, and Decision Trees in Python.

Visualizing The Decision Boundary

A trained classifier takes in X and tries to predict the target variable Y. You can visualize how the classifier translates different inputs X into a guess for Y by plotting the classifier’s prediction probability (that is, for a given class c, the assigned probability that Y=c) as a function of the features X. One common visual summary of a classifier is its decision boundary. Most classifiers in scikit-learn have a method called predict_proba that computes this quantity for new examples after the classifier has been trained.

Decision surface visualizations are really only meaningful if they are plotted against inputs X that are one- or two-dimensional. So before I plot these surfaces, I will first find two “important” dimensions of X to focus on. In my previous blog posts I discussed aboutTruncated SVD  and PCA to perform dimensionality reduction. Here, I will use a different dimension reduction method based on random forests.

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the data points in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is. Importance can be a useful way to select a small number of features for visualization. This is called as variable importance, I have talked about it in my previous post.

Now, I am going to train a random forest classifier on the wine data using 15 trees. I am going to use the feature_importances_ attribute of the classifier to obtain the relative importance of the features. These features are the columns of the dataframe. Then I plot a simple bar plot to  show the relative importance of the named features.

Feature importance in random forest

Plotting a horizontal bar graph in python matplotlib random forest

It is always nice to visualize the features, matplotlib documentation has got very good recipes of plots. I used one such recipe to construct the above horizontal bar chart.

Then I plot the decision surfaces of a decision tree classifier, and a random forest classifier with number of trees set to 15, and a support vector machine with C set to 100, and gamma set to 1.0.

Plotting a decision boundary in random forest , svm and decision trees

Random forest, svm and decision trees in python

Decision tree in python scikit learn
Random forest classifier in python scikit learn
svm in python scikit learn

The decision surfaces for the decision tree and random forest are very complex . The decision tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The random forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.

The SVM implementation of sklearn has an optional parameter class_weight. This parameter is set to None per default, but it also provides an auto mode, which uses the values of the labels Y to automatically adjust weights inversely proportional to class frequencies. I am going to draw the decision boundaries for two SVM classifiers. I am going to use C=1.0, and gamma=1.0 for both models, but for the first SVM I set class_weight to None, and for the second SVM, I set class_weight to ‘auto’.

svm auto mode

svm without class weight


The first SVM with equal class weights only classifies a small subset of the positive training points correctly, but it only produces very few false positive predictions on the training set. Thus, it has higher precision, but lower recall than the second SVM with the auto weighting option. The overall performance of the SVMs seems to be quite poor, with a lot of misclassified data points for both models. To improve the performance you would have to tune the parameters(C and class_weight).

What other things can you do to improve the performance of the classifiers?

Bio: Manu Jeevan writes about Data Science, Digital analytics and Growth hacking at Big Data Examiner. He believes that analytics is not about fancy mathematics or algorithms, it is about having an understanding of what your relationship with customers who are most important to you and an awareness of the potential in that relationship.

Original. Reposted with permission.


RANDOM FORESTS and RANDOMFORESTS are registered marks of Minitab, LLC.