Dealing with Unbalanced Classes, SVMs, Random Forests®, and Decision Trees in Python
An overview of dealing with unbalanced classes, and implementing SVMs, Random Forests, and Decision Trees in Python.
Visualizing The Decision Boundary
A trained classifier takes in X and tries to predict the target variable Y. You can visualize how the classifier translates different inputs X into a guess for Y by plotting the classifier’s prediction probability (that is, for a given class c, the assigned probability that Y=c) as a function of the features X. One common visual summary of a classifier is its decision boundary. Most classifiers in scikit-learn have a method called predict_proba that computes this quantity for new examples after the classifier has been trained.
Decision surface visualizations are really only meaningful if they are plotted against inputs X that are one- or two-dimensional. So before I plot these surfaces, I will first find two “important” dimensions of X to focus on. In my previous blog posts I discussed aboutTruncated SVD and PCA to perform dimensionality reduction. Here, I will use a different dimension reduction method based on random forests.
Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the data points in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is. Importance can be a useful way to select a small number of features for visualization. This is called as variable importance, I have talked about it in my previous post.
Now, I am going to train a random forest classifier on the wine data using 15 trees. I am going to use the feature_importances_ attribute of the classifier to obtain the relative importance of the features. These features are the columns of the dataframe. Then I plot a simple bar plot to show the relative importance of the named features.
It is always nice to visualize the features, matplotlib documentation has got very good recipes of plots. I used one such recipe to construct the above horizontal bar chart.
The decision surfaces for the decision tree and random forest are very complex . The decision tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The random forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.
The SVM implementation of sklearn has an optional parameter class_weight. This parameter is set to None per default, but it also provides an auto mode, which uses the values of the labels Y to automatically adjust weights inversely proportional to class frequencies. I am going to draw the decision boundaries for two SVM classifiers. I am going to use C=1.0, and gamma=1.0 for both models, but for the first SVM I set class_weight to None, and for the second SVM, I set class_weight to ‘auto’.
The first SVM with equal class weights only classifies a small subset of the positive training points correctly, but it only produces very few false positive predictions on the training set. Thus, it has higher precision, but lower recall than the second SVM with the auto weighting option. The overall performance of the SVMs seems to be quite poor, with a lot of misclassified data points for both models. To improve the performance you would have to tune the parameters(C and class_weight).
What other things can you do to improve the performance of the classifiers?
Bio: Manu Jeevan writes about Data Science, Digital analytics and Growth hacking at Big Data Examiner. He believes that analytics is not about fancy mathematics or algorithms, it is about having an understanding of what your relationship with customers who are most important to you and an awareness of the potential in that relationship.
Original. Reposted with permission.
- 7 Steps to Mastering Machine Learning With Python
- Comprehensive Guide to Learning Python for Data Analysis and Data Science
- XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink
RANDOM FORESTS and RANDOMFORESTS are registered marks of Minitab, LLC.