Dealing with Unbalanced Classes, SVMs, Random Forests®, and Decision Trees in Python
An overview of dealing with unbalanced classes, and implementing SVMs, Random Forests, and Decision Trees in Python.
By Manu Jeevan, Big Data Examiner.
So far I have talked about decision trees and ensembles. But I hope, I have made you understand the logic behind these concepts without getting too much into the mathematical details. In this post lets get into action, I will be implementing the concepts that we learned in these two blog posts. The only concept that I haven't discussed about is SVM. I suggest you to watch Professor Andrew Ng's week 7 videos on Coursera.
Can a winemaker predict how a wine will be received based on the chemical properties of the wine? Are there chemical indicators that correlate more strongly with the perceived “quality” of a wine? In this problem we’ll examine the wine quality dataset hosted on the UCI website. This data records 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. In this problem, we will only look at the data for red wine.
Let me first import the libraries,
Collecting And Transforming Data
I import only the data for red wine, then I build a pandas dataframe and print the head.
wine_df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv’, sep=’;’)
I have the feature data, usually labeled as X, and the target data, labeled Y. Every row in the matrix X is a data point (i.e. a wine) and every column in X is a feature of the data (e.g. pH). For a classification problem, Y is a column vector containing the class of every data point.
I will use the quality column as my target variable. I am going to save the quality column as a separate numpy array (labeled Y) and remove(drop) the quality column from the dataframe.
Also, I will simplify the problem as a binary one in which wines are either “bad” (score < 7) or “good” (score ≥ 7). This means that I am going to change the Y array accordingly such that it only contains zeros (“bad” wines) and ones (“good” wines). For example, if originally Y = [1,3,8,4,7], the new Y should be [0,0,1,0,1].
I then use the as_matrix function in Pandas to save the feature information in my data frame as a numpy array. This is my X matrix.
Visualizing The Classification Scores
My goal is to predict the target Y(quality of wine) as a function of the features X. In previous section, I have defined Y as a binary variable (bad as 0 and good as 1), this is a classification problem. First I will use random forests to classify the quality of wine, later on I will implement SVM and decision trees on this data set.
As you know Random forest basically aggregates a group of decision trees together. It adds randomness in 2 ways , one is by sampling with replacement(boot strap sampling) from the training data and then fitting a tree for each of these samples. Then splitting on a feature in the decision tree, random forest considers random subset of variables to split on.
There are many ways to construct a random forest — these differences in the method of construction are described as tuning parameters. One of the most important tuning parameters in building a random forest is the number of trees to construct.
Here, I am going to apply the random forest classifier to the wine data and use cross-validation to explore how the score of the classifier changes when varying the number of trees in the forest. I am going to use the random forest classifier function in the scikit-learn library and the cross_val_score function (using the default scoring method) to plot the scores of the random forests as a function of the number of trees in the random forest, ranging from 1 (simple decision tree) to 40. I am going to use 10-fold cross-validation.
If you don’t know what is meant by parameter selection and cross validation, please watchweek 6 videos of coursera’s machine learning course.
First I import RandomForestClassifier and cross_val_score from scikit learn library. n_estimators is the parameter specifying the number of trees inRandomForestClassifier. Then I create a list of all the classifier scores for the trees ranging from 1 to 41.
Let me show you what is going with a random forest classifier that has 2 trees
I got classification scores for each cross validation set. Here I have fixed the the number of trees to be 2 and number of folds as 10.
You can see that I have used seaborn to plot my scores.
You can notice that accuracy seems to improve with additional trees. You should also consider the computational cost of fitting additional trees compared to the small accuracy benefit.