Doing Data Science: A Kaggle Walkthrough Part 6 – Creating a Model
In the final part of this 6 part series on the process of data science, and applying it to a Kaggle competition, building the predictive models is covered, and multiple algorithms are discussed.
By Brett Romero, Open Data Kosovo.
This article is Part VI in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, you are strongly encouraged to go back and read the earlier parts – (Part I, Part II, Part III, Part IV and Part V).
Continuing on the walkthrough, in this part we build the model that will predict the first booking destination country for each user based on the dataset created in the earlier parts.
Choosing an Algorithm
The first step to building a model is to decide what type of algorithm to use. Below we look at some of the options.
Arguably the most well known algorithm, and one of the simplest conceptually. The decision tree works in a similar manner to the decision tree that you might create when trying to understand which decision to make based on a range of variables.
The goal of the decision tree algorithm used for classification problems (like the one we are looking at) is to create one of these decision trees to classify records into a set number of categories. To do this, it starts with all the records in the training dataset and looks through all the features until it finds the one that allows it to most ‘cleanly’ split the records according to their categories. For example, if you are using daily weather data to try and determine whether it will rain the following day (i.e. there are two categories, ‘it does rain’ and ‘it does not rain’), the algorithm will look for a feature that best splits the records (in this case representing days) into those two categories. When it finds that feature, and the value to split on, it creates one point (‘decision node’) on the decision tree. It then takes each subpopulation and does the same thing again, building up a tree until either all the records are correctly classified, or the number in each subpopulation becomes too small to split. Below is an example decision tree using the described weather data to predict if it will rain tomorrow or not (thanks to Graham Williams’ excellent Rattle package for R):
The way to interpret the above tree is to start at the top. The first criteria the algorithm splits on is the humidity at 3pm. Starting with 100% of the records, if the the humidity at 3pm is less than 71, as it is the case for 93% of the records, we move to the left and find the next decision node. If the humidity at 3pm is greater than or equal to 71, we move to the right, which takes us to a leaf node where the model predicts that there will be rain tomorrow (‘yes’). We can see from the numbers in the node that this represents 7% of all records, and that 74% of the records that reach this node are correctly classified.
The first thing to note is that the model does not accurately predict whether it will rain tomorrow for all records, and in some leaf nodes, it is only slightly better than a coin toss. This is not necessarily a bad thing. The biggest problem that data scientists have with decision trees is the classic problem of overfitting. In the example above, parameters have been set to stop model splitting once the population of records at a given node gets too small (minimum split) and when a certain number of splits have occurred (‘maximum depth’). These values have been set at values to prevent the tree from growing to large. The reason for this is that if the tree gets too large, it will start modelling random noise and hence will not work for data not in the training dataset (it will not ‘generalize’ well).
To picture what this means, imagine extending the example decision tree above further until the model starts splitting out single records using criteria like ‘Humidty3pm = 54’ and ‘Humidty3pm = 31’. That type of decision node may work for this particular training data because there is a specific record that meet that criteria, but it is highly unlikely that it represents any predictive ability and so is unlikely to be accurate if applied to other data.
All this discussion of overfitting with decision trees does however raise an important problem. That problem is how do you know how large you should grow the tree. How do you set the parameters to avoid overfitting but still have an accurate model? The truth is that is is extremely difficult to know how to set the parameters. Set them too conservatively and the model will lose too much predictive power. Set them too aggressively and the model will start overfitting the data.
Seeing the Forest for the Trees
Given the limitations of decisions trees and the risk of overfitting, it may be tempting to think “why bother?” Fortunately, methods have been found to reduce the risk of overfitting and increase predictive power of decisions trees and the two most popular methods both have the same basic premise – to train multiple trees.
One of the most well known algorithms that utilizes decision trees is the ‘random forest’ algorithm. As the name suggests, the algorithm constructs a large number of different trees (as defined by the user) by randomly selecting the features that can be used to build each tree (as opposed to using all the features for each tree). Typically, the trees in a random forest also have the parameters set to ensure each tree will also be relatively shallow, meaning that the algorithm creates a large number of shallow decision trees (decision bonsai?). Once the trees are constructed, each tree is used to predict the outcome for a new record, with these multiple predictions then serving as votes, with a majority rules approach applied.
Another algorithm which has become almost the default algorithm of choice for Kagglers, and is the type of the model we will use, uses a method called ‘boosting’, which means it builds trees iteratively such that each tree ‘learns’ from earlier trees. To do this the algorithm builds a first tree – again typically a shallower tree than if you were going to use a one tree approach – and makes predictions using that tree. Then the algorithm finds the records that are misclassified by that tree, and assigns a higher weight of importance to those records than the records that were correctly classified. The algorithm then builds a new tree with these new weightings. This whole process is repeated as many times as specified by the user. Once the specified number of trees have been built, all the trees built during this process are used to classify the records, with a majority rules approach used to determine the final prediction.
It should be noted that this methodology (‘boosting’) can actually be applied to many classification algorithms, but has really grown popular with the decision tree based implementation. It should also be noted there are different implementations of this algorithm even just using trees. In this case, we will be using the very popular XGBoost algorithm.
So far we have only covered decision trees and decision tree-based algorithms. However, there are a range of different algorithms that can be used for classification problems. Given this is supposed to be a short blog series, I will not go into too much detail on each algorithm here. But if you want more information on these algorithms, or other algorithms that I haven’t covered here, there is a growing amount of information online. I also strongly recommend the Data Science specialization offered by John Hopkins University, for free, on Coursera.
The K-nearest neighbor algorithms are arguably one of the simplest algorithms in concept. The algorithm classifies a given object by looking at the classification of the k most similar records  and seeing how those records are classified. This type of algorithm is called a lazy learner because during the training phase, it essentially just stores the data provided. Only when a new object needs to be classified does the algorithm start looking through the data to try to find the closest matches.
As the name suggests, these algorithms simulate biological networks by creating a series of nodes and connecting them together. A neural network typically consists of three layers; an input layer, a hidden layer (although there can be multiple hidden layers) and an output layer.
A model is trained by passing records through the network and weights adjusted at each node continually adjusted to ensure that the record ends up at the right ‘output node’.
Support Vector Machines
This type of algorithm, commonly used for text classification problems, is arguably the most difficult to visualize. At the simplest level, the algorithm tries to draw straight lines (or planes for classifications with more than 2 features) that best separate the classes provided. Although this sounds like a fairly simplistic approach to classifying objects, it becomes far more powerful due to the transformations (sometimes called a ‘kernel trick’) the algorithm can apply to the data before drawing these lines/planes. The mathematics behind this are far too complex to go into here, but the Wikipedia page has some nice visuals to help picture how this is working. In addition, this video provides a nice example of how a Support Vector Machine can separate classes using this kernel trick: