A Simple XGBoost Tutorial Using the Iris Dataset
This is an overview of the XGBoost machine learning algorithm, which is fast and shows good results. This example uses multiclass prediction with the Iris dataset from Scikit-learn.
By Ieva Zarina, Software Developer, Nordigen.
The XGBoost algorithm (source).
Installing Anaconda and xgboost
In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones. You can download the installer for Windows, but if you want to install it on a Linux server, you can just copy-paste this into the terminal:
After this, use conda to install pip which you will need for installing xgboost. It is important to install it using Anaconda (in Anaconda’s directory), so that pip installs other libs there as well:
Now, a very important step: install xgboost Python Package dependencies beforehand. I install these ones from experience:
I upgrade my python virtual environment to have no trouble with python versions:
And finally I can install xgboost with pip (keep fingers crossed):
This command installs the latest xgboost version, but if you want to use a previous one, just specify it with:
Now test if everything is has gone well – type python in the terminal and try to import xgboost:
If you see no errors – perfect.
Xgboost Demo with the Iris Dataset
Here I will use the Iris dataset to show a simple example of how to use Xgboost.
First you load the dataset from sklearn, where X will be the data, y – the class labels:
Then you split the data into train and test sets with 80-20% split:
Next you need to create the Xgboost specific DMatrix data format from the numpy array. Xgboost can work with numpy arrays directly, load data from svmlignt files and other formats. Here is how to work with numpy arrays:
If you want to use svmlight for less memory consumption, first dumpthe numpy array into svmlight format and then just pass the filename to DMatrix:
Now for the Xgboost to work you need to set the parameters:
Different datasets perform better with different parameters. The result can be really low with one set of params and really good with others. You can look at this Kaggle script how to search for the best ones. Generally try with eta 0.1, 0.2, 0.3, max_depth in range of 2 to 10 and num_round around few hundred.
Finally the training can begin. You just type:
To see how the model looks you can also dump it in human readable form:
And it looks something like this (f0, f1, f2 are features):
You can see that each tree is no deeper than 3 levels as set in the params.
Use the model to predict classes for the test set:
But the predictions look something like this:
Here each column represents class number 0, 1, or 2. For each line you need to select that column where the probability is the highest:
Now you get a nice list with predicted classes:
Determine the precision of this prediction:
Perfect! Now save the model for later use:
Now you have a working model saved for later use, and ready for more prediction.
See the full code on github or below:
Bio: Ieva Zarina is a Software Developer at Nordigen.
Original. Reposted with permission.