Introduction to Python Ensembles
In this post, we'll take you through the basics of ensembles — what they are and why they work so well — and provide a hands-on tutorial for building basic ensembles.
Ensembles as averaged predictions
Our foray into ensembles so far has shown us two important aspects of ensembles:
- The less correlation in prediction errors, the better
- The more models, the better
For this reason, it's a good idea to use as different models as possible (as long as they perform decently). So far, we have relied on simple averaging, but later we will see to how use more complex combinations. To keep track of our progress, it is helpful to formalize our ensemble as n models fi averaged into an ensemble e:
There's no limitation on what models to include: decision trees, linear models, kernel-based models, non-parametric models, neural networks or even other ensembles! Keep in mind though that the more models we include, the slower the ensemble becomes.
To build an ensemble of various models, we begin by benchmarking a set of Scikit-learn classifiers on the dataset. To avoid repeating code, we use the below helper functions:
We're now ready to create a prediction matrix P, where each feature corresponds to the predictions made by a given model, and score each model against the test set:
That's our baseline. The Gradient Boosting Machine (GBM) does best, followed by a simple logistic regression. For our ensemble strategy to work, prediction errors must be relatively uncorrelated. Checking that this holds is our first order of business:
Errors are significantly correlated, which is to be expected for models that perform well, since it's typically the outliers that are hard to get right. Yet most correlations are in the 50-80% span, so there is decent room for improvement. In fact, if we look at error correlations on a class prediction basis things look a bit more promising:
To create an ensemble, we proceed as before and average predictions, and as we might expect the ensemble outperforms the baseline. Averaging is a simple process, and if we store model predictions, we can start with a simple ensemble and increase its size on the fly as we train new models.
Ensemble ROC-AUC score: 0.884
Visualizing how ensembles work
We've understood the power of ensembles as an error correction mechanism. This means that ensembles smooth out decision boundaries by averaging out irregularities. A decision boundary shows us how an estimator carves up feature space into neighborhood within which all observations are predicted to have the same class label. By averaging out base learner decision boundaries, the ensemble is endowed with a smoother boundary that generalize more naturally.
The figure below shows this in action. Here, the example is the iris data set, where the estimators try to classify three types of flowers. The base learners all have some undesirable properties in their boundaries, but the ensemble has a relatively smooth decision boundary that aligns with observations. Amazingly, ensembles both increase model complexity and acts as a regularizer!
Example decision boundaries for three models and an ensemble of the three. Source
Another way to understand what is going on in an ensemble when the task is classification, is to inspect the Receiver Operator Curve (ROC). This curve shows us how an estimator trades off precision and recall. Typically, different base learners make different trade offs: some have higher precision by sacrificing recall, and other have higher recall by sacrificing precision.
A non-linear meta learner, on the other hand, is able to, for each training point, adjust which models it relies on. This means that it can significantly reduce necessary sacrifices and retain high precision while increasing recall (or vice versa). In the figure below, the ensemble is making a much smaller sacrifice in precision to increase recall (the ROC is further in the "northeast" corner).
Beyond ensembles as a simple average
But wouldn't you expect more of a boost given the variation in prediction errors? Well, one thing is a bit nagging. Some of the models perform considerably worse than others, yet their influence is just large as better performing models. This can be quite devastating with unbalanced data sets: recall that with soft voting, if a model makes an extreme prediction (i.e close to 0 or 1), that prediction has a strong pull on the prediction average.
An important factor for us is whether models are able to capture the full share of Republican denotations. A simple check shows that all models underrepresent Republican donations, but some are considerably worse than others.
We can try to improve the ensemble by removing the worst offender, say the Multi-Layer Perceptron (MLP):
Truncated ensemble ROC-AUC score: 0.883
Not really an improvement: we need a smarter way of prioritizing between models. Clearly, removing models from an ensemble is rather drastic as there may be instances where the removed model carried important information. What we really want is to learn a sensible set of weights to use when averaging predictions. This turns the ensemble into a parametric model that needs to be trained.
Learning to combine predictions
Learning a weighted average means that for each model fi, we have a weight parameter ωi ∈ (0,1) that assigns our weight to that model's predictions. Weighted averaging requires all weights to sum to 1. The ensemble is now defined as
This is a minor change from our previous definition, but is interesting since, once the models have generated predictions pi = fi(x), learning the weights is the same as fitting a linear regression on those predictions:
with some constraints on the weights. Then again, there is no reason to restrict ourself to fitting just a linear model. Suppose instead that we fit a nearest neighbor model. The ensemble would then take local averages based on the nearest neighbors of a given observation, empowering the ensemble to adapt to changes in model performance as the input varies.
Implementing an ensemble
To build this type of ensemble, we need three things:
- a library of base learners that generate predictions
- a meta learner that learns how to best combine these predictions
- a method for splitting the training data between the base learners and the meta learner.
Base learners are the ingoing models that take the original input and generate a set of predictions. If we have an original data set ordered as a matrix X of shape
(n_samples, n_features), the library of base learners output a new prediction matrix Pbase of size
(n_samples, n_base_learners), where each column represent the predictions made by one of the base learners. The meta learner is trained on Pbase.
This means that it is absolutely crucial to handle the training set X in an appropriate way. In particular, if we both train the base learners on X and have them predict X, the meta learner will be training on the base learner's training error, but at test time it will face their test errors.
We need a strategy for generating a prediction matrix P that reflects test errors. The simplest strategy is to split the full data set X in two: train the base learners on one half and have them predict the other half, which then becomes the input to the meta learner. While simple and relatively fast, we loose quite a bit of data. For small and medium sized data sets, the loss of information can be severe, causing the base learners and the meta learner to perform poorly.
To ensure the full data set is covered, we can use cross-validation, a method initially developed for validating test-set performance during model selection. There are many ways to perform cross-validation, and before we delve into that, let's get a feel for this type of ensemble by implementing one ourselves, step by step.
Step 1: define a library of base learners
These are the models that take the raw input data and generates predictions, and can be anything from linear regression to a neural network to another ensemble. As always, there's strength in diversity! The only thing to consider is that the more models we add, the slower the ensemble will be. Here, we'll use our set of models from before:
Step 2: define a meta learner
Which meta learner to use is not obvious, but popular choices are linear models, kernel-based models (SVMs and KNNS) and decision tree based models. But you could also use another ensemble as "meta learner": in this special case, you end up with a two-layer ensemble, akin to a feed-forward neural network.
Here, we'll use a Gradient Boosting Machine. To ensure the GBM explores local patterns, we restricting each of 1000 decision trees to train on a random subset of 4 base learners and 50% of input data. This way, the GBM will be exposed to each base learner's strength in different neighborhoods of the input space.
Step 3: define a procedure for generating train and test sets
To keep things simple, we split the full training set into a training and prediction set of the base learners. This method is sometimes referred to as Blending. Unfortunately, the terminology differs between communities, so it's not always easy to know what type of cross-validation the ensemble is using.
We now have one training set of the base learners (Xtrain_base,ytrain_base) and one prediction set (Xpred_base,ypred_base) and are ready to generate the prediction matrix for the meta learner.
Step 4: train the base learners on a training set
To train the library of base learners on the base-learner training data, we proceed as usual:
To train the base learners, execute
Step 5: generate base learner predictions
With the base learners fitted, we can now generate a set of predictions for the meta learner to train on. Note that we generate predictions for observations not used to train the base learners. For each observation x(i)pred ∈ Xpred_base in the base learner prediction set, we generate a set of base learner predictions:
If you implement your own ensemble, pay special attention to how you index the rows and columns of the prediction matrix. When we split the data in two, this is not so hard, but with cross-validation things are more challenging.
To generate predictions, execute
6. Train the meta learner
The prediction matrix Pbase reflects test-time performance and can be used to train the meta learner:
That's it! We now have a fully trained ensemble that can be used to predict new data. To generate a prediction for some observation x(j), we first feed it to the base learners. These output a set of predictions
that we feed to the meta learner. The meta learner then gives us the ensemble's final prediction
Now that we have a firm understanding of ensemble learning, it's time to see what it can do to improve our prediction performance on the political contributions data set:
To generate predictions, execute
Ensemble ROC-AUC score: 0.881
As expected, the ensemble beats the best estimator from our previous benchmark, but it doesn't beat the simple average ensemble. That's because we trained the base learners and the meta learner on only half the data, so a lot of information is lost. To prevent this, we need to use a cross-validation strategy.
Training with cross-validation
During cross-validated training of the base learners, a copy of each base learner is fitted on K - 1 folds, and predict the left-out fold. This process is iterated until every fold has been predicted. The more folds we specify, the less data is being left out in each training pass. This makes cross-validated predictions less noisy and a better reflection of performance during test time. The cost is significantly increased training time. Fitting an ensemble with cross-validation is often referred to as stacking, while the ensemble itself is known as the Super Learner.
To understand how cross-validation works, we can think of it as an outer loop over our previous ensemble. The outer loop iterates over K distinct test folds, with the remaining data used for training. The inner loop trains the base learners and generate predictions for the held-out data. Here's a simple stacking implementation:
Let's go over the steps involved here. First, we fit our final base learners on all data: in contrast with our previous blend ensemble, base learners used at test time are trained on all available data. We then loop over all folds, then loop over all base learners to generate cross-validated predictions. These predictions are stacked to build the training set for the meta learner, which too sees all data.
The basic difference between blending and stacking is therefore that stacking allows both base learners and the meta learner to train on the full data set. Using 2-fold cross-validation, we can measure the difference this makes in our case:
Ensemble ROC-AUC score: 0.889
Stacking yields a sizeable increase in performance: in fact, it gives us our best score so far. This outcome is typical for small and medium-sized data sets, where the effect of blending can be severe. As the data set size increases, blending and stacking performs similarly.
Stacking comes with its own set of shortcomings, particularly speed. In general, we need to be aware of there important issues when it comes to implementing ensembles with cross-validation:
- Computational complexity
- Structural complexity (risk of information leakage)
- Memory consumption
It's important to understand these in order to work with ensembles efficiently, so let's go through each in turn.
1. Computational complexity
Suppose we want to use 10 folds for stacking. This would require training all base learners 10 times on 90% of the data, and once on all data. With 4 base learners, the ensemble would roughly be 40 times slower than using the best base learner.
But each cv-fit is independent, so we don't need to fit models sequentially. If we could fit all folds in parallel, the ensemble would only be roughly 4 times slower than the best base learner, a dramatic improvement. Ensembles are prime candidates for parallelization, and it is critical to leverage this capability to the greatest extent possible. Fitting all folds for all models in parallel, the time penalty for the ensemble would be negligible. To hone this point in, below is a benchmark from ML-Ensemble that shows the time it takes to fit an ensemble via stacking or blending either sequentially or in parallel on 4 threads.
Even with this moderate degree of parallelism, we can realize a sizeable reduction in computation time. But parallelization is associated with a whole host of potentially thorny issues such as race conditions, deadlocks and memory explosion.
2. Structural complexity
Once we decide to use the entire training set to meta learner, we must worry about information leakage. This phenomena arises when we mistakenly predict samples that were used during training, for instance by mixing up our folds or using a model trained on the wrong subset. When there's information leakage in the training set of the meta learner, it will not learn to properly correct for base learner predictions errors: garbage in, garbage out. Spotting such bugs though is extremely difficult.
3. Memory consumption
The final issue arises with parallelization, especially by multi-processing as is often the case in Python. In this case, each sub-process has its own memory and therefore needs to copy all data from the parent process. A naive implementation will therefore copy all data to all processes, eating up memory and wasting time on data serialization. Preventing this requires sharing data memory, which in turns easily cause data corruption.
Upshot: use packages
The upshot is that you should use a unit-tested package and focus on building your machine learning pipeline. In fact, once you've settled on a ensemble package, building ensembles becomes really easy: all you need to do is specify the base learners, the meta learner, and a method for training the ensemble.
Fortuntately, there are many packages available in all popular programming languages, though they come in different flavors. At the end of this post, we list some as reference. For now, let's pick one and see how a stacked ensemble does on our political contributions data set. Here, we use ML-Ensemble and build our previous generalized ensemble, but now using 10-fold cross-validation:
It's as simple as that!
Inspecting the ROC-curve of the super learner against the simple average ensemble reveals how leveraging the full data enables the super learner to sacrifice less recall for a given level of precision.
Where to go from here
There are many other types of ensembles than those presented here. However the basic ingredients are always the same: a library of base learners, a meta learner, and a training procedure. By playing around with these components, various specialized forms of ensembles can be created. A good starting point for more advanced material on ensemble learning is this excellent post by mlware.
When it comes to software, it's a matter of taste. As the popularity of ensembles have risen, so has the number of packages available. Ensembles were traditionally developed in the statistics community, so R has had a lead on purpose-built libraries. Several packages have recently been developed in Python and other languages, with more on the way. Each package caters to different needs and are at different stages of maturity, so we recommend shopping around until you find what you are looking for.
Here are a few packages to get you started:
|Python||ML-Ensemble||General ensemble learning|
|Python||Scikit-learn||Bagging, majority voting classifiers. API for stacking in development|
|Python||mlxtend||Regression and Classification ensembles|
|R||SuperLearner||Super Learner ensembles|
|R||caretEnsemble||Ensembles of Caret estimators|
|Mutliple||H20||Distributed stacked ensemble learning. Limited to estimators in the H20 library|
|Java||StackNet||Empowered by H20|
|Web-based||xcessiv||Web-based ensemble learning|
Bio: Sebastian Flennerhag is a machine learning researcher and predictive modeller. He writes about exciting open source projects, predictive modeling and the latest in deep learning.
Original. Reposted with permission.
- Ensemble Learning to Improve Machine Learning Results
- Random Forests(r), Explained
- What is the difference between Bagging and Boosting?