Using Ensembles in Kaggle Data Science Competitions – Part 1
How to win Machine Learning Competitions? Gain an edge over the competition by learning Model Ensembling. Take a look at Henk van Veen's insights about how to get improved results!
Use for Kaggle: Forest Cover Type prediction
[Try with code]
We then use a weighted majority vote. Usually we want to give a better model more weight in a vote. So in our case we count the vote by the best model 3 times. The other 4 models count for one vote each.
The reasoning is as follows: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.
Use for Kaggle: CIFAR-10 Object detection in images [Try with code]
[Read Winners Interview]
Averaging

Averaging which is taking the mean of individual model predictions works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logarithmic loss). An often heard shorthand for this on Kaggle is “bagging submissions”. Averaging predictions often reduces overfit.
Kaggle use: Bag of Words Meets Bags of PopcornGet the code
Rank averaging
When averaging the outputs from multiple different models some problems can pop up. Not all predictors are perfectly calibrated: they may be over- or under-confident when predicting a low or high probability. Or the predictions clutter around a certain range.
Historical ranks
So what do you do when want predictions for a single new sample? It's simple!
Store the old test set predictions together with their rank. Now when you predict a new test sample like “0.35000110″ you find the closest old prediction and take its historical rank (in this case rank “3″ for “0.35000111″).
Kaggle use case: Acquire Valued Shoppers Challenge
Using Ensembles in Kaggle Data Science Competitions – Part 2
How are you planning to implement this? Share your thoughts!
Original post : Kaggle Ensembling Guide by Henk van Veen
Related:
- The forest cover type prediction challenge uses the Forest CoverType dataset. The dataset has 54 attributes and there are 6 classes.
- We create a simple starter model with a 500-tree Random Forest. We then create a few more models and pick the best performing one. For this task and our model selection an ExtraTreesClassifier works best.
We then use a weighted majority vote. Usually we want to give a better model more weight in a vote. So in our case we count the vote by the best model 3 times. The other 4 models count for one vote each.
The reasoning is as follows: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.
Use for Kaggle: CIFAR-10 Object detection in images [Try with code]
- CIFAR-10 is another multi-class classification challenge where accuracy matters.
- A voting ensemble of around 30 convnets submissions (all scoring above 90% accuracy) was used by the team Leader. The best single model of the ensemble scored 0.93170.
- A voting ensemble of 30 models scored 0.94120. A ~0.01 reduction in error rate, pushing the resulting score beyond the estimated human classification accuracy.
[Read Winners Interview]
Averaging
Averaging which is taking the mean of individual model predictions works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logarithmic loss). An often heard shorthand for this on Kaggle is “bagging submissions”. Averaging predictions often reduces overfit.
Kaggle use: Bag of Words Meets Bags of PopcornGet the code
- This is a movie sentiment analysis contest.
- The perceptron is a decent linear classifier which is guaranteed to find a separation if the data is linearly separable.
- This is a welcome property to have, but you have to realize a perceptron stops learning once this separation is reached. It does not necessarily find the best separation for new data.
- So as an example if we initialize 5 perceptrons with random weights and combine their predictions through an average, we get an improvement on the test set!
- Bagging a single poorly cross-validated and overfitted submission may even bring you some gain through adding diversity (thus less correlation).
Rank averaging
When averaging the outputs from multiple different models some problems can pop up. Not all predictors are perfectly calibrated: they may be over- or under-confident when predicting a low or high probability. Or the predictions clutter around a certain range.
In the extreme case you may have a submission which looks like this:
Such a prediction may do well on the leaderboard when the evaluation metric is ranking or threshold based like AUC. But when averaged with another model like:
it will not change the ensemble much at all.
Our solution is to first turn the predictions into ranks, then averaging these ranks.
After normalizing the averaged ranks between 0 and 1 you are sure to get an even distribution in your predictions. The resulting rank-averaged ensemble:
Id,Prediction(1) | 1,0.35000056 | 2,0.35000002 | 3,0.35000098 | 4,0.35000111 |
Such a prediction may do well on the leaderboard when the evaluation metric is ranking or threshold based like AUC. But when averaged with another model like:
Id,Prediction(2) | 1,0.57 | 2,0.04 | 3,0.96 | 4,0.99 |
Our solution is to first turn the predictions into ranks, then averaging these ranks.
Id,Rank, Prediction(1) |
1,1,0.35000056 | 2,0,0.35000002 | 3,2,0.35000098 | 4,3,0.35000111 |
After normalizing the averaged ranks between 0 and 1 you are sure to get an even distribution in your predictions. The resulting rank-averaged ensemble:
Id,Prediction | 1,0.33 | 2,0.0 | 3,0.66 | 4,1.0 |
Historical ranks
So what do you do when want predictions for a single new sample? It's simple!
Store the old test set predictions together with their rank. Now when you predict a new test sample like “0.35000110″ you find the closest old prediction and take its historical rank (in this case rank “3″ for “0.35000111″).
Kaggle use case: Acquire Valued Shoppers Challenge
- The goal of the shopper challenge was to rank the chance that a shopper would become a repeat customer.
- The average of multiple Vowpal Wabbit models was taken together with an R GLMNet model. Then a ranking average was used to improve the exact same ensemble.
Using Ensembles in Kaggle Data Science Competitions – Part 2
How are you planning to implement this? Share your thoughts!
Original post : Kaggle Ensembling Guide by Henk van Veen
Related:
- 10 Steps to Success in Kaggle Data Science Competitions
- Top 10 R Packages to be a Kaggle Champion
- Should Data Science Really Do That?