Using Ensembles in Kaggle Data Science Competitions – Part 1

How to win Machine Learning Competitions? Gain an edge over the competition by learning Model Ensembling. Take a look at Henk van Veen's insights about how to get improved results!

By Henk van Veen

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article the author shares his ensembling approaches for Kaggle Competitions.

We explore Model Ensembling in three parts :
• Let us explore the process of creating ensembles from submission files as the first part of this article.
• In the second part, we look at creating ensembles through stacked generalization/blending. The author answers why ensembling reduces the generalization error.
• And in the final part, we learn different methods of ensembling, together with their results and code so even you can try it out for yourself!

Creating ensembles from submission files

The most basic and convenient way to ensemble is by ensembling Kaggle submission CSV files. You only need the predictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up.

Voting ensembles

Consider a simple majority vote ensemble. Let’s see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions.

Error correcting codes
• Consider a signal in the form of a binary string like:

1110110011101111011111011011 gets corrupted with just one bit flipped as in : 1010110011101111011111011011

During space missions it is very important that all signals are correctly relayed. Such an error could have a devastating effect on lives!
• A coding solution was found in error correcting codes. The simplest error correcting code is a Repetition code repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote as shown in the table below.
• Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote.

 Original signal: Encoded: (Relay multiple times) Decoding: Majority vote: 1110110011 10,3 101011001111101100111110110011 101011001111101100111110110011 1110110011

Let's understand with a simple Machine learning example

Suppose we have a test set of 10 samples.

The ground truth is all positive (“1″): 1111111111
We furthermore have 3 binary classifiers (A,B,C) with a 70% accuracy.

You can view these classifiers for now as pseudo-random number generators which output a “1″ 70% of the time and a “0″ 30% of the time.
We will now show how these pseudo-classifiers are able to obtain 78% accuracy through a voting ensemble.

 All three are correct (0.7 * 0.7 * 0.7) = 0.3429 Two are correct (0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7) = 0.4409 Two are wrong (0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3) = 0.189 All three are wrong (0.3 * 0.3 * 0.3) = 0.027

This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).

Number of voters

Like repetition codes increase in their error-correcting capability when more codes are repeated, so do ensembles usually improve when adding more ensemble members.

Correlation

Uncorrelated submissions clearly do better when ensembled than correlated submissions.

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy

These models are highly correlated in their predictions. When we take a majority vote we see no improvement:
1111111100 = 80% accuracy

Now we compare to 3 less-performing, but highly uncorrelated models: 1111111100 = 80% accuracy 0111011101 = 70% accuracy 1000101111 = 60% accuracy

When we ensemble this with a majority vote we get:
1111111101 = 90% accuracy
Which is an improvement: A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.