How to Rank 10% in Your First Kaggle Competition

This post presents a pathway to achieving success in Kaggle competitions as a beginner. The path generalizes beyond competitions, however. Read on for insight into succeeding while approaching any data science project.



Model Training

We can improve a model’s performance by tuning its parameters. A model usually have many parameters, but only a few of them are significant to its performance. For example, the most important parameters for a random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each parameter have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

  • eta: Step size used in updating weights. Lower eta means slower training but better convergence.
  • num_round: Total number of iterations.
  • subsample: The ratio of training data used in each iteration. This is to combat overfitting.
  • colsample_bytree: The ratio of features used in each iteration. This is like max_features in RandomForestClassifier.
  • max_depth: The maximum depth of each tree. Unlike random forest, gradient boosting would eventually overfit if we do not limit its depth.
  • early_stopping_rounds: If we don’t see an increase of validation score for a given number of iterations, the algorithm will stop early. This is to combat overfitting, too.

Usual tuning steps:

  1. Reserve a portion of training set as the validation set.
  2. Set eta to a relatively high value (e.g. 0.05 ~ 0.1), num_round to 300 ~ 500.
  3. Use grid search to find the best combination of other parameters.
  4. Gradually lower eta until we reach the optimum.
  5. Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

Cross Validation

Cross validation is an essential step in model training. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation.Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well. However, we shouldn’t use too many folds if our training data is limited. Otherwise we would have too few samples in each fold to guarantee statistical significance.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions when they feel that reliable CV is not easy.

Ensemble Generation

Ensemble Learning refers to the technique of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thus increasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

  • Bagging: Use different random subsets of training data to train each base model. Then all the base models vote to generate the final predictions. This is how random forest works.
  • Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. (Actually it’s not the whole story. Apart from boosting, GBTs try to learn the residuals of earlier iterations.) It performs better than bagging but is more prone to overfitting.
  • Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
  • Stacking: To be discussed next.

In theory, for the ensemble to perform well, two factors matter:

  • Base models should be as unrelated as possibly. This is why we tend to include non-tree-based models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
  • Performance of base models shouldn’t differ to much.

Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase the overall performance.

Stacking

Compared with blending, stacking makes better use of training data. Here’s a diagram of how it works:

Stacking

(Taken from Faron. Many thanks!)

It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(samples in training data) X #(base models). This matrix is then fed to the stacker (it’s just another model) in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

Prize winners usually have larger and much more complicated ensembles. For beginner, implementing a correct 5-fold stacking is good enough.

*Pipeline

We can see that the workflow for a Kaggle competition is quite complex, especially for model selection and ensemble. Ideally, we need a highly automated pipeline capable of:

  • Modularized feature transformations. We only need to write a few lines of codes (or better, rules / DSLs) and the new feature is added to the training set.
  • Automated grid search. We only need to set up models and parameter grid, the search will be run and the best parameters will be recorded.
  • Automated ensemble selection. Use K best models for training the ensemble as soon as we put another base model into the pool.

For beginners, the first one is not very important because the number of features is quite manageable; the third one is not important either because typically we only do several ensembles at the end of the competition. But the second one is good to have becausemanually recording the performance and parameters of each model is time-consuming and error-prone.

Chenglong Chen, the winner of Crowdflower Search Results Relevance, once released his pipeline on GitHub. It’s very complete and efficient. Yet it’s very hard to understand and extract all his logic to build a general framework. This is something you might want to do when you have plenty of time.