By Yanir Seroussi.
Kaggle is the leading platform for data science competitions, building on a long history that has its roots in the
KDD Cup and the
Netflix Prize, among others. If you’re a data scientist (or want to become one), participating in Kaggle competitions is a great way of honing your skills, building reputation, and potentially winning some cash. This post outlines ten steps to Kaggle success, drawing on my personal experience and the experience of other competitors.
While the focus of this post is on Kaggle competitions, it’s worth noting that most of the steps below apply to any well-defined predictive modelling problem with a closed dataset. However, in “real life”, data scientists spend much of their time defining the problem together with stakeholders and chasing down the data required for its solution. Working on a Kaggle-like problem is often the more fun part of a data scientist’s job.
Step 1: Read the manual
It’s surprising to see how many people miss out on important details, such as remembering the final date to make the first submission. It’s important to understand the competition timeline, be able to reproduce benchmarks, generate the correct submission format, and so on. Just like in real life, you should understand what’s going on before jumping into coding and model building.
Step 2: Understand the performance measure
A key part of doing well in a competition is understanding how the performance measure works. It’s often easy to significantly improve your score by using an optimisation approach that is suitable to the measure. A classic example is optimising the mean absolute error (MAE) versus the mean square error (MSE). It’s easy to show that given no other data for a set of numbers, the predictor that minimises the MAE is the median, while the predictor that minimises the MSE is the mean. Indeed, in the
EMC Data Science Hackathon we fell back to the median rather than the mean when there wasn’t enough data, and that ended up working quite well.
Step 3: Know your data
In Kaggle competitions, over-specialisation (without overfitting) is a good thing. This is unlike academic machine learning papers, where researchers often test their proposed method on many different datasets. This is also unlike more applied work, where you may care about data drifting and whether what you predict actually makes sense.
When competing, exploiting anomalies in the data can work in your favour. For example, in the aforementioned hackathon, we noticed that even though we had to produce hourly predictions for air pollutant levels, the measured levels didn’t change every hour (probably due to limitations in the measuring equipment). This led us to try a simple “model” for the first few hours, where we predicted exactly the last measured value. This proved to be one of our most valuable insights. Obviously, this means that we were predicting what the measurement equipment would say rather than actual pollutant levels – something you’d definitely want to avoid in a real-life situation!