Step 4: Understand what you want to achieve before worrying about the how
Many beginners tend to worry too much about which tool to use (Python or R? Random forests or deep learning?), when they should be worrying about understanding the data and what useful patterns they want to model. For example, when we worked on the Yandex search personalisation competition
, we spent a lot of time looking at the data and thinking what makes sense for users to be doing. In that case it was easy to come up with ideas, because we all use search engines. In the general case, you have to deeply familiarise yourself with the dataset and problem domain to be effective – become one with the data.
Step 5: Set up a local validation environment
Having a local validation environment allows you to move faster and produce more reliable results than when relying on leaderboard scores. The main scenarios where you should skip local validation is when the data is too small, or when you run out of time (towards the end of the competition). If your local validation is set up well and is consistent with the leaderboard (which you need to test by making one or two submissions), there’s no need to make many submissions. Making only a few submissions reduces the likelihood of overfitting the leaderboard, which can lead to disastrous results
Step 6: Monitor the forum
It’s very important to subscribe to the forum to receive notifications on issues with the data or the competition. In addition, it’s worth trying to figure out what your competitors are doing. An extreme example is the recent trend of code sharing during the competition – while it’s not a good idea to rely on such code, it’s important to be aware of its existence. Finally, reading the post-competition summaries on the forum is a valuable way of learning from the winners and improving over time.