The Present and the Future of the KDD Cup Competition

KDD cup is the first and among most prestigious competitions in data science, Among key takeaways from KDD Cup 2015: XGBoost – Gradient Boosted Decision Trees package works wonders in data classification, feature engineering is the king, and team work is crucial.

By Ron Bekkerman, (CandorMap).

As a rule, I don’t compete. Just not a competitive type of guy. Whenever I’m forced into a competition, I lose. Whenever I lose I feel sorry and decide to never compete again.

This is why I deeply respect people who compete and especially those who win. This year I was fortunate to co-chair (with Prof. Jie Tang) the KDD Cup – the most prestigious Data Science competition that’s been run by SIGKDD for almost two decades. The competition usually lasts for a few months and ends with a workshop preceding the KDD Conference.

KDD Cup 2015 Workshop

As I was running the KDD Cup Workshop on August 10 in Sydney, I found myself among the most competitive and successful Data Scientists in the world – the KDD Cup winners. Nine winning teams presented their solution, and the workshop concluded with the Serial Winner Panel. Below is my take on the workshop, but first a few words (and some data) about the KDD Cup 2015 competition itself.

The objective of KDD Cup 2015KDD Cup 2015 was to predict student dropouts from online courses. The importance of this task cannot be understated as online courses are often the only affordable method of education for many people – and the courses are most effective if taken in full. The KDD Cup competition attracted participation of 821 teams. Over 11,000 submissions were made. The quality of submitted predictions was measured in terms of AUC score, while the highest achieved score was 0.9074.

At first, I suspected that 821 participating teams was an exaggerated number. How many of them were seriously competing and how many just threw a dart or two and never showed up again? Below is the plot of resulting AUC scores over the team ranks.

The scores decrease very gradually until they reach the 0.84 mark and then they fall off the cliff. My conclusion is that the 0.84 score was fairly easy to achieve and whoever didn’t achieve it didn’t really try. The good news is that over 550 teams achieved the 0.84 score. Let’s now zoom into the area above 0.88 AUC achieved by almost 200 teams:

At this resolution we can see that the scores take a steeper dip over the top ranks and get flatter later on. Apparently, this steep dip marks the area of the real battle for the top prize, in which 31 teams participated. Zooming furthermore onto that area, another interesting effect can be observed.