The Present and the Future of the KDD Cup Competition
KDD cup is the first and among most prestigious competitions in data science, Among key takeaways from KDD Cup 2015: XGBoost – Gradient Boosted Decision Trees package works wonders in data classification, feature engineering is the king, and team work is crucial.
The top 13 teams look truly outstanding: their scores are very close to each other, while a significant drop can be seen afterwards. The majority of those 13 teams took part in the KDD Cup workshop.
Remarkably, the AUC scores of the top 13 teams span over a tiny interval of 0.003 – which puts the winning teams at the average distance of 0.0002 from each other. This raises a question about utility of winning KDD Cup – what is the reason for competing over the fourth digit after the dot, besides the apparent thrill of victory? (Did I already mention that I never compete? ☺) We’ll come back to this question later.
Three main take-aways from the KDD Cup workshop presentations:
- Something dramatic happened in Machine Learning over the past couple of years. It is called XGBoost – a package implementing Gradient Boosted Decision Trees that works wonders in data classification. I have tried it on my benchmark dataset – it shows 35% improvement over an SVM. Apparently, every winning team used XGBoost, mostly in ensembles with other classifiers. Sometimes an ensemble of (parametrized) XGBoosts was used – without other classifiers ☺. Most surprisingly, the winning teams report very minor improvements that ensembles bring over a single well-configured XGBoost. After all, in contrary to common belief, one might not need to build an ensemble to win a Data Science competition.
- Feature Engineering is the king. This fact is as well-known as the superiority of ensembles – but in contrast to ensembles, feature engineering is not going anywhere. Teams are very creative about features, and this is what at the end wins the KDD Cup. It is worth mentioning that one winning team reported achieving 0.9 AUC with “basic” features and a single XGBoost (they then concentrated on adding less-basic features). Funnily, 800 competing teams never achieved 0.9 AUC. You can make a conclusion about the “basicness” of features this team came up with.
- Team work is crucial. You can win KDD Cup only in a team – the first four winning teams were 4 to 21 people in size. The best solo contestant came to the 5th place. Within a team, the most important role of every team member is to bring features. Often, team members start working independently not to get exposed to each other’s features, and merge their results later. Quite often, the teams are not even planning to work together early during the competition – they decide to run together once they establish their benchmarks and have reasons to assume that the collaboration will furthermore improve the results.
The Serial Winner Panel
We wanted to conclude the workshop with a take-home message so we organized a panel named “The Serial Winner Panel”. We had three panelists on the podium but the atmosphere was informal enough so that all workshop participants could express their opinion. In the end, it was an open forum that analyzed the present and the future of KDD Cups – their pros and cons.
Our panelists were:
- Prof. Shou-De Lin from the National Taiwan University. The NTU team led by Prof. Lin has won numerous prizes at KDD Cups and other Data Science competitions, including the fourth prize of this year’s KDD Cup.
- Xavier Conort from DataRobot.com. Xavier and his team won the third prize at KDD Cup 2015 and the second prize at KDD Cup 2014. Xavier was the top ranked data scientist on Kaggle in 2013.
- Hang Zhang from Microsoft. Hang was an invited speaker at our workshop and grabbed prizes at many competitions including KDD Cup 2011 and 2012.
My first question was: “I believe that it is clear to everyone that KDD Cups are not cost effective: you spend lots of time competing while the reward is not really encouraging. So why do you guys compete?”
Surprisingly for me, I received three different but all very sensible responses.
Prof. Shou-De Lin: “I use KDD Cups to teach my students practical Data Science. Every year I teach a course based on the KDD Cup.”
Me: “Is the course requirement to participate or to win?”
Prof. Shou-De Lin: “To win, of course ☺”
Xavier Conort: “I learn by competing. Before I started competing at KDD Cups and on Kaggle, I had little Data Science experience – now I feel capable enough. Besides that, winning KDD Cups is good for your CV ☺”
Hang Zhang: “Actually, I was forced to compete ☺. When my previous employer Opera Solutions asked me to participate in a Data Science competition, not being among the top ten was just out of the question. By winning competitions, Opera Solutions assured its leading role in the field and promoted its brand.”
So, people compete at KDD Cups to teach or learn new skills, as well as promote their personal or corporate brands. As a side note, there was a recent piece of news about a big company trying to win a Data Science competition no matter what – this didn’t end well for the company. Needless to say, it wasn’t Opera Solutions ☺
My second question was: “Data Science competitions are plenty, all in different domains. But the winners are often the same. Why?”
Hang Zhang: “Serious competitive Data Scientists have the computational infrastructure and tools in place – ready for future competitions. Because of that, they get better results earlier over the course of the competition and have more time to concentrate on domain-specific feature engineering.”
Prof. Shou-De Lin: “Serial winning teams have extensive experience in defining roles of team members and are effective in collaboration and communication between the team members.”
Xavier Conort: “Winners of Data Science competitions are a social network: many people know each other, and had a chance to collaborate in the past. This allows them to team together efficiently in a way that complements each other’s skills and leads to another victory.”
My third question was: “This is not a secret that KDD Cup winning models rarely (if ever?) go to production. This is probably because the models are so fit to the competition dataset. Does it feel right to compete over the fourth digit after the dot? Is there anything that we need to change in future KDD Cups to adapt them to real-world settings?”
This question initiated a wide discussion involving the panelists and the audience, which I can summarize as follows:
- Competing over the fourth digit after the dot is not all that unnatural: you don’t have to win with a big margin, a small margin will do as well. And if you’re not yet winning, you need that much to overcome the current winner.
- The structural drawback of KDD Cups is that the test set is prebuilt and fixed. During the competition, contestants evaluate their models against a publicly available test set – and at the end, the models are evaluated against a private test set. The problem is that the public and the private test sets cannot be substantially different in their characteristics, because otherwise the evaluation against the public test set wouldn’t have made much sense. This causes a fair amount of overfitting and limited applicability of created models. A solution might be in testing on the future data: models could be trained on historical data and evaluated on the data that didn’t yet exist when the model was trained. This narrows the domain options (not in every domain the data is being generated on the fly). It also adds a lot of uncertainty: the private test set can now by-design have different characteristics than the historical data. It would be much closer to the real-world though.
- Testing on future data might lead to a certain frustration: you might be at the top of the ranked list yesterday, and at its bottom today. The competing teams might lose their interest in the competition. This is yet to be checked, but what’s clear is that the new approach will fundamentally change the competition as new skills will have to be developed. The big question is whether KDD Cups have to be dramatically shaken – after all they bring many benefits in their current form.
Me: “Do KDD Cups actually bring any benefit?”
Prof. Shou-De Lin: “They do, especially to academics. KDD Cup competitions allow us to publish practical Machine Learning papers which would have been rejected for not proposing any methodological breakthrough. Nevertheless, those papers are very useful as they provide a practical guidance for solving hard Machine Learning problems.”
Hang Zhang: “They do, because winning a KDD Cup can bring a lot of good publicity to the company that employs the winning team, which in turn can be used in marketing, can improve sales, lead to a successful partnership, close an investment deal etc.”
Xavier Conort: “Data Science competitions help researchers get their work exposed to a larger community. XGboost, LibFM, LibFFM, and Vowpal Wabbit got very popular thanks to their outstanding performance in such competitions.”
The future will tell us whether KDD Cups are gonna dramatically change, but I can tell one thing right now: organizing the KDD Cup made me hungry for a competition ☺.
Bio: Ron Bekkerman, is an Assistant Professor of Data Science at U. of Haifa and CTO of CandorMap. He worked at LinkedIn and has a PhD in CS (Machine Learning) with extensive experience in commercial software engineering.