Running Kaggle Chess Ratings Contest

The major challenge is that chess ratings are longitudinal (results from earlier tournaments can be used to predict later tournaments).

My experience running the contest, and lessons learned for next time

Kaggle Blog, 24 November 2010, by Jeff Sonas

Kaggle ... Anthony Goldbloom and I started discussing the contest to predict chess results back in July. He had read some of my writings about chess ratings and liked my Chessmetrics website. It was funny because he initially contacted me with the idea of discussing a rating system for Kaggle's users (i.e. how well they performed in their contests), and I had misunderstood his email and thought he wanted me to run a contest about chess ratings, and by the time we sorted out that miscommunication, he was more excited about the chess rating contest anyway, so we went that direction.

I was originally envisioning a contest where participants submit code/executables to calculate ratings, or they submit an algorithm and I would implement it, but after I learned more from Anthony about how Kaggle worked, we settled on the contest design where participants would implement their own algorithms and make their own predictions, and the website would score submissions automatically. The major challenge there came from the fact that chess results/ratings are longitudinal - if a player competes in Tournament A, then Tournament B, then Tournament C, then we can use the results from A to predict their performance in B, and we can use the results from A and B to predict their performance in C. But if we want to keep the results of B hidden (so you can't cheat in your predictions of B) then how can players use the results of B to help predict performance in C?

We had another very important decision to make: should we spend a lot of time upfront on preparing an ideal contest with lots of data, or should we take the data I already had handy and zoom ahead with the contest? I was concerned that I wouldn't have much time available to prepare additional data - unfortunately it is a somewhat manual process - so we decided to zoom ahead. In retrospect I think the single biggest flaw in the contest was not having enough data points, both the training dataset (to allow proper optimization of systems) and the test dataset (to allow accurate scoring). So if I had it to do over, I would probably wait longer to start, have larger datasets, and then run a shorter contest. I would also put more time in beforehand making sure about my selection of an evaluation function, although I am still not convinced there is a better one than what we used.

Read more.