KDnuggets Home » News » 2011 » Jul » Competitions » Wikipedia for Kaggle Participants  ( < Prev | 11:n17 | Next > )

Wikipedia for Kaggle Participants


 
  
Tips from Wikipedia editor/admin on how to best analyze Wikipedia data for the ICDM 2011 Kaggle data-mining challenge: use data from 10 years of Wikipedia edits in order to predict future edit rates.


Date:

Backsidesmack, Adam Hyland, July 1, 2011

WikipediaKaggle has released a new data-mining challenge: use data from 10 years of Wikipedia edits in order to predict future edit rates. The dataset has been anonymized in order to obscure editor identity and article identity, simultaneously adding focus to the challenge and robbing the dataset of considerable richness.

I have some experience with wikipedia from both a data science standpoint and personal experience. As I indicated below I am an editor and an administrator on the English Wikipedia with about 20,000 edits under my belt. Some of the information and experience I have will be less helpful for data scientists on this particular challenge, but the beauty of Wikipedia is all data is available. Everything, should you want it.

Many of these suggestions will be remedial or duplicative for veteran data miners. However you can always benefit from local knowledge. You can make a comment here or on my wikipedia talk page if you need some more information.

But enough of that, on to my suggestions for folks looking to win the challenge!

General model:

  • Zeros are very important. Depending on time in the dataset and tenure of the account, zeros may comprise 30-50% of the editors after a certain number of months. Modeling zeros differently than a small number of edits will be important. ...
Editors
  • Look at edit type. Does the edit remove content (just in bytes) or add it? Is the edit an exact reversion to a previous state? Reversion is a variable in the dataset (probably determined by comparing hashes)
  • Look at edit persistence. Do the edits of a given editor tend to stick around or be reverted/trimmed over time? There are multiple implications to persistence, so don't assume "good" edits are always kept and "bad" edits reverted.
Read more.

 
Related
Data Mining Competitions

KDnuggets Home » News » 2011 » Jul » Competitions » Wikipedia for Kaggle Participants  ( < Prev | 11:n17 | Next > )