Large Scale Hierarchical Text classification

This year's challenge will increase the scale and the difficulty of the task, using data from Wikipedia, in addition to the ODP Web directory data (www.dmoz.org).

Second Pascal Challenge on
Large Scale Hierarchical Text classification
lshtc.iit.demokritos.gr/
Email: lshtc_info@iit.demokritos.gr

Following a successful first edition, we are pleased to announce the 2nd edition of the Large Scale Hierarchical Text Classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a hierarchical text classification competition, using large datasets. This year's challenge will increase the scale and the difficulty of the task, using data from Wikipedia (www.wikipedia.org), in addition to the ODP Web directory data (www.dmoz.org).

Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.

The challenge consists of three categorization tasks, involving different documents and category systems. In particular, the largest category system, based on Wikipedia, contains more than 300,000 categories and 2M documents for training. The largest category system ever used in the past for evaluation purposes, to the best of our knowledge, was based on the Yahoo! Directory and contained 130,000 categories and 500,000 training documents. In addition to the largest task, two smaller ones, based on Wikipedia and DMOZ respectively, are included in the challenge. The scale of these is in the order of the first edition of the challenge. All of the datasets in this edition are multi-label. Particularly in the two datasets that are based on Wikipedia, each document is assigned on average to 3.2 and 4.6 categories respectively. Furthermore, the hierarchies are no longer simple tree structures, as both documents and subcategories are allowed to belong to more than one other category. More information regarding the tasks and the challenge rules can be found at the challenge's Web site; follow the "Tasks, Rules and Guidelines" link.

As in the first edition, participants will be able to smoothly and continuously submit runs, in order to improve their systems. This year we also plan a two-stage evaluation of the participating methods: one measuring classification performance and one for computational performance. It is important to measure both, as they are dependent. The results will be included in a final report about the challenge and we also aim at organizing a special ECML'11 workshop.

In order to register for the challenge and gain access to the datasets you must have an account at the challenge Web site.

Key dates:

Start of testing: January 15, 2011
End of testing: March 31, 2011
Submission of executables and short papers to challenge organizers: April 30, 2011
Submission of workshop papers: May 31, 2010
ECML'11 workshop (subject to approval): September 5, 2011

Organisers:

George Paliouras, NCSR "Demokritos", Athens, Greece
Eric Gaussier, LIG, Grenoble, France
Aris Kosmopoulos, NCSR "Demokritos" & AUEB, Athens, Greece
Ion Androutsopoulos, AUEB, Athens, Greece
Thierry Artières, LIP6, Paris, France
Patrick Gallinari, LIP6, Paris, France