CFP (KDnuggets News 09:13, item 36, CFP)

KDnuggets : News : 2009 : n13 : item36

CFP

From: Richi Nayak
Date: 26 Jun 2009
Subject: CFP: XML Clustering Task in INEX 2009

This is a call for participation in XML Clustering Task in INEX 2009. INEX 2009 clustering task is an evaluation forum that provides a platform to measure the performance of clustering methods on a huge scale test collection (consisting of a set of documents, their labels, a set of information needs (queries), and the answers to those information needs).

In the last decade, we have observed a proliferation of approaches for clustering XML documents based on their structure and content. There have been many approaches developed for diverse application domains. Many applications require data objects to be grouped by similarity of content, tags, paths, structure and semantics.

The clustering task in INEX 2009 evaluates unsupervised machine learning in the context of XML information retrieval. This year we are running a novel evaluation task using manual query assessments from the INEX Ad Hoc track. The clustering track will explicitly test the Jardine and van Rijsbergen cluster hypothesis (1971), which states that documents that cluster together have a similar relevance to a given query. The task is to split the English Wikipedia collection, 60 Gigabytes in size having around 2.7 million documents in XML format, into disjoint clusters for collection selection. If the cluster hypothesis holds true, and if suitable clustering can be achieved, then a clustering solution will minimise the number of clusters that need to be searched to satisfy any given query. There are important practical reasons for performing collection selection on a very large corpus. If only a small fraction of clusters (hence documents) need to be searched, then the throughput of an information retrieval system will be greatly improved.

The INEX XML Wikipedia collection is a marked-up version of the Wikipedia documents. The mark-up includes, for instance, explicit tagging of named entities. In order to enable participation with minimal overheads in data-preparation the collection has been pre-processed to provide various representations of the documents. For instance, a bag-of-words representation of terms and frequent phrases in a document, frequencies of various XML structures in the form of trees, links, named entities, etc. These various collection representations will be released by the end of this month. As well, the entire document collection is available in XML format and in text-only format if you wish to try different representation approaches. A subset of collection containing about 50,000 documents (of the INEX 2009 corpus) will also be provided, in order to cluster them, for teams that are unable to process such a large data collection.

The clustering solutions will be evaluated by two means. Firstly, the clustering solution will be evaluated by using the standard criteria such as purity, entropy and F-score to determine the quality of clusters. These evaluation results will be provided online and ongoing along the same lines as NetFlix, starting from mid-September. Secondly, the clustering solutions will be evaluated to determine the quality of cluster relative to the optimal collection selection goal, given a set of queries. Better clustering solutions in this context will tend to (on average) group together relevant results for (previously unseen) ad-hoc queries. Real Ad-hoc retrieval queries and their manual assessment results will be utilised in this evaluation. This novel approach evaluates the clustering solutions relative to a very specific objective - clustering a large document collection in an optimal manner in order to satisfy queries while minimising the search space. Results of second evaluation will be released at the INEX workshop in December.

The clustering task in INEX 2009 brings together researchers from Data Mining, Machine Learning, Information Retrieval and XML fields. It allows participants to evaluate clustering methods against a real use case and with significant volumes of data. The task is designed to facilitate participation with minimal effort by providing not only raw data, but also pre-processed data which can be easily used by existing clustering software.

Dr Richi Nayak, School of Information Technology,

Queensland University of Technology, Brisbane, QLD 4001

Email: r.nayak@qut.edu.au

http://sky.scitech.qut.edu.au/~nayak/

KDnuggets : News : 2009 : n13 : item36

PREVIOUS | NEXT