Highlights of IEEE ICDM 2013 International Conference on Data Mining, Dallas

Highlights of the IEEE ICDM 2013 Conference on Data Mining: Good organization in icy conditions, How to do clustering in high dimensions, Discovering unexpected sequential patterns, and perspectives on #BigData.

Guest blog by Francois Petitjean, Dec 15, 2013.

As many of you already know, last week the 2013 edition of the IEEE ICDM 2013 Conference on Data Mining, Dallas, Dec 7-10IEEE Int. Conf. on Data Mining (ICDM 2013) was held in Dallas, Tx.

I was there to present Chordalysis (see description on KDnuggets), and was asked to share a few highlights.

The conference started amidst extremely icy conditions: -8°C and inches of ice covering cities and airports of Northern Texas. Everybody has a story to tell about how they reached the conference: some were stranded at Dallas airport because the ice made it impossible for vehicles to get to the airport; others were diverted hundreds of miles away). However, the organisers did a great job smoothing everything out, and I didn't experience the cancellation of a single talk.

The papers

More than 150 papers were presented at the conference. It's not possible to describe them all, but here are a couple that stood out for me:

1. MMSC: Clustering in high dimension

Clustering is hard when the dataset has many attributes/variables, mainly because distances lose their discriminative power in high-dimension (see curse of dimensionality). S. Gunnemann and C. Faloutsos address this issue with "Mixed Membership Subspace Clustering", a method that makes the most of two observations: (1) objects can have different degrees of membership to the different groups, and (2) not all the attributes are relevant for all the groups.

2. Discovering unexpected sequential patterns

Discovering patterns in data, not only because they're frequent, but because they might be unexpected by (and of interest to) the user, has been of major interest to the pattern mining community. The discovery of such patterns in temporal datasets is however still a work-in-progress. With SigSpan, C. Lowkam, C. Raissi, M. Kaytoue and J. Pei address several of the theoretical and computational challenges.

The panel

The conference was closed with a panel on Big Data. The panel had some interesting insights about the nature and future of big data.

One insight was about the term 'big data'. We have been using many terms to designate the field that targets the understanding of data. Statistics, KDD, data mining, data science or big data, the question always has been the same: how do we understand data? Big data represents what we, in 2013, find difficult to analyse. This doesn't mean that big data has no distinctive features (the three Vs, for example), but we will certainly see another term replace 'big data' in a few years. But even when that happens, our job will remain the same - understanding data.

Another interesting discussion was about how to learn from a large quantity of data, and more precisely if learning for big data is about scaling up current algorithms or requires a new generation of techniques. We all know that we can't load our datasets in main memory anymore, so what do we do? One solution is to analyse a subset of the data (sampling). We can also split the data into datasets that we can more easily handle, analyse each sub-dataset, and recombine the results (what is often done with MapReduce frameworks).

The problem is that this recombined model might not be perfect, because none of the "learners" has had a look at the entire dataset. Making the most of the fine-detail information that big data holds, requires looking at the whole dataset. But if you want to do that, the learning method has to be out-of-core. But that's not all, better predicting from big data also requires learning methods with less bias. When the quantity of data is small, learning has to be biased if we want to correctly estimate the parameters of the model. However, with more data, not only can we afford to be less biased, but this is also how we will access to the fine details that 'big data' holds.

Francois PetitjeanDr Francois Petitjean, @LeDataMiner, is a data mining researcher whose focus is on finding useful solutions for big data. He obtained his PhD with the French Space Agency and is now with the Centre for Research in Intelligent Systems at Monash University in Melbourne.