Chordalysis: a new method
to discover the structure of data
This new method helps you answer "why" - understand the reasons for prediction. It uses chordal graphs to scale the classical method of log-linear analysis to much larger datasets.
Guest blog by Francois Petitjean, Nov 11, 2013.
We need to know why!
We've become really good at making predictions from data, but our models for understanding why those predictions will eventuate are less developed. ("You're going to get cancer! But we don't know why..."). Yes, sometimes the prediction itself is all that's necessary, but other times the "why" is vital. It's useful to know that the earth's temperature is going to increase by 3°C, but to do something about it we need to know why that's going to happen. If we want to develop informed responses to our increasingly complex domains and problems, we have to better address the "why".
Log-linear analysis: the solution for small datasets
With the "why" question in mind, I have been investigating (with Professors GI Webb and AE Nicholson) the discovery of complex relationships between variables in high-dimensional data. We started from the reference method in statistics: log-linear analysis.
This might ring a bell if you're using SPSS, SAS or R; all the main statistical toolboxes have it, and there is a reason for that: this is how you answer questions about statistical dependencies like: "Is getting a heart attack independent of patients' cholesterol level given their use of anti-cholesterol drug?".
Chordal+analysis = Chordalysis
But here's the problem: you can't use log-linear analysis if your dataset has more than, say, 10 variables! This is because the process is exponential in the number of variables. That is where our new work makes a difference. The question was: how can we keep the rigorous statistical foundations of classical log-linear analysis but make it work for datasets with hundreds of variables?
The main part of the answer is "chordal graphs", which are the graphs made of triangular structures. We showed that for this class of models, the theory is scalable for high-dimensional datasets. The rest of the solution involved melding the classical statistical machinery with advanced data mining techniques from association discovery and graphical modelling.
The result is Chordalysis: a log-linear analysis method for high-dimensional data. Chordalysis makes it possible to discover the structure of datasets with hundreds of variables on a standard computer. So far we've applied it successfully to datasets with up to 750 variables.
A model obtained with Chordalysis from the dataset on the left.
The software (with source code and examples) is released under GPL and can be found at https://sourceforge.net/projects/chordalysis/
This method has been peer-reviewed and accepted for publication at the 2013 IEEE International Conference on Data Mining - see
Scaling log-linear analysis to high-dimensional data (PDF), by Francois Petitjean, Geoffrey I. Webb and Ann E. Nicholson.
Dr Francois Petitjean, is a data miner whose focus is on finding useful solutions for big data. After obtaining his PhD with the French Space Agency, he joined the Centre for Research in Intelligent Systems at Monash University in Melbourne.
You can contact him at petitjean [at] tiny-clues [dot] eu for questions/feedback, etc.