Chordalysis: a new method
to discover the structure of data
This new method helps you answer "why"  understand the reasons for prediction. It uses chordal graphs to scale the classical method of loglinear analysis to much larger datasets.
Guest blog by Francois Petitjean, Nov 11, 2013.
(see also Chordalysis: Free software for Loglinear analysis of Big Data)
We need to know why!
We've become really good at making predictions from data, but our models for understanding why those predictions will eventuate are less developed. ("You're going to get cancer! But we don't know why..."). Yes, sometimes the prediction itself is all that's necessary, but other times the "why" is vital. It's useful to know that the earth's temperature is going to increase by 3°C, but to do something about it we need to know why that's going to happen. If we want to develop informed responses to our increasingly complex domains and problems, we have to better address the "why".
Loglinear analysis: the solution for small datasets
With the "why" question in mind, I have been investigating (with Professors GI Webb and AE Nicholson) the discovery of complex relationships between variables in highdimensional data. We started from the reference method in statistics: loglinear analysis.
This might ring a bell if you're using SPSS, SAS or R; all the main statistical toolboxes have it, and there is a reason for that: this is how you answer questions about statistical dependencies like: "Is getting a heart attack independent of patients' cholesterol level given their use of anticholesterol drug?".
Chordal+analysis = Chordalysis
But here's the problem: you can't use loglinear analysis if your dataset has more than, say, 10 variables! This is because the process is exponential in the number of variables. That is where our new work makes a difference. The question was: how can we keep the rigorous statistical foundations of classical loglinear analysis but make it work for datasets with hundreds of variables?
The main part of the answer is "chordal graphs", which are the graphs made of triangular structures. We showed that for this class of models, the theory is scalable for highdimensional datasets. The rest of the solution involved melding the classical statistical machinery with advanced data mining techniques from association discovery and graphical modelling.
The result is Chordalysis: a loglinear analysis method for highdimensional data. Chordalysis makes it possible to discover the structure of datasets with hundreds of variables on a standard computer. So far we've applied it successfully to datasets with up to 750 variables.
A model obtained with Chordalysis from the dataset on the left.
Software
The software (with source code and examples) is released under GPL and can be found at https://sourceforge.net/projects/chordalysis/
Reference
This method has been peerreviewed and accepted for publication at the 2013 IEEE International Conference on Data Mining  see
Scaling loglinear analysis to highdimensional data (PDF), by Francois Petitjean, Geoffrey I. Webb and Ann E. Nicholson.
Bio
Dr Francois Petitjean, is a data miner whose focus is on finding useful solutions for big data. After obtaining his PhD with the French Space Agency, he joined the Centre for Research in Intelligent Systems at Monash University in Melbourne.
You can contact him at petitjean [at] tinyclues [dot] eu for questions/feedback, etc.
Top Stories Past 30 Days  


