By John P. Mello Jr., TechNewsWorld Dec 21, 2011
Massive data sets -- a season's worth of baseball statistics, for example, or health data from around the world -- can contain some very revealing knowledge. The problem confronting researchers, though, is finding it.
That may be a little easier with some tools developed by scientists at Harvard University and the Broad Institute.
The suite of tools called "MINE" -- Maximal Information-based Nonparametric Exploration -- were revealed this week in a
paper (subscription required) in the Dec 16 issue of Science.
What they do is allow researchers to find patterns and relationships in massive data sets that would be otherwise difficult or impossible to find.
"What makes MINE unique is its ability to find a very broad range of different types of patterns in data and to do that equally well," one of the authors of the article, David Reshef, who is in a dual degree program at Harvard and MIT, told TechNewsWorld.
Coping With 'Noise'
Another distinctive characteristic of MINE is its ability to balance generality and equitability in its results.
A statistic has generality when it captures a wide range of associations in a large data sample without being limited by linear, exponential periodic or other statistical functions.
It has equitability when scores assigned to data pairs described by the statistic are similar when the "noise" associated with the pairs is similar. Noise is what number crunchers call the amount of unexplained variation in a data sample.
"The reason that's important is that if you have a method that gives patterns that look different from each other but have the same amount of noise different scores, then you can't compare scores across different types of patterns," Reshef explained.
That balancing of generality and equitability by MINE distinguishes it from other tools used for similar purposes, observed another of the article's authors, Harvard Computer Science Professor Michael Mitzenmacher.
"Other similar data-mining techniques that we know of may have one or the other, but don't appear to have both," he told TechNewsWorld.
Paper: Reshef, DN et al. Detecting novel associations in large data sets. Science DOI: 10.1126/Science1205438.
See also Broad Institure Press Release.
From Science Daily
Tool Detects Patterns Hidden in Vast Data Sets
MINE is especially powerful in exploring data sets with relationships that may harbor more than one important pattern. As a proof of concept, the researchers applied MINE to social, economic, health, and political data from the World Health Organization (WHO) and its partners. When they compared the relationship between household income and female obesity, they found two contrasting trends in the data. Many countries follow a parabolic rate, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. But in the Pacific Islands, where female obesity is a sign of status, countries follow a steep trend, with the rate of obesity climbing as income increases.
From Science Perspective
A Correlation for the 21st Century, by Terry Speed (Berkeley)
Most scientists will be familiar with the use of Pearson's correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson's r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson's in the linear case, but that behaves as we'd like in the nonlinear case. On page 1518 of this issue, Reshef et al. (1) introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably.