Sizing samples - how much data is enough?

MIT researchers have shown that graphs shaped like stars and chains establish, respectively, the worst- and best-case scenarios for computers doing pattern recognition.

MITnews, Larry Hardesty, Aug 24, 2010

Many scientific disciplines use computers to infer patterns in data. But how much data is enough to ensure that the inferences are right?

Across science and engineering, computers are often enlisted to find patterns in data. The data might be genetic information about a population, and the pattern could be which gene variants predispose people to asthma. Or the data might be frames of video, and the patterns could be objects that move or stand still from frame to frame, which data-compression or image-sharpening algorithms might want to locate

MIT researchers have shown that graphs shaped like stars and chains establish, respectively, the worst- and best-case scenarios for computers doing pattern recognition.
Image Credit: Christine Daniloff

In most cases, more data means more reliable inference of patterns. But how much data is enough? Vincent Tan, a graduate student in the Department of Electrical Engineering and Computer Science, and his colleagues in Professor Alan Willsky's Stochastic Systems Group have taken the first steps toward answering that question.

Tan, Willsky and Animashree Anandkumar, a postdoc in Willsky's group, envision data sets as what mathematicians call graphs. A graph is anything with nodes and edges: Nodes are generally depicted as circles and edges as lines connecting them. A typical diagram of a communications network, where the nodes represent electronic devices and the edges represent communications links, is a graph.

In the MIT researchers' work, however, the nodes represent data and the edges correlations between them. For instance, one node might represent asthma, and the others could be a host of environmental, physiological and genetic factors. Some of the factors might be correlated with asthma, others not; other factors might be correlated with each other but not with asthma. Moreover, the edges can have different weights: The strength of the correlations can vary. From this perspective, a computer charged with pattern recognition is given a bunch of nodes and asked to infer the weights of the edges between them.

...

In an article published this spring in IEEE Transactions on Signal Processing The researchers demonstrated that trees with a "star" pattern - in which one central node is connected to all the others - are the hardest to recognize; their shape can't be inferred without lots of data. Suppose, for instance, that the central node represents asthma, and 100 other nodes represent all the factors that can contribute to it. If the computer system looks at 100 data samples, each one could imply a different predictor of asthma. It might require tens of thousands of samples before the system could reliably conclude which factors have stronger correlations than others.

Trees that form "chains," on the other hand - where each node is linked to at most two others - are the easiest to recognize. Suppose that a computer system was analyzing concentrations of chemicals in biological cells. If the data reflected different stages of a single, complicated biochemical process, then the chemicals present at each stage might determine the chemicals present at the next. In that case, it would be fairly easy to conclude, with few data samples, the correlations between successive stages of the process.

Read more.