Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Sizing Samples: How Much Data Is Enough?

Across science and engineering, computers are often enlisted to find patterns in data. The data might be genetic information about a population, and the pattern could be which gene variants predispose people to asthma. Or the data might be frames of video, and the patterns could be objects that move or stand still from frame to frame, which data-compression or image-sharpening algorithms might want to locate.

In most cases, more data means more reliable inference of patterns. But how much data is enough? Vincent Tan, a graduate student in MIT's Department of Electrical Engineering and Computer Science, and his colleagues in Alan Willsky's Stochastic Systems Group have taken steps toward answering that question.

Tan, Willsky and Animashree Anandkumar, a postdoc in Willsky's group, envision data sets as what mathematicians call graphs. A graph is anything with nodes and edges: Nodes are generally depicted as circles and edges as lines connecting them. A typical diagram of a communications network, where the nodes represent electronic devices and the edges represent communications links, is a graph.

In their work, however, the nodes represent data and the edges correlations between them. For instance, one node might represent asthma, and the others could be a host of environmental, physiological and genetic factors. Some of the factors might be correlated with asthma, others not; other factors might be correlated with each other but not with asthma. Moreover, the edges can have different weights: The strength of the correlations can vary. From this perspective, a computer charged with pattern recognition is given a bunch of nodes and asked to infer the weights of the edges between them.

The researchers are concentrating on graphs known as "trees". Technically, a tree is a graph without any closed circuits: for any node, there's no sequence of edges that will lead you back to it without backtracking. But intuitively, a tree is a graph that looks like a family tree, with a single node at the top connected to one or more nodes in the layer below, which are connected to nodes in the layer below them, and so on. The patterns formed by data of interest won't always be trees, but when they aren't, there's generally a tree that closely approximates them. And trees are much easier to work with mathematically than graphs with random shapes.

In a paper entitledLearning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, the researchers demonstrated that trees with a "star" pattern -- in which one central node is connected to all the others -- are the hardest to recognize; their shape can't be inferred without lots of data. Suppose, for instance, that the central node represents asthma, and 100 other nodes represent all the factors that can contribute to it. If the computer system looks at 100 data samples, each one could imply a different predictor of asthma. It might require tens of thousands of samples before the system could reliably conclude which factors have stronger correlations than others.

Trees that form "chains," on the other hand -- where each node is linked to at most two others -- are the easiest to recognize. Suppose that a computer system was analyzing concentrations of chemicals in biological cells. If the data reflected different stages of a single, complicated biochemical process, then the chemicals present at each stage might determine the chemicals present at the next. In that case, it would be fairly easy to conclude, with few data samples, the correlations between successive stages of the process.

The results for stars and chains led the researchers to hope that the difficulty of recognizing a tree pattern is purely a function of its diameter -- the distance between the two farthest-apart nodes. (In a star, the diameter is two: No two nodes are more than two hops apart. In a chain, the diameter is equal to the number of nodes.) Unfortunately, that turns out not to be the case. For trees with shapes other than stars or chains, "Strength of the connectivity between the variables matters," Tan says. "We cannot discount it." Nonetheless, the tools that the researchers used to evaluate the best- and worst-case scenarios could shed further light on the intermediary cases.

"When you have large and high-dimensional data sets, it becomes very difficult to compute with more-general models," says John Lafferty, a professor of computer science at Carnegie-Mellon University. "So having this tree-structured approximation is much more computationally efficient." Lafferty points out, however, that the MIT researchers' work assumes that the data being analyzed have a "normal distribution" -- that a graph of them would have the familiar bell-curve shape. "You'd like the distribution to have any shape," says Lafferty, an extension of the MIT researchers' result that his own group his been working on. Lafferty also says that pattern-recognition techniques that "relax this tree assumption" -- that can tolerate a few closed loops in the graph -- would be more generally useful. But that's something that Willsky's group is already working to do, he says.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.