Machine-Learning Bubble in Computational Medicine?

It isn't clear if there are underlying low-dimensional structures in biological data. Is machine-learning's possible role in biology and medicine overhyped?

Ben Lorica reports on the joint meetings of the AMS/MAA/SIAM, which took place last week in San Francisco.

The Machine-Learning Bubble in Computational Medicine (Challenges in Computational Medicine and Biology)
Donald Geman gave a nice survey of the problems and mathematical techniques frequently used in computational biology. He also raised something that struck a chord with me. While computational biology has things in common with other fields ("small n, large d problem": small samples, relative to the number of dimensions), techniques that work in fields like computer vision don't automatically translate to biology. First, the size of samples in biology and medicine are orders of magnitude smaller compared to other fields. Secondly, while black boxes (think SVM's or neural nets) are acceptable in other fields, biologists want accurate predictions and explanations for why/how algorithms work. Finally, it isn't clear if there are underlying low-dimensional structures in biological data. Taken together, Geman wonders if machine-learning's possible role in biology and medicine has been overhyped.

Read full story.

Abstract
Donald Geman* (geman at jhu.edu), 302A Clark Hall, 3400 N. Charles St., Baltimore, MD 21218.

Challenges in Computational Medicine and Biology.

Modern biology involves analyzing very large networks of interacting molecular parts. In contrast to the "gene-centric" era, this seems to call for large-scale mathematical modeling. Even assuming biological systems exhibit general properties and are amenable to modeling, the challenges are still overwhelming, at least for having a major impact in medicine and biology. One high barrier is technical: measured against the complexity of the processes (e.g., gene regulation), and the dimension (d) of the data (e.g., DNA microarrays), the number of available samples (n) is minuscule; indeed, this "small n, large d" dilemma reaches extremes in computational biology. Another barrier is cultural: the "black box" decision rules generated by computational learning inhibit biological understanding and clinical applications. I will talk about several cases studies in attempted generalization. One is estimating the topology and statistics of signaling networks, where grand goals have overrun good sense. Another is an approach to cancer biomarker discovery based solely on orderings of mRNA concentrations and sufficiently accurate and transparent for practical diagnosis and prognosis, and for modeling pathway deregulation. (Received September 19, 2009)