KDnuggets News 05:13, item 1, Features

KDnuggets : News : 2005 : n13 : item1

Features

From: Gregory Piatetsky-Shapiro
Date: 28 Jun 2005
Subject: Winner of SIGKDD Data Mining and Knowledge Discovery Innovation Award is ...

SIGKDD Innovations Award is the highest technical award in the field of data mining and knowledge discovery. It is given to one individual or one group of collaborators who has made significant technical innovations in the field of Data Mining and Knowledge Discovery that have been transferred to practice in significant ways, or that have significantly influenced direction of research and development in the field.

ACM SIGKDD is pleased to announce the winner of SIGKDD 2005 Innovations Award, who is

Prof. Leo Breiman, Professor Emeritus, Berkeley.

It is with great sadness we have learned that Prof. Breiman has died on July 5, 2005 after a long illness.

Dr. Leo Breiman is widely considered one of the founding fathers of modern machine learning and data mining. He has been actively contributing to these fields, as well as to statistics, for more than 30 years. His best known contribution is his landmark work on decision trees (Classification and Regression Trees, 1984, known as CART(R)), written with Jerome Friedman, Richard Olshen, and Charles Stone. Since 1984 he has become further associated with tree ensembles in the form of Bootstrap Aggregation (Bagging Predictors, 1996) and Random Forests (also trademarked) and has become one of the most widely cited authors in the field. These works alone would be sufficient to merit Leo substantial honors and indeed he has been elected as a member to the National Academy of Sciences. The citation of the National Academy reads, in part:

"Breiman has done fundamental work in stochastic processes, information theory, and mathematical statistics. He is a seminal thinker who has developed modern methods of classification and pattern recognition. He has made significant contributions to the practice of statistics bridging the gaps between that field, signal processing, and computer science."

Leo was born in New York city in 1928 and laid the foundations for his professional career as a mathematician with a degree in Physics from Cal Tech in 1949 and a PhD in mathematics from the University of California, Berkeley, in 1954. His first well-known paper, which proved the Shannon-Breiman-MacMillan information theorem (1957), was followed by another body of work related to optimal gambling systems (1960). After spending seven years at UCLA as a mathematics professor teaching probability theory, he wrote the celebrated graduate textbook Probability (1968) which has been republished in the SIAM classics series. Leo reports that at this stage in his career he was tired of theoretical pursuits and wanted to get his hands dirty with real world data analysis.

He resigned his tenured full professorship and began a 13-year stint as a statistical consultant working on challenging problems in defense, traffic analysis, toxic substance detection, and air pollution. Soon realizing that classical statistical techniques were not adequate to the challenges, he begin the crafting the idea of classification via a series of yes/no questions, which put him on the road to CART. The ensuing CART monograph is a landmark for the wealth of insight and subtlety with which the subject is treated and is the source of many modern decision tree concepts:

cost-complexity
pruning, growing trees with asymmetric costs of misclassification,
surrogate splitters
to handle missing values, cross-validation for trees, linear combination and Boolean splitting rules,
the distinction between classification trees and probability trees,
regression trees via least squares and least absolute deviations,
and proof that in infinite samples CART trees converge toward the minimal obtainable error rate as they are grown larger.

"Estimating optimal transformations in multiple regression" (1985) is now standard in several major statistics packages. This work initiated some ideas carried further in Hastie and Tibshirani's GAM (Generalized Additive Models, 1990). Over the next several years Leo took on a number of projects, not all of which have been published.

In the late 1980's he began tinkering with CART parallelization on a network of Sun computers and reported his somewhat disappointing results in a 1995 discussion paper. In 1992 he developed the first implementation of a CART for a vector of target variables.

He also wrote a definitive analysis of certain problems in US Census estimates ("The 1990 Census adjustment: Undercount or bad data?", 1994).

In 1995 he introduced a major improvement in stepwise regression "Better Subset Regression Using the Nonnegative Garrote," which led to Tibshirani's now celebrated lasso. The method combines shrinking of regression coefficients towards zero, non-uniformly allowing some coefficients to reach zero and thus accomplishing variable selection.

From this point on, Leo was involved in a variety of attempts to improve the performance of CART trees. A series of studies investigating the implications of instability in function approximation began with his award-winning paper "The II-method for estimating multivariate functions from noisy data" (1991), followed by "The heuristics of instability in model selection" in 1996.

By this time he had hit on the idea of creating ensembles of trees via bootstrap re-sampling and he perfected the method (no pruning, atom size of 1) for his paper on Bagging (1996). This was followed by an extension of the bagger to mimic boosting (Adaptive Resampling and Combining, or Arcing Classifiers 1998) and an extension to online learning.

Since 1999 Breiman's name has become tightly linked with Random Forests, which further push the ideas developed for the bagger. Whereas in the bagger randomness is induced into a tree by the data sampling mechanism, in Random Forests the randomness is injected into the split selection itself. The resulting classifications and regressions are often considerably more accurate than the bagger and are competitive with the best methods now extant. Breiman also introduced novel post-processing in which he uses the Random Forest trees to generate a non-metric distance between any two records in a data set, thus supporting new ways to cluster data and identify anomalies and outliers.

Leo is more than just an academic researcher and his non-scientific activities have also been noteworthy. While a Professor at UCLA in the 1960's he took a year's sabbatical to work in Liberia for UNESCO, trekking through rain forests to help count the number of schools and school children in that country. In 1976 he served on the Santa Monica school board and developed ways to improve mathematics instruction. Finally, over the years he has hosted a total of 21 rural Mexican school children as "exchange students", providing them the opportunity to learn English in one-year stays in his home.

In summary, Leo Breiman has contributed some of the key ideas at the heart of today's data mining and has wielded immense influence in the field, doing this while also contributing to the welfare of many of the people in academia, industry, and everyday life with whom he has come in contact.

SIGKDD 2005 Awards Committee

Gregory Piatetsky-Shapiro (KDnuggets), Chair
Rakesh Agrawal (IBM)
Christos Faloutsos (CMU)
Jerome Friedman (Stanford)
Robert Grossman (U. of Illinois, Chicago)
Jiawei Han (U.Illinois, Urbana-Champaign)
Foster Provost (NYU)
Xindong Wu (U. Vermont)
Sam Uthurusamy (GM)

References

Richard Olshen (2001) "A Conversation with Leo Breiman" Statistical Science Vol. 16, No. 2, 184-198.

Breiman, L. (1957). The individual ergodic theorem of information theory. Ann. Math. Statist. 28 809-811. [Correction (1960). Ann. Math. Statist. 31 809-810.]

Breiman, L. (1960). Optimal gambling systems for favorable games. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1 60-77. Univ. California Press.

Breiman, L. (1968). Probability Theory. Addison-Wesley, Reading, MA. [Republished (1991) in Classics of Mathematics. SIAM, Philadelphia.]

Breiman, L. (1991). The II-method for estimating multivariate functions from noisy data (with discussion). Technometrics. 33 125-160. (Awarded the Youden Prize as the best expository paper of the year in Technometrics.)

Breiman, L. (1994). The 1990 Census adjustment-Undercount or bad data? (with discussion). Statist. Sci. 9 458-475.

Breiman, L. (1996a). Bagging predictors. Machine Learning 26 123-140

Breiman, L.(1996b).The heuristics of instability in model selection. Ann. Statist. 24 2350-2383.

Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801-849.

Breiman, L. and Friedman, J.H. (1985). Estimating optimal transformations in multiple regression and correction (with discussion). J. Amer. Statist. Assoc. 80 580-619. (Theory and Methods Paper of the Year.)

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA. (Since 1993 this book has been published by Chapman and Hall, New York.)

KDnuggets : News : 2005 : n13 : item1