Exclusive: Interview with Dan Steinberg, President of Salford Systems, Data Mining Pioneer
My exclusive interview with Dan Steinberg, on CART, MARS, RandomForests, working with Leo Breiman, the origin of Salford Systems, CART vs C4.5, winning competitions, and more. If you miss one revolution you only needed to wait a bit to catch the next one!
By Gregory Piatetsky, Jun 15, 2013.
I have known Dan Steinberg for many years. His company - Salford Systems- has been a pioneer in providing cutting edge analytics and data mining software such as CART, MARS, and Random Forests. I am also grateful for their sponsorship of KDnuggets for over 10 years. Finally, after many years of informal conversations, I am delighted to have the opportunity to interview him for KDnuggets readers.
Dan Steinberg, President and Founder of Salford Systems, is a well-respected member of the statistics and econometrics communities. In 1992, he developed the first PC-based implementation of the original CART procedure, working with Professors Leo Breiman, Richard Olshen, Charles Stone and Jerome Friedman.
Dr. Steinberg received his Ph.D. in Economics from Harvard. He then worked as a Member of the Technical Staff at Bell Labs, and then as Assistant Professor of Economics at the U. of California, San Diego. A book he co-authored on Classification and Regression Trees was awarded the 1999 Nikkei Quality Control Literature Prize in Japan. Dr. Steinberg led consulting and modeling projects for major banks worldwide. He also led the teams that won first place awards in the KDD Cup 2000, 2002 Duke/TeraData Churn modeling competition, and awards in the PAKDD competitions of 2006 and 2007. He has given talks at many leading conferences, published papers in economics, econometrics, and computer science journals, and contributes actively to the ongoing research and development at Salford.
Dan also writes a blog on analytics and data mining topics.
1. GP: Why Salford? I can understand if your company were named Steinberg Systems, or given your long association with Stanford statisticians, it could be Stanford Systems, but why Salford?
Dan Steinberg: When I was ready to give the company a name in 1983 I needed something not in use by anybody else of course, and at that time it was expensive to hire Thompson and Thompson to do their formal search. I once lived in Salford, England, I hoped that no one in the USA would be using the name of a not-famous British town. Fortunately, I was right and the $400 search turned up no one in the software industry.
2. GP: How did you start Salford Systems?
DS: I had actually developed two extensions to mainframe SAS to help me with the models I was estimating as part of my PhD thesis work. I continued working on those "PROCS" after graduation and they eventually became our first products. My "office" was a Kinko's style shop for sending and receiving faxes and I bought a Compaq I portable computer for development of PC based statistical software.
3. GP: Leo Breimanwas one of the leading statisticians you worked with - winner of many honors, including KDD Innovation Award in 2005. What are your favorite stories of Leo Breiman ?
DS: I visited Leo at his home many times during the 1990s and exchanged possibly thousands of emails with him. We worked closely together for about 10 years as I commercialized CART as well as Leo's Random Forests. To work with Leo, one of the true greats of data mining, has been an enormous privilege. Hearing Leo is better than any story that I could offer as Leo was a tremendous story-teller.
I recommend viewing A Tribute to Leo Breiman,and you will hear perhaps the last recording of Leo prior to his death. He describes his 'aha' moment when he received 'a bolt from the blue' that led to the famous CART Classification and Regressions Trees. What I have always found impressive is that as soon as Leo built his first tree, it was as if no time elapsed. He began growing trees all over the place, for air pollution studies, for classification of toxic compounds, for the military and so on.
4. GP: CART and Ross Quinlan C4.5 were started at about the same time. Can you compare these two systems? What are the main differences besides the use Gini in CART and Entropy in C4.5?
The long answer to the question can be found in my chapter on CART in "10 Top Data Mining Algorithms" edited by Xingdon Wu and Vipin Kumar (CRC Press, 2009) and a much shorter version in the paper Top 10 algorithms in data mining, Knowledge Information Systems, 2008.
There are many differences, starting with the fact that original CART offered six splitting rules covering both classification and regression. CART does does not use a stopping a rule and grows trees out very large followed by a pruning algorithm; the reason simply is that for any stopping rule it is possible to construct data that will cause the rule to stop too soon.
The CART authors therefore decided not to stop. At its heart, CART is about growing not one tree but a sequence of nested trees defined by the backwards pruning algorithm and the best tree is found by test sample performance, possibly adjusted by human judgment. Also, CART uses binary splits for all predictors including categoricals.
5. GP: The main products of Salford Systems are software systems: CART®, MARS®, TreeNet®, RandomForests®, and the combined system, SPM 7. For what type of problems would one apply each of them?
(DS Note: CART®, MARS®, TreeNet®, RandomForests®, are all trademarked names owned by the researchers who invented the methods.)
CART (Classification And Regression Trees):
We always start our analytical work with CART because it is supremely robust in the face of flawed and dirty data and no matter how foolishly the model has been set up or how poor the data quality is, you can actually extract considerable insight from a CART model. (This is certainly not true for classical statistical models). The single CART tree often yields insights ranging from highlighting clear data errors to revealing important data segments and offering a floor to predictability of the data.
Today's CART includes a large number of extensions including "hotspot" detection, missing value imputation, nonlinear "correlation" maps among predictors, and more than 30 prepackaged experiments designed to assist the user move rapidly to a high quality model. CART is best for classiifcation problems but can provide good initial insight into regression problems.
MARS
is a specialist tool for those who are looking for a model that looks like a regression (or logistic regression) but is not confined to linearity, automatically adapts to missing values in predictors, and provides clear graphical displays of the discovered nonlinearity in predictors.
TreeNet
is Jerome Friedman's stochastic gradient boosting (from the person that invented the technology) and is our preferred tool for acheiving maximum possible accuracy. TreeNet yields exceptionally high performance for both classification and regression and offers methods for adapting well to outliers, anomalies, and outright coding errors even in the dependent variable. TreeNet also includes new tools for the effective discovery of most important interactions which can be thought of as sub-segment discovery.
RandomForestsis based on Leo Breiman's original code but includes extensions and modifications coming from both Breiman and co-author Adele Cutler which are unique to SPM. Later this year we expect to publish independent University based research showing the clear superiority of this "true RF" over several non-Salford implementations.
SPM 7 also includes Friedman's proprietary regularized regression (and logistic regression) GPS (Generalized PathSeeker). GPS includes capabilities not found in any other regularized regression package and we leverage it not just for direct modeling, but also in hybrid models combining CART, MARS, TreeNet, RandomForests, and GPS.
Here is Part 2 of the Interview with Dan Steinberg, President of Salford Systems.