Exclusive: Part 2 of my Interview with Dan Steinberg, President of Salford Systems, Data Mining pioneer
Part 2 of my exclusive interview with Dan Steinberg, on CART, MARS, RandomForests, working with Leo Breiman, winning competitions, On Big Data, and advice to aspiring data scientists. If you missed one revolution wait a bit to catch the next one!
By Gregory Piatetsky, Jun 15, 2013.
Here is Part 1 of the Interview with Dan Steinberg, CEO of Salford Systems(including Dan's bio).
6. Gregory Piatetsky: Salford Systems has been around for 30 years. How did the industry and the competitive landscape change?
Dan Steinberg: I will need to write another book chapter to answer that one! One motivation for starting Salford was the advent of the personal computer. I realized that we were going through an extraordinary change and I wanted to be part of it.
I told people at the time I was concerned that a revolutions of that scale might only happen once in a lifetime and I did not want to be just a bystander. Since then we have experienced the rise of the internet, Google search, mobile computing, and now Big Data technology. So, in retrospect, it appears that
if I missed one revolution I only needed to wait a bit to catch the next one!
The biggest changes have been discussed by many, but the one that truly stands out for us is the attention now given to analytics. In past years i discovered that telling people what I did for a living in a social situation could be a conversation stopper. it might still be, but at least people now are prepared to consider data science "cool".
7. GP: Salford Systems had won many prizes at different competitions - tell us about your approach to it
DS: The one we are most proud of was the KDD Cup 2000 organized by Ronny Kohavi which involved the analysis of e-commerce weblogs to predict online behavior. It required heavy duty data preparation, creative feature construction, and imaginative modeling. The competition had 5 "challenges" and we won first place in two of them.
We also won another competition organized by Teradata predicting churn in telecommunications. There we came in first in all four of the competition challenges. After that, the majority of the wins we cite were won by our clients using our software as we found the effort required to seriously go for the win to be more than we could spare.
8. GP: What do you think of the Big data, both the hype and the trend?
DS: The software and tech industries are in a state of perpetual hype so it is important to be be able to separate the story from the spin. It is very tempting for people who have devoted their entire careers to small to moderate sized data analysis to view the Big Data movement with scepticism. Sampling is often sufficient to support superb analysis and we plan to publish some results showing how using less than literally "all" the data actually yields superior future predictive performance.
Further, today (June 10, 2013) I configured a Dell Rack mounted server with four 6-core CPUs and 512GB of RAM for less than $18,000. You can crunch considerable data on this machine and in some cases literally run 48 analytical threads in parallel. The price of RAM goes up steeply beyond 512GB and going to 1TB would double the price. But if you must manage about 1TB of data for analysis you can get the needed hardware for $40,000.
But having made the point (that we can still go a long way with a single well equipped modern server) the Big Data revolution has ushered in truly remarkable new technology for processing data at the scales that large e-commerce and social networking sites must operate. The fact that we can fairly easily work with data spread out over 1,000 well equipped servers (say 16TB disk storage each) is remarkable. From the point of view of analytics it is worth observing that the vast majority of Big Data analytics to date is confined to descriptive statistics. At Salford, we are hard at work to deliver effective advanced data mining to this new world of distributed computing.
9. GP: I noticed you have a blog post Sanitizing Data: Keep the Details of Your Data Mining Project Private .
Is it possible to effective anonymize data in today's era of online sharing?
Note: here are some relevant papers:
- Netflix prize de-anonymized
- Latanya Sweeney showed that Zip + birthday uniquely identify about 75% of US residents.
DS: Actually, what we mean by sanitizing is obfuscating the column names of a data set. We introduced the feature into our software to assist users wanting advice or tech support via providing us with data while not revealing too much about it.
10. GP: Advice to people considering a career in Analytics and Data Science?
After completing my graduate courses in econometrics at Harvard I got a part-time job at Boston area consulting company to support myself while I worked on my thesis. I quickly discovered that I was ill equipped to do the work necessary, which included working with a relational data base, extensive data preparation, followed by regression modelling.
Fortunately, after 5pm each day the time sharing computers became very responsive and the more technically oriented people at the company finally got to work. With experienced colleagues around and fast (for that era) computation I managed to learn everything they hadn't taught me in school!
While we live in a different world today, it is still the case that a budding data analyst or data scientist needs considerable real world experience to learn. Beyond this, there is a wealth of good on-line learning material as well.
11. GP: What recent books have you read and liked?
DS: : I rarely read one book at a time as I am often looking for some very specific information. Among books that I have been reading parts of recently include "An Introduction to the Bootstrap" by Efron and Tibshirani (1993) which is full of great data analytics insights, "Programming Pig", by Alan Gates, and Alan Blinder's superb 'After the Music Stopped' (2013) about the recent financial and economic crisis.
See also an interview of Dan Steinberg by Ajay Ohri.