Stephen McDaniel on Data Science vs Statistics

Stephen McDaniel, a noted expert in data science and visualization and founder of Freakalytics, provides his perspective on Data Science vs Statistics debate

This comment by Stephen McDaniel on KDnuggets item

Is Data Science The End of Statistics? A Discussion

was very interesting, and I wanted to share it with a wider audience.
Gregory Piatetsky, Editor.

Stephen McDanielStephen McDaniel, May 5, 2013

As a speaker at several big data conferences, my opinion leans toward a data scientist as someone who is naturally curious (a dreamer as a child?), creative, interested in helping people solve business problems, literate in programming, experienced in statistics, willing to 'get their hands dirty' intense data management and a variety of other skills (like operations research, systems admin, systems integration and more). You can throw a wide range of variability into the latter skills, but lack of the former are deal breakers IMO.

I am not certain how it happened, but there is definite confusion that a data scientist is simply using Hadoop, MapReduce or being a Python hacker, to which I strongly disagree. That is like saying a statistician is someone who knows SAS or SPSS, which is most definitely not the case. It is similar to the fallacy of causality based on correlation.

As a former "classic" statistician and biostatistician dealing with "small data" back in the 90's, I agree with many of the points made here. I do think that the core statistical profession made statistics too rigorous, often scoffing at longitudinal analyses and insisting on a level of rigor in regular business statistics that frankly made them appear out of touch. I have seen quite a few statisticians within industry realize the folly of this rigor, which was perfectly reasonable for testing drugs or deciding probabilities of guilt in a criminal trial (both life and death situations, where rigor makes a lot of sense.)

For me, the turning point was when I joined Netflix after writing SAS for Dummies. At Netflix, there were many great analysts in the business, but advanced analytics usage was surprisingly light. I spent my time at Netflix in two roles:

1) developing a cohesive view of the customer base with data, constructing segments, lifetime value estimates and integrating great work from many analysts across the business into a single "customer view" and

2) I was often asked to attack problems that other analysts in the business had reached dead ends, but were of substantive business value.

What I found fascinating is how the creative application of statistical techniques would often add insight into business problems with just a few days of work. My classic stats training inclined me towards even more work and rigor, but business execs were thrilled to have a new approach or variable to experiment with and often asked me to stop until they could go and work on the business problem based on my 'initial' findings.

Even more surprising was how much my managers wanted me to stay when I chose to leave for family reasons- it all came down to my willingness to study the business, work aggressively to enrich the data warehouse and listen to subject matter experts. Most important was providing actionable recommendations (not all the time, I often hit dead ends quickly and attacked from a different angle.)

I never considered myself a data scientist or even a statistician, I thought I was just a creative problem solver with a vast array of tools that I was willing to use in non-traditional ways. In business, it is often the ability to explore observed outcomes that aren’t part of an experiment that can actually lead to changes that become the experiment (often performed with many other uncontrolled variables, but still an experiment of sorts.)

I was also fortunate to have attended a very applied, large program in statistics at North Carolina State that allowed me to survey almost every area of statistics from theory to non-parametrics to QC to Six Sigma to biostats (at UNC-CH) to classic experimental design, all wrapped in a heavy orientation towards using SAS for solving real-world case studies back in the early 90’s. Also, I had been programming since I was 12 and I also worked in data warehouse design and even BI through the years.