# Statistical Community and Big Data disconnect: Discussion Highlights

Highlights from a vigorous discussion on Statistical community and Big Data, including: Are data scientists reinventing statistics? Did statisticians miss the boat in 1990s? Is more data always better? Statistics 2.0?

By Gregory Piatetsky, Dec 5, 2013.

My post Why statistical community is disconnected from Big Data and how to fix it, which presented opinions from the leaders of ASA (American Statistical Association), has generated a vigorous discussion with over 50 comments on LinkedIn, mainly on Advanced Business Analytics, Data Mining and Predictive Modeling group, and here are some of the most interesting comments.

Key themes raised included:

- statisticians are used to getting bashed, because society has a low statistical literacy
- statisticians fumbled government fundings, did not progress much since 1990s
- many problems (like health care) do not have huge datasets
- data scientists are reinventing statistics
- data science has a major computation/data processing component, unlike statistics
- data science role is not a single person, but a team (which includes a statistician)
- more data is NOT always better
- some statisticians envision the jump start of Statistics 2.0

**Kevin Gray**, International Marketing Science and Analytics Consulting

Statisticians are accustomed to being ignored, not challenged. :-)

**Randy Bartlett**, Ph.D. PSTAT® (Analytics LION)

Kevin, now we are accustomed to getting bashed as well. Our society has a low statistical literacy. We are not disconnected from Big Data no matter what anyone says.

magazine.amstat.org/blog/2013/10/01/we-are-data-science/

**Thomas Speidel**, Statistician & Data Scientist

"Statistics has been the most successful information science. Those who ignore statistics are condemned to re-invent it." (B.Efron).

Isn't it funny that's exactly what we are starting to see?

**Thomas Ball**, Research Consultant

I was lucky enough to attend the joint ASA/NSF Massive Data Conference in 1996 in Washington. Massive data then was measured in terabytes and we were all in awe of the NSA guy who was receiving 10T of satellite data per day. There was something of a consensus on a few topics: 1) Few tools were capable of analyzing massive amounts of information and those were mostly limited to machine learning approaches, e.g., trees, and the tools that did exist leveraged simple heuristics in the analysis of highly aggregated data; 2) Sampling was a solution if and only if it did not destroy data structure and variance; 3) 20th c statistical solutions had been developed off of and for small amounts of information and did not scale; 4) New statistical models and solutions that did scale were badly needed if the profession was to remain relevant.

Massive data has become big data but most, if not all, of the issues identified in 1996 remain relevant today. In the years since, it's not clear that the statistical profession has made a lot of progress in advancing solutions that scale -- e.g., based on the evidence from any number of Linkedin discussion threads sampling remains the most frequently recommended best practice and prescription. On the other hand and by stark comparison, fast and efficient machine learning algorithms are being developed that can, if they work as claimed -- successfully analyze massive data...massive even by today's standards. Based on reports from the frontiers of current ML practice these solutions are rooted in computer science, information theory and Kolmogorov Complexity and for the most part bypass statistics.

The recent Future of Statistics conference, attended by about 200 people -- almost all statisticians -- reads in large part like a review of the ground covered in 1996. As a practitioner, I can't speak knowledgeably as to the reasons why things have unfolded in statistics as they have but my best guess is that statisticians fumbled the government funding ball in failing to obtain the necessary grants -- the lifeblood for innovation and one true metric of current relevance for all academic professions -- and/or lost that battle to apparatchiks in other disciplines (ML?) who were smarter, better connected or had deeper pockets to begin with. Hal Varian's quote notwithstanding, this history should not be allowed to repeat itself if statistics is not to be relegated to the intellectual backwaters of the 21st c.

**Thomas Speidel**

Thomas B. Not every problem revolves around TB of data. Lots of statisticians work in health research where sample sizes are often limited (which is good), hence the compuational limitations you referred to are not an issue. Statisticians are more publication oriented whereas ML and CS are more conference/proceedings/white paper oriented. There are benefits and disadvantages to either approaches. The former is more rigorous, but the latter lead to faster innovation.

**Kevin Gray**

Not a small part of Data Science is really data processing, sometimes very sophisticated data processing, but DP nonetheless. If my characterization is correct, it would not be surprising that Computer Scientists would have the lead in Data Science. On the other hand, Breiman, Tibshirani, Friedman and some of the others who have developed many of the algorithms used in Data Science actually have been statisticians.

**Louis Giokas**, Technology Consultant

I cannot agree with the characterization on the web site mentioned. I have done lots of statistical work over the years and am now in a MS program in Applied Statistics. We study both top down and bottom up methodologies. We search for meaning in data sets that are given.

I do not believe that there are going to be many people who really work on both sides of the "Big Data" equation. I may be one, since I also have many years in the database area working for IBM and Oracle. This is not, however something taught at schools. Any methodologies taught at schools for data handling will generally be obsolete soon after it is taught. Data access methods, such as SQL are one thing. Data infrastructure is a totally other thing. The current method that many talk about is Hadoop. Well, I just saw an article on the Big Data Republic site that pointed out that Hadoop is yesterday's news. That is why Google made it available. They are already on to the next thing. I also see many PhD statisticians who have relabeled themselves data scientists (I guess I should as well) and then turn around and say dumb things about the data handling part of the process.

What we are doing at DePaul is adding courses in Data Mining, for example. There is also a new program where alums can take courses, at a discount, in new areas of statistics as they come up. These will be like certificate programs. With technology moving as it is, there will always be a need to update one's knowledge base.

**Phil Scinto**, Senior Fellow at The Lubrizol Corporation

Excited to see movement in this area. I applaud our ASA leadership. I do believe that statisticians are currently working in analytics and big data projects, but we need to promote what we bring to the table. Our biggest strengths are that we understand there is variability in every process and that correlation is not causation. Organizations need to understand that big data is not magic. Root causes still need to be hypothesized and investigated. Sensitivity and utility should also be explored.

**Wayne Applebaum**, Vice President-Analytics and Data Science at Avalon Consulting, LLC

The ASA and Big Data article that Gregory presents contains an interesting description of the role of the data scientist and the may facets it contains. What businesses are beginning to realize is that this is not a description of one person (as much as they would like it to be) it is a description of a team. Try as you may, you won't find many of these people. Data Science is a team it is not a person.

To business on Randy's comment. One elusive trait is finding someone with a good deal of statistical literacy, that can communicate with the business side. Key to the success of any Big Data project is creating reliable and valid results that are applicable to the business. Defining the appropriate "question space" for investigation and the boundaries of generalization is fundamental to both Statistics and Data. Science. The hype says that Big Data will provide answers to questions that you didn't know you had. While it might (an probably will) uncover relationships that you didn't know existed, it is equally important to understand their magnitude, impact and in some cases causation, in order to integrate them into a business process.

**Bill Luker Jr**, Deploying Advanced Expertise in Statistical and Econometric Analysis

...All of this begs the question, or the assertion, unspoken by the computer scientists and engineers who are driving the bus on this whole thing, that always and everywhere, more data is better, and the most data is the best. Somebody needs to convince me of the correctness of this assertion. It reminds me of the debates that were carried out at the turn of the 20th century, when epistemologists--among whom was Bertrand Russel--argued about the nature of "facts." Were there such things as basic, incontrovertible and irreducible facts?

Turned out to be an empty argument, a philosophical dead end. But it has arisen again, in the quest for more, and more, and more data. It is essentially a Platonic endeavor, one in which the pursuers of absolute incontrovertible truth as irreducible "facts" are at it once again, thinking that if they compile enough data, there won't ever be any uncertainty or error, and that there will be complete information (forgetting for the moment that data is not information) about any question. This is a chimera, and we will look back on those who pursue this desideratum in the same way we look back on folks like Rudolph Carnap and the rest of the logical positivists in the 1920s, i.e., smart, but terribly misguided.

**Gregory Piatetsky-Shapiro**, Analytics/Data Mining Expert, KDnuggets Editor

@carey - why should statisticians "be the leaders of the Big Data and data science movement" ? Except for a few statisticians like Breiman & Tibshirani, most statisticians missed the boat on Data Science and Big Data, and statistics does not deal with computational aspect which is critical for Big Data, nor with the business aspect which is critical for getting results.

**Wayne Applebaum**, Vice President-Analytics and Data Science at Avalon Consulting, LLC

Bill makes an excellent point--whatever size or structure of the data set we are talking about certain fundamental rules apply. One of these fundamental is the need to know what the data represents, whether it is from a structured or unstructured source. If the definition is fuzzy (which sounds like an error within measurement issue) then that needs to be taken into account in the analysis. Randy Bartlett's early comment on low statistical literacy is particularly appropriate here.

It is also interesting that the term "statistics" seems to be being replaced by the "analytics." Although there is considerable debate about what analytics is, it does have as a foundation the principals of sound statistical and modeling practice. Ignoring these is very dangerous. It also implies, as Cary suggests, perhaps expanding the definition of what "Statisticians" do.

**Vincent Granville**, Co-Founder, Big Data Scientist, DSC Network

I think there are two very different statistical communities:

(1) Those involved with big data, they don't call themselves statisticians anymore.

(2) Those attached to statistical associations (AMSTAT, etc.) They are survey or government statisticians, or working on clinical trials: All the official statistical societies, worldwide, focus almost exclusively on these narrow areas of statistics, with an emphasis on theoretical rather than applied stats. Just look at job ads and articles published in their journals, websites and newsletters.

Read more on the subject at www.datasciencecentral.com/profiles/blogs/data-science-the-end-of-statistics

Also I think that you can be both data scientist and statistician at the same time. Just like you can be data scientist and entrepreneur at the same time, I would even go as far as to say that it is a requirement. It's certainly not incompatible, we just have to be aware that the official image of statisticians as pictured in AMSTAT publications or on job boards, does not represent (for now at least) the reality and what many statisticians do.

Still, data science and statistics are different, in my opinion. Many of the data science books I've read can give you the impression that it's one and the same, but it's because the author just re-used old stuff (not even part of data science), added a bit of R or Python, and put a new name on it. I call this fake data science. Likewise, data science without statistics (or with reckless application of statistical principles) is not real data science either.

**Jeremy Wu**, Adjunct Faculty, George Washington University

Statistics is well established globally as an applied science and has made much contribution to the advancement of society in the past century. While a few in the American Statistical Association may not have demonstrated their understanding or enthusiasm for Big Data, it is not totally surprising because random sampling, the core of modern statistics, was debated for 40 years before it was accepted. Call it arrogance, complacency or fear if you will, but there are also many statisticians who envision the jump start of Statistics 2.0. There are statisticians who have been working on Big Data and consider themselves to be statisticians. Unfortunately, their thoughts cannot be shared in this group because their postings seem to be censored. Attacks on statistics do not advance data science. If there is merit, both statistics and data science will flourish and thrive for the next hundred years.

I will resubmit the Statistics 2.0 article for posting in your group. Current Statistics 1.0 is lacking in 3 important areas: (a) inadequate emphasis and use of descriptive statistics and exploratory data analysis using modern data visualization, (b) innovative development and use of dynamic frames to conduct real-time analysis of the population, and (c) development of theories and methods for statistical inference using non-randomly collected data. Dr. Xiao-Li Meng, Dean of Graduate School and Chair of Department of Statistics at Harvard University, has prepared an article titled "A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it)," which is scheduled to appear in a Chapman and Hall/CRC Press book titled "Past, Present, and Future of Statistical Science" on April 4, 2014 (see www.crcpress.com/product/isbn/9781482204964). Dr. Meng has agreed to share this article with those who are interested. A copy will be supplied upon an email request to Jeremy.S.Wu@gmail.com.

**Bill Luker Jr**, Deploying Advanced Expertise in Statistical and Econometric Analysis

I am trying to argue two points (and then, back to work):

(1) Please let a thousand flowers bloom, particularly in the statistically intensive, empirical and data-driven fields that people call sciences--the physical, natural, social/organizational/behavioral/management/historical/evolutionary/systems sciences--because fantastic new approaches are being developed every day, to a collectively huge problem set, built on older techniques but incorporating much that is new and more efficient. These are 5th and 6th generation analytical tools (going back to the 1950s), and indeed, need, and are able to handle, much much bigger data sets than ever before, and routinely do. But to us it's almost always not the volume, particularly, but the reliability and validity. What this rambling amounts to is that we want to say to the big data evangelists and people who Vince calls the "fake" data scientists (he got that right) the same thing I would say to a Jehovah's Witness guy who came to my door: let us do our thing, and we'll tell you when to come around and get us some servers, or whatever, when we run into problem sets that require exabytes, kapisch? Or come back and show us some analytical techniques for these exabytes of data that will really blow our socks off, and meet a need that we have decided we have, not a need you have made up for us.

And (2) Pursuant to that, all of us who know and use statistical-analytic approaches as the indispensable part of the toolkit of the empirical discipline in which we theorize and practice, want a seat at the table whenever people start talking universally of or about big paradigm-shifting things with respect to data, of any size, kind, or type. And we have not been granted that, because the conversation about big data has occurred within an echo chamber occupied almost exclusively by computer scientists and engineers. And we do not only want to talk about technical things like how to get more data, store it, and retrieve it, but other very important things, very important things that have been skipped in the rush to claim a piece of new science glory: what data we should get that will actually help us understand some big problems, X, Y, or Z. So we want the SME analysts first, the disciplinarians who have strong industry and academic analytical creds (c.f., point 1, above), among the other firsts at the table, when we start talking about data. Our views are important and will leaven a broadening of understanding and mutual respect for us all.

Oh yes, and if we hear one more time that the world is adding such and such more exabytes of data every day, we will collectively retch. It is an example of a meaningless statistic even before it was overused to the point of complete insufferability. I think everybody can agree on that.