KDnuggets Home » News » 2015 » Jul » Opinions, Interviews, Reports » Statistics Denial Myth: Repackaging Statistics With Straddling Terms ( 15:n23 )

Statistics Denial Myth: Repackaging Statistics With Straddling Terms


Data science is nothing but the old wine in new bottle versions of the statistics with different fields. Here, we are busting the myth which states data scientist is new and different than traditional statisticians.



By Randy Bartlett, Blue Sigma Analytics.

"I almost feel that folks in data science [excluding statisticians] are suddenly realizing that this kind of work is not new and are desperately looking for ways to justify a distinction."

—Thomas Speidel

[Machine Learning is simply a] “loose confederation of themes in statistical inference (and decision-making)”

— Michael Jordan

"Without a grounding in statistics, a Data Scientist is a Data Lab Assistant."

—Martyn Jones

Myth #3: Data mining, machine learning, Big data analysis, business analytics, and data science are distinct from statistics

Repackaging statistics with complementary fields has the potential to create new synergies. E.g., econometrics is the marriage of economics and statistics. This repackaging has been extremely successful; surpassing Six Sigma's mixed results. Econometrics has embraced statistics. Applied econometricians have helped develop best practice and some identify as applied statisticians. They are with the science.

Straddling Distinct Applications:

The interests of the 'promotional industrial complex' are to sell things, things like books, magazines, conferences, workshops, new degree programs, software, advertisement space, and newly anointed 'experts.' New things sell better. Promotional interests are not married to protecting the integrity of statistics or best practice.

The IT part of the promotional industrial complex has started using the terms Data Mining, Machine Learning, and Data Science to include both data analysis AND data management. In the field, we are problem oriented and not tool oriented. This makes repackaging data analysis with data management comparable to packaging addition problems with sorting problems and calling it 'Add-Sort Science.' In some circles, the point of this repackaging is less about finding synergies and more about expanding IT, giving IT more missions. The next unbelievably bad idea, being shopped around, is that data analysis is somehow a data management problem?! If it were, then we should be better at applying statistics to reporting and data collection.

Data analysis and data management are distinct applications, separated by differences in culture, software, objectives, and thinking. Data management emphasizes efficiency in storing and accessing data, and statistics is about extracting information in the presence of uncertainty.

Any repackaging of data analysis with data management that removes statistics expertise from the data analysis is a bad idea. However, bad ideas can happen, even linger. A popular trend in the 1960s was for corporations to merge into conglomerations ... which provided no synergies, made no economic sense. Without the mergers, shareholders could invest in each company separately and realize the same return. Even so, these conglomerates continued for a decade and these types of mergers still happen. The merger of data analysis with data management does not have to make sense, and it can last a long time without making sense.

In the case of Machine Learning, academia emphasizes how this set of tools work; their facility for iterative learning. In the field, we have a problem-based view. Any Machine Learning tool that solves statistics problems will necessarily make statistics assumptions and require statistical thinking. These tools are part of statistics or Statistical Machine Learning. We provide a clarifying problem-based view of statistics in the May/June 2015 issue of Analytics Magazine,http://goo.gl/Wod3gk.

The Venn Diagram in Figure 1 illustrates two areas of application for applying Machine Learning: data analysis and data management. From an applied perspective, there is no overlap.

Bartlett Statistical It Machine Learning Fig 1: Statistical Machine Learning and IT Machine Learning have no overlap

The same relationship holds for Data Mining and Data Science. In the field we should be problem-based and that means splitting these problems:

  1. Machine Learning = Statistical ML + IT ML
  2. Data Mining = Statistical DM + IT DM
  3. Data Science = Statistical DS + IT DS

What Is Wanted:

We want to keep the statistics expertise on the data analysis.This means embracing specialization, even if this does not help the promotional industrial complex to sell things. Large corporations need separate teams for data analysis and data management. Business managers should be playing chess, not checkers.

Consumers of data analysis should look to statistical certifications, like the PSTAT, to ensure that the Statistical Qualifications are brought to bear.

Close:

There is a flood of statistical malfeasance on its way. Wise consumers of data analysis want to avoid removing statistics expertise from their data analysis.

Repackaging data analysis/statistics with data management/IT will not provide further synergies in the field. It will just sell things.

In the field, we are problems based. We want to split Data Science, Data Mining, and Machine Learning to match our business problems: Statistical DS, DM, & ML and IT DS, DM, & ML. 

We sure could use Deming, right now. Many of us, who consume or produce data analysis, hang out in the new LinkedIn group: About Data Analysis. Come see us.

Bio: Randy Bartlett, PhD, CAP®, PSTAT® is a Statistician/Statistical Data Scientist; Author of ‘A Practitioner’s Guide To Business Analytics’.



Related: