What is Wrong with the Definition of Data Science

A veteran statistician argues that 3 different areas usually included in "Data Science" require dramatically different, skills, education, and training with very little overlap.

Guest blog by Michael Mout (CStat, CSci, Ret.), Dec 23, 2013

"Jack of all trades, Master of none" is usually not a complement.

The newly invented field of Data Science (DS) seems to want to be all things to all people. (en.wikipedia.org/wiki/Data_science )

...[DS includes] mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.

These 9 areas of study can be classified into 3 areas:

  • Advanced Analysis - Math, Stats, Pattern Recognition/Learning, Uncertainty, Data Mining,Visualization (?)
  • Computer Systems - Advanced Computing, High Performance Computing, Data Mining, Visualization (?)
  • Data Bases - Data Engineering, Data Warehousing.

3 Areas of Data Science: Statistics, Computing, and Database

These three areas require dramatically different, skills, education, and training with very little overlap, maybe data mining and visualization.

Suppose one wants to identify outliers in millions of insurance claims to identify fraud. They would want someone experienced in sampling and multivariate statistical analysis (if one needs to ask why, it proves the point). They might want the computer guy to design and implement the solution.

Just because DS is defined to include all these specialties makes life confusing for employers and applicants.

If a company advertises for DS, do they expect someone who can do sophisticated data analysis AND manipulate large data AND design/build sophisticated systems? That would be like advertising for a Scientist when they really want a Chemist or Physicist or Biologist.

One solution is educate HR and employers that they need break DS into specialties, such as:

  • Statistical/Analytical DS
  • Computer Systems DS
  • Database DS

The following is not meant to be a precise description of the overlap, but only a graphic guideline.

On the other hand we could just throw away the Data Scientist category and go back to Statistician, Data Base Analyst, and Computer Scientist.

Bio: Michael Mout has over 40 years experience in stats analysis: Teaching, Stats Models, Basic data analysis, predictive models for consumer behavior, fraud modeling, and many custom applications.

Gregory Piatetsky, Editor: I invited Michael Mout to share his opinion with KDnuggets readers, since he was an active participant in the intense debate (over 250 comments) on LinkedIn Group Advanced BA, DM and PM, prompted by my post Why statistical community is disconnected from Big Data and how to fix it.

Indeed, it is hard to find people who share all the desired data science skills (see another well-known Data Science Venn Diagram from Drew Conway), so there is a growing recognition that companies need a team of people to cover this set of skills, rather than search for a "unicorn" data scientist with all of them.