Will the Real Data Scientists Please Stand Up?

Job postings for data scientists are everywhere. But what is a data scientist? I present a few archetypes.

According to the Harvard Business Review, Data Scientist now reigns as the sexiest job of the 21st century. According to Forbes columnist Bob Violino, Data science skills are among the hottest, with demand far outstripping supply. And yet, as I compose this article for KDnuggets, arguably among the most prominent sites serving the data science community, I realize that I seldom know what anyone truly means by "data scientist". Following on a previous post on Data Science's most Used, Confused, and Abused Jargon, I'll provide some background on this vague term's use and offer some archetypes of distinct categories of data scientists.

Recent KDnuggets posts enumerate the most demanded skills and 9 must-have skills of data scientists. Perusing one list of skills, it seems that a data scientist must possess scripting and query language chops as well as business acumen in order find actionable insights in datasets. These data scientists would appear to be the end users of machine learning software tools. And yet tens of other posts and job listings seek candidates to carry out fundamental research into machine learning algorithms and applications. These data scientists appear be researchers with backgrounds in mathematics and machine learning.

The range of candidates for so-called data science positions has grown to include computer scientists, mathematicians, and physicists as well as business school graduates, economists, and other social scientists. Some positions seem to require mathematical maturity, others superior coding skills, and yet more are clearly looking for SQL jockeys, who can generate visualizations and insert them into powerpoint presentations.


A Horribly Vague Definition

According to Wikipedia, "Data Science is the extraction of knowledge from data". So is a data scientist one who extracts this information or one who studies methods of extracting such information? Further, is machine learning a superset, subset, overlapping set or disjoint set from data science? Is Geoff Hinton a data scientist? While his body research spans the fields of machine learning, cognitive science, computational neuroscience and even psychology, he is a leading researcher who's invented many of the tools that now make it possible recognize objects in images, phonemes in audio, and patterns in text.

Or does data science more properly connote a data miner, one for whom an interesting dataset is a first-class priority, on par with novel algorithms? This person would be an engineer for whom insights are more important than capabilities. Still, I would expect someone in this category to possess strong programming skills.

IBM's website contains a post titled "What is a data scientist?", offering yet third, less technical, notion of such a person. According to their definition, "A data scientist represents an evolution from the business or data analyst role". This clearly describes neither a theorist nor a data miner as laid out above. Perhaps IBM is self-interested in putting forth this definition, offering a glorified title for a role that might heavily rely upon its proprietary software.

Now, I'll present five archetypes, each representing a sense in which the term data scientist is frequently used.


What they Do: The word "theory" itself can mean many things to many people. Generally, in the machine learning community, theorists are computer scientists, mathematicians, and statisticians, who primarily study algorithms that are provably efficient and provably correct, even if they must rely on unrealistically strong assumptions. Theory papers contain proofs correctness, proofs of convergence, and guarantees on performance.

Tools of the Trade: Theorists rely mostly upon paper, writing utensils, and occasionally email and Matlab to be productive.

Where they Work: Most pure theorists aim for academic jobs. However, large private research institutions like Microsoft research and IBM research employee a large number of top researchers. Some large companies, such as Google, do have problems so novel as to erode the disconnect between theory and practice.


Machine Learning Scientists

What They Do: Machine learning scientists sit somewhere between theorists and data miners. For these scientists, a single method whose behavior is understood is preferable to system which wins a Kaggle competition by cobbling together a gaggle of algorithms into an ensemble. Generally, these scientists develop new algorithms which may be heuristic are theoretically motivated. They also care about empirical performance on real-world tasks.

Tools of the Trade: Implementation is a significant part of machine learning work and machine learning scientists should have strong coding skills in both high and low-level languages, as well as the ability to rapidly prototype with existing machine learning frameworks like scikit-learn.

Where they Work: While academia is a siren that calls many in machine learning, university jobs are hard to score. Fortunately, a healthy Silicon Valley job market has gobbled up the vast majority of machine learning scientists in recent years. Major employers include Google, Microsoft, Amazon, Facebook, and more. Finance companies also employ a substantial number of these research engineers.

Data Miners

What They Do: Unlike machine learning researchers who consider many abstract tasks, such as the active learning paradigm, and are often content to show state of the art performance on widely-studied datasets, data miners work on two types of problems. The first: "here is a dataset, produce insights". The second: "here is a dataset and a task, win." Generally, these are the folks who win competitions.

Tools of the Trade: These engineers are often strong programmers and combine domain-specific intuition with a knowledge of algorithms to generate valuable insights. They have a strong knowledge of available libraries and implement quickly.

Where they Work: Data miners work at a broader range of companies than pure machine learning workers. They can be found at the traditional silicon valley powerhouses, but also in the health space, or mining data for companies that may not be primarily in the business of building high-tech solutions.

Script Kiddies

What They Do: Like their security hacker counterparts, script kiddies are the end users of data science products. They may know roughly what a support vector machine does, but wouldn't code one from scratch.

Tools of the Trade: Azure ML, IBM Watson, KNIME.

Where they Work: Everywhere.

Powerpoint Jockeys

What They Do: As with all trends, everyone wants in. Just as everyone wants to work at a startup, traditional business analysts want technical-sounding job titles. These individuals may have no coding skills or mathematical background, but why should qualifications stand in the way of ambition?

Tools of the Trade: Powerpoint, excel, laser pointers, buzzwords.

Where they Work: Everywhere, but are most celebrated at management consultancies.

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs. He will be working for Amazon this summer as a machine learning scientist.