Data Scientist: Owning Up to the Title

Regardless of how Data Science or Data Scientist is defined, if you are going to use the word "scientist" in your title, you are going to be held accountable for it.

By Sean McClure (thoughtworks), Dec 2014.

Turning something raw into something industrially valuable has always required 2 things; science and engineering. The science is our attempt to explain and predict the behavior exhibited by some complex system and capture those explanations in the form of testable models. The engineering looks to mechanize those modeled concepts into useable tools that make a direct impact on society.

Data Science Kid

The science piece requires more than simply scaling our existing machines into this new world of rich and varied data. This new world represents entirely novel complex systems that we need to understand. In order to produce value we need models of this complex behavior so we can capture those new-found concepts in our machines.

Enter the Data Scientist; a new kind of scientist charged with understanding these new complex systems being generated at scale and translating that understanding into useable tools.

Unfortunately, along with this increased demand for 'science on data' is an accompanying ambiguity with regards to what it means to be a data scientist. We need to settle once and for all the term and focus on developing the right skills and attracting the right talent. We need a self-consistent 'guiding light' that gets us past the hype and the demand, and distills for us what it really means to be a data scientist. It turns out, the answer is right in the title.

What it means to be a Scientist

Regardless of what industries are currently fueling the demand, or what skill sets happen to be sexy today, there is one thing that cannot be argued; if you are going to use the word scientist in your title, you are going to be held accountable for it.

This has been the case throughout all of scientific history. Before our traditional fields of science laid out their foundations in self-consistent theories they were not considered what we now call science. What makes something scientific is attaching the phenomena you are studying to some self-consistent model that is independent of opinions or subjective interests.

If you are going to use the word scientist in your title, you are going to be held accountable for it.

In all cases of science, we build testable, mathematically grounded models to explain and predict the behavior of some complex system, and it is this activity that gives us our definition of doing modern science.

What it means to be a Data Scientist In order to attach the word science to data we must show that data can represent a complex system that exhibits behavior, and that we can explain and predict that behavior using our instruments of choice; namely, computers.

Data Representing a Complex System

We can enter into the debate about what exactly we mean by "complex" but for the sake of any practical argument we can say that any system producing unobvious behavior is complex. In other words, if some phenomenon produces behavior in a way that is not immediately obvious, it requires a simplified approximation, a model, to explain and predict how it achieves that behavior.

The data we collect from sensors, websites, detectors, and any other device is being generated from phenomena. We are organisms on a planet interacting in complex ways producing behavior that is anything but obvious. Its 'unobviousness' is why we look to scientists to try and figure out how that behavior was manifested because that discovery is what makes building new technology that acts on data possible. Therefore, data does indeed represent a complex system worthy of modeling.

Explaining and Predicting Behavior

Can we really build models from all these data we are collecting? Well, of course. After all, this is no different from any other science. All sciences collect data whether it is from butterfly collections, particle accelerators, chemical analyses, MRI imaging, or disease propagation. Data are simply recorded activity that was generated from some underlying complexity. To build a model means turning those 'recordings' into a consistent collection of testable concepts that explain and predict the activity we are observing.

Why Scientists are attracted to Data Science

The majority of individuals who have entered into data science are indeed full-blown scientists looking to apply their skills outside the ivory towers; if data science wasn't considered a real science that would be a lot of scientists all-of-a-sudden reconsidering what they find interesting and important.

So what's the problem? Why is it becoming difficult to identify what it means to be a data scientist? One word. Demand.

The Diluting Power of Demand

Science has become the "cool" kid in the real world and it shouldn't be surprising why that is. If what we value in the information age is the ability to convert our new "oil" into something different, something intangible, it will require, like every industrial innovation before us, a deep understanding of the mechanisms underlying that oil's behavior.

But with the increase in demand comes an uncertainty as to whether or not science is actually taking place. Are testable models being built using real research on the underlying complexities that an organization is attempting to understand and anticipate? Are the models employed actually mapping an algorithmic approach to the pain points of the organization, or are they simply the ones that came with a scaled recommendation engine?

And so we are in need of that 'guiding light' to move us past the hype of vendors, and the diluting power of demand. We need to educate organizations that are looking for talent on what a data science resume should look like. We need to understand data science for what it is, not what we want it to be.

Owning Up to the Title

As a data scientist your allegiance is to science; not machine learning, statistics, database technology, or business practices. All of these are critically important but not the 'guiding light' that leads us to objective discovery. We cannot equate data science to a particular discipline or tool, because it is the investigation and the development of models that make us scientists. Tools, practices and languages can be learned, but having a passion and mind for discovery and experimentation is how science has always moved forward.

As a data scientist your allegiance is to science; not machine learning, statistics, database technology or business practices.

As we move forward and lay the foundations of data science we must ensure that we own up to the title of scientist. We must frame the building of new products and new technologies as something that requires real research. We must instill in organizations and practitioners alike that the only way to produce value, compete effectively, and invest in long-term solutions is to understand the complexity of the markets in which we compete. The ambiguity of the term data scientist is only superficial, and merely a byproduct of hype and demand. If we stick to the science, the ambiguity falls away to a solid approach to understanding behavior and building tomorrow's exciting products.

Sean McClure, PhD, is a data scientist at ThoughtWorks, where he assists organizations looking to compete analytically by applying advanced statistical and mathematical modeling.