KDnuggets Home » News » 2016 » Nov » Opinions, Interviews » Predictive Science vs Data Science ( 16:n42 )

Predictive Science vs Data Science

  http likes 120

Is Predictive Science accurately represented by the term Data Science? As a matter of fact, are any of Data Science's constituent sciences well-represented by the umbrella term? This post discusses a few of these points at a high level.

Crystal ball

We can talk about nutrition science, which focuses on what we eat. We can talk about exercise science, focusing on what we do with those calories. Or we can talk holistically about the outcome -- which I would argue deals with an issue of greater importance about which most people ultimately care -- health science. Given the argument that the outcome, health science, is generally more interesting than the raw material, nutrition, or the process for turning raw materials into something more useful, exercise, why do we tend to speak about data science as opposed to predictive science?

To be clear, data science is much more than prediction or classification. It includes other machine learning techniques, such as clustering and frequent itemset mining. It also includes data visualization and data storytelling. It can also encompass the various aspects of traditional data mining frameworks, like The KDD Process, including data selection, preprocessing, and transformation. Data science can also include other algorithms and approaches to data-related tasks beyond what I have mentioned here.

I have previously and holistically defined data science as follows:

Data science is a multifaceted discipline, which encompasses machine learning and other analytic processes, statistics and related branches of mathematics, increasingly borrows from high performance scientific computing, all in order to ultimately extract insight from data and use this new-found information to tell stories.

When considering "predictive science" vs. data science, it is the slender related section of data science which I am measuring it against. In fact, the disassembly of data science into constituent "sciences" (clustering science, for example) would certainly help express what exactly it is we do, at the obvious expense of a sexy umbrella buzzword.

But taking a step back, it is inarguable that data is input, a raw material. In this sense, data science places the emphasis on the "what" in predictive processes. While the data is a prime ingredient in the predictive puzzle, and possibly the most difficult to procure or otherwise come across, "data science" seems to neglect the other major component as well as the interesting insights.

Algorithms are transformative processes. So what about algorithmic science? This focuses on the tools, the "how," and is firmly rooted in computer science. Again, this falls short of accurately describing the holistic predictive process; data is abandoned in favor of the processes which transform it into prediction. Any successful description would likely focus on the end result.

The outcome of the holistic predictive process is the prediction. Or is it the hypothesis? I don't mean this in a general "hypothesis vs. prediction" sort of way, but in a "is the prediction or the hypothesis the more valuable output from a particular classifier/model?"

Whether prediction or hypothesis, one of these 2 will be the most interesting piece of the holistic predictive science puzzle. Predictive science - or prediction science, if that strikes you better - sounds pretty good. But really, isn't that just "science?" That seems very non-specific.

What about statistics? Are we applied statisticians? Sourced from Wikipedia:

"Applied statistics" comprises descriptive statistics and the application of inferential statistics.

Add in prescriptive statistics, and this seems like a step in the right direction. However, the emphasis in this case is on the application of statistical processes at the expense of... well, not much, really. Yet I would argue that this actually does not place the proper emphasis on inferential and prescriptive statistics, and perhaps implies too much reliance on descriptive, and thus seems to also fall short in describing the science of prediction.

Predictive analytics? Maybe the closest fit, but this term seems closer to the business world at this point than to the world of science. I don't see this term brought up in research at all, and it generally seems to be the sole domain of big business. And that's fine for what it is, but what it is does not seem to put science at its forefront (though, clearly, science underlies its usage).

I don't know that there is a solution. To be fair, I don't even know that this is a problem that exists outside of my own head. But I think everything about boils down to the following, and can be generalized beyond the prediction aspect of data science: Does the term data science actually represent anything of value to us, the data scientists, or to everyone else?

I don't purport to have a recommendation, and even if I did I'm sure it would be passed over. Which is fine. But as someone who is not terribly excited by, or comfortable with, the term "data science," I think it's worth being introspective about what it is we do, and how we categorize those tasks. Sure, there is a convenience at being able to put a name to a broad profession of somewhat related tasks, but do we lose the trees through this forest?

And when it comes to the very complex science of prediction, data may be the new oil, and algorithms the special sauce, but their paired predictive power is where the actual money is, both figuratively and literally.