Vasant Dhar on “Data Science and Prediction”

What does "Data Science" and #BigData mean? Is there something unique about it? What skills do "data scientists" need to be productive in a world deluged by data? What are the implications for scientific inquiry?

By Gregory Piatetsky, Dec 21, 2013.c comments

Vasant DharVasant Dhar was a data scientist before the term "Data Scientist" appeared. He is the Head of the Information Systems Group, and Director for the Center for Business Analytics at the Stern School of Business at NYU and was on NYU-Stern Faculty since 1983.

He published an excellent article in Communications of ACM (Dec 2013), where he examines Data Science and Prediction.

An earlier version of "Data Science and Prediction - What does Data Science Mean" appeared in KDnuggets in 2012.

Data Science and Prediction - Key Insights

Here is an excerpt from the updated version:

The term "science" implies knowledge gained through systematic study. In one definition, it is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions. Data science might therefore imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. Why then do we need a new term like data science when we have had statistics for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term.

The short answer is data science is different from statistics and other existing disciplines in several important ways. To start, the raw material, the "data" part of data science, is increasingly heterogeneous and unstructured-text, images, video-often emanating from networks with complex relationships between their entities.

Figure 1 outlines the relative expected volumes of unstructured and structured data from 2008 to 2015 worldwide, projecting a difference of almost 200 petabytes (PB) in 2015 compared to a difference of 50PB in 2012. Analysis, including the combination of the two types of data, requires integration, interpretation, and sense making that is increasingly derived through tools from computer science, linguistics, econometrics, sociology, and other disciplines.

The proliferation of markup languages and tags is designed to let computers interpret data automatically, making them active agents in the process of decision making. Unlike early markup languages (such as HTML) that emphasized the display of information for human consumption, most data generated by humans and computers today is for consumption by computers; that is, computers increasingly do background work for each other and make decisions automatically. This scalability in decision making has become possible because of big data that serves as the raw material for the creation of new knowledge; Watson, IBM's "Jeopardy!" champion, is a prime illustration of an emerging machine intelligence fueled by data and state-of-the-art analytics.

From an engineering perspective, scale matters in that it renders the traditional database models somewhat inadequate for knowledge discovery. Traditional database methods are not suited for knowledge discovery because they are optimized for fast access and summarization of data, given what the user wants to ask, or a query, not discovery of patterns in massive swaths of data when users lack a well-formulated query. Unlike database querying, which asks "What data satisfies this pattern (query)?" discovery asks "What patterns satisfy this data?" Specifically, our concern is finding interesting and robust patterns that satisfy the data, where "interesting" is usually something unexpected and actionable and "robust" is a pattern expected to occur in the future.

Organizations and managers face significant challenges in adapting to the new world of data. It is suddenly possible to test many of their established intuitions, experiment cheaply and accurately, and base decisions on data. This opportunity requires a fundamental shift in organizational culture, one seen in organizations that have embraced the emerging world of data for decision making.

Here is the full article (ACM subscription may be required)