Big Data does not mean we have ALL the data

Does Big Data imply "You have collected all there is - all the data there is about a phenomenon". I strongly disagree with this quote from Viktor Mayer-Schonberger and Kenneth Cukier book on Big Data - here is my letter to the editor.



By Gregory Piatetsky, Oct 7, 2013.c comments

Here is my letter to Edd Dumbill, Editor-in-Chief of Big Data journal.
(Note: I am also on the editorial board of this journal).

Edd,

I enjoyed your interview (online.liebertpub.com/doi/full/10.1089/big.2013.0016 ) with Viktor Mayer-Schonberger and Kenneth Cukier, the authors of the recently published book Big Data: A Revolution That Will Transform How We Live, Work, and Think.

Big Data is not ALL the DataWhile I am also optimistic on prospects of Big Data, I was struck by this part of Kenneth Cukier answer:

You have collected all there is-all the data there is about a phenomenon.

part of this paragraph (emphasis mine)

The point is not that the data is necessarily big, even though there are now some gargantuan data sets that did not exist in the past. Instead, it is big in a relative sense, not in an absolute sense-it is often big in relation to the phenomenon that we are trying to record and understand. So, if we are only looking at 64,000 data points, but that represents the totality or the universe of observations, where before we might have used a sampling technique, now we do not have to sample; we can use all the observations. That is what qualifies as big data.

You do not have to have a hypothesis in advance before you collect your data. You have collected all there is - all the data there is about a phenomenon.

I want to strongly disagree with this idea. It may be possible to collect all the data in an artificial domain such as checkers or chess, but in almost every real domain we never have ALL the data.

Even if we collect all the data about a shopper in the supermarket or online, we usually don't have the data about their behavior outside the supermarket or offline, we don't know their thoughts, their emotions, what mood they are in, etc. Even if knew all about a person, we could not perfectly predict what they will so, since human behavior has a strong random component.

A good example of that is Netflix Prize progressNetflix prize which, despite a million-dollar prize and several years of effort by tens of thousands of data scientists, has succeeded in reducing an error in predicted movie rating from 0.95 stars to 0.86 stars, on a 1-5 stars scale. The best prediction is still about one star off.

Having all sensor data about machine may create an illusion that all relevant information is collected, but each sensor has certain limits, may fail, and some relevant aspects may not be measured.

Thinking that you have ALL the information may bring a sense of false confidence and unrealistically high expectations.

While Big Data increases the amount of collected information about a phenomenon, it rarely enables perfect prediction and modeling of that phenomenon. Like a function which goes to (but never achieves) infinity, Big Data gets closer but never achieves the limit of ALL information.