3 Stages of Big Data

The confusion around Big Data is partly the result of different aspects of Big Data which have very different meaning and produce very different results. We propose a 3 stage classification.

By Gregory Piatetsky, Dec 8, 2013.

My recent post Why statistical community is disconnected from Big Data and how to fix it, which presented opinions from the leaders of American Statistical Association (ASA), has generated a vigorous discussion on LinkedIn, and brought into sharp relief different approaches of statistical and data science communities. Some in the statistical community argue that Big Data is not always needed

If you have 500,000 customer responses to a 5 question survey with 4 answer choices each - then you need a data reduction strategy.

3 Stages of Big DataThis comment is true, but it focuses on only one aspect of Big Data - structured, customer-oriented data. Big Data also includes many other types, including text and unstructured data, social networks, images, video, mobility data, and more.

In many new applications - face recognition, speech understanding, recommendations, or fraud detection - bigger data does produces better results - just look at the amazing recent successes of deep learning or read The Unreasonable Effectiveness of Data, by Google researchers.

To help clarify the different meanings of "Big Data", I propose to consider 3 stages of Big Data:

Big Data 1.0: Transactional

Transactional data analysis has been done since the beginning of the databases. The difference is that now transactional data is much bigger and requires a whole set of new technologies. Financial Services, IT, Retail, Telecom and other companies are using these much larger datasets in order to get a competitive advantage.

Recent McKinsey study shows that Big Data leaders in these industries beat the competition, but the results they get - 5% higher productivity, 6% higher profits - while significant over the long term, are hardly revolutionary.

Big Data leaders beat the competition

This is the most common type of Big Data analysis, but it is being augmented by the next stage - networked data.

Big Data 2.0: Networked

Networked data has appeared with the beginning of the web and much of the Google early success was due to its ability to leverage web links with its PageRank algorithm.

Probably almost all large data sets have some network aspects - people networks, as in case of Facebook or LinkedIn, networks of documents, such as the Web, biological networks for DNA analysis, etc.

Effective use of the network data - leveraging the connections between people, places, documents, businesses, etc - is what enabled the success of information age companies like Google, Facebook, LinkedIn, Netflix, Twitter, and many others.

Such companies create new platforms, and when successful, create entirely new industries.

The social/network Layer can be added to almost any data, and not only it usually improves the results from transactional data, but it comes with a whole new set of questions, problems, measures, and a new way of thinking.

Classical statistics are harder to apply to this type of data, because we usually do not have independent, identically distributed variables when analyzing networks.

Big Data 3.0: Intelligent

This would be a combination of data, with huge knowledge bases and a very large collection of algorithms, perhaps reaching the level of true Artificial Intelligence (Singularity?).

We can see early examples of what is coming in

  • Google Now, which knows your activities and brings you personalized information before you ask for it,
  • IBM WatsonIBM Watson technology, which defeated human champions in Jeopardy and is now being applied to other industries, such as health care, retail, and call centers.
  • Forthcoming Wolfram Alpha Language, described as insanely more ambitious than Google Knowledge Graph, with a goal of using Wolfram Language as the platform and epicenter of an entirely computable world.


Note: The term Analytics 3.0 introduced by Tom Davenport and International Institute of Analytics sounds similar, but they define it very differently:

Analytics 3.0 marks the stage of maturity where leading organizations realize measurable business impact from the combination of traditional analytics and big data