KDnuggets Home » News » 2016 » Feb » Opinions, Interviews, Reports » The Next Big Inflection in Big Data: Automated Insights ( 16:n06 )

The Next Big Inflection in Big Data: Automated Insights



Tags: , , , ,



To keep up with big data and improve our use of information, we need insightful applications that will quickly and inexpensively extract correlations while associating insights with actions.

By Evangelos Simoudis, @esimoudis, Corporate Innovation Ventures.

In previous posts, I wrote about the need for insight generation and provided an example of an insightful application. I maintain that insightful applications are the key to businesses effectively exploiting big data in order to improve decision-making and address important problems. To better understand and appreciate the need for developing such applications, it is important to consider what is happening more broadly in big data and evaluate how our experiences with business intelligence systems should be driving our thinking about insightful applications.

Because I consider insightful applications the next inflection in big data (see recent examples of such applications built using IBM’s Watson platform), I would like to further explore this topic in a series of blog posts. In this first post, I will provide my observations on how data analysis has evolved over the past 25 years, particularly as we moved to big data, and is necessitating the development of insightful applications. In the second post, I will describe such applications in more detail and provide early examples. In the third and final post, I will discuss investor interest in insightful applications and describe my recent investments in startups in this space. In these posts, I will draw upon my 30-year experience as an entrepreneur and founder of two analytic applications startups, and also as a venture capitalist who has been investing in this area for the past 15 years.

Data analytics over the past 25 years

As the volume of data has grown over the past 25 years, data comprehension for decision-making has consisted of the same two steps: creating the data warehouse and understanding the contents within the data warehouse.

The data warehouse and all its incarnations—enterprise data warehouse, data mart, and so on—is essentially an infrastructure of curated data. This data may come from a single data source (e.g., the database of a CRM application) or by integrating a variety of data sources (e.g., integrating the database of a CRM application with a database containing the social media interactions of each customer in the CRM database). This data may be structured (e.g., currency data describing the amount paid by each customer), unstructured (e.g., notes about each interaction between a customer and a service employee in free text form), or semi-structured (e.g., log data generated by a network router). Curated data is data that, once captured, is cleaned, tagged, and profiled both automatically and, more often than people would like to think, manually.

Over the years, we have reduced data warehousing costs through the growing use of open source software, cloud computing, and commodity hardware, while improving our ability to manage more data of greater variety that is created at a higher velocity. We’ve moved from data warehouses costing tens of millions of dollars and being afforded only by the largest corporations, such as financial services institutions like Citibank and retailers like Walmart, to warehouses becoming more affordable to small and mid-sized corporations. More recently, low-cost offerings such as Amazon Redshift, Google BigQuery, and even Microsoft Azure, have moved data warehousing to the cloud. Finally, data warehousing is accessible to the corporate masses.

With the rise of data warehouses, delivery of data analysis reports has shifted from print to digital.

The second step in data comprehension involved understanding the data warehouse’s contents through data analysis. In business settings, this was often done through reports and associated visualizations, while occasionally using more bespoke visualizations and machine learning algorithms such as neural networks. (Machine learning is not new, as some believe, but rather has been used almost since the time data warehouses appeared as data storage and management tools.)

As data warehouses became adopted by a broader set of corporations from a variety of industries, we saw a shift in the form of the reports that could be created, in the medium through which reports were presented to analysts and decision-makers, and also in the personnel that would prepare these reports. In the early days (late 80s, early 90s), business intelligence reports were created by specialized IT personnel that were also formulating and issuing to data warehouses the queries necessary for these reports. These reports were canned (i.e., they could be modified, but with great difficulty and only by the same specialized IT personnel that created them) and presented on computer paper. Later on, while still canned, these reports were presented on PCs through specialized reporting programs, and then later on, Web browsers running on a variety of devices, including (most recently) smartphones and tablets. Over the years, the task of query creation and report writing migrated away from IT personnel to business users. However, while queries and associated reports were becoming faster, more flexible and widely used, the primary users of these reports—business analysts—continued to struggle to determine the simplest patterns in the breadth of information included in such reports. Most importantly, these users struggled to determine what actions to take based on the information included (see examples in Figure 1).

intricate data patterns and visualizations

Figure 1. Some common examples of intricate data patterns and visualizations. Image courtesy of Evangelos Simoudis.

As more data has been generated, we have become better managing it cost effectively, but still struggle to efficiently analyze it.

Driven by the broadening global use of the Internet, the connectivity the Internet affords, new areas like the Internet of Things that yield data in volumes we’ve never seen before, and the applications that are being created to capitalize on this use and connectivity, we find ourselves awash with data. Fast data and slow data, simple data and complex data, and all of it in unprecedented volumes. How much bigger has the data become? We have grown from generating approximately 5 zettabytes of unstructured data in 2014 to a projected approximation of 40 zettabytes of unstructured data in 2020 (see Figure 2).

real and predicted growth of unstructured data generated between 2005-2020

Figure 2. Graphic showing real and predicted growth of unstructured data generated between 2005-2020. Image courtesy of IDC, used with permission.

During the last 10 years in particular, while the data became bigger, the core of corporate IT strategy became “do more with less.” Corporations started to face two problems with their data warehousing systems. First, some of these systems could not effectively manage the big data that was being captured, so applications could not use it effectively. Second, costs were becoming prohibitively high, even for the systems that could rise to the data management challenge.

Around this time, a partial solution started to emerge when a new generation of data management software, such as Hadoop, was developed by heavyweight tech companies such as Google, Yahoo, and others. From the beginning, this software ran on commodity hardware and was quickly open-sourced, thus enabling corporations to address some of their big data issues at a lower cost. Companies like Cloudera, Hortonworks, and a few others that offer services around open source software have since become important players in the big data infrastructure space. I call the solution “partial” because, while managing data, these systems did not have all the features of the sophisticated, proprietary data warehouse management systems used by corporations. But these new systems were good at building data lakes, which suit the diverse big data environment, and for replacing or augmenting certain types of data warehouses with lower-cost alternatives.

While our ability to manage big data cost effectively may have been improving, our ability to analyze the data, at any cost, was not. While the popular press declared that insights from data would be the new oil (or gold, pick your metaphor), the market research firm IDC predicted that by 2020, only a fraction of the data that would be collected would be analyzed. We needed to analyze more of the data we captured and extract more of the information it contained.