Dear CIO, what you have is NOT a Data Lake

Data Lakes are often the ideal structure of a company's big data, but the reality is that data is often split into data puddles. Xurmo seeks to eliminate this by integrating Data Virtualization into the Data Lake.

By Sridhar Krishnan, Xurmo Technologies, July 2014.

The Data Lake is supposed to be the architecture of choice for Big Data. Ideally, the Data Lake has the following features:

  • It is the single, silo-less repository of all types of data
  • It is the go-to place for seamless analysis across all data

In short, the Data Lake is supposed to be the grown-up version of traditional DW - minus its intransigence plus a plethora of added benefits. It should eliminate the need for upfront schema design (a major component of traditional DW costs) and yet, give a single, comprehensive view of all data for analysis (thus reducing the cost and effort in analysis). CIOs would have reason to rejoice with the Data Lake, because it would be the most cost-effective way to bring all their data to play.

Ideal Data Lake

Fig 1: The Ideal Data Lake

Unfortunately, this scenario is far removed from reality. Enterprises with Big Data ambitions are faced with a profusion of specialized storage/processing options catering to specific data types/use cases, perpetuating the very silo-ed data problem that the Data Lake is supposed to solve. CIOs who want to bring different types of data to analysis, end up procuring different sets of tools and hiring (or training) different types of expertise to work on them. The Data Lake is in fact reduced to a bunch of Data Puddles that analysts have to jump across – at great cost to the enterprise in terms of tools, talent and time.

For example, one study projects an additional cost of more than $500Mn over 5 years to analyse 500TB of data in a bunch of data puddles in Hadoop vs. a monolithic relational DB.

Data Puddles

Fig 2: Reality: The bunch of Data Puddles

Enter Data Virtualization (DV) - a great data abstraction concept that gives a single-format view to different types of data and is intended to reduce the cost of leveraging multiple data sources and formats. It works by connecting to various sources of data and exposing them in a virtual, tabular view in near real-time. Analysts can define specific operations on this virtual view quickly without upfront modeling or moving data to the processing layer. It should be the final piece that brings all the puddles together and makes a lake.

Except that it isn’t.

For one, DV works on small amounts of data brought to cache and hence cannot work on large data sets. It is good for near real-time transformations but cannot work on historical data, complex hierarchical or multi-structured data or in cases where changes in data need to be brought to analysis. To me, DV is a quick-fix solution to solve a small set of Big Data problems – a mousetrap in the elephant cage that ignores the elephant.

The concept however, is virtuous.

What if we could bring the concept of Data Virtualization right down to the Data Lake, make it an integral part of it and not merely a superficial appendage? What if all Data (not just cache) were to be stored in a single-format but without the kind of upfront schema modeling, that traditional DWs are infamous for?

At Xurmo, we have developed a NoSQL store on Hadoop that accomplishes the following automatically:

  • Captures the natural schema of input data;
  • Materializes data in a single analysable format; and
  • Determines the syntactic relationships* between data elements.[Note: Syntactic relationships are those pertaining to the structure and relative position of data. For example, the syntactic relationship between New York and London is that they are both found under the category of City.]

Xurmo Data Lake

Fig 3: The Xurmo Data Lake

This means that Xurmo can store any kind of input data (structured, unstructured, static, streaming and across formats) in a single materialized form without schema design. Since the natural schema of data is stored, it is possible to handle multi-structured, complex data as well as changes in it. Semantic models can be quickly layered on auto-identified syntactic relationships.(For example, semantic models explicitly define that ‘London is a City’ so that queries like ‘Show me all Cities’ are easy).

Xurmo has a single abstraction layer to search, query or perform operations on this data, which means that an analyst can create analytics workflows seamlessly. Data processing is not divorced from storage and is not limited to a sample set.

The abstraction layer solves another problem too. It enables new operations to be created in any JVM language and ingested into the workflow without MapReduce expertise. 3rd party analytics tools can be easily integrated with native libraries.

To sum up, Xurmo realizes the vision of an Enterprise Data Lake by creating a single repository of all data and enabling seamless processing across. The benefits of the Xurmo Data Lake are immediate and obvious: It is simpler, quicker and cheaper than deploying a portfolio of storage and processing tools and finding a way to make them all work together.

Sridhar Krishman Sridhar Krishnan is the founder and CEO of Xurmo Technologies. Xurmo ( is a middleware company that simplifies Big Data and Analytics workflows by providing seamless integration between data storage and applications. Xurmo holds number of patents in the area of unified analytics and self learning. For more information about the post or Xurmo email Sridhar at