KDnuggets Home » News » 2014 » Jun » Opinions, Interviews » Data Lakes vs Data Warehouses ( 14:n14 )

Data Lakes vs Data Warehouses


Data Warehouses, traditionally popular for business intelligence tasks, are being replaced by less-structured Data Lakes which allow more flexibility.



By Sundeep Sanghavi (Co-founder & CEO, DataRPM), June 2014.

There is a phenomenal shift that is happening now in the enterprise data world with data warehouses, which have so far been the foundation for business intelligence and data discovery for several decades, getting obsoleted by the emergence of data lakes.

The limitation of data warehouses is that they store data from various sources in some specific static structures and categories that dictate the kind of analysis that is possible on that data, at the very point of entry. While this was sufficient during the early stages of evolution of business intelligence where analysis was primarily done on proprietary databases and the scope was restricted to the canned reports, dashboards with limited and pre-defined interaction paths.

This approach has started to fall apart in the world of big data discovery where it is very difficult to ascertain upfront all the intelligence and insights one would be able to derive from the variety of different sources, including proprietary databases, files, 3rd party tools to social media and web, that keep cropping up on a regular basis. While one may have some initial list of questions that they want answers to during the setup phase but the real questions only emerge when one starts analyzing the data. Ability to navigate from a starting question or data point to different directions, slicing and dicing the data in any ad-hoc way that the train-of-thought of analysis demands is essential for real data discovery.

Example someone may start with a question like “What was the total revenue last year from North America” on a Transaction Database and then may want to slice the result by “North American states” and further by the “Demographics of the buyer” from the CRM Database and then proceed to correlate with the “Ads Campaigns”, from the Ad Platforms, to analyze the effectiveness of marketing spends and then navigate from there to evaluate the impact of efficiency and timelines of their “Delivery Logistics” on repeat sales by using the GPS data of vehicles. All of these analyses are happening on ad-hoc basis with new data sources added on the fly as per what the user thought process requires for decision-making.

DataRPM Dashboard

Therefore the traditional approach of manually curated data warehouses, which provide limited window view of data and are designed to answer only specific questions identified at the design time, doesn’t make sense any more for data discovery in today’s big data world.

This is where data lakes excel and why the world is now shifting away from data warehouses to data lakes. A data lake is a hub or repository of all data that any organization has access to, where the data is ingested and stored in as close to the raw form as possible without enforcing any restrictive schema. This provides an unlimited window of view of data for anyone to run ad-hoc queries and perform cross-source navigation and analysis on the fly. Successful data lake implementations respond to queries in real-time and provide users an easy and uniform access interface to the disparate sources of data.

DataRPM offers an integrated data lake and data discovery platform for the modern business. We are to the world of big data discovery for enterprises what Google is to the world of information discovery for the web. We have pioneered a Smart Machine that delivers the following:

  1. Automatically creates data lakes using proprietary algorithms and machine learning techniques which identify and index entities and relations from across disparate data sources like Hadoop, relational databases, CSV files, 3rd party systems like Salesforce and many other sources of data.
  2. A natural language question answering interface to analyze & visualize data from these data lakes.
  3. A distributed computation search engine that uses an optimized Just-In-Time (JIT) memory approach to provide real-time computation of big data stored in the data lakes and scale-out seamlessly in cheap commodity hardware.
  4. Enhanced data security in the data lakes stored in distributed binary index files, which are only machine-readable.



With DataRPM the machines do all the heavy lifting that the implementation of data lakes and big data discovery platform entail. All that a business needs to is to specify the connection parameters to the different data sources and then start asking questions in natural language to get answers that they can interact with and collaborate in-place with stake holders, anytime and anywhere. DataRPM empowers any user to become a data scientist and leverage the true power of data intelligence in the fastest, easiest, affordable, scalable and most natural way, thereby delivering data democracy in any organization.

Related Posts: