Follow Gregory Piatetsky, No. 1 on LinkedIn Top Voices in Data Science & Analytics

KDnuggets Home » News » 2017 » Oct » Opinions, Interviews » Actionable Insights: Obliterating BI, Data Warehousing as We Know It ( 17:n42 )

Actionable Insights: Obliterating BI, Data Warehousing as We Know It


There is a big demand of quick insights or real time analytics from business side. But traditional BI or data warehouse architectures lack this realtime functionality. Here we discuss realtime analytics architecture in details.



By Yaron Haviv, Founder and CTO,iguazio

I spent much of my time recently in conferences talking to customers and analysts and realized they all were saying many of the same things about the challenges of productized, modern analytics solutions.

Digital businesses, mobility, and IoT depend on real-time actionable insights and machine learning. At the same time traditional big data, data warehousing (DW) and business intelligence (BI) solutions have mostly worked on batch and interactive data queries. Streaming solutions and machine learning logic have been added on top of legacy architecture (and are not well integrated), leading to complexity and sub-optimal performance.

The time is ripe for re-architecting analytics to maximize the value of machine learning and real-time streaming, drive actionable insights, and enable continuous operations.

Current approaches fail to deliver actionable  insights

Many data scientists and engineers start by collecting data from various sources, query it to aggregate and denormalize data (group by, joins, etc.), and try to find common patterns or anomalies using various visualization tools. One you have denormalized and aggregate result set you can also use machine learning to identify patterns for you (assuming you selected the right features). This is why most big data solutions are designed for the data science exploration or reporting phase. They mostly expose SQL semantics (Hive, Spark, Impala, RedShift, and others) and their key focus is on maximizing SQL query performance and achieving interactive results by compressing data to immutable columns. Many queries still take minutes or hours to run.

The phenomenal growth in machine learning, deep learning and quickly growing demand for real-time actions rather than batch means the data collection and query part take a smaller role in the overall solution, and there needs to be more focus on continuous and event driven processing, anytime an event arrives to the system it need to be evaluated against a fresh data model in real-time and drive an action or serve dashboards. In many production applications data is constantly modified and searched, immutable data warehousing and BI techniques (where data is updated once and queried multiple times) are not practical.

We need solutions with fast update, ingest, enrichment, and indexing support.

The common short-term solution involves creating complex big data “stacks” or “pipelines.”But those are slow and impossible to operate or govern. In many cases, projects are further complicated and slowed down due to different stacks for data research and production environments.

Fig. 1: Common analytics architecture

Prevalent architecture also has data copied to a data lake periodically, using batch ETL tasks.With this process, data is compressed to immutable columnar formats such as Parquet to accelerate query performance. Events and logs are pushed through streaming storage and processed immediately with a copy stored in the data lake – they are loosely coupled with the data lake to avoid a speed mismatch (Lambda or Kappa architectures).

Data in the lake is usually unstructured, forcing long data “cleaning” and wrangling. This all leads to inaccurate results and consumes many computational resources every time data is processed, not to mention that without proper metadata it is impossible to find the relevant datasets or secure the data.

While there’s a great deal of chatter about real-time streaming and machine learning, the dominant method for data exploration is querying data through batch or interactive queries and applying statistical modeling to the results. Stream processing is still mostly used to address faster data ingestion and transformation, or for simple tasks of evaluating events against an (old) data model to drive some immediate actions.

Machine learning usually is applied on old data sets extracted from the data warehouse or in data lakes – it is simply too difficult to learn from fresh data. AI decisions and prediction need to work at the speed of the event, limiting the decision and feature vectors to smaller datasets or short history that fit in-memory databases. This is part of the reason many recommendation engines provide irrelevant results, mistargeted ads or false detections of fraud.

What I noticed is that once data scientists finish the exploration and modeling phase, different teams come in to redesign for production, starting the project from scratch and using new dev tools and languages that address better error handling or higher performance.

With so many of us accustomed to running interactive queries on old data, the key challenge now is in initiating sub-second actions driven from fresh data models.

Designing continuous analytics from scratch

With continuous analytics, actions and insights are delivered from fresh data in real-time, for production use, as opposed to just work on interactive queries for data exploration and reports. This means eliminating complex and slow data wrangling and parsing as soon as possible, so data entering the system is machine optimized, clean and curated. No more schema on read, dubious data and data swamps.

Instead of periodic ETLs from external databases, this method streams data using continuous data capture (CDC) tools. That means no more time gaps and unknowns, because data is synchronized between operational databases and the analytics system in real-time.

Fig. 2: Continuous Analytics Architecture

When all of this happens, the focus can shift from traditional BI and DW queries to parallelism and continuous operations that address both real-time and interactive responses while streamlining operations.

With continuous analytics solution, tasks are broken into independent micro-services or serverless functions which run concurrently, such as:

Enrichment and Denormalization: Data entering the system often requires additional context. The system may accept sensor information keyed by the sensor ID but also wants the manufacturer, model or other related historical information. The same goes for correlating a user profile with his or her ID to serve custom content. Real-time analytics uses in-memory caching and fast random access to a real-time database to address real-time requirements. This is a huge improvement over current enrichment processes and SQL JOIN queries which consume much more time and resources.

Merge and Aggregate: Many analytics processes refer to a historical state, such as how many times a user performed a certain operation, or the average temperature of a sensor, or the number of cars in a specific geo-location. With continuous analytics, some statistics can now get aggregated and stored with the existing data, ready for immediate use or validation against pre-defined thresholds instead of running an expensive “GROUP BY” database query.

Prediction (inferencing): The event data along with previous state, aggregate results and enriched data together form a wide feature vector. This vector is input to an AI prediction logic using algorithms for regression, classification, etc. And the prediction output results can drive an action or are updated in a real-time database which serves dashboards.

Real-Time Event Processing: Once detected, a threshold, anomaly, or an event can trigger micro-services that address it by alerting a user, conducting repair operations, modifying a data model or other action.

Analytics and Machine Learning: Multiple analytics and ML processes such as data exploration, model training and reinforced learning can be fed with fresh data in the form of streams, images or structured tables and generate new AI models frequently or continuously, those new models are immediately shared with the real-time inferencing services. Detected features can update/tag individual data records and images immediately. They can also trigger an event. Real-time analytics tasks consume the enriched and aggregated results directly and don’t need to wait for query processing.

Real-Time Dashboards: With continuous analytics, since data and its derivatives are always up to date dashboards only need to read and present relevant datasets. API services map complex dashboard elements to several real-time data requests and serve them to clients through ready-to-use JSON responses.

When application microservices and analytics engines are stateless and containerized, they allow for greater elasticity, simpler deployment and continuous development and operation. Tools such as Spark handle many of those tasks, but it is also possible to leverage emerging stream processing tools, new ML and AI libraries (such as TensorFlow), or implement event-driven tasks using “serverless” functions such as Amazon Lambda or a real-time open-source serverless platform called nuclio.

A key attribute of the new architecture is that it uses a real-time unified data store to update, manipulate and query a variety of data objects simultaneously. Rather than working with multiple store classes and complicating application logic and operations, a unified store provides tiering across memory, flash and cloud storage, balancing cost vs performance requirements.

A preferred approach is to use structured/semi-structured records or streaming where possible and avoid unstructured data. This approach eliminates errors, provides better performance and enables more granular security. Using transactional or atomic data update semantics allows for a consistent state across failures and software version upgrades.

Using modern orchestration platforms such as Mesos or Kubernetes, coupled with a CI/CD pipeline or an alternative public cloud offering, allows for continuous development and operations, drives agility and minimizes gaps between research, development and production.

Summary

What’s needed are continuous applications which respond in real-time, using modern agile and cloud methodologies. Modern solutions are the ones that move away from the notion of long-running batch and interactive queries. They’re also restructuring the organizational separation between research, dev and ops.

Bio: Yaron Haviv is a serial entrepreneur with deep technological knowledge in the fields of big data, cloud, storage, networking and high-performance.

Related:


Sign Up