Interview: Ted Dunning, MapR on The Real Meaning of Real-Time in Big Data

We discuss major Big Data developments in 2014, real-time processing, interactive queries, streaming systems, batch systems, MapR partnerships and challenges in scaling recommendation engines.

ted-dunningTed Dunning is Chief Applications Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects and mentor for Apache Storm, DataFu, Flink and Optiq projects. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Here is my interview with him:

Anmol Rajpurohit: Q1. Big Data witnessed a lot of interest and action in 2014. In terms of technology advancement, what do you believe to be the most significant achievement in 2014?

big-data-2014Ted Dunning : I think that one of the biggest things that has happened in 2014 is that people have started viewing big data technologies, especially those often referred to as the Hadoop eco-system with a critical and pragmatic eye.

This practical focus is really good for the field since it gets people to focus on what really can be done to add value using big data techniques. There has always been substantial potential value, but until people started taking the field seriously that value was largely unrecognized.

AR: Q2. Recently, a lot of Big Data discussion has been around "real-time". What are your thoughts on the maturity assessment of current tools with regards to the "real-time" requirement? What would be a good benchmark to classify real-time and non real-time?

TD: I think that there are several important ways to describe different requirements and I really don't like that some people have been trying to co-opt words like real-time to mean other things.

Real-time is about processing incoming data within a time limit and about making guarantees about the response time. The critical criterion is that guarantees are being made and met. It doesn't actually matter whether the guarantee is that processing of each record will complete in 1 micro-second or 100 seconds. The key is that a guarantee is made.

There are distinct differences in the technologies required as the guaranteed processing time moves from micro-seconds to milli-seconds to seconds and even to minutes. Much of the recent maturation in these systems is in the large-scale systems that process data in the 1 millisecond to 5 seconds range.

real-timeInteractive query systems are not real-time systems. The term interactive is an excellent one to use for these systems in that they respond to human requests within human tolerable limits. How quickly they incorporate incoming data is unspecified which generally makes interactive systems not real-time.

Streaming systems are like real-time systems without the guarantees. The idea is that incoming data is processed as it arrives, but without strict guarantees. Streaming is an important option to have since streaming systems have the option to fall back streaming-datato batch processing if they fall behind. Also, with streaming systems, you don't have to over-provision in order to meet stringent guarantees. This allows a flexibility of resource allocation that is often lacking from true real-time systems. Most of the Hadoop eco-system components are actually streaming systems rather than real-time, although a few like MapR do stand out from the crowd in that you can build true real-time systems with them.

Batch systems collect inputs over a period of time and process them together, often much more efficiently than they could be processed one at a time. Surprisingly, there has been recent progress in batch systems as well as real-time, interactive and streaming systems. Spark has brought micro-batching into the mix, even for some streaming applications and other systems like Tez, Flink and Drill (all from Apache) have provided real advances in the batch processing models available in the Hadoop eco-system.

Each of these kinds of systems are well defined and the definitions should not be muddied by careless usage.

AR: Q3. Based on your experience as Application Architect, how do you see the changes in application development priorities over the past few years? What will be the key priorities in next 2-3 years?

TD: I think that streaming systems are going to be hugely important over the next few years. This is a huge change from the ad hoc workflow scheduling that was required with pure batch systems.

AR: Q4. What does your typical day at MAPR looks like? Which activities interest you the most?

TD: My days are very full and quite varied. I don't know that there is a typical day. Things that I do include:

public-speaking- public speaking. This involves lots of travel and lots of preparation of talks. I try to always find interesting topics that are interesting to people with varied backgrounds and useful to hard-core implementors as well as others.

- code and algorithm development. I stay hands-on because otherwise, I would lose touch with what is important. My hands-on work includes machine learning, both applied and theoretical, applied math, and systems design and implementation. Some recent projects have included log-synth for building realistic emulated data and t-digest which is a new way to approximate quantiles very efficiently and accurately.

AR: Q5. What has been your experience of MAPR partnership with other companies to deliver greater value to the customers? Which partnerships have been the most remarkable?

partnershipTD: We partner with literally hundreds and hundreds of companies. Some of the large ones like Cisco stand out because of how much difference a partner like that can make, but we also have a number of large (and quiet) OEM partners who are doing amazing things with our software as a foundation. Some of our smaller partners like Skytree are very exciting because they are pushing the limits of technology.

AR: Q6. What are the most underrated challenges in scaling recommendation engines, while maintaining speed and accuracy?

I think that people under-rate how important it is to deal with abnormal situations well. This can include hardware failure, required maintenance windows for things like OS upgrades or software abnormalities. If you don't take these things into account at the platform level, they really can't be handled well. This applies to all big data systems, not just recommendation engines.

recommendation-systemsFor recommendation engines specifically, I think that people take the research literature a bit too literally. That literature has been unfortunately heavily distorted by the data that is available to researchers and by the original approaches of ratings prediction. As such, it is pretty unrealistic and fails to account for important aspects of industrial use of recommenders.

Topics such as scalability, easy deployment, multi-modal inputs, result dithering, anti-flood and cross-pollination are essentially untouched in the research literature and yet they have a far larger impact on result quality than ratings prediction. I really do think that the indicator based approaches such as what we covered in "the pony book" ( are the way to go for most users of recommendation systems.

Second part of the interview