Interview: Ted Dunning, MapR on The Real Meaning of Real-Time in Big Data
We discuss major Big Data developments in 2014, real-time processing, interactive queries, streaming systems, batch systems, MapR partnerships and challenges in scaling recommendation engines.

Here is my interview with him:
Anmol Rajpurohit: Q1. Big Data witnessed a lot of interest and action in 2014. In terms of technology advancement, what do you believe to be the most significant achievement in 2014?

This practical focus is really good for the field since it gets people to focus on what really can be done to add value using big data techniques. There has always been substantial potential value, but until people started taking the field seriously that value was largely unrecognized.
AR: Q2. Recently, a lot of Big Data discussion has been around "real-time". What are your thoughts on the maturity assessment of current tools with regards to the "real-time" requirement? What would be a good benchmark to classify real-time and non real-time?
TD: I think that there are several important ways to describe different requirements and I really don't like that some people have been trying to co-opt words like real-time to mean other things.
Real-time is about processing incoming data within a time limit and about making guarantees about the response time. The critical criterion is that guarantees are being made and met. It doesn't actually matter whether the guarantee is that processing of each record will complete in 1 micro-second or 100 seconds. The key is that a guarantee is made.
There are distinct differences in the technologies required as the guaranteed processing time moves from micro-seconds to milli-seconds to seconds and even to minutes. Much of the recent maturation in these systems is in the large-scale systems that process data in the 1 millisecond to 5 seconds range.

Streaming systems are like real-time systems without the guarantees. The idea is that incoming data is processed as it arrives, but without strict guarantees. Streaming is an important option to have since streaming systems have the option to fall back

Batch systems collect inputs over a period of time and process them together, often much more efficiently than they could be processed one at a time. Surprisingly, there has been recent progress in batch systems as well as real-time, interactive and streaming systems. Spark has brought micro-batching into the mix, even for some streaming applications and other systems like Tez, Flink and Drill (all from Apache) have provided real advances in the batch processing models available in the Hadoop eco-system.
Each of these kinds of systems are well defined and the definitions should not be muddied by careless usage.
AR: Q3. Based on your experience as Application Architect, how do you see the changes in application development priorities over the past few years? What will be the key priorities in next 2-3 years?
TD: I think that streaming systems are going to be hugely important over the next few years. This is a huge change from the ad hoc workflow scheduling that was required with pure batch systems.
AR: Q4. What does your typical day at MAPR looks like? Which activities interest you the most?
TD: My days are very full and quite varied. I don't know that there is a typical day. Things that I do include:

- code and algorithm development. I stay hands-on because otherwise, I would lose touch with what is important. My hands-on work includes machine learning, both applied and theoretical, applied math, and systems design and implementation. Some recent projects have included log-synth for building realistic emulated data and t-digest which is a new way to approximate quantiles very efficiently and accurately.
AR: Q5. What has been your experience of MAPR partnership with other companies to deliver greater value to the customers? Which partnerships have been the most remarkable?

AR: Q6. What are the most underrated challenges in scaling recommendation engines, while maintaining speed and accuracy?
TD:
I think that people under-rate how important it is to deal with abnormal situations well. This can include hardware failure, required maintenance windows for things like OS upgrades or software abnormalities. If you don't take these things into account at the platform level, they really can't be handled well. This applies to all big data systems, not just recommendation engines.

Topics such as scalability, easy deployment, multi-modal inputs, result dithering, anti-flood and cross-pollination are essentially untouched in the research literature and yet they have a far larger impact on result quality than ratings prediction. I really do think that the indicator based approaches such as what we covered in "the pony book" (https://www.mapr.com/practical-machine-learning) are the way to go for most users of recommendation systems.
Second part of the interview
Related:
Top Stories Past 30 Days
|
|