KDnuggets Exclusive: Interview with Paco Nathan, Chief Scientist at Mesosphere

KDnuggets talks with Paco Nathan, computer scientist, OSS developer, author, and advisor about Apache Mesos, Cascading, his books and Big Data trends.

Paco NathanPaco Nathan is a "player/coach" in the field of Big Data, having led innovative Data teams on large-scale apps for 10+ years. An expert in distributed systems, machine learning, and Enterprise data workflows, Paco is an O'Reilly author and an advisor for several firms including The Data Guild, Mesosphere, Marinexplore, Agromeda, and TagThisCar. Paco received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 25+ years technology industry experience ranging from Bell Labs to early-stage start-ups.

Website: http://liber118.com/pxn
Twitter: @pacoid

Paco recently delivered a presentation at Strata 2014 on “Apache Mesos as an SDK for Building Distributed Frameworks”. He walked through the unique benefits of Apache Mesos, an open source cluster manager and shared the case studies of Mesos uses in production at scale (at Twitter, Airbnb, etc.). Besides his presentation, Paco also gave a tutorial on “Big Data Workflows on Mesos Clusters”.

Here is my interview with him:

Anmol Rajpurohit: 1. During your presentation at Strata 2014, you envisioned "Cluster Democratization". Can you please explain the term? How would it help towards the high priority goal of interdisciplinary data analysis?

Paco Nathan: The notion of “Data Democratization” involves making data available throughout an organization. That has pros and cons in practice, but the overall idea is sound.

By analogy, the notion of “Cluster Democratization” involves making data+resources available throughout more of the organization. To paraphrase Chris Fry, SVP Engineering at Twitter: they relied on Apache Mesos during their ramp up to IPO to get rid of the “Fail Whale” conditions, and now any new service launched at Twitter is launched on Mesos. Period. Full stop. Developers don’t have to perform mental backflips about all the steps required to roll their code out at scale: they work on laptops, then deploy on what appears to be a very large laptop – the several thousand multicore servers in a cluster.

Moreover, the notion of Datacenter Computing focuses on mixed workloads — for several reasons. Two of the very compelling outcomes center on “democratizing” resources. On the one hand, rather than have separate clusters for different frameworks (static partitioning) mixed workloads imply better data locality: producers and consumers of data products reside on the same nodes or racks, share the same data partitions. On the other hand, production apps can reside next to dev/test apps, while isolation in lieu of virtualization guarantees that some dev app gone rogue won’t be able to step all over the production apps. That allows development managers to have their teams work directly within production environments. Both of those points address enormous current problems in industry — problems which were not well articulated prior to the introduction of Mesos.

So the notion of “Cluster Democratization” is to go back to some of the basic tenets of distributed computing, leverage what we know about SOA best practices, leverage what’s emerging with SDNs, and so on that would rebuild team and processes, making large scale apps much more apropos for what customers/consumers need.

For further details and online resources regarding Mesos, refer the following slides presented by Paco at Strata 2014: http://www.slideshare.net/pacoid/strata-sc-2014-apache-mesos-as-an-sdk-for-building-distributed-frameworks

AR: 2. Apache Mesos is widely being adopted across academia as well as industry. What do you think are the key reasons for its success? What new features would you like to have in the next version of Mesos?

Apache Mesos Logo PN: I believe that the reason for adoption is closely tracking the adoption of cluster computing at scale on commodity hardware. Google has set the pace for this, with Borg and Omega. Twitter and other firms pushed the envelope for OSS adaptations of Google’s approach. For proprietary hardware, there are other commercial alternatives: IBM Symphony comes to mind, and perhaps Microsoft Autopilot is also in the category. Meanwhile, the notion of commodity hardware in clusters is drastically changing: multi-core processors, enormous memory spaces, affordable SSD arrays with high write throughput, software defined networks, etc.

Apache Mesos has had large­ scale deployments now for three years. I find it odd that so many people ask to compare Mesos with YARN: the latter is not even at the same layer of the technology stack, nor does it have quite the pedigree of deployment at scale (outside of Hortonworks and Yahoo?) Moreover, Mesos addresses needs for mixed workloads in the context of highly­ available request/response services — in other words, low­ latency use cases. In many verticals, you can measure revenue loss in terms of service latency in milliseconds. Expect that trend to increase across industry. Nearly all of the Mesos deployments that I’ve tracked are maniacally focused on low­ latency use cases. Period. Full stop. Apache Mesos Introduction

Keep in mind that current data center utilization rates in the industry range around single­ digit percentage points. That’s terrible. Think of how much energy consumption and carbon footprint that implies. Google beats that via Borg/Omega, and Mesos is the open source version of this approach. Also keep in mind that Hadoop, which is based on Google’s work circa 2002, was based on hardware from more than a decade ago. Meanwhile, industry use cases for data at scale are becoming more about real ­time and streaming and less about batch. It’s time to update our notions of Big Data, because Hadoop only has a couple more years of life. Google defined Data center Computing in terms of mixed workloads, high utilization, and low latency. It’s time we update our overall notions of Big Data and real requirements.

Also, let’s keep in mind that Mesos is a distributed kernel. Much like how we really don’t interact much with a Linux kernel, but interact with the GNU/Linux apps built around it, expect to see the evolution of the distributed apps built around Mesos: Aurora, Marathon, Chronos, etc. In terms of other features for Mesos, from a high level I would suggest emphasis on:
  • Better/more seamless integration with Docker (or something much like it) and Ansible, and perhaps defining the relationship with OpenStack — much credit goes to RedHat for their ardent support of both Mesos and Docker
  • Another layer (above) for orchestration: auto­-scaling is the trivial case, but workflow orchestration is a complex thing on a cluster, especially in the context of SOA, and greatly needed — that’s why the flexibility of Mesos “reshaping” a cluster at millisecond rates is so vital
  • Surfacing fine­ grained metrics for resource utilization, via closer integration with Ganglia, Collectd, Nagios, etc. — in other words, the kernel knows what resources were used by a process, as opposed to say Hadoop resource counters which are just silly fictions and not particularly usable

For further details on Apache Mesos and learning resources, please refer: http://strata.oreilly.com/pnathan

For Data center Computing, please refer to Google Senior Fellow Jeff Dean’s talk on Taming Latency Variability and Scaling Deep Learning: https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPhQhcDRv

AR: 3. Given the increasing magnitude of unstructured data and complexity of analytic tasks, what are the most important benefits of the abstraction provided by Cascading? How does Cascading differentiate from other workflow systems (for example, KNIME)?

Enterprise Data Workflows with CascadingPN: Great question.

Please see a recent talk @Twitter Seattle http://www.slideshare.net/pacoid/data-workflows-for-machine-learning

In particular, see the “scorecard” of comparisons on slide #57. This turned out to be one of my most popular talks ever. The approach was to compare/contrast several popular open source frameworks for building data workflows. The talk develops a “scorecard” for comparing frameworks, and for understanding how your use cases map to some frameworks more than others.

In terms of KNIME, I’m a huge fan. For many organizations this is an ideal tool to adopt. It’s great for code reuse, visualizing the data flows, etc. We call it “future-proofing your system integration”. Cascading fits more on the “conservative” part of the programming spectrum, as one would hope to see: operations for your bank, your airline, your hospital, etc. In terms of abstraction layers, Cascading initiated some of use of “monoids” and abstract algebra for Big Data circa 2007, by introducing aspects of Functional Programming (FP) into a Java API. Weird, and quite successful! Unfortunately, the important abstract algebra uses weren’t identified explicitly, nor is Java much of an FP language, but it fit for Enterprise IT at the time. Cascalog, Scalding, Summingbird, etc., have picked up the torch and defined this approach even better than their foundation technology, Cascading.

Meanwhile, the industry was focused on using Pig, Hive, etc., which represent enormous steps backwards in terms of formalizing data workflow abstractions. Seriously, I’d pay large sums of money to hear Edgar Codd (inventor of the relational model) expound about Pig and Hive. I’m about 99.9999999% certain that his responses would be NSFW!

In short, the need within Enterprise to leverage large scale data generally involves cross ­departmental contributions: ETL, data prep, predictive modeling, system integration, etc. Cascading and its DSLs (Domain Specific Languages) provide an abstraction to leverage several excellent points from more theoretical computer science:
  • Functional programming for Big Data, including TDD (Test Driven Development) practices at scale
  • Pattern language (convey expertise to non-­experts in safe ways)
  • Literate programming (across departments)
  • Separation of concerns (focus on business logic, let the compiler handle low­ level)
  • Mitigate the costs of operationalizing large­ scale apps on clusters

Take lots of those moving parts and compile them into one JAR file: one point for troubleshooting, debugging, exception handling, notifications, etc. Particularly in the case of Cascalog, there have been comparisons drawn between the Cascading abstraction and what Edgar Codd originally proposed for “relational model”... which was very different than SQL!

Here is part 2 of the interview.