Spark and the Remorseless Recrystallization of the Open Source Analytics Ecosystem

Apache Spark had robust machine learning, graph, streaming, and in-memory capability to the Hadoop-centric ecosystem. In 2016, we expect adoption in diverse big data, advanced analytics, data science, Internet of Things, and other application domains.

Open source is a disruptor that never quits. It seems to be penetrating every pore of established technology ecosystems and recrystallizing everything into newer, more malleable configurations.

Even established open source technologies aren’t immune from this disruption. Less than two years ago, I commented on how Hadoop was beginning to confront newer open source rivals—most notably, Spark. It was clear even then that Spark had legs as a convergence platform that brought a more robust machine learning, graph, streaming, and in-memory capability to the Hadoop-centric ecosystem.

Less than a year ago, I commented at length on Spark’s growing pains, but it’s clear that it’s been maturing with impressive speed. In 2016, we can expect to see Spark continue its trajectory toward mainstream adoption in diverse big data, advanced analytics, data science, Internet of Things, and other application domains.

It’s probably a bit premature to call Spark a mature segment, considering that it remains tiny compared to Hadoop and other better established segments of the analytics market. But I agree with Peter Schlampp’s recent declaration in IBM Big Data & Analytics Hub that Spark has crossed the adoption chasm from what Geoffrey Moore has called Early Adopters (technology enthusiasts) to deepening embrace by the so-called Early Majority (champions implementing disruptive new applications).

If you review my March 2015 checklist of the areas in which Spark was still immature, and consider the progress made by IBM and other solution providers in the intervening months, it’s clear that we’ve come a long way as a community:

  • The Spark ecosystem has greatly expanded the range of training, education, consulting, and technical support capabilities available to developers, data scientists, and users everywhere.
  • There has been considerable growth in the range of commercial and open-source tools for managing, monitoring, securing, tuning, optimizing, and recovering Spark jobs and clusters.
  • More solution providers, including IBM, are integrating Spark with their middleware, development tools, data platforms, and applications.
  • IT and big-data professionals have gained considerable experience with Spark technology in support of high-profile development projects.

For progress reports on Spark’s maturation, you can refer to Schlampp’s article (cited above), Kimberly Madia’s recent post on Spark community innovations, Courtney Pallotta’s post on Spark developer events, my post on the global Spark industry, and my post on IBM’s deep and wide-ranging investment in integrating Spark into its solution portfolio.

I also recommend that you read this recent article by Databricks reviewing the Apache Spark project’s evolution over the past year and prospects for more of the same in 2016. They note that Apache Spark went through 4 releases in 2015, with each release adding hundreds of improvements. Chief Apache Spark enhancements in 2015 included new APIs (DataFrames, Machine Learning Pipelines, R support, platform APIs for data-source integration), performance optimizations (Project Tungsten), and a new streaming capability. The number of Apache Spark code contributors doubled in 2015, ending the year at over 1,000. And new free Spark training resources came online, which are already instructing over 125,000 students in the technology.

Where innovations are concerned, momentum in the open-source market has clearly passed to Spark. I wouldn’t go as far as Databricks in placing Spark at the center of the evolving open-source ecosystems. The graphic in their blog makes it clear that they consider open-source analytics now a Spark-centric ecosystem, with everything else, including Hadoop, orbiting around Spark. Contrast that with the Hadoop-centric ecosystem featured on the Apache Hadoop site. In the eyes of the Hadoop community, Spark is simply a component of their project, not the other way around.

Let’s call them a binary star system that orbits around a common center of open-source gravity. By that, I’m referring to the gravitational pull of their respective codebases, which are widely adopted in many big-data analytics initiatives now. What joins Spark to Hadoop are that fact that they both include HDFS, HBase, Hive, Ambari, Mahout, Pig, and Cassandra as key components of their respective ecosystems.

How do they differ? Unlike Spark, Hadoop also includes MapReduce, YARN, Avro, Chukwa, and Tez in its ecosystem. Spark differs from Hadoop in also including Kubernetes, Docker, Spring, Mesos, OpenStack, MongoDB, Parquet, Elasticsearch, Tachyon, MySQL, PostgreSQL, Kafka, SequoiaDB, Sparkling, H20, IP[y], Thunder, and Sqoop.

Over the coming year, I expect to see the industry focus far less on Spark and Hadoop as self-contained (dare I say “rival”?) communities and codebases. The true disruptor going forward will be the open-source analytic stack as a whole, composed of the most widely adopted subprojects within the scopes of Spark, Hadoop, and many of the others listed above.

Through a process of “natural selection,” this menagerie of squabbling open-source initiatives is fostering an ecosystem within which the best-fit components will be embraced and the rest de-emphasized and ignored. Through that same process, the surviving codebases will be shifted back and forth to whatever roles in the open-source analytics ecosystem best suit them.

We already saw that “natural selection” at work in the Hadoop community. Spark took flight as a higher-performance, lower-latency runtime to MapReduce and HDFS. As Spark picked up the in-memory, streaming, graph, and iterative modeling use cases, Hadoop retrenched to its sweet spots of unstructured data storage, information refinement, data governance, and queryable archiving.

Spark has also experienced some of that mix-and-match change-out disruption. That’s how you might characterize the community’s decision 2 years ago to embrace Spark SQL and abandon efforts to use Shark or Hive as its primary query language. And the Spark community might some day conceivably abandon substantial pieces of its codebase, such as its streaming and graph analytics runtimes and libraries, if better performing open-source alternatives prove more advantageous for developers.

None of these open-source projects is carved in stone. And even if any of them were, stone is not immune to erosion.

Fast streams can etch deep channels in seemingly solid substrates.