Why Spark Reached the Tipping Point in 2015

A quantitative look at Spark's breakthrough year in 2015, from 3 different points of view. Will 2016 be an even bigger year for the open source project?

There is no doubt that Apache Spark is wildly popular as a Big Data processing framework, and continues to gain popularity. Since its inception barely more than 3 years ago, Spark has been an industry buzzword, and a project which was prophesied from very early on to be the eventual replacement for Apache Hadoop. It seemingly continues to gain popularity daily.

2015 was the tipping point for Spark. During that year, it reached maturity in a number of regards, and gained parity with Hadoop, the other Big Data framework, in numerous avenues of popularity. Yet, both defining and measuring "popularity" is difficult. In an attempt to do so, however, let's have a look at 3 recent metrics by which we can try to gain a better determination of Spark's actual current popularity, how 2015 was such a big year for the open source project, and why 2016 should lead to an even greater demand.

Spark logo

1. Community Activity

In its A Year in Review for Apache Spark report, published in November of 2015, Cloudera states that Apache Spark has more than 50% more development activity than does Hadoop core, with over 750 contributors across hundreds of companies.

Last September, Databricks, Spark's development steward and backer company, released its own survey results indicating that Spark was the most active open source Big Data project as well. It noted more than 600 contributing developers within the past 12 months, nearly double the contributors from 2014.

A look at Apache's Github repository shows nearly 15,000 commits from 825 contributors at the time of writing. This doesn't paint a full picture of the rate or frequency; however, it does support the number of contributors claims. By contrast, Apache Hadoop states 61 contributors and almost 13,000 commits on its repository.

Perhaps as a precursor to the engaged Spark community of 2015, this graph from a Redmonk article on The Emergence of Spark tells a story of Spark's meteoric rise in activity on Stack Overflow during 2014, and on into 2015. It's clear that the buzz was building for a long time before Spark finally tipped.

Stack Overflow Big Data activity

2. Quantifying Interest

Last month, Big Data software company Syncsort released the results of its annual Hadoop survey, which noted as one of its 3 key trends for 2016 that Apache will move from being a talking point to being a doing point. The report goes on to say:

Nearly 70 percent of respondents are most interested in Apache Spark, surpassing interest in all other compute frameworks, including the recognized incumbent, MapReduce (55 percent). While Syncsort expects MapReduce will still be the prevalent compute framework in production, the high level of interest should translate into more Spark deployments, mostly running on Hadoop.

Nearly 70% of respondents are most interested in Spark. A massive number to be sure, and another indicator of the actual interest in Apache Spark.

A look at Indeed's Hadoop and Spark job trends gives us an indication that the 2 projects seem to reach a parity in 2015, after Spark bounced ahead of Hadoop in late 2014 for a period. This may have been evidence of a peak of inflated expectations, though there does not seem to be any indication of a trough of Spark disillusionment coming.

Indeed Hadoop, Spark job trends

Job trends can help give us insight into both interest in Spark, as well as the next topic, adoption.

3. Adoption

In the same A Year in Review for Apache Spark report, Cloudera has also stated that it has more clients running Spark now than all distributions of Hadoop combined. Impressive, given that Cloudera is a Hadoop-centric company. Or, at least, it was. Cloudera also plans to officially replace MapReduce with Apache Spark as the default processing engine for Hadoop.

Cloudera Spark adoption

KDnuggets' most recent annual survey of analytics software (2015) shows that Spark was used by 11.3% of all respondents, and was second only to Hadoop's 18.4% share in the Big Data Tools category. Given the age difference of the 2 products, and noting that Spark only had a 2.6% usage rate in the previous year's survey, this supports that Spark's continued rise is fast and strong.

A Bright Future

This all puts Spark in an abnormal position: Big Data is still becoming mainstream, yet Big Data's initial Big Application, Hadoop, may already be getting eclipsed in both interest and community participation by its inevitable successor, if not yet by implementation. While these frameworks are not interchangeable, they are able to offset one another in some ways, and live together in a swampy ecosystem which includes tools for each, and for both.

So why did Apache Spark's popularity surpass Hadoop's in 2015, only a few years after it was released? Well, Big Data is now upon us. When Hadoop was born, and was gradually being adopted, Big Data was coming. Now, not only are current Hadoop users looking at Spark, but those who have not yet gotten on the Big Data Train and are now making the jump are often going directly to Spark. There aren't a lot of organizations still waiting to see if this Big Data thing takes off; it's no longer a matter of if, but rather when.

Now that Spark has passed its tipping point, we will have to see if 2016 provides the set of circumstances it needs to move from talking point to doing point. Will its wild popularity and community engagement transfer into continued adoption? We will also keep an eye on KDnuggets' next software survey to see if it reflects the general attitude being reported.