Ranking Popular Distributed Computing Packages for Data Science

We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.



By Rachel Allen and Michael Li

At The Data Incubator, we strive to provide the most up-to-date data science curriculum available. Using feedback from our corporate and government partners, we deliver training on the most sought after data science tools and techniques in industry. We wanted to include a more data-driven approach to developing the curriculum for our corporate data science training and our free Data Science Fellowship program for PhD and master's graduates looking to get hired as professional Data Scientists. To achieve this goal, we started by looking at and ranking popular deep learning libraries for data science. Next, we wanted to analyze the popularity of distributed computing packages for data science. Here are the results.

The Rankings
Below is a ranking of the top 20 of 140 distributed computing packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as Google Search results. The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0). For example, Apache Hadoop is 6.6 standard deviations above average in Stack Overflow activity, while Apache Flink is close to average. See below for methods.

Results and Discussion
Ranking Popular Distributed Computing Packages Data Science

The package ranking is based on equally weighing its three components: Github (stars and forks), Stack Overflow (tags and questions), and number of Google search results. These were obtained using available APIs. Coming up with a comprehensive list of distributed computing packages was tricky - in the end, we scraped three different lists that we thought were representative. We chose to focus on 140 frameworks and distributed programing packages (see methods below for details). Computing standardized scores for each metric allows us to see which packages stand out in each category. The full ranking is here, while the raw data is here.

Apache Spark and Apache Hadoop are in a class of their own
Apache Spark (1) is an incredibly popular open source distributed computing framework. Apache Spark dominated the Github activity metric with its numbers of forks and stars more than eight standard deviations above the mean. Apache Spark utilizes in-memory data processing, which makes it faster than its predecessors and capable of machine learning. It also offers an interactive console in either Scala or, more popular among data scientists, Python. Although Apache Spark was initially designed for the Hadoop ecosystem, it can run on its own using one of many different file management systems. Apache Hadoop (2) outperformed Apache Spark in Stack Overflow activity. The disconnect between Hadoop's Stack Overflow activity and the other two metrics is likely due to the fact that the meaning of Apache Hadoop has evolved over time. Rather than referring to just the framework, the term "Hadoop" can also mean all Hadoop-related projects that make up the ecosystem. This results in a somewhat inflated Stack Overflow score. Nevertheless, most of the frameworks and engines on our list have Apache Hadoop integrations. And it measured at least two standard deviations above the mean on all our metrics, solidifying its number two spot.

Apache Storm and Apache Flink are popular alternative frameworks, especially for streaming
Apache Storm (4), initially touted as the Apache Hadoop of real-time, is a stream-only framework best for near real-time distributed computing. It performed above average on all of our metrics. While Apache Storm processes stream data at scale, it is frequently used with Apache Kafka (3), a platform that processes the raw messages from real-time data feeds at scale. Similar to Apache Spark, Apache Flink (8) is also a framework capable of both batch and stream processing. However, Apache Spark bills itself as a batch-processor that can handle streaming, while Apache Flink is suited for heavy stream processing with some batch tasks.

Stratio Crossdata is the highest ranked data hub and fastest growing package
Stratio Crossdata (6) extends the capabilities of Apache Spark by providing a unified way to access to multiple datastores.
Stratio Crossdata uses a SQL-like language and just one API to access multiple datastores with different natures, like Apache Cassandra, ElasticSearch, Arvo, or MongoDB. The number of Google search results for Stratio Crossdata have increased by 400% from the last quarter, which is the largest growth rate out of all 140 packages on our list.

Two of the top 10 were developed by Twitter
The most popular of the two Twitter projects on our list, Apache Storm (4), was donated to the Apache Software Foundation by Twitter in 2011. Twitter Heron (9) is a direct successor to Apache Storm released in June 2016. Twitter Heron offers improved real-time, fault-tolerant stream processing with higher throughput than Storm. Twitter Heron had the fifth largest quarterly growth rate with an increase of 180%. It will be interesting to see if Twitter Heron can climb farther up the ranks with time.

The Hadoop Ecosystem dominates
The Hadoop Ecosystem projects are the most prevalent and widely adopted distributed computing frameworks and interfaces. 17 of the top 20 packages we ranked are part of the Hadoop Ecosystem or designed to integrate with Apache Spark or Apache Hadoop (including HDFS). Outside of the Hadoop Ecosystem Hazelcast (10), an in-memory data grid, Google BigQuery (12), cloud-based big data analytics web service using a SQL-like syntax, and Metamarkets Druid (15) a framework for real-time analysis of large datasets performed well on our metrics.

Limitations
As with any analysis, decisions were made along the way. All source code and data is on our Github Page. The full list of distributed computing packages came from a few sources.

Naturally, some libraries that have been around longer will have higher metrics, and therefore higher ranking. The only metric that takes this into account is the Google search quarterly growth rate.

The data presented a few difficulties:

  • Several of the libraries were common words (onyx, drools, disco), for this reason the search terms used to determine the number of google search results included an additional descriptive term("onyx platform", "kiegroup drools") or alias ("discoproject"). All search terms can be found here.
  • Manual checks were done to confirm Stack Overflow tags and Github repository locations
  • Stack Overflow tags can be found here.
  • Github repository names can be found here.

Methods
All source code and data is on our Github Page.

We first generated a list of 140 distributed computing packages from these four sources, and then collected metrics for all of them, to come up with an index value that was used for the ranking. "Github" index scores are based on both stars and forks, "Stack Overflow" index scores are based on tags and questions containing the package name, and "Search Results" are based on total number of Google search results over the last five years and the quarterly growth rate of results calculated over the past three months as compared to the prior three months. The number of Google search results data was chosen as a metric over Google Trends data, as the amount of websites indexed for a particular keyword(s) is a more reliable indicator of popularity of use of that package than the amount of people searching for that keyword(s). The calculations for this index ranking can be found in the source code here.

A few other notes:

  • Any unavailable Stack Overflow or counts were converted to zero count.
  • If no Github repository existed, forks and stars were recorded as zero.
  • Counts were standardized to mean 0 and deviation 1, then averaged to get "Github" and "Stack Overflow" index scores, and, combined with "Search Results", the Overall index score.

All data was downloaded on September 19, 2017.
Resources
Source code is available on The Data Incubator's GitHub.

Visit our website to learn more about our offerings:

  1. Data Science Fellowship - a free, full-time, eight-week bootcamp program for PhD and master's graduates looking to get hired as professional Data Scientists in New York City, Washington DC, San Francisco, and Boston.
  2. Hiring Data Scientists
  3. Corporate data science training
  4. Online data science courses: introductory part-time bootcamps - taught by our expert Data Scientists in residence, and based on our Fellowship curriculum - for busy professionals to boost their data science skills in their spare time.

Bio: Michael Li is the founder and CEO of The Data Incubator. He worked as a data scientist (Foursquare), quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves.

Rachel Kay Allen is a Lead Scientist at Booz Allen Hamilton and was an Instructor at The Data Incubator.

Related: