Gold Blog, Jun 2017Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed

We examine which top tools are "friends", their Python vs R bias, and which work well with Spark/Hadoop and Deep Learning, and identify an emerging Big Data Deep Learning ecosystem.

Big Data and Deep Learning

Spark / Hadoop tools were used by 33% of respondents in KDnuggets 2017 Software Poll, and Deep Learning tools were used by 32%. See the full list of tools in that post.

For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).

Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
Fig. 3: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
Circle size corresponds to tool share of use, and color to Python (blue) vs R (Orange) bias.

We note a cluster of Python-related blue-colored circles in the upper-right part of the chart, including scikit-learn, PyCharm, Anaconda, Java, and Unix tools, which are more frequently used both with Spark/Hadoop and with Deep Learning tools.

This suggests an emergence of Python-friendly Big Data / Deep Learning ecosystem.

We note that Scala is the most "Big Data" associated tool.

To make Fig. 3 more legible, it only includes tools with at least 200 votes and excludes Deep Learning and Spark/Hadoop tools. See more detailed information in the table 1 below for tools with at least 100 votes.

Table 1: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017

ToolVotes% with Spark/Hadoop % with
Deep Learning
R language150240.3%36.6%
SQL Language100644.5%34.5%
Open Source Hadoop Tools431100.0%59.2%
Microsoft SQL Server33439.5%32.9%
SQL on Hadoop tools298100.0%56.0%
Microsoft Power BI29537.6%35.9%
Unix tools27855.0%48.2%
Commercial Hadoop Tools218100.0%53.7%
SAS Base20427.0%19.6%
IBM SPSS Statistics19632.1%21.9%
Other programming and data languages19641.3%38.3%
Microsoft Azure Machine Learning18457.1%54.9%
IBM SPSS Modeler18244.5%24.2%
Languages: C/C++18145.9%53.0%
SAS Enterprise Miner16229.0%25.9%
Other free analytics/data mining tools13938.8%51.1%
Other Deep Learning Tools13868.1%100.0%
IBM Watson / Watson Analytics12552.8%40.0%
Microsoft R Server (former Revolution Analytics)12563.2%55.2%

Here is the link to anonymized poll data in CSV format, with columns
  • N: record number (randomized, records not in order of voting)
  • region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
  • Python: 1 if Votes (last column) includes Python, 0 otherwise
  • R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
  • SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
  • RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
  • Excel: 1 if Votes includes Excel, 0 otherwise.
  • Spark: 1 if Votes includes Spark, 0 otherwise.
  • Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
  • Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
  • scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
  • Tableau: 1 if Votes includes Tableau, 0 otherwise.
  • KNIME: 1 if Votes includes KNIME, 0 otherwise.
  • Deep: 1 if Votes includes Deep, 0 otherwise.
  • Spark/Hadoop: 1 if Votes includes Spark/Hadoop, 0 otherwise.
  • ntools: number of tools
  • Votes: list of votes, separated by a semicolon ";"
Let me know what you think!