Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed
We examine which top tools are "friends", their Python vs R bias, and which work well with Spark/Hadoop and Deep Learning, and identify an emerging Big Data Deep Learning ecosystem.
Big Data and Deep Learning
Spark / Hadoop tools were used by 33% of respondents in KDnuggets 2017 Software Poll, and Deep Learning tools were used by 32%. See the full list of tools in that post.
For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).
Fig. 3: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
Circle size corresponds to tool share of use, and color to Python (blue) vs R (Orange) bias.
We note a cluster of Python-related blue-colored circles in the upper-right part of the chart, including scikit-learn, PyCharm, Anaconda, Java, and Unix tools, which are more frequently used both with Spark/Hadoop and with Deep Learning tools.
This suggests an emergence of Python-friendly Big Data / Deep Learning ecosystem.
We note that Scala is the most "Big Data" associated tool.
To make Fig. 3 more legible, it only includes tools with at least 200 votes and excludes Deep Learning and Spark/Hadoop tools. See more detailed information in the table 1 below for tools with at least 100 votes.
Table 1: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
|Tool||Votes||% with Spark/Hadoop||% with|
|Open Source Hadoop Tools||431||100.0%||59.2%|
|Microsoft SQL Server||334||39.5%||32.9%|
|SQL on Hadoop tools||298||100.0%||56.0%|
|Microsoft Power BI||295||37.6%||35.9%|
|Commercial Hadoop Tools||218||100.0%||53.7%|
|IBM SPSS Statistics||196||32.1%||21.9%|
|Other programming and data languages||196||41.3%||38.3%|
|Microsoft Azure Machine Learning||184||57.1%||54.9%|
|IBM SPSS Modeler||182||44.5%||24.2%|
|SAS Enterprise Miner||162||29.0%||25.9%|
|Other free analytics/data mining tools||139||38.8%||51.1%|
|Other Deep Learning Tools||138||68.1%||100.0%|
|IBM Watson / Watson Analytics||125||52.8%||40.0%|
|Microsoft R Server (former Revolution Analytics)||125||63.2%||55.2%|
Here is the link to anonymized poll data in CSV format, with columns
- N: record number (randomized, records not in order of voting)
- region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
- Python: 1 if Votes (last column) includes Python, 0 otherwise
- R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
- SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
- RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
- Excel: 1 if Votes includes Excel, 0 otherwise.
- Spark: 1 if Votes includes Spark, 0 otherwise.
- Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
- Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
- scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
- Tableau: 1 if Votes includes Tableau, 0 otherwise.
- KNIME: 1 if Votes includes KNIME, 0 otherwise.
- Deep: 1 if Votes includes Deep, 0 otherwise.
- Spark/Hadoop: 1 if Votes includes Spark/Hadoop, 0 otherwise.
- ntools: number of tools
- Votes: list of votes, separated by a semicolon ";"