Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed
We examine which top tools are "friends", their Python vs R bias, and which work well with Spark/Hadoop and Deep Learning, and identify an emerging Big Data Deep Learning ecosystem.
Big Data and Deep Learning
Spark / Hadoop tools were used by 33% of respondents in KDnuggets 2017 Software Poll, and Deep Learning tools were used by 32%. See the full list of tools in that post.
For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).

Fig. 3: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
Circle size corresponds to tool share of use, and color to Python (blue) vs R (Orange) bias.
We note a cluster of Python-related blue-colored circles in the upper-right part of the chart, including scikit-learn, PyCharm, Anaconda, Java, and Unix tools, which are more frequently used both with Spark/Hadoop and with Deep Learning tools.
This suggests an emergence of Python-friendly Big Data / Deep Learning ecosystem.
We note that Scala is the most "Big Data" associated tool.
To make Fig. 3 more legible, it only includes tools with at least 200 votes and excludes Deep Learning and Spark/Hadoop tools. See more detailed information in the table 1 below for tools with at least 100 votes.
Table 1: Deep Learning vs Spark/Hadoop affinity for top Data Science, Machine Learning Tools, 2017
Tool | Votes | % with Spark/Hadoop | % with Deep Learning |
---|---|---|---|
ALL | 2881 | 33.4% | 31.9% |
Python | 1516 | 45.1% | 47.8% |
R language | 1502 | 40.3% | 36.6% |
SQL Language | 1006 | 44.5% | 34.5% |
RapidMiner | 946 | 28.2% | 26.0% |
Excel | 810 | 28.6% | 26.8% |
Spark | 654 | 100.0% | 61.8% |
Anaconda | 629 | 43.7% | 55.3% |
Tensorflow | 581 | 57.0% | 100.0% |
scikit-learn | 561 | 49.6% | 64.3% |
Tableau | 560 | 36.8% | 29.5% |
KNIME | 551 | 29.6% | 31.8% |
Open Source Hadoop Tools | 431 | 100.0% | 59.2% |
Java | 399 | 50.4% | 47.1% |
Microsoft SQL Server | 334 | 39.5% | 32.9% |
SQL on Hadoop tools | 298 | 100.0% | 56.0% |
Microsoft Power BI | 295 | 37.6% | 35.9% |
Weka | 281 | 37.4% | 43.1% |
Unix tools | 278 | 55.0% | 48.2% |
Keras | 274 | 54.0% | 100.0% |
PyCharm | 260 | 45.8% | 58.1% |
Dataiku | 235 | 53.6% | 43.4% |
Commercial Hadoop Tools | 218 | 100.0% | 53.7% |
MATLAB | 214 | 34.6% | 48.1% |
Scala | 214 | 89.3% | 62.6% |
SAS Base | 204 | 27.0% | 19.6% |
IBM SPSS Statistics | 196 | 32.1% | 21.9% |
Other programming and data languages | 196 | 41.3% | 38.3% |
Microsoft Azure Machine Learning | 184 | 57.1% | 54.9% |
IBM SPSS Modeler | 182 | 44.5% | 24.2% |
Languages: C/C++ | 181 | 45.9% | 53.0% |
H2O.ai | 179 | 58.7% | 63.1% |
Theano | 167 | 54.5% | 100.0% |
SAS Enterprise Miner | 162 | 29.0% | 25.9% |
Alteryx | 152 | 35.5% | 23.7% |
Other free analytics/data mining tools | 139 | 38.8% | 51.1% |
Other Deep Learning Tools | 138 | 68.1% | 100.0% |
MLlib | 130 | 90.8% | 66.2% |
IBM Watson / Watson Analytics | 125 | 52.8% | 40.0% |
Microsoft R Server (former Revolution Analytics) | 125 | 63.2% | 55.2% |
QlikView | 121 | 34.7% | 28.1% |
Orange | 115 | 24.3% | 39.1% |
Here is the link to anonymized poll data in CSV format, with columns
- N: record number (randomized, records not in order of voting)
- region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
- Python: 1 if Votes (last column) includes Python, 0 otherwise
- R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
- SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
- RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
- Excel: 1 if Votes includes Excel, 0 otherwise.
- Spark: 1 if Votes includes Spark, 0 otherwise.
- Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
- Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
- scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
- Tableau: 1 if Votes includes Tableau, 0 otherwise.
- KNIME: 1 if Votes includes KNIME, 0 otherwise.
- Deep: 1 if Votes includes Deep, 0 otherwise.
- Spark/Hadoop: 1 if Votes includes Spark/Hadoop, 0 otherwise.
- ntools: number of tools
- Votes: list of votes, separated by a semicolon ";"