Silver BlogWhat you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem

We identify the 6 tools in the modern open-source Data Science ecosystem, examine the Python vs R question, and determine which tools are used the most with Deep Learning and Big Data.

Recently we reported the results of 20th annual KDnuggets Software Poll:
Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis.

As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer - the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption.

The top tools show surprising stability - we see essentially the same pattern as last year.

First, we selected the tools with at least 20% of the vote. There were 11 such tools - exactly the same list of 11 tools as last year, although the order has changed a little. Keras moved up from n. 10 to n. 8, and Anaconda moved up from n. 6 to n. 5. Tableau and SQL moved down a little.

The cutoff for this group of 11 is a natural one, since there is a big gap between n. 11 (Apache Spark, with 21%) and n. 12 (Microsoft Power BI, 13%).

We used the same Lift measure as in our 2017 analysis and 2018 analysis.

We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below. We made the patterns easier to see by showing only associations with abs(Lift1) > 15%.

Poll Data Science 2019 Top11 Ecosystem
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2019
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green - positive association, red - negative association).

We note a group of 6 primary tools that make the modern open source data science ecosystem: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark. This is exactly the same group as last year - see below.

Poll Data Science 2018 Top11 Ecosystem
Fig. 1b: Data Science, Machine Learning Top Tools Associations in 2018

Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.

R has small positive associations with Keras, Apache Spark, SQL, and Tableau.

The second group includes the 3 supporting tools for Data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.

Note that this chart is symmetrical relative to diagonal (top right triangle is equal to bottom left), but we included both triangles because the patterns are easier to see in the full chart.

Lift Definition:
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1

Python vs R

Next we examine Python vs R.

Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X) - with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.

In Fig. 2, we charted the bias of most popular tools with at least 90 votes, and as we can see, almost every tool is biased towards Python. The only 3 exceptions are R (obviously), Microsoft SQL Server, and SAS Base (which is exactly zero bias). For comparison, in a similar 2017 analysis there were 10 tools biased towards R and 3 R-biased tools in 2018.

R is an excellent platform with tremendous depth and width, which is widely used for data analysis and visualization, and it still has about 50% share. However, going forward, we expect more development and energy around Python ecosystem.

Python Vs R 2019 Poll
Fig. 2: Data Science, Machine Learning Platforms 2019: Python vs R bias

I don't think that the relative stability of the share and associations of top 11 platforms suggest the end of innovation, but perhaps only a pause before another major system - perhaps something related to AutoML, will disrupt the current ecosystem.

Big Data and Deep Learning

Finally, we look DS/ML platforms and languages relationship to Big Data (Hadoop and Spark tools) and Deep Learning.

Big Data tools were used by 37.4% up from 33% in both 2018 and 2017 polls. Despite this increase, most Data Scientists still work with medium / small data that does not require Hadoop / Spark.

The fraction of Deep Learning tools grew to 50% vs 43% in 2018 poll and 32% in 2017.

For each tool X, we compute how frequently it was included by the same voter with Big Data (Spark/Hadoop tools) - vertical axis, and with Deep Learning tools (horizontal axis).

Here is a chart with top tools (with at least 50 votes), excluding Deep Learning and Big Data tools themselves.

Poll 2019 Big Data vs Deep Learning Affinity
Fig. 3: KDnuggets 2019 Data Science, Machine Learning Poll: Deep Learning vs Big Data affinity

We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower right side, with almost every tool being used more with Deep Learning than with Big Data tool.

Interestingly, the tools most associated with Deep Learning are XGBoost and LightGBM.

Here is a table which shows the affinity of different platforms to Big Data and Deep Learning, sorted by affinity with Deep Learning tools.

Table 1: Top Data Science/ML Software and its affinity to Big Data and Deep Learning
% with
Big Data
% with
Deep Learning
Azure ML7747%68%
Other free DS tools14532%60%
Unix shell/awk13051%60%
Other prog lang9440%60%
Orange DM5133%55%
SQL Server18434%49%
MS Power BI21738%46%
SAS EM5542%42%
SAS Base9328%34%
IBM SPSS Statistics8720%25%