What you need to know: The Modern OpenSource Data Science/Machine Learning Ecosystem
Tags: Anaconda, Apache Spark, Big Data Software, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikitlearn, Software, SQL, Tableau, TensorFlow
We identify the 6 tools in the modern opensource Data Science ecosystem, examine the Python vs R question, and determine which tools are used the most with Deep Learning and Big Data.
Recently we reported the results of 20th annual KDnuggets Software Poll:
As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer  the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption.
The top tools show surprising stability  we see essentially the same pattern as last year.
First, we selected the tools with at least 20% of the vote. There were 11 such tools  exactly the same list of 11 tools as last year, although the order has changed a little. Keras moved up from n. 10 to n. 8, and Anaconda moved up from n. 6 to n. 5. Tableau and SQL moved down a little.
The cutoff for this group of 11 is a natural one, since there is a big gap between n. 11 (Apache Spark, with 21%) and n. 12 (Microsoft Power BI, 13%).
We used the same Lift measure as in our 2017 analysis and 2018 analysis.
We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below. We made the patterns easier to see by showing only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2019
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green  positive association, red  negative association).
We note a group of 6 primary tools that make the modern open source data science ecosystem: Python, Anaconda, scikitlearn, Tensorflow, Keras, and Apache Spark. This is exactly the same group as last year  see below.
Fig. 1b: Data Science, Machine Learning Top Tools Associations in 2018
Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.
R has small positive associations with Keras, Apache Spark, SQL, and Tableau.
The second group includes the 3 supporting tools for Data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.
Note that this chart is symmetrical relative to diagonal (top right triangle is equal to bottom left), but we included both triangles because the patterns are easier to see in the full chart.
Lift Definition:
Next we examine Python vs R.
Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X)  with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.
In Fig. 2, we charted the bias of most popular tools with at least 90 votes, and as we can see, almost every tool is biased towards Python. The only 3 exceptions are R (obviously), Microsoft SQL Server, and SAS Base (which is exactly zero bias). For comparison, in a similar 2017 analysis there were 10 tools biased towards R and 3 Rbiased tools in 2018.
R is an excellent platform with tremendous depth and width, which is widely used for data analysis and visualization, and it still has about 50% share. However, going forward, we expect more development and energy around Python ecosystem.
Fig. 2: Data Science, Machine Learning Platforms 2019: Python vs R bias
I don't think that the relative stability of the share and associations of top 11 platforms suggest the end of innovation, but perhaps only a pause before another major system  perhaps something related to AutoML, will disrupt the current ecosystem.
Finally, we look DS/ML platforms and languages relationship to Big Data (Hadoop and Spark tools) and Deep Learning.
Big Data tools were used by 37.4% up from 33% in both 2018 and 2017 polls. Despite this increase, most Data Scientists still work with medium / small data that does not require Hadoop / Spark.
The fraction of Deep Learning tools grew to 50% vs 43% in 2018 poll and 32% in 2017.
For each tool X, we compute how frequently it was included by the same voter with Big Data (Spark/Hadoop tools)  vertical axis, and with Deep Learning tools (horizontal axis).
Here is a chart with top tools (with at least 50 votes), excluding Deep Learning and Big Data tools themselves.
Fig. 3: KDnuggets 2019 Data Science, Machine Learning Poll: Deep Learning vs Big Data affinity
We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower right side, with almost every tool being used more with Deep Learning than with Big Data tool.
Interestingly, the tools most associated with Deep Learning are XGBoost and LightGBM.
Here is a table which shows the affinity of different platforms to Big Data and Deep Learning, sorted by affinity with Deep Learning tools.
Table 1: Top Data Science/ML Software and its affinity to Big Data and Deep Learning
Related:
Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis.
As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer  the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption.
The top tools show surprising stability  we see essentially the same pattern as last year.
First, we selected the tools with at least 20% of the vote. There were 11 such tools  exactly the same list of 11 tools as last year, although the order has changed a little. Keras moved up from n. 10 to n. 8, and Anaconda moved up from n. 6 to n. 5. Tableau and SQL moved down a little.
The cutoff for this group of 11 is a natural one, since there is a big gap between n. 11 (Apache Spark, with 21%) and n. 12 (Microsoft Power BI, 13%).
We used the same Lift measure as in our 2017 analysis and 2018 analysis.
We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below. We made the patterns easier to see by showing only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2019
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green  positive association, red  negative association).
We note a group of 6 primary tools that make the modern open source data science ecosystem: Python, Anaconda, scikitlearn, Tensorflow, Keras, and Apache Spark. This is exactly the same group as last year  see below.
Fig. 1b: Data Science, Machine Learning Top Tools Associations in 2018
Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.
R has small positive associations with Keras, Apache Spark, SQL, and Tableau.
The second group includes the 3 supporting tools for Data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.
Note that this chart is symmetrical relative to diagonal (top right triangle is equal to bottom left), but we included both triangles because the patterns are easier to see in the full chart.
Lift Definition:
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y)  1
Python vs R
Next we examine Python vs R.
Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X)  with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.
In Fig. 2, we charted the bias of most popular tools with at least 90 votes, and as we can see, almost every tool is biased towards Python. The only 3 exceptions are R (obviously), Microsoft SQL Server, and SAS Base (which is exactly zero bias). For comparison, in a similar 2017 analysis there were 10 tools biased towards R and 3 Rbiased tools in 2018.
R is an excellent platform with tremendous depth and width, which is widely used for data analysis and visualization, and it still has about 50% share. However, going forward, we expect more development and energy around Python ecosystem.
Fig. 2: Data Science, Machine Learning Platforms 2019: Python vs R bias
I don't think that the relative stability of the share and associations of top 11 platforms suggest the end of innovation, but perhaps only a pause before another major system  perhaps something related to AutoML, will disrupt the current ecosystem.
Big Data and Deep Learning
Finally, we look DS/ML platforms and languages relationship to Big Data (Hadoop and Spark tools) and Deep Learning.
Big Data tools were used by 37.4% up from 33% in both 2018 and 2017 polls. Despite this increase, most Data Scientists still work with medium / small data that does not require Hadoop / Spark.
The fraction of Deep Learning tools grew to 50% vs 43% in 2018 poll and 32% in 2017.
For each tool X, we compute how frequently it was included by the same voter with Big Data (Spark/Hadoop tools)  vertical axis, and with Deep Learning tools (horizontal axis).
Here is a chart with top tools (with at least 50 votes), excluding Deep Learning and Big Data tools themselves.
Fig. 3: KDnuggets 2019 Data Science, Machine Learning Poll: Deep Learning vs Big Data affinity
We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower right side, with almost every tool being used more with Deep Learning than with Big Data tool.
Interestingly, the tools most associated with Deep Learning are XGBoost and LightGBM.
Here is a table which shows the affinity of different platforms to Big Data and Deep Learning, sorted by affinity with Deep Learning tools.
Table 1: Top Data Science/ML Software and its affinity to Big Data and Deep Learning
Software  2019 Count  % with Big Data  % with Deep Learning 

LightGBM  50  34%  90% 
XGBoost  208  50%  84% 
H2O.ai  117  58%  82% 
Scala  57  89%  77% 
scikitlearn  418  47%  73% 
Azure ML  77  47%  68% 
Anaconda  556  43%  64% 
Weka  109  55%  63% 
C/C++  116  39%  63% 
Javascript  112  48%  63% 
Other free DS tools  145  32%  60% 
Unix shell/awk  130  51%  60% 
Other prog lang  94  40%  60% 
Python  1078  44%  59% 
MATLAB  100  33%  59% 
Orange DM  51  33%  55% 
Java  203  48%  55% 
KNIME  175  39%  54% 
R  764  41%  53% 
SQL Server  184  34%  49% 
SQL  538  46%  46% 
MS Power BI  217  38%  46% 
Tableau  362  40%  44% 
Alteryx  66  47%  44% 
QlikView  60  40%  43% 
RapidMiner  839  34%  42% 
SAS EM  55  42%  42% 
Excel  571  29%  41% 
SAS Base  93  28%  34% 
IBM SPSS Statistics  87  20%  25% 
Related:
 Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis.
 Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed
 The 6 components of OpenSource Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?
Top Stories Past 30 Days

