What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem
Tags: Anaconda, Apache Spark, Big Data Software, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikit-learn, Software, SQL, Tableau, TensorFlow
We identify the 6 tools in the modern open-source Data Science ecosystem, examine the Python vs R question, and determine which tools are used the most with Deep Learning and Big Data.
As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer - the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption.
The top tools show surprising stability - we see essentially the same pattern as last year.
First, we selected the tools with at least 20% of the vote. There were 11 such tools - exactly the same list of 11 tools as last year, although the order has changed a little. Keras moved up from n. 10 to n. 8, and Anaconda moved up from n. 6 to n. 5. Tableau and SQL moved down a little.
The cutoff for this group of 11 is a natural one, since there is a big gap between n. 11 (Apache Spark, with 21%) and n. 12 (Microsoft Power BI, 13%).
We used the same Lift measure as in our 2017 analysis and 2018 analysis.
We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below. We made the patterns easier to see by showing only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2019
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green - positive association, red - negative association).
We note a group of 6 primary tools that make the modern open source data science ecosystem: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark. This is exactly the same group as last year - see below.
Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.
R has small positive associations with Keras, Apache Spark, SQL, and Tableau.
The second group includes the 3 supporting tools for Data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.
Note that this chart is symmetrical relative to diagonal (top right triangle is equal to bottom left), but we included both triangles because the patterns are easier to see in the full chart.
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1
Python vs R
Next we examine Python vs R.
Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X) - with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.
In Fig. 2, we charted the bias of most popular tools with at least 90 votes, and as we can see, almost every tool is biased towards Python. The only 3 exceptions are R (obviously), Microsoft SQL Server, and SAS Base (which is exactly zero bias). For comparison, in a similar 2017 analysis there were 10 tools biased towards R and 3 R-biased tools in 2018.
R is an excellent platform with tremendous depth and width, which is widely used for data analysis and visualization, and it still has about 50% share. However, going forward, we expect more development and energy around Python ecosystem.
Fig. 2: Data Science, Machine Learning Platforms 2019: Python vs R bias
I don't think that the relative stability of the share and associations of top 11 platforms suggest the end of innovation, but perhaps only a pause before another major system - perhaps something related to AutoML, will disrupt the current ecosystem.
Big Data and Deep Learning
Finally, we look DS/ML platforms and languages relationship to Big Data (Hadoop and Spark tools) and Deep Learning.
Big Data tools were used by 37.4% up from 33% in both 2018 and 2017 polls. Despite this increase, most Data Scientists still work with medium / small data that does not require Hadoop / Spark.
The fraction of Deep Learning tools grew to 50% vs 43% in 2018 poll and 32% in 2017.
For each tool X, we compute how frequently it was included by the same voter with Big Data (Spark/Hadoop tools) - vertical axis, and with Deep Learning tools (horizontal axis).
Here is a chart with top tools (with at least 50 votes), excluding Deep Learning and Big Data tools themselves.
Fig. 3: KDnuggets 2019 Data Science, Machine Learning Poll: Deep Learning vs Big Data affinity
We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower right side, with almost every tool being used more with Deep Learning than with Big Data tool.
Interestingly, the tools most associated with Deep Learning are XGBoost and LightGBM.
Here is a table which shows the affinity of different platforms to Big Data and Deep Learning, sorted by affinity with Deep Learning tools.
Table 1: Top Data Science/ML Software and its affinity to Big Data and Deep Learning
|Other free DS tools||145||32%||60%|
|Other prog lang||94||40%||60%|
|MS Power BI||217||38%||46%|
|IBM SPSS Statistics||87||20%||25%|
- Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis.
- Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed
- The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?