Gold BlogThe 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

We find 6 tools form the modern open source Data Science / Machine Learning ecosystem; examine whether Python declared victory over R; and review which tools are most associated with Deep Learning and Big Data.

In May, we reported initial results on 19th annual KDnuggets Software Poll: Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018: Trends and Analysis.

Here we take a more detailed look at which tools go well together. The emerging ecosystem of open-source Python friendly Data Science tools we identified last year has received a new entry - see below.

We provide a link to anonymized dataset at the end of the post - let me know what else you find in the data, and please publish or email me the results.

First, we look at which tools go together, and to make the charts understandable, we selected the tools with at least 400 votes. There were 11 such tools, and this selection also makes sense because there was a big gap between n. 11 (Apache Spark, with 442 votes) and n. 12 (Java, 309 votes).

There are many ways to measure the significance of associations between two binary features, like chi-square or T-test, but we used the same Lift measure as in our 2016 analysis and 2017 analysis.

We then grouped together the tools with the strongest association, starting with Tensorflow and Keras, until we arrived to the figure 1 below. To reduce clutter, we also filtered it to show only associations with abs(Lift1) > 15%.

Poll Data Science 2018 Top11 Ecosystem
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2018
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green: stronger association, red: weaker one). The number before the tool is their rank in popularity in KDnuggets 2018 Software Poll, eg Python was no. 1, RapidMiner no. 2, etc.

We note a group of 6 primary tools that together make the modern open source data science ecosystem: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark.

Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.

R has small positive associations with Apache Spark, SQL, and Tableau.

The second group that emerges are the 3 supporting tools for Data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.

We note that although chart below is symmetrical relative to diagonal (top right triangle is equal to bottom left), the patterns are easier to see in the full chart, rather than half.

Lift Definition:
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1

Python vs R

Next we examine Python vs R.

Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X) - with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.

In Fig. 2, we charted the bias of most popular tools with at least 100 votes, and as we can see, almost every tool is biased towards Python. The only 2 exceptions are IBM SPSS Statistics, and SAS Base. For comparison, in similar 2017 analysis there were 10 such tools: SAS Base, Microsoft tools, Weka, RapidMiner, Tableau, and Knime, and almost all became more used along with Python.

Python Vs R 2018 Poll
Fig. 2: KDnuggets 2018 Data Science, Machine Learning Poll: Python vs R bias

Did Python declare victory over R?

I don't think so, because R is an excellent platform with tremendous depth and breadth, which is widely used for data analysis and visualization, and it still has about 50% share. I expect R to be used by many data scientists for a long time, but going forward, I expect more development and energy around Python ecosystem.

Big Data and Deep Learning

Big Data (Spark / Hadoop tools) were used by 33% of respondents in KDnuggets 2018 Software Poll, exactly the same fraction as in 2017. This suggests that most Data Scientists work with medium / small data that does not require Hadoop / Spark, or they use other data in the cloud solutions.

However the fraction of Deep Learning tools grew to 43% from 32%.

For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).

Here is a chart with top tools (with over 100 votes), excluding Deep Learning and Big Data tools themselves.

Poll 2018 Big Data Deep Learning Affinity
Fig. 3: KDnuggets 2018 Data Science, Machine Learning Poll: Deep Learning vs Spark/Hadoop affinity

We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower left side, with almost every tool being used more with Deep Learning than with Big Data tool.

Here is the link to anonymized poll data in CSV format, with columns
  • Nrand: record id (randomized, records not in order of voting)
  • region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
  • Python: 1 if Votes (last column) includes Python, 0 otherwise
  • RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
  • R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
  • SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
  • Excel: 1 if Votes includes Excel, 0 otherwise.
  • Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
  • Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
  • Tableau: 1 if Votes includes Tableau, 0 otherwise.
  • scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
  • Keras: 1 if Votes includes KNIME, 0 otherwise.
  • Apache Spark: 1 if Votes includes Apache Spark, 0 otherwise.
  • With DL: 1 if Votes includes Deep Learning tools, 0 otherwise.
  • With BD: 1 if Votes includes Big Data tools, 0 otherwise.
  • ntools: number of tools in Votes
  • Votes: list of votes, separated by a semicolon ";"
Let me know what you find!