Gold Blog, Jun 2017Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed

We examine which top tools are "friends", their Python vs R bias, and which work well with Spark/Hadoop and Deep Learning, and identify an emerging Big Data Deep Learning ecosystem.

Last month we reported on the results of 18th annual KDnuggets Software Poll: New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning.

Here is a more detailed look at which tools go well with each other, and which don't. We also find an emerging Python-friendly ecosystem of tools that are commonly used with the two leading edges of data science: Big Data (Spark/Hadoop) and Deep Learning.

A link to anonymized dataset is at the end of this post - analyze the data yourself and publish or send me the results.

First, we look at the associations between top tools.

We have selected the tools with at least 500 votes - this year there are 11 such entries.

There are many ways to measure how significant is associations between two nominal or binary features, like chi-square or T-test, but we used the same simple measure as in our 2015 analysis and 2016 analysis. We define "Lift" as

Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

To make the differences from 1 easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1

Fig. 1 below shows the pairwise lift1 between the top 11 tools, filtered to show only associations with abs(Lift1) > 15%.

Poll: Data Science Top Tools Associations, 2017
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2017
Color is green for positive association, red for negative
Label is Lift1 as explained above; bar width is proportional to Lift1.

We note that Python has significant positive association not only with Anaconda, Tensorflow, and scikit-learn (as expected), but also with Spark.

R has weaker associations than Python among more popular tools.

RapidMiner has mostly negative associations with other top tools, except Tableau. Excel users also like Tableau. Spark best friends are Tensorflow and scikit-learn.

The cluster of Python, Spark, Anaconda, Tensorflow, and scikit-learn are frequently used together, and appear to form the core of the emerging Python-based Big Data and Deep Learning ecosystem.

Python vs R

Next we examine the affinity of top 30 tools with Python vs R.

Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize the affinities, we used a very simple measure Bias_Py_R(X) = log2(with_Py(X)/with_R(X)) which is positive if tool is more used with Python and negative if it is more used with R. One can correct for relative frequencies of Python and R, but since they were almost equal in 2017, the correction would be insignificant.

Poll Data Science 2017 Python R Bias
Fig. 2: Python vs R associations for top 30 Data Science, Machine Learning Top Tools Associations, 2017
bar length is Bias_Py_R as defined above, bar height is the popularity of the tool.

We note that Python friends include not only scikit-learn, PyCharm, and Anaconda, which is expected, but also Deep Learning tools Keras and Tensorflow, and notably Spark and Scala.

R best friends include SAS Base, Microsoft tools (expected since Microsoft bought Revolution Analytics), Weka, and Tableau.

Next, we examine how well do different tools play with Big Data and Deep Learning - see the next page.