Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed
We examine which top tools are "friends", their Python vs R bias, and which work well with Spark/Hadoop and Deep Learning, and identify an emerging Big Data Deep Learning ecosystem.
Last month we reported on the results of 18th annual KDnuggets Software Poll:
New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning.
Here is a more detailed look at which tools go well with each other, and which don't. We also find an emerging Pythonfriendly ecosystem of tools that are commonly used with the two leading edges of data science: Big Data (Spark/Hadoop) and Deep Learning.
A link to anonymized dataset is at the end of this post  analyze the data yourself and publish or send me the results.
First, we look at the associations between top tools.
We have selected the tools with at least 500 votes  this year there are 11 such entries.
There are many ways to measure how significant is associations between two nominal or binary features, like chisquare or Ttest, but we used the same simple measure as in our 2015 analysis and 2016 analysis. We define "Lift" as
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
To make the differences from 1 easier to see we define
Lift1 (X & Y) = Lift (X & Y)  1
Fig. 1 below shows the pairwise lift1 between the top 11 tools, filtered to show only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2017
Color is green for positive association, red for negative
Label is Lift1 as explained above; bar width is proportional to Lift1.
We note that Python has significant positive association not only with Anaconda, Tensorflow, and scikitlearn (as expected), but also with Spark.
R has weaker associations than Python among more popular tools.
RapidMiner has mostly negative associations with other top tools, except Tableau. Excel users also like Tableau. Spark best friends are Tensorflow and scikitlearn.
The cluster of Python, Spark, Anaconda, Tensorflow, and scikitlearn are frequently used together, and appear to form the core of the emerging Pythonbased Big Data and Deep Learning ecosystem.
Next we examine the affinity of top 30 tools with Python vs R.
Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize the affinities, we used a very simple measure Bias_Py_R(X) = log2(with_Py(X)/with_R(X)) which is positive if tool is more used with Python and negative if it is more used with R. One can correct for relative frequencies of Python and R, but since they were almost equal in 2017, the correction would be insignificant.
Fig. 2: Python vs R associations for top 30 Data Science, Machine Learning Top Tools Associations, 2017
bar length is Bias_Py_R as defined above, bar height is the popularity of the tool.
We note that Python friends include not only scikitlearn, PyCharm, and Anaconda, which is expected, but also Deep Learning tools Keras and Tensorflow, and notably Spark and Scala.
R best friends include SAS Base, Microsoft tools (expected since Microsoft bought Revolution Analytics), Weka, and Tableau.
Next, we examine how well do different tools play with Big Data and Deep Learning  see the next page.
Here is a more detailed look at which tools go well with each other, and which don't. We also find an emerging Pythonfriendly ecosystem of tools that are commonly used with the two leading edges of data science: Big Data (Spark/Hadoop) and Deep Learning.
A link to anonymized dataset is at the end of this post  analyze the data yourself and publish or send me the results.
First, we look at the associations between top tools.
We have selected the tools with at least 500 votes  this year there are 11 such entries.
There are many ways to measure how significant is associations between two nominal or binary features, like chisquare or Ttest, but we used the same simple measure as in our 2015 analysis and 2016 analysis. We define "Lift" as
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
To make the differences from 1 easier to see we define
Lift1 (X & Y) = Lift (X & Y)  1
Fig. 1 below shows the pairwise lift1 between the top 11 tools, filtered to show only associations with abs(Lift1) > 15%.
Fig. 1: Data Science, Machine Learning Top Tools Associations, 2017
Color is green for positive association, red for negative
Label is Lift1 as explained above; bar width is proportional to Lift1.
We note that Python has significant positive association not only with Anaconda, Tensorflow, and scikitlearn (as expected), but also with Spark.
R has weaker associations than Python among more popular tools.
RapidMiner has mostly negative associations with other top tools, except Tableau. Excel users also like Tableau. Spark best friends are Tensorflow and scikitlearn.
The cluster of Python, Spark, Anaconda, Tensorflow, and scikitlearn are frequently used together, and appear to form the core of the emerging Pythonbased Big Data and Deep Learning ecosystem.
Python vs R
Next we examine the affinity of top 30 tools with Python vs R.
Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize the affinities, we used a very simple measure Bias_Py_R(X) = log2(with_Py(X)/with_R(X)) which is positive if tool is more used with Python and negative if it is more used with R. One can correct for relative frequencies of Python and R, but since they were almost equal in 2017, the correction would be insignificant.
Fig. 2: Python vs R associations for top 30 Data Science, Machine Learning Top Tools Associations, 2017
bar length is Bias_Py_R as defined above, bar height is the popularity of the tool.
We note that Python friends include not only scikitlearn, PyCharm, and Anaconda, which is expected, but also Deep Learning tools Keras and Tensorflow, and notably Spark and Scala.
R best friends include SAS Base, Microsoft tools (expected since Microsoft bought Revolution Analytics), Weka, and Tableau.
Next, we examine how well do different tools play with Big Data and Deep Learning  see the next page.
Pages: 1 2
Top Stories Past 30 Days  


