What Big Data, Data Science, Deep Learning software goes together?
We analyze the associations between top Data Science tools, Commercial vs Free/Open Source, rank tools on R vs Python bias, find tools more associated with Big Data, those more associated with Deep Learning, and uncover strong regional differences.
R_Py_bias(X) = R%(X)/Py%(X) / (R%(all)/Py%(all)) .
Fig 3 shows R/Python bias for top tools (with >150 votes). To make the patterns easier to see we chart log2(R_Py_bias).
Fig. 3: KDnuggets Data Science Software Poll
R/Python bias for top tools
We note that the tools with R bias are mainly blue (commercial), except for KNIME and Weka, while tools with Python bias are Free/Open Source (orange), Languages (green) and Big Data (purple).
Anaconda and scikit-learn, which are based on Python, have the largest Python bias, while SAS, SQL Server, IBM SPSS Statistics, KNIME, and RapidMiner have R bias.
Next we compute Big Data affinity for top tools (> 150 votes). We define it as the number of voters who used tool X and a Big Data tool, divided by number who used tool X. We exclude Big Data tools from this list since they all have affinity of 100%. Overall, 39% have used Big Data tools.
Fig. 4: KDnuggets Data Science Software Poll
Big Data affinity for top tools
Bar height corresponds to tool global share, bar length is the % of tool users that also use Big Data tools, and bar color is the tool type: Commercial, Deep Learning, Free/Open Source or Language.
We note that Scala has an amazing 95% correlation with Big Data. Most of the tools with high Big Data affinity are languages or free tools, It is expected that Tensorflow is used frequently Big Data (it has 77% affinity), but we also note that 23% of Tensorflow users are still in learning phase, and have not used any Big Data tools.
Next, we compute Deep Learning affinity among top tools using a similar approach.