Gold Blog, May 2017New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll

Python caught up with R and (barely) overtook it; Deep Learning usage surges to 32%; RapidMiner remains top general Data Science platform; Five languages of Data Science.
 
 



The 18th annual KDnuggets Software Poll again got huge participation from analytics and data science community and vendors, attracting about 2,900 voters, almost exactly the same as last year. This post has the initial analysis - here is the more detailed look: Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed.

Python, whose usage has been growing faster than R for the last several years, has finally caught up with R, and (barely) overtook it, with 52.6% respondents using it vs 52.1% for R.

The biggest surprise is probably the phenomenal share of Deep Learning tools, now used by 32% of all respondents, while only 18% used DL in 2016 and 9% in 2015. Google Tensorflow rapidly became the leading Deep Learning platform with 20.2% usage, up from only 6.8% in 2016 poll, and entered the top 10 tools.

While in 2014 I wrote about Four main languages for Analytics, Data Mining, Data Science being R, Python, SQL, and SAS, the 5 main languages of Data Science in 2017 appear to be Python, R, SQL, Spark, and Tensorflow.

RapidMiner remains the most popular general platform for data mining/data science, with about 33% usage, almost exactly the same as in 2016.

We note that many vendors have encouraged their users to vote, but all vendors had equal chances, so this does not violate KDnuggets guidelines. We have not seen any bot voting or direct links to vote for only one tool this year.

Spark grew to about 23% and kept its place in top 10 ahead of Hadoop.

Besides TensorFlow, another new tool in the top tier is Anaconda, with 22% usage.

Top Analytics/Data Science Tools


Top Analytics Data Science Machine Learning Software, 2015-2017
Fig 1: KDnuggets Analytics/Data Science 2017 Software Poll: top tools in 2017, and their usage in the 2015-6 polls

Here are the top 11 tools, which all passed the threshold of 500 votes.

Table 1: Top Analytics/Data Science Tools in 2017 KDnuggets Poll
Tool2017
% Usage
% change
2017 vs 2016
% alone
Python52.6% 15%0.2%
R language52.1% 6.4%3.3%
SQL language34.9% -1.8%0%
RapidMiner32.8% 0.7%13.6%
Excel28.1% -16%0.1%
Spark22.7% 5.3%0.2%
Anaconda21.8% 37%0.8%
Tensorflow20.2% 195%0%
scikit-learn19.5% 13%0%
Tableau19.4% 5.0%0.4%
KNIME19.1% 6.3%2.4%

In this table 2017 % usage is % of voters who used this tool, % change is the change in usage vs 2016 Software Poll, with green and red highlighting changes up and down of 5% or more, and % alone is the percent of voters who used only the reported tool among all voters who used that tool. E.g. 3.3% of R voters reported using only R and nothing else. This year there were 13 tools with 5% or more lone votes.

Average number of tools per respondent was 6.1, almost unchanged from 6.0 in 2016.

Compared to 2016 KDnuggets Analytics/Data Science Poll results, the 2 newcomers in top 11 are Anaconda and Tensorflow.

The participation by region was:
  • US/Canada (41.5%),
  • Europe (35.5%),
  • Asia (10.1%),
  • Latin America (6.5%),
  • Africa/MidEast (3.8%),
  • Australia/NZ (2.7%).
Compared to 2016, we note slightly less participation from Europe, and slightly more from all other regions.

Trends

Notable new tools tools in the poll with over 2% usage are Keras (9.5%), PyCharm (9%), Microsoft R Server (4.3%), IBM DSX (3.0%), PyTorch (3.0%), and Teradata (2.4%).

The table below lists the tools that have grown 20% or more in usage and reached at least 2% usage in 2017. Note this includes 5 Deep Learning tools and 4 Microsoft tools.

Table 2: Major Analytics/Data Science Tools with the largest increase in usage
Tool% change2017
% usage
2016
% usage
Microsoft, CNTK294%3.4%0.9%
Tensorflow195%20.2%6.8%
Microsoft Power BI84%10.2%5.6%
Alteryx76%5.3%3.0%
SQL on Hadoop tools42%10.3%7.3%
Microsoft other ML/Data Science tools40%2.2%1.6%
Anaconda37%21.8%16.0%
Caffe32%3.1%2.3%
Orange30%4.0%3.1%
DL4J30%2.2%1.7%
Other Deep Learning Tools30%4.8%3.7%
Microsoft Azure Machine Learning26%6.4%5.1%


DataRobot just missed the 2% usage, but grew from 0.5% in 2016 to 1.9% in 2017.

We note that among tools with 2% or higher usage in 2016, 22 have increased usage, while 27 have dropped in usage. This suggests that Data Science platform market is still innovating and there is no consolidation yet.

Tools that had at least 2% usage in 2016 and declined 20% or more in their usage in 2017 are in the next table. Turi and Salford have been bought recently, Perl and Octave are losing to Python and R, RapidInsight perhaps did not remind its users to vote, QlikView is probably losing to Tableau, and C4.5 is old technology. What is interesting is the decline in usage for Hadoop Open Source Tools, for MLlib and Other free analytics/data mining tools.

Table 3: Major Analytics/Data Science Tools with the largest decline in usage
Tool% change2017
% usage
2016
% usage
Turi (former Dato/GraphLab)-93%0.2%2.4%
RapidInsight/Veera-92%0.2%3.0%
Salford SPM/CART/RF/MARS/TreeNet-89%0.4%3.5%
MLlib-61%4.5%11.6%
C4.5/C5.0/See5-38%1.2%2.0%
Hadoop: Open Source Tools-32%15.0%22.1%
Other free analytics/data mining tools-29%4.8%6.8%
Rattle-28%2.6%3.6%
Perl-27%1.7%2.3%
Pentaho-23%1.8%2.3%
Gnu Octave-22%2.4%3.1%
QlikView-21%4.2%5.3%


Deep Learning Tools

The usage of Deep Learning tools jumped to 32% of all respondents, vs only 18% in 2016 and 9% in 2015.

Google Tensorflow is the dominant platform, displacing the last year leader Theano/Pylearn2.

Top tools are:
  • Tensorflow, 20.2% usage
  • Keras, 9.5%
  • Theano, 5.8%
  • Other Deep Learning Tools, 4.8%
  • Microsoft CNTK, 3.4%
  • Caffe, 3.1%
  • PyTorch, 3.0%
  • DL4J, 2.2%
  • mxnet, 1.8%
  • Torch, 1.2%
  • Lasagne, 0.9%

Hadoop/Big Data Tools

We have simplified the choices on Hadoop/Spark tools to Hadoop: Commercial/Open Source Tools, SQL on Hadoop, and Spark and they were used by 33% of all respondents. This is slightly lower than 39% in 2016 but more tools were counted as Big Data in 2016. In 2015, 29% used Spark/Hadoop tools.

In 2017 the Big Data tools usage was
  • Spark, 22.7%
  • Hadoop: Open Source Tools, 15.0%
  • SQL on Hadoop tools, 10.3%
  • Hadoop: Commercial Tools 7.6%

Programming Languages

Python, Java, Unix tools, Scala grew in popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.

Here are the main programming languages sorted by popularity.
  • Python, 52.6% usage (was 45.8% in 2016), 15% up
  • R language, 52.1% (was 49.0%), 6% up
  • SQL, 34.9% (was 35.5%), 2% down
  • Java, 13.8% (was 16.8%), 18% down
  • Unix shell/awk/gawk, 9.6% (was 10.4%), 7% down
  • C/C++, 6.3%, (was 7.3%), 13% down
  • Perl, 1.7%, (was 2.3%), 27% down
  • Julia, 1.1%, (was 1.1%), no change
Python keeps growing and sucking oxygen from competitors like Julia, which surprisingly did not grow its usage.

next page shows the full poll results and 3-years trends.