Platinum BlogPython eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018: Trends and Analysis

Python continues to eat away at R, RapidMiner gains, SQL is steady, Tensorflow advances pulling along Keras, Hadoop drops, Data Science platforms consolidate, and more.



The 19th annual KDnuggets Software Poll had over 2,300 voters, somewhat less than in 2017, perhaps because only one vendor - RapidMiner - had a very active campaign to vote in KDnuggets poll. On average, a participant selected about 7 different tools used, so votes with just one tool selected stood out. We removed about 260 such "lone" votes (which mainly were from RapidMiner), because even if they represented legitimate users of that tool, their experience was very atypical and would skew the results. To compare "apples" to "apples" I also removed such lone votes from 2016 and 2017 data (about 11% in 2017 and 12% in 2016), so the 2017 percentage in this blog for most tools will be slightly higher than what was reported in 2017 post.

Here is my initial analysis, based on 2052 participants, after "lone" voters were removed. More detailed association analysis and anonymized data will be published in about 2 weeks.

Top Analytics, Data Science, Machine Learning Tools


Top Analytics Data Science Machine Learning Software 2018 3yrs
Fig 1: KDnuggets Analytics/Data Science 2018 Software Poll: top tools in 2018, and their share in the 2016-7 polls
(* for a more valid comparison, we recomputed the results of 2016, 2017 polls to exclude "lone" votes)

Here are the top 11 tools, which all had at least 20% share.

Table 1: Top Analytics/Data Science/ML Software in 2018 KDnuggets Poll
Software2018
% share
% change
2018 vs 2017
Python65.6%11%
RapidMiner52.7%65%
R48.5%-14%
SQL39.6%1%
Excel39.1%24%
Anaconda33.4%37%
Tensorflow29.9%32%
Tableau26.4%21%
scikit-learn24.4%11%
Keras22.2%108%



Here 2018 % share is % of voters who used this tool, % change is the change in share vs 2017 Software Poll, with green and red highlighting changes up and down of 10% or more.

The average number of tools per respondent was 7.0, slightly higher than 6.75 in 2017 Poll (also excluding just 1-tool responses).

Compared to 2017 Software Poll, the one new entry is Keras. Knime dropped from top 11, perhaps because this year they did not have a campaign among their users to vote.

Here are some observations.

Python eats away at R

Python already had over 50% share in 2017, and increased its share to 66%, while R share has decreased for the first time since we have done this poll, and dropped to below 50%.

RapidMiner surges

RapidMiner, which was the top Data Science platform in the past several polls, dramatically increased its share to about 50%, up from 33% in 2017.

What part of this is due to user growth, and what part to vendor promotion?

I asked RapidMiner what they did to encourage their users, and here is a response from Ingo Mierswa, RapidMiner founder and president.
"Like many vendors, RapidMiner promotes the KDnuggets survey to users through a number of channels, including sending a few emails to people who have used our product in the past 12 months. We've done the same promotion before, but two different things happened this year. First we received a much better response. Over 400 users personally replied to my email expressing how happy they were to help us out. But more importantly, we've seen a 300% increase in monthly active RapidMiner users over the past year, so we emailed more people than in prior years. We're humbled to have such an engaged and loyal user community."


For the record I note that RapidMiner is not a current advertiser on KDnuggets.

SQL is steady

SQL, including Spark SQL, and SQL to Hadoop tools, continues to have a share of about 40% in each of the last 3 polls. So, if you are an aspiring Data Scientist, learn SQL - it will likely be useful for a long while!

Trends

The only new entry in the poll with over 2% share of usage was Spark SQL, with 11.7% share.

The table below lists the tools that have grown 20% or more in share and reached at least 3% share in 2018.

Table 2: Major Analytics/Data Science/ML Tools with the largest increase in usage
Tool% change2018
% share
2017
% share
Keras108%22.2%10.7%
PyTorch92%6.4%3.4%
Amazon Machine Learning74%3.3%1.9%
RapidMiner65%52.7%31.9%
Other free analytics/data mining tools53%8.3%5.4%
DeepLearning4J39%3.4%2.4%
Anaconda37%33.4%24.3%
PyCharm33%13.5%10.1%
Tensorflow32%29.9%22.7%
Excel24%39.1%31.5%
Tableau21%26.4%21.8%


Consolidation

We note that among 56 tools with 2% or higher share in 2017, 19 (only about one third) have increased share in 2018, while 37 have dropped in share. This, along with recent acquisitions (Datawatch buying Angoss, Minitab buying Salford) suggests that consolidation of Data Science platforms is on the way.

Tools that had at least 3% share in 2017 and declined 25% or more in their share in 2018 are in the next table.

Table 3: Major Analytics/Data Science Tools with the largest decline in usage
Tool% change2018
% share
2017
% share
Caffe-58%1.5%3.5%
Microsoft Machine Learning Server (former R Server)-57%2.1%4.9%
IBM Data Science Experience-55%1.4%3.2%
KNIME-41%12.3%21.0%
IBM Watson / Watson Analytics-35%3.1%4.8%
Hadoop: Open Source Tools-35%11.0%16.8%
Hadoop: Commercial Tools-33%5.7%8.5%
SAS Enterprise Miner-30%4.3%6.2%
IBM SPSS Modeler-29%4.9%6.9%
Scala-29%5.9%8.3%
SAS Base-29%5.5%7.7%
Alteryx-28%4.0%5.7%
MLlib-26%3.8%5.1%
Theano-25%4.9%6.5%


Deep Learning Tools

The share of voters who used Deep Learning tools remained stable, at 33% of voters, vs 32% in 2017 and 18% in 2016.

Google Tensorflow is by far the dominant platform, but Keras emerged as a very popular wrapper on top of Tensorflow.

Top Deep Learning tools were:
  • Tensorflow, 29.9%
  • Keras, 22.2%
  • PyTorch, 6.4%
  • Theano, 4.9%
  • Other Deep Learning Tools, 4.9%
  • DeepLearning4J, 3.4%
  • Microsoft Cognitive Toolkit (Prev. CNTK), 3.0%
  • Apache MXnet, 1.5%
  • Caffe, 1.5%
  • Caffe2, 1.2%
  • TFLearn, 1.1%
  • Torch, 1.0%
  • Lasagne, 0.3%


Big Data Tools: Hadoop Drops

In 2018, about 33% used Big Data tools, either Hadoop or Spark - about the same as in 2017, but Hadoop usage has markedly declined - about 30%.

Here are the details:
Tool% change2018
% share
2017
% share
Apache Spark-15% 21.5%25.5%
Spark SQLnew 11.7%
Hadoop: Open Source Tools-35% 11.0%16.8%
SQL on Hadoop tools-12% 10.2%11.6%
Hadoop: Commercial Tools-33% 5.7%8.5%


Programming Languages

Python seems to swallow not only R, but also most other languages, except for SQL, Java, C/C++ which remained at about the same level. R has declined for the first time since we have run this survey. Other languages have also declined.

Here are the main programming languages sorted by popularity.
  • Python, 65.6% (was 59.0% in 2017), 11% up
  • R, 48.5% (was 56.6%), 14% down
  • SQL, 39.6% (was 39.2%), 1% up
  • Java, 15.1% (was 15.5%), 3% down
  • Unix, shell/awk/gawk, 9.2% (was 10.8%), 15% down
  • Other programming and data languages, 6.9%, (was 7.6%), -9% down
  • C/C++, 6.8%, (was 7.1%), 3% down
  • Scala, 5.9%, (was 8.3%), 29% down
  • Perl, 1.0% (was 1.9%), 46% down
  • Julia, 0.7% (was 1.2%), 45% down
  • Lisp, 0.3% (was 0.4%), -25% down
  • Clojure, 0.2% (was 0.3%), -38% down
  • F, # 0.1% (was 0.5%), -73% down
Next page shows full 3-year results, regional participation, and links to past polls.