KDnuggets Annual Software Poll:RapidMiner and R vie for first place

The 2013 KDnuggets Software Poll was marked by a battle between RapidMiner and R for the first place. Surprisingly, commercial and free software maintained parity, with about 30% using each exclusively, and 40% using both. Only 10% used their own code - is analytics software maturing? Real Big Data is still done by a minority - only 1 in 7 used Hadoop or similar tools, same as last year.



By Gregory Piatetsky, Jun 3, 2013.

The 14th annual KDnuggets Software Poll attracted record participationof 1880 voters, more than doubling 2012 numbers.

This year's poll was noted for the battle between RapidMiner and R for the first place. RapidMiner has been very successful in motivating their users, and got the most votes.

The distinction between commercial and open-source is becoming less clear, since tools open-source software providers like RapidMiner, KNIME, and Revolution Analytics increasingly also have commercial/enterprise versions. Many of RapidMiner users were apparently confused by this distinction, since there were more votes (almost 500) for the commercial version of RapidMiner than there were actual licenses (according to Rapid-I CEO Ingo Mierswa). We dealt with this by treating votes that had both commercial and free version of RapidMiner as votes for the free version. This still left 225 RapidMiner users that used only the commercial version.

We found an interesting and stable balance between commercial and free software: 29% of voters used commercial software but not free software (vs 28% in 2012), a very similar number - 30% - used free software but not commercial (same as in 2012), and 41% used both (same as in 2012).

Only 10% indicated that they used their own code, down from 47% in 2012. Is it an indication of growing maturity of tools?

The average number of tools used was 3.0.

Only 14% of voters report using big data tools, compared 15% used them in 2012 (and 3% in 2011).

This suggests that Real Big Data remains isolated among a select group of web giants, government agencies, and similar very large enterprises, and most data analysis is done on "medium" and small data. Recent KDnuggets Poll on Largest DB analyzed supports this conclusion.

The following table shows results of the poll.

% aloneis what percent of tool voters used only that tool alone. For example, only 5.6% of Weka voters voted only for Weka, while 43% of Predixion Software voters had only that tool alone.

What Analytics, Big Data, Data mining, Data Science software you used in the past 12 months for a real project?[1880 voters]
Legend: Red: Free/Open Source tools

Green: Commercial tools
% users in 2013

% users in 2012
Rapid-I RapidMiner/RapidAnalytics free edition(737), 30.9% alone 39.2%

26.7%
R(704), 6.5% alone 37.4%

30.7%
Excel(527), 0.9% alone 28.0%

29.8%
Weka / Pentaho(269), 5.6% alone 14.3%

14.8%
Python with any of numpy/scipy/pandas/iPython... packages(250), 0% alone 13.3%

14.9%
Rapid-I RapidAnalytics/RapidMiner Commercial Edition(225), 52.4% alone 12.0%
SAS(202), 2.0% alone 10.7%

12.7%
MATLAB(186), 1.6% alone 9.9%

10.0%
StatSoft Statistica(170), 45.9% alone 9.0%

14.0%
IBM SPSS Statistics(164), 1.8% alone 8.7%

7.8%
Microsoft SQL Server(131), 1.5% alone 7.0%

5.0%
Tableau(118), 0% alone 6.3%

4.4%
IBM SPSS Modeler(114), 6.1% alone 6.1%

6.8%
KNIME free edition(110), 1.8% alone 5.9%

21.8%
SAS Enterprise Miner(110), 0% alone 5.9%

5.8%
Rattle(84), 0% alone 4.5%
JMP(77), 7.8% alone 4.1%

4.0%
Orange(67), 13.4% alone 3.6%

5.3%
Other free analytics/data mining software(64), 3.1% alone 3.4%

4.9%
Gnu Octave(54), 0% alone 2.9%
Revolution Analytics R Enterprise(53), 1.9% alone 2.8%

1.4%
Predixion Software(51), 43.1% alone 2.7%

0.4%
KNIME Professional(46), 4.3% alone 2.4%
Revolution Analytics R free edition(46), 2.2% alone 2.4%
IBM Cognos(45), 2.2% alone 2.4%

2.0%
Other commercial analytics/data mining/data science software(45), 0% alone 2.4%

4.0%
QlikView(45), 2.2% alone 2.4%
Salford SPM/CART/MARS/TreeNet/RF(42), 26.2% alone 2.2%

1.1%
Mathematica(39), 0% alone 2.1%

2.9%
Stata(39), 2.6% alone 2.1%

1.9%
KXEN(35), 54.3% alone 1.9%

1.8%
Miner3D(34), 41.2% alone 1.8%

2.4%
SAP (including BusinessObjects/Sybase/Hana)(27), 3.7% alone 1.4%

0.9%
TIBCO Spotfire / S+ / Miner(26), 3.8% alone 1.4%

4.6%
C4.5/C5.0/See5(21), 0% alone 1.1%

1.6%
Bayesia(19), 15.8% alone 1.0%

1.8%
Oracle Data Miner(19), 5.3% alone 1.0%

4.4%
Zementis(17), 41.2% alone 0.9%

1.8%
XLSTAT(16), 0% alone 0.9%

0.9%
F#(14), 14.3% alone 0.7%

0.6%
RapidInsight/Veera(9), 0% alone 0.5%

0.6%
Teradata Miner(9), 0% alone 0.5%

0.5%
Lavastorm(8), 25.0% alone 0.4%
WordStat(7), 0% alone 0.4%

0.4%
Angoss(6), 16.7% alone 0.3%

0.9%
11 Ants Analytics(5), 0% alone 0.3%

0.5%
Alteryx(5), 0% alone 0.3%
Megaputer Polyanalyst/TextAnalyst(2), 0% alone 0.1%

Among tools with at least 1% share, the highest increase in 2013 was for

  • Predixion Software, up 622%, to 2.7% share, from 0.4% in 2012 (was 0.5% in 2011)
  • Revolution Analytics R Enterprise, up 105%, to 2.8% share, from 1.4% in 2012 (was 1.4% in 2011)
  • Salford SPM/CART/MARS/TreeNet/RF, up 98%, to 2.2% share, from 1.1% in 2012 (was 10.6% in 2011)
  • SAP (including BusinessObjects/Sybase/Hana), up 64%, to 1.4% share, from 0.9% in 2012 (not asked in 2011)
  • Rapid-I RapidMiner/RapidAnalytics free edition, up 47%, to 39.2% share, from 26.7% in 2012 (was 27.7% in 2011)
  • Tableau, up 43%, to 6.3% share, from 4.4% in 2012 (was 2.6% in 2011)
  • Microsoft SQL Server, up 39%, to 7.0% share, from 5.0% in 2012 (was 4.9% in 2011)
  • R, up 22%, to 37.4% share, from 30.7% in 2012 (was 23.3% in 2011).

Note that only Tableau and R showed strong growth in both 2013 and 2012.

Some of the increase was probably due to successful efforts of vendors to get their users to vote. Likewise, some of the decreases in the numbers below were probably due to lack of such efforts, compared to 2012.

The tools with the largest decline in share of usage were:

  • Orange, down -32.3%, to 3.6% in 2013, from 5.3% (was 1.3% in 2011)
  • StatSoft Statistica, down -35.6%, to 9.0% in 2013, from 14.0% (was 8.5% in 2011)
  • Bayesia, down -42.4%, to 1.0% in 2013, from 1.8% (was 0.8% in 2011)
  • TIBCO Spotfire / S+ / Miner, down -70.2%, to 1.4% in 2013, from 4.6% (was 1.7% in 2011)
  • KNIME free edition, down -73.2%, to 5.9% in 2013, from 21.8% (was 12.1% in 2011)
  • Oracle Data Miner, down -77.0%, to 1.0% in 2013, from 4.4% (was 0.7% in 2011)

New tools that appeared this year and got more than 1% were

  • Rattle, 4.5%
  • Gnu Octave, 2.9%
  • QlikView, 2.4%
  • Revolution Analytics R free edition, 2.4%

Suprisingly, Big Data tools share remained relatively stable, at about 14%, vs 15% in 2012, but only 3% in 2011.

The most popular Big Data tools were

  • Big Data Software: Hadoop/ Hbase/ Pig/ Hive, 9.3%
  • MongoDB, 4.3%
  • Other Big DataData/Cloud analytics software, 3.2%
  • Other NoSQL Databases, 2.0%

Only one vote was received for an interesting new tool HPCC from LexisNexis.

Regional participation had more participants from Europe, where RapidMiner was especially successful in getting their users to vote.

The following table shows breakdown by region and tool type: commercial/free/both. Less than 1% used only Big Data tools (not shown). US, E. Europe, and Africa/Middle East had the highest percentage of analysts who ONLY used commercial tools (34%), while Latin America and W. Europe had the highest percentage who ONLY used free tools.

Region % users who only use commercial tools

% users who only use free tools

% users who use both
US/Canada (33%)

34%                    17%         48%
W. Europe (28%)

24%              35%                 40%
E. Europe (13%)

34%                      29%             36%
Asia (11%)

25%              34%                 41%
Latin America (8.8%)

24%                45%                        31%
Africa/MidEast (4.0%)

34%                    29%              36%
Australia/NZ (2.3%)

25%              32%                 43%

We also did a comparable regional breakdown in KDnuggets 2011 Poll: Data Mining/Analytic Tools Used, and the results were quite different then.

Next, we looked at the regional breakdown for Big Data tools, and found that Asian and US/Canada users used more Big Data tools than the rest. US/Canada, Asian, and AU/NZ users also used more tools than the average.

Region Avg. Number of Tools % using Big Data Software
US/Canada 3.4 17%
W. Europe 2.8 14%
E. Europe 2.7 9%
Asia 3.3 19%
Latin America 2.5 8%
Africa/MidEast 2.4 11%
Australia/NZ 3.3 9%
All 3.0 14%

Here are the results of past polls:

Comments

Additional software mentioned in response to this poll:

  • FICO Model Builder
  • Infocentricity Xeno

From ACM SIGKDD LinkedIn Group:

  • Xingkai Li: ELKI for clustering. ELKI:is an Environment for Developing KDD-Applications Supported by Index-Structures, used for clustering
  • Greg Makowski: SAS Enterprise Miner, R, Angoss ... deploy with PMML and the Zementis ADAPA server
  • Phil Nguyen: Mathematica/Matlab for math or matrix intensive projects, C++, CUDA for the extra boost.
  • Will (Willy) Martin: Storm, Kafka, R/Math3 (Apache)

On KDnuggets Poll Results:

  • Daniel: There's one tool missing for data mining: TIMi. I use it for all classification and regression problems when working with actual population data (i.e. any data mining application).
  • Frank: You missed an important anatlytic ETL tool for Big Data: Anatella. We used this tool to process big datasets for predictive modelling for churn for different big telecoms (Millicom,MTN,VOO,etc.). For example: we started from 35TB raw text files, and generated a 85e6 row, 2e3 columns dataset for churn modeling. Just to give an idea: On this job, Anatella is roughly between 30 times and 200 times faster than SAS or IBM-SPSS (about 7 days of computation-time instead of 7 months).