KDnuggets Home » News » 2012 » May » Poll Results: Top Analytics, Data Mining, Big Data software used  (  12:n13 | Next > )

Poll Results: Top Analytics,
Data Mining, Big Data software used

For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with Statsoft Statistica getting the top commercial tool spot.

The 13th annual KDnuggets Software Poll asked:

What analytics/data mining software you used in the past 12 months for a real project (not just evaluation)

About 28% used commercial software but not free software, 30% used free software but not commercial, and 41% used both.

The usage of big data tools grew five-fold: 15% used them in 2012, vs about 3% in 2011.

R, Excel, and RapidMiner are the most popular tools, with Statsoft Statistica becoming the most popular commercial tool, getting more votes from SAS (in part due to more active campaign from Statsoft users, and lack of such campaign from SAS).

Among those who wrote analytics code in lower-level languages, R, SQL, Java, and Python were most popular.

This poll also had a very large number of participants and used email verification and other measures to remove unnatural votes (*see note below).

What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters]
Legend: Free/Open Source tools
Commercial tools
% users in 2012
% users in 2011
R (245) 30.7%
Excel (238) 29.8%
Rapid-I RapidMiner (213) 26.7%
KNIME (174) 21.8%
Weka / Pentaho (118) 14.8%
StatSoft Statistica (112) 14.0%
SAS (101) 12.7%
Rapid-I RapidAnalytics (83) 10.4%
not asked in 2011
MATLAB (80) 10.0%
IBM SPSS Statistics (62) 7.8%
IBM SPSS Modeler (54) 6.8%
SAS Enterprise Miner (46) 5.8%
Orange (42) 5.3%
Microsoft SQL Server (40) 5.0%
Other free analytics/data mining software (39) 4.9%
TIBCO Spotfire / S+ / Miner (37) 4.6%
Oracle Data Miner (35) 4.4%
Tableau (35) 4.4%
JMP (32) 4.0%
Other commercial analytics/data mining software (32) 4.0%
Mathematica (23) 2.9%
Miner3D (19) 2.4%
IBM Cognos (16) 2.0%
not asked in 2011
Stata (15) 1.9%
Bayesia (14) 1.8%
KXEN (14) 1.8%
Zementis (14) 1.8%
C4.5/C5.0/See5 (13) 1.6%
Revolution Computing (11) 1.4%
Salford SPM/CART/MARS/TreeNet/RF (9) 1.1%
Angoss (7) 0.9%
SAP (including BusinessObjects/Sybase/Hana) (7) 0.9%
not asked in 2011
XLSTAT (7) 0.9%
RapidInsight/Veera (5) 0.6%
not asked in 2011
11 Ants Analytics (4) 0.5%
Teradata Miner (4) 0.5%
not asked in 2011
Predixion Software (3) 0.4%
WordStat (3) 0.4%

Among tools with at least 10 users, the tools with the highest increase in "usage percent" were

  • Oracle Data Miner, 4.4% in from 2012, up from 0.7% in 2011, 505% increase
  • Orange, 5.3% from 1.3%, 315% increase
  • TIBCO Spotfire / S+ / Miner, 4.6% from 1.7%, 169% increase
  • Stata, 1.9% from 0.8%, 130% increase
  • Bayesia, 1.8% from 0.8%, 115% increase

The three tools with highest decrease in usage percent were 11 Ants Analytics, Salford SPM/CART/MARS/TreeNet/RF, and Zementis. Their dramatic decrease is probably due to vendors doing much less (or nothing) to encourage their users to vote in 2012 as compared to 2011.

Note: 3 tools received less than 3 votes and were not included in the above table: Clarabridge, Megaputer Polyanalyst/TextAnalyst, Grapheur/LIONsolver.

Big Data

Big data tools use grew 5-fold, from about 3% to about 15% of respondents.
Big Data software you used in the past 12 months
Apache Hadoop/Hbase/Pig/Hive (67) 8.4%
Amazon Web Services (AWS) (36) 4.5%
NoSQL databases (33) 4.1%
Other Big Data Data/Cloud analytics software (21) 2.6%
Other Hadoop-based tools (10) 1.3%

We also asked about the popularity of the individual languages for data mining. Note that we also included R in this table, as well as among higher-level tools

Your own code you used for analytics/data mining in the past 12 months in:
R (245) 30.7%
SQL (185) 23.2%
Java (138) 17.3%
Python (119) 14.9%
C/C++ (66) 8.3%
Other languages (57) 7.1%
Perl (37) 4.6%
Awk/Gawk/Shell (31) 3.9%
F# (5) 0.6%

For comparison here are the recent software polls:

Vote: cleaning: To reduce multiple voting this poll used email verification, which reduced the total number of votes compared to 2011, but made results more representative.
Furthermore, some vendors were much more active than others in recruiting their users, and to give a more objective picture of the tool popularity, a large number (over 100) of the "unnatural" votes were removed, leaving 798 votes.

I work in a commercial bank in Africa and we have been using R since we started predictive modeling. R is becoming more common in the industry. Some of the banks even in the US ask for datamining skills in R.
A point to note though, why isnt Octave rated? Isnt it popular?

Gregory PS
Correction: number of voters has grown from about 300 in year 2000 to about 1,100 in 2011 but 2011 poll did not use email verification.

Gregory PS
BR, alas not every data miner has voted in KDnuggets software poll. KDnuggets has an average of 50,000 unique monthly visitors, according to Google Analytics, and majority of visitors do not vote in polls. The number of voters in this poll has stayed around 700-1000 since 2001.

BR Deshpande
What is interesting to note is that there seem to be less than 800 people who could be considered as serious data miners. Now, if one were to accept the McKinsey study, this would mean a shortage of 199,000+ people for the industry at large by 2018...!?

I wonder, Gregory if this number has remained constant over the last 13 years.

Gregory PS
Walter, it is fine that vendors ask their users to vote, but some votes look "unnatural". This year only Statsoft and KNIME had significant number of such votes which I removed. I used the same objective method for removing such votes for the last several years, but I don't want to publish this method because I would not be able to use next year. However, I am working on a separate survey of leading open-source tools using different objective measures.

Walter Clark
I am also a bit surprised by the "clean up" that obivously hit some tools a lot harder than others. We were curious and wanted to monitor trends so we took snapshots of the intermediate poll number over time - the numbers stated here show a huge discrepancy to the numbers that were reported a few hours before the poll closed. What happened? Could someone share some light on what an "unnatural" vote is?

Apart from that I agree with Jan and some of the others - this type of pool is not really represenative. Even weirder that the numbers don't match what users seemed to have voted for. Why would anyone bother altering them?


PS: here are the voted tools reported on 5/30/2012 at 0:13 PST:
Excel (252)
SAS (105)
Java (146)
Python (121)
R (257)
SQL (195)
KNIME (271)
Rapid-I RapidMiner (232)
Weka / Pentaho (128)

Jan Galkowski
Echoing "Guest", it is surprising for a community to put any faith in a self-selected survey like this, especially one that claims to offer quantitative and statistical skills for hire. No doubt these are popular on the Web. But this community should know better.

Gregory PS
Eugene, good request - I will see what I can do.

Eugene Dubossarsky
Gregory, is it possible for you to publish a chart showing trends in the polls over the last few years, or at least the numbers in a format that would allow readers to do their own analysis ?
The alternative is a whole lot of data entry....

Gregory PS, I liked the email verification too

Guest What do you mean by syndicated research studies? Are you talking about Gartner papers?

If is difficult to get a good "big picture" of predictive analytics in the business environment. Some companies see predictive analytics as their "secret ingredient" and won't talk about how they are using it. Some companies are trying to figure out how to use data mining and don't have anything to talk about.

Some industries have started using R more than others. Conservative companies (insurance) are more comfortable with purchasing software and more importantly services/support.

Gregory PS
Most KDnuggets readers and visitors are in the industry - only about 20% or so are in academic community. This is not a scientific survey, but KDnuggets have been doing such software polls every year since 2000, and they serve as a useful measure of the community.

As for RapidMiner, you cannot add its numbers with RapidAnalytics, since these two sets overlap a lot.

James T.
Statistica gets its own position more and more in the analytics and data mining. it is so convenient one among the tools i have used.

I don't see the science in this survey as most of your readership happens to be in the academic communicty which uses R and other open source for their research efforts. Syndicated research studies have a much different picture of the top data mining vendors than what is shown here. They can't both be correct if the survey results are so strickingly different.

Gurupad Hegde
Its pretty sad that rapidminer lost its first position. But still rapid-I stands first as
RapidMiner (213) and rapidAnalytics(83) which sums up to 301 which is still >R :)

As per as I know this is the first kdnuggests poll which used email verification. nice one!

Looking forward to more polls in future!

KDnuggets Home » News » 2012 » May » Poll Results: Top Analytics, Data Mining, Big Data software used  (  12:n13 | Next > )