KDnuggets Home » News » 2015 » May » News, Features » R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites ( 15:n17 )

R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites

          


Tags: , , , , , , , , ,



R is the most popular overall tool among data miners, although Python usage is growing faster. RapidMiner continues to be most popular suite for data mining/data science. Hadoop/Big Data tools usage grew to 29%, propelled by 3x growth in Spark. Other tools with strong growth include H2O (0xdata), Actian, MLlib, and Alteryx.



The 16th annual KDnuggets Software Poll continued to get huge attention from analytics and data mining community and vendors, attracting about 2,800 voters, who chose from a record number of 93 different tools.

R is the most popular overall tool among data miners and data scientists, but Python usage grows faster and it is likely to catch up in 2-3 years. RapidMiner remains the most popular suite for data mining/data science, but it got fewer votes than last year. There was a notable increase in Hadoop/Big Data tool usage (29%, up from 17% in 2014), mainly driven by jump in Spark whose usage share grew over 3-fold. (see KDnuggets exclusive interview with Spark Creator Matei Zaharia). Other tools with strong growth include H2O (0xdata), Actian, MLlib, and Alteryx.

This report has 5 sections
 
The participation by region was: US/Canada (41.5%), Europe (38.4%), Asia (8.2%), Latin America (6.3%), Australia/NZ (3.1%), Africa/MidEast (2.5%).
 

Top Analytics Tools and Trends

Here are the top 10 tools by share of usage:

Top10 Analytics Data Mining Software 2015

The top 10 tools by share of users were
  1. R, 46.9% share ( 38.5% in 2014)
  2. RapidMiner, 31.5% ( 44.2% in 2014)
  3. SQL, 30.9% ( 25.3% in 2014)
  4. Python, 30.3% ( 19.5% in 2014)
  5. Excel, 22.9% ( 25.8% in 2014)
  6. KNIME, 20.0% ( 15.0% in 2014)
  7. Hadoop, 18.4% ( 12.7% in 2014)
  8. Tableau, 12.4% ( 9.1% in 2014)
  9. SAS, 11.3 (10.9% in 2014)
  10. Spark, 11.3% ( 2.6% in 2014)

 
Compared to 2014 Analytics/Data Mining Software Poll, Tableau and Spark were newcomers to top 10, displacing Weka and Microsoft SQL Server.

The average number of tools jumped to 4.8, up from 3.7 in 2014 and 3.0 in 2013.

The distinction between commercial and free software is becoming harder to make, with many tools having both a free/community version and commercial/enterprise version. We classified each tool according to the primary type of the latest version, so we put RapidMiner in commercial category and KNIME in free software category.

Many vendors asked their users to vote in the poll and even tweet their vote, but we have not found any bot or illegal voting, and did not have to remove any votes.

This year, 91% of voters used commercial software and 73% used free software. About 27% used only commercial software, and only 9% used free-software. For the first time a majority of 64% used both free and commercial software, up from 49% in 2014.

Analytics Data Mining Software Commercial Free Venn

Among tools with at least 10 votes, the highest increase in 2014 was for
  1. H2O (0xdata), 1210% up, to 2.0% share (55 votes) from 0.2% in 2014
  2. Actian, 345% up, to 2.0% (56 votes), from 0.5% in 2014
  3. Spark, 326% up, to 11.3% (311), from 2.6% in 2014
  4. MLlib, 228% up, to 3.3% (91), from 1.0% in 2014
  5. Alteryx, 79% up, to 5.6% (155), from 3.1% in 2014
  6. Python, 56% up, to 30.3% (837), from 19.5% in 2014
  7. TIBCO Spotfire, 56% up, to 4.3% (119), from 2.8% in 2014
  8. Pig, 54% up, to 5.4% (150), from 3.5% in 2014
  9. SAS Enterprise Miner, 53% up, to 10.9% (302), from 7.2% in 2014
  10. Splunk/Hunk, 49% up, to 1.1% (30), from 0.7% in 2014


Tools that showed at least 20% increases in their share for 2 years in the row are Alteryx, Hadoop, KNIME, Python, Qlikview, SAS Enterprise Miner, Tableau, and TIBCO Spotfire.

New analytics tools that received at least 20 votes in 2015 were
  • scikit-learn, 8.3% (229)
  • Microsoft Azure ML, 3.7% (102)
  • Microsoft Power BI, 3.6% (98)
  • IBM Watson Analytics, 2.1% (57)
  • Ayasdi, 2.0% (56)
  • Dataiku, 2.0% (56)
  • Lexalytics, 1.3% (35)
  • Vowpal Wabbit, 1.3% (35)
  • Microstrategy, 0.9% (24)
  • Amazon Machine Learning, 0.7% (20)

 


Among tools with at least 20 votes in 2014, the largest decline in 2015 was for these tools, which includes probably a combination of decline of popularity for free tools like Orange and lack of a voter drive for some of commercial tools this year.
  • Predixion Software, 90% down (0.4% share), from 3.7% in 2014
  • BayesiaLab, 86% down, to 0.6%, from 4.1%
  • Alpine Data Labs, 82% down, to 0.5% from 2.7%
  • Oracle Data Miner, 64% down, to 0.8% from 2.2%
  • RapidInsight/Veera, 60% down, to 0.2% from 0.5%
  • Revolution Analytics (now part of Microsoft), 57% down, to 4.0% from 9.1%
  • SAP (including former KXEN), 57% down, to 3.0% from 6.8%
  • Orange, 44% down to 1.9% from 3.4%
  • Gnu Octave, 41% down, to 2.3% from 3.9%

 


Hadoop/Big Data Tools

Hadoop/Big Data tool usage jumped to 29% among voters, up from 17% in 2014, and 14% in 2013.

This is probably due to availability and low-cost of many cloud-based Big Data tools. Very notable is the jump in Spark share to 11.3%.

However, most data analysis is still done on "medium" and small data.

Top Hadoop/Big Data tools were
  • Hadoop, 18.4% share (507 votes)
  • Spark, 11.3% (311)
  • Hive, 10.2% (282)
  • SQL on Hadoop tools, 7.2% (198)
  • Pig, 5.4% (150)
  • HBase, 4.6% (127)
  • Other Hadoop/HDFS-based tools, 4.5% (125)
  • MLlib, 3.3% (91)
  • Mahout, 2.8% (76)
  • Datameer, 0.8% (23)

 

Deep Learning Tools

New this year was a category of Deep Learning Tools, with most popular tools being:
  • Pylearn2 (55 users)
  • Theano (50)
  • Caffe (29)
  • Cuda-convnet (17)
  • Deeplearning4j (12)
  • Torch (27)

 

However, this category is growing rapidly and above list is incomplete, since the largest count in this category was for other Deep Learning tools (106)

See also
 

Programming Languages

Python increased significantly in popularity. Java is the second most commonly used language for analytics/data mining tasks. Here is the
  • Python, 30.3% share (837 votes), up from 19.5%
  • Java, 14.2% (392), na in 2014
  • C/C++, 9.4% (260), na in 2014
  • Unix shell/awk/gawk, 8.0% (221), up from 5.8%
  • Other programming languages, 5.1% (140)
  • Scala, 3.5% (96), na in 2014
  • Perl, 2.9% (79), down from 3.0
  • Ruby, 1.2% (33), na in 2014
  • Julia, 1.1% (31), up from 0.8%
  • F#, 0.7% (18), up from 0.5%
  • Clojure, 0.5% (13), same as 0.5%
  • Lisp, 0.4% (10), up from 0.3%

 

Full Results and 3-year trends

The following table shows the poll results in detail.
% alone is the percent of tool voters used only that tool alone. For example, only 3.6% of R users have used only R, while 13.7% of RapidMiner users indicated they used that tool alone.

What Analytics, Big Data, Data mining, Data Science software you used in the past 12 months for a real project? [2759 voters]
Legend: Red: Free/Open Source tools
Green: Commercial tools
Fuchsia: Hadoop/Big Data tools
% users in 2015
% users in 2014
% users in 2013
R (1293), 3.6% alone 46.9%
38.5%
37.4%
RapidMiner (870), 13.7% alone 31.5%
44.2%
39.2%
SQL (853), 0% alone 30.9%
25.3%
na
Python (837), 0% alone 30.3%
19.5%
13.3%
Excel (631), 0% alone 22.9%
25.8%
28.0%
KNIME (553), 6.7% alone 20%
15.0%
5.9%
Hadoop (507), 0% alone 18.4%
12.7%
9.3%
Tableau (341), 0% alone 12.4%
9.1%
6.3%
SAS base (313), 0.6% alone 11.3%
10.9%
10.7%
Spark (311), 0% alone 11.3%
2.6%
na
Weka (310), 0% alone 11.2%
17.0%
14.3%
SAS Enterprise Miner (302), 3.6% alone 10.9%
7.2%
5.9%
Microsoft SQL Server (268), 0% alone 9.7%
10.5%
7.0%
MATLAB (243), 0% alone 8.8%
8.4%
9.9%
scikit-learn (229), 0% alone 8.3%
na
na
Unix shell/awk/gawk (221), 0% alone 8.0%
5.8%
na
IBM SPSS Statistics (213), 0% alone 7.7%
7.7%
8.7%
IBM SPSS Modeler (197), 7.1% alone 7.1%
5.7%
6.1%
Alteryx (155), 39.4% alone 5.6%
3.1%
0.3%
Pig (150), 0% alone 5.4%
3.5%
na
Other programming languages (140), 0% alone 5.1%
3.0%
na
Other free analytics/data mining tools (138), 0% alone 5.0%
5.1%
3.4%
Other Hadoop/HDFS-based tools (125), 0% alone 4.5%
3.9%
na
TIBCO Spotfire (119), 11.8% alone 4.3%
2.8%
1.4%
Rattle (117), 0.9% alone 4.2%
4.9%
4.5%
QlikView (116), 0% alone 4.2%
3.0%
2.4%
Revolution Analytics (now part of Microsoft) (109), 0% alone 4.0%
9.1%
4.5%
Microsoft Azure ML (102), 1.0% alone 3.7%
na
na
Microsoft Power BI (98), 0% alone 3.6%
na
na
MLlib (91), 0% alone 3.3%
1.0%
na
JMP (86), 0% alone 3.1%
3.8%
4.1%
SAP (including former KXEN) (82), 26.8% alone 3.0%
6.8%
1.4%
Perl (79), 0% alone 2.9%
3.0%
na
Mahout (76), 0% alone 2.8%
2.5%
na
Pentaho (74), 0% alone 2.7%
2.6%
na
Other paid analytics/data mining/data science software (66), 6.1% alone 2.4%
1.9%
2.4%
Salford SPM/CART/Random Forests/MARS/TreeNet (64), 43.8% alone 2.3%
3.6%
2.2%
Gnu Octave (64), 0% alone 2.3%
3.9%
2.9%
IBM Watson Analytics (57), 0% alone 2.1%
na
na
Ayasdi (56), 10.7% alone 2.0%
na
na
Dataiku (56), 7.1% alone 2.0%
na
na
Actian (56), 7.1% alone 2.0%
0.5%
na
H2O (0xdata) (55), 0% alone 2.0%
0.2%
na
Orange (53), 0% alone 1.9%
3.4%
3.6%
Mathematica (52), 0% alone 1.9%
2.3%
2.1%
IBM Cognos (51), 0% alone 1.8%
1.8%
2.4%
Dell (including StatSoft) (47), 19.1% alone 1.7%
1.7%
9.0%
XLSTAT for Excel (42), 0% alone 1.5%
1.2%
0.9%
Stata (36), 2.8% alone 1.3%
1.4%
2.1%
Lexalytics (35), 28.6% alone 1.3%
na
na
Vowpal Wabbit (35), 0% alone 1.3%
na
na
C4.5/C5.0/See5 (35), 0% alone 1.3%
1.5%
1.1%
Julia (31), 3.2% alone 1.1%
0.8%
na
Splunk/ Hunk (30), 0% alone 1.1%
0.7%
na
Datameer (26), 0% alone 0.9%
0.8%
na
MicroStrategy (24), 0% alone 0.9%
na
na
BigML (23), 0% alone 0.8%
0.9%
na
Zementis (22), 31.8% alone 0.8%
0.8%
0.9%
Miner3D (22), 9.1% alone 0.8%
0.9%
1.8%
Oracle Data Miner (22), 0% alone 0.8%
2.2%
1.0%
Amazon Machine Learning (20), 5.0% alone 0.7%
na
na
F# (18), 0% alone 0.7%
0.5%
0.7%
BayesiaLab (16), 12.5% alone 0.6%
4.1%
1.0%
Dato (former Graphlab) (15), 6.7% alone 0.5%
0.9%
na
Clojure (13), 0% alone 0.5%
0.5%
na
Alpine Data Labs (13), 0% alone 0.5%
2.7%
na
Angoss (11), 18.2% alone 0.4%
0.4%
0.3%
Lavastorm (10), 0% alone 0.4%
0.3%
0.4%
Lisp (10), 0% alone 0.4%
0.3%
na
Predixion Software (10), 0% alone 0.4%
3.7%
2.7%
WordStat (9), 0% alone 0.3%
0.2%
0.4%
Megaputer Polyanalyst/TextAnalyst (8), 0% alone 0.3%
0.1%
0.1%
WPS: World Programming System (7), 0% alone 0.3%
0.2%
na
GoodData (6), 0% alone 0.2%
0.1%
na
MetaMind (5), 0% alone 0.2%
na
na
SiSense (5), 0% alone 0.2%
0.1%
na
RapidInsight/Veera (5), 0% alone 0.2%
0.5%
0.5%
Skytree (3), 0% alone 0.1%
na
na
Birst (2), 0% alone 0.1%
na
na
Ontotext (1), 0% alone 0%
na
na
FICO Model Builder (1), 0% alone 0%
0.2%
na


Additional tools not included but mentioned in the comments include
  • Daniel Soto: ETL: Anatella; predictive analytics: TIMI modeler.
  • Henrique Pinto: proposed separation of SAP technologies into the modeling tool (SAP Predictive Analytics, which merges SAP PA + KXEN) and SAP HANA as the underlying platform, in the same sense you have SAS Miner and SAS Base. HANA has its own programming logic (based on SQL, called SQLScript) which can be used for native development of predictive models, or you could use SAP Predictive Analytics high-level modeling capabilities on top of HANA for less development capable users.
  • Another tool suggestion: Domino (DominoLabs), analytical hub for sophisticated enterprises: helps organizations develop, track and deploy their analytical models faster, while facilitating best practices by keeping work centralized, sharable, and auditable.
  • Roberto Lopez: Neural Designer, a predictive analytics tool with high performance.
  • Julian GV: Experian Strategy Management, that includes Assisted Design, the analytics module integrated with the software. That's the solution I used in the past 12 months.
  • Universal Platform, UP

 
Here are the results of past polls