KDnuggets Analytics, Data Mining, Data Science Software Poll – Analyzed
We analyze the results of KDnuggets Software Poll, including correlations between tools, and relationships between commercial, free, and Hadoop/Big Data tools. We identify a potential capability gap. Download anonymized data and analyze it yourself.
KDnuggets 15th Annual Software Poll asked
What Analytics, Big Data, Data mining, Data Science software you used in the past 12 months for a real project?
It received a lot of attention, both from vendors and from the analytics community. However, regardless of who was in what place, the votes from over 3,000 participants give a lot of interesting data to examine - see our initial analysis below.
We prepared an anonymized version of the poll votes - scroll down to the end of the post for instructions on how to download the data. Let me know what else you find!
One interesting question is what tools are used by the same user.
The poll question asked about usage over a period of 12 months, so these tools were not necessarily used on the same project, but data scientists, like all craftsmen, tend to learn particular set of tools and use them repeatedly. Indeed, meaningful correlations we find confirm that co-occurrence of tools in the same vote is meaningful.
One way to observe tool co-occurrence is with a heat map - see below. This heat map was prepared by Ran Bi from NYU.
Darker colors mean more usage. We note that R is used a lot with other tools, SAS is used frequently with SAS Enterprise Miner (not a surprise), SAS is used less frequently with R.
Commercial, Free, and Hadoop toolsNext, we examine for each tool, what other types of tools were also used.
We note that most "lonely" tools are
- Alpine Data Labs, 52.3% of voters used it alone, 3.2 tools avg
- Alteryx, 50.5%, 3.7
- Predixion Software, 47.5%, 3.2
- RapidMiner, 35.1%, 3.9
- Salford SPM/CART/Random Forests/MARS/TreeNet, 31.4%, 4.3
The overall average was 29.6% alone, Ntools=3.7.
The least "lonely" tools are below (0% lonely - never used alone): Hadoop, Microsoft SQL Server, MATLAB, Unix shell/awk/gawk, Rattle, Other Hadoop/HDFS-based tools, Gnu Octave, KXEN, Pig, Orange, Spark Pentaho , Mahout, Mathematica, and IBM Cognos.
The second group also correlated with tools that were part of larger platforms. Users of Pig, Mathematica , Mahout , Perl , Other Hadoop/HDFS-based tools , and Orange have used at least 8 tools (vs avg of 3.7).
The largest number of commercial tool used was for users of free tools
- Orange, 5.2
- Perl , 5
- Rattle , 4.9
- Unix shell/awk/gawk, 4.8
- Gnu Octave , 4.7
while the smallest (just slightly above 1) are users of
- Salford, BayesiaLab, Alteryx, Alpine Data Labs, and Predixion Software.
Very surprisingly, the largest number of different free tools was among users commercial tools:
- StatSoft Statistica
- IBM Cognos,
- IBM SPSS Modeler
- SAS Enterprise Miner
- IBM SPSS Statistics
with N free tools ranging from 4.4 to 3.6. Does this suggest a gap in capabilities of those commercial tools that users want to fill in with free tools ?
The table below has entries for all tools with at least 50 votes. For each tool, we show
- % alone: % of voters who only used this tool
- % Avg N tools: average number of all tools used
- % Avg N of commercial, Free/Language, and Hadoop tools, both as colored bars and then numbers. This includes the tool itself.
|Tool (cnt)||% alone||Avg N|
|Avg Number of |
Commercial | Free/Language | Hadoop tools
|Microsoft Excel (847)||0.1%||5.8|
|Hadoop _Apache or other_ (416)||0.0%||7.2|
|SAS base (357)||1.4%||5.7|
|Microsoft SQL Server (344)||0.0%||6.7|
|Revolution Analytics R (300)||13.3%||6.2|
|IBM SPSS Statistics (253)||0.4%||6.5|
|SAS Enterprise Miner (235)||1.3%||6.2|
|SAP _including BusinessObjects/Sybase/Hana_ (225)||10.2%||4.7|
|Unix shell/awk/gawk (190)||0.0%||7.9|
|IBM SPSS Modeler (187)||3.2%||6.6|
|Other free analytics/data mining tools (168)||1.8%||7.0|
|Other Hadoop/HDFS-based tools (129)||0.0%||8.0|
|Gnu Octave (128)||0.0%||7.3|
|KXEN _now part of SAP_ (125)||0.0%||4.7|
|Predixion Software (122)||47.5%||3.2|
|Salford SPM/CART/Random Forests/MARS/TreeNet (118)||31.4%||4.3|
|Other languages for analytics (98)||0.0%||7.0|
|TIBCO Spotfire (91)||25.3%||4.7|
|Alpine Data Labs (88)||52.3%||3.2|
|Oracle Data Miner (72)||5.6%||6.9|
|Other paid analytics/data mining/data science software (62)||0.0%||6.7|
|IBM Cognos (60)||0.0%||7.3|
|StatSoft Statistica _now part of Dell_ (56)||14.3%||7.6|
Download Anonymized Poll Data.
The anonymized data (3285 records) is in the file www dot domain dot com/dir/kdnuggets-2014-software-poll-anon.tab (tab-delimited), where domain is the name of this site, and dir=tmp.
The fields are:
- Rand: random number between 1 and 1000,000 (votes re-ordered by it)
- CC: 2-letter country code
- reg: region - usca (US/Canada), europe, asia, lam (Latin America), aunz, afme (Africa/Middle East)
- Ntools: num of tools
- Commercial: 1 if any commercial tool was used
- Free/Lang: 1 if any free or "language" tool was used
- Hadoop: 1 if Hadoop-related tool was used
- Tools: list of tools, separated by ";". Tool names are preceded by C:, F:, L:, or H: for Commercial, Free, Language, or Hadoop categories.
- KDnuggets 15th Annual Analytics, Data Mining, Data Science Software Poll: RapidMiner Continues To Lead
- KDnuggets 2013 Software Poll: RapidMiner and R vie for first place.
- KDnuggets 2012 Poll: Analytics, Data mining, Big Data software used
- KDnuggets 2011 Poll: Data Mining/Analytic Tools Used