Top Languages for analytics, data mining, data science

The most popular languages continue to be R (used by 61% of KDnuggets readers), Python (39%), and SQL (37%). SAS is stable at around 20%. The highest growth was for Pig/Hive/Hadoop-based languages, R, and SQL, while Perl, C/C++, and Unix tools declined. We also find a small affinity between R and Python users.



By Gregory Piatetsky, Aug 27, 2013. c comments

Previous KDnuggets polls looked at high-level Analytics and Data mining software, but sometimes a full-power programming language is needed. That was the focus of the latest KDnuggets Poll, which asked:

What programming/statistics languages you used for an analytics / data mining / data science work in 2013?

R languageBased on a very high response of over 700 voters, the most popular languages continue to be R (now used by 61% of responders), Python (39%), and SQL (37%). On average, there were 2.3 languages used.

For trends, we compared the 2013 results with similar

The language with the highest relative growth (2013 vs 2012) was Julia, which doubled in popularity, but still was used only by 0.7% in 2013.

Among more common languages, the largest relative increases in share of usage from 2012 to 2013 were for

  • Pig Latin/Hive/other Hadoop-based languages, 19% growth, from 6.7% in 2012 to 8.0% in 2013
  • R, 16% growth
  • SQL, 14% growth (perhaps the result of increasing number of SQL interfaces to Hadoop and other Big Data systems?)

The languages with the largest decline is share of usage were

  • Lisp/Clojure, 77% down
  • Perl, 50% down
  • Ruby, 41% down
  • C/C++, 35% down
  • Unix shell/awk/sed, 25% down
  • Java, 22% down

Is there an affinity between R and Python? Yes, people who use R are about 13% more likely to use Python than overall population. Here are the languages more likely to be used with R:

  • Julia, 64% more
  • Lisp/Clojure, 41% more
  • GNU Octave, 27% more
  • Pig Latin/Hive/other Hadoop-based languages, 27% more
  • Unix shell/awk/sed, 23% more
  • Python, 13% more

Here are the full results:

What programming/statistics languages you used for an analytics / data mining / data science work in 2013? [713 votes total]

% users in 2013   % users in 2012   % users in 2011
R (434 voters in 2013) 60.9%
52.5%
45.1%
Python (277) 38.8%
36.1%
24.6%
SQL (261) 36.6%
32.1%
32.3%
SAS (148) 20.8%
19.7%
21.2%
Java (118) 16.5%
21.2%
24.4%
MATLAB (89) 12.5%
13.1%
14.6%
High-level data mining suite (80) 11.2%
not asked in 2012
Unix shell/awk/sed (79) 11.1%
14.7%
C/C++ (66) 9.3%
14.3%
Pig Latin/Hive/other Hadoop-based languages (57) 8.0%
6.7%
Other low-level language (42) 5.9%
11.4%
GNU Octave (40) 5.6%
5.9%
Perl (32) 4.5%
9.0%
Ruby (16) 2.2%
3.8%
Scala (16) 2.2%
2.4%
F# (12) 1.7%
not asked in 2012
Lisp/Clojure (7) 1.0%
4.3%
Julia (5) 0.7%
0.3%
None (2) 0.3%
0.7%

Comments
A number of comments, such as one below, pointed that SPSS also has its own language similar to SAS – will include it in the next poll.

Ralph Winters, SPSS Language
It seems odd to exclude SPSS based upon a definition of what is or what is not language. Especially for a language which has such legacy roots, and is backed by IBM. I could argue that both Matlab and R are both not true progamming language, and SAS, as flexible as it is, I would not consider a standarized programming language as well.

Regional participation was

  • US/Canada, 50.8%,
  • Europe: 25.7%,
  • Asia: 11.8%,
  • Latin America: 6.7%,
  • AU/NZ: 3.2%,
  • Africa/Middle East: 1.5%