KDnuggets Home » News » 2014 » Aug » News, Features » Four main languages for Analytics, Data Mining, Data Science ( 14:n22 )

Four main languages for Analytics, Data Mining, Data Science

          

New KDnuggets Poll shows the growing dominance of four main languages for Analytics, Data Mining, and Data Science: R, SAS, Python, and SQL - used by 91% of data scientists - and decline in popularity of other languages, except for Julia and Scala.

By Gregory Piatetsky, @kdnuggets, Aug 18, 2014.

Sometimes the high-level data science platform is not enough for a particular analytics task, and data scientists need to go to a lower level statistics / programming language.

The last KDnuggets poll asked
What programming/statistics languages you used for an analytics / data mining / data science work in 2014?

The results show that the main 4 languages - R, Python, SAS, and SQL - hold a commanding lead - 91% of all respondents used one of them.

Comparing with similar KDnuggets Polls
in 2013: What programming/statistics languages you used for analytics / data mining, and in 2012, we note several changes and trends.

1. A big increase in SAS user participation in 2014, perhaps partly driven by growth and change in KDnuggets readers composition, and likely also by increased visibility of this poll among SAS users. SAS users had a high percentage of "lone" votes - in 2014, 58% of them said they used only SAS, compared to 26% in 2013. The fraction of "lone" votes in 2014 was 20.5% for R, 14% for Python, and only 4.5% for SQL.

2. Consolidation among top 4 languages - R, SAS, Python, and SQL. 91% of all voters have used at least one of them. Almost all other languages declined in their popularity for data mining tasks, including Java, Unix shell, MATLAB, C/C++, Perl, Octave, Ruby, Lisp, and F#.

Here is a Venn diagram that shows significant overlap between R, Python, and SQL. The percentages indicated how many voters chose that option, eg 20% of all voters have used both R and Python, while 10% have used R, Python, and SQL. The areas of the circles and intersections approximately correspond to the fraction of voters.

KDnuggets 2014 Poll - Overlap between languages for Analytics/Data Mining: R, Python, and SQL

Here is a similar Venn diagram showing overlap between R, Python, and SAS. We see that SAS is much more independent from R and Python, with about 2/3 of of SAS users not using R or Python.

KDnuggets 2014 Poll - Overlap between languages for Analytics/Data Mining: R, Python, and SAS

3. Languages with the highest growth in 2014 were
  • Julia, 316% growth, from 0.7% share in 2013 to 2.9% in 2014
  • SAS, 76% growth, from 20.8% in 2013 to 36.4% in 2014
  • Scala, 74% growth, from 2.2% in 2013 to 3.9% in 2014

 
4. The languages with the largest decline in share of usage were
  • F#, 100% decline, from 1.7% share in 2013 to zero in 2014
  • C++/C, 60% decline, from 9.3% in 2013 to 3.6% in 2014
  • GNU Octave, 57% decline, from 5.6% in 2013 to 2.4% in 2014
  • MATLAB, 50% decline, from 12.5% in 2013 to 6.3% in 2014
  • Ruby, 44% decline, from 2.2% in 2013 to 1.3% in 2014
  • Perl, 41% decline, from 4.5% in 2013 to 2.6% in 2014

 
Here is the table with more details:

What programming/statistics languages you used for an analytics / data mining / data science work in 2014?
Language used % voters in 2014 (719 total)
% voters in 2013 (713 total)
% voters in 2012 (579 total)
R (352 voters in 2014) 49.0%
60.9%
52.5%
SAS (262) 36.4%
20.8%
19.7%
Python (252) 35.0%
38.8%
36.1%
SQL (220) 30.6%
36.6%
32.1%
Java (89) 12.4%
16.5%
21.2%
Unix shell/awk/sed (63) 8.8%
11.1%
14.7%
Pig Latin/ Hive/ other Hadoop-based languages (61) 8.5%
8.0%
6.7%
SPSS (58) 8.1%
not asked
not asked
MATLAB (45) 6.3%
12.5%
13.1%
Scala (28) 3.9%
2.2%
2.4%
C/C++ (26) 3.6%
9.3%
14.3%
Julia (21) 2.9%
0.7%
0.3%
Other low-level languages (20) 2.8%
5.9%
11.4%
Perl (19) 2.6%
4.5%
9.0%
GNU Octave (17) 2.4%
5.6%
5.9%
Ruby (9) 1.3%
2.2%
3.8%
Lisp/Clojure (5) 0.7%
1.0%
4.3%
F# (0) 0%
1.7%
not asked in 2012


Among other programming languages William Dwinnell mentioned Compiled BASIC (PowerBASIC).

Regional participation was
  • US/Canada, 51.6%,
  • Europe: 26.7%,
  • Asia: 13.3%,
  • Latin America: 3.7%,
  • Africa/Middle East: 3.5%
  • AU/NZ: 2.0%

 


This is similar to 2013, but with more participation from Asia and Africa/Middle East (led by Israel and Turkey), and less from Latin America (main decline from Brazil, perhaps still depressed from the World Cup loss).