Poll Results: Data Types/Sources Analyzed

Trends in data sources for data mining include: table data dominates, followed by time series and text; audio, JSON grows in popularity, while itemsets decline; 70% access DB engines, but only 20% access NoSQL stores; Hadoop, MongoDB used more for text; Europe is lagging in NoSQL usage.



By Gregory Piatetsky, May 17, 2014.

Recent KDnuggets Poll asked:

What data types/sources you analyzed in the past 12 months? Data Sources

Multiple choice were allowed.

The most popular data types, not surprisingly, were
- table data (fixed n. columns), 77%
- time series, 48%
- text, 41%
- itemsets / transactions, 27%
- location/geo/mobile, 20%
 
Comparing with a similar
2012 KDnuggets Poll: Data types analyzed/mined, we see that the data types/sources with the highest increase in usage were
  • music / audio: 143% up, from 1.1% rate in 2012 to 2.7% in 2014
  • JSON: 95% up, from 8.7% in 2012 to 17%
  • XML: 51% up, from 9.3% in 2012 to 14%.

 
The largest declines in usage were for
  • anonymized data: 42% down, from 24% in 2012 to 14% in 2014
  • itemsets / transactions: 19% down, from 33% to 27%
  • images / video: 18% down, from 6.0% to 4.9%

 
We also added new options in 2014 poll for accessing data from a database engine, and took the top 7 database engines from db-engines.com/en/ranking.

Overall, 70% of all respondents have accessed data from some database, but only about 20% accessed NoSQL databases (Hadoop, MongoDB or another DB engine)

Here is a table with full results.
What data types/sources you analyzed in the past 12 months? [264 votes total]

% users in 2014   % users in 2012
table data (fixed n. columns) (203) 77%
73%
time series (126) 48%
44%
text (108) 41%
39%
itemsets / transactions (70) 27%
33%
location/geo/mobile (52) 20%
19%
Twitter (47) 18%
NA (not asked in 2012)
JSON (45) 17%
8.7%
web content (42) 16%
13%
social network (41) 16%
18%
anonymized data (37) 14%
24%
XML (37) 14%
9.3%
web clickstream/web log (33) 12.5%
9.3%
email (26) 10%
11%
images / video (13) 4.9%
6.0%
music / audio (7) 2.7%
1.1%
Other (19) 7.2%
8.2%


Here is a breakdown of database engine source by popularity, with the rank and popularity score from db-engines.com site for May 2014, rounded.
Data Source% Useddb-engines Rank (score)
Microsoft SQL Server (84) 31.8%3 (1208)
Oracle (65) 24.6%1 (1503)
MySQL (60) 22.7%2 (1309)
another database engine (41) 15.5%na
Hadoop/HDFS (34) 12.9%na
Microsoft Access (31) 11.7%7 (145)
PostgreSQL (25) 9.5%4 (241)
DB2 (19) 7.2%6 (186)
MongoDB (13) 4.9%5 (225)


We note that popularity ranking of database engines for data mining, as shown in this poll, is NOT the same as db-engines ranking, with SQL Server being an especially popular source for data analysis.

We also analyzed co-occurrence of popular data types with different types of databases, and measured "affinity" of a database engine to data type as ratio of how frequently this data type was used in conjunction with that database, divided by average % usage of that data type. The usage of a particular data type and a database engine by the same respondent within a year does not mean that that DB was used for analyzing this data type, but we found some interesting and strong correlations.

The database engines with the most affinity for text data were MongoDB (1.88) and Hadoop (1.65), while for time series, the most popular database engines were Postgres (1.65).

We also analyzed the regional breakdown, including total number of data types used, and database sources.

Table below shows breakdown by Region, with columns:
  • % Participants: % of participants from that region
  • Ntypes: N. of Different data Sources used,
  • %from DB: % used data from a database engine or Hadoop
  • %from NoSQL: % used data from NoSQL engine
  • %text: % used text data

 

Region% ParticipantsNtypes%from
DB engine
%from
NoSQL
%text
US/Canada47%5.478%20%44%
Europe26%4.566%10%38%
Asia13%3.841%15%41%
Latin America8.7%4.474%17%35%
Africa/Middle East3.8%4.570%20%30%
Australia/NZ2.3%4.283%0%50%
ALL100%4.870%16%41%


We note that US/Canada region is leading in N. of different types used, usage of all database engines, and in NoSQL engines. Europe is lagging in using NoSQL engines, while Asia is lagging in usage of data from database engines in general (but not so much with NoSQL engines). Australia and Middle East participation is too low in this poll to draw inferences.

Related:
 


No, thanks!