Poll Results: Data Types/Sources Analyzed

Trends in data sources for data mining include: table data dominates, followed by time series and text; audio, JSON grows in popularity, while itemsets decline; 70% access DB engines, but only 20% access NoSQL stores; Hadoop, MongoDB used more for text; Europe is lagging in NoSQL usage.

comments

By Gregory Piatetsky, May 17, 2014.

Recent KDnuggets Poll asked:

What data types/sources you analyzed in the past 12 months? Data Sources

Multiple choice were allowed.

The most popular data types, not surprisingly, were
- table data (fixed n. columns), 77%
- time series, 48%
- text, 41%
- itemsets / transactions, 27%
- location/geo/mobile, 20%

Comparing with a similar
2012 KDnuggets Poll: Data types analyzed/mined, we see that the data types/sources with the highest increase in usage were

music / audio: 143% up, from 1.1% rate in 2012 to 2.7% in 2014
JSON: 95% up, from 8.7% in 2012 to 17%
XML: 51% up, from 9.3% in 2012 to 14%.

The largest declines in usage were for

anonymized data: 42% down, from 24% in 2012 to 14% in 2014
itemsets / transactions: 19% down, from 33% to 27%
images / video: 18% down, from 6.0% to 4.9%

We also added new options in 2014 poll for accessing data from a database engine, and took the top 7 database engines from db-engines.com/en/ranking.

Overall, 70% of all respondents have accessed data from some database, but only about 20% accessed NoSQL databases (Hadoop, MongoDB or another DB engine)

Here is a table with full results.

What data types/sources you analyzed in the past 12 months? [264 votes total] % users in 2014 % users in 2012
table data (fixed n. columns) (203)	77% 73%
time series (126)	48% 44%
text (108)	41% 39%
itemsets / transactions (70)	27% 33%
location/geo/mobile (52)	20% 19%
Twitter (47)	18% NA (not asked in 2012)
JSON (45)	17% 8.7%
web content (42)	16% 13%
social network (41)	16% 18%
anonymized data (37)	14% 24%
XML (37)	14% 9.3%
web clickstream/web log (33)	12.5% 9.3%
email (26)	10% 11%
images / video (13)	4.9% 6.0%
music / audio (7)	2.7% 1.1%
Other (19)	7.2% 8.2%

Here is a breakdown of database engine source by popularity, with the rank and popularity score from db-engines.com site for May 2014, rounded.

Data Source	% Used	db-engines Rank (score)
Microsoft SQL Server (84)	31.8%	3 (1208)
Oracle (65)	24.6%	1 (1503)
MySQL (60)	22.7%	2 (1309)
another database engine (41)	15.5%	na
Hadoop/HDFS (34)	12.9%	na
Microsoft Access (31)	11.7%	7 (145)
PostgreSQL (25)	9.5%	4 (241)
DB2 (19)	7.2%	6 (186)
MongoDB (13)	4.9%	5 (225)

We note that popularity ranking of database engines for data mining, as shown in this poll, is NOT the same as db-engines ranking, with SQL Server being an especially popular source for data analysis.

We also analyzed co-occurrence of popular data types with different types of databases, and measured "affinity" of a database engine to data type as ratio of how frequently this data type was used in conjunction with that database, divided by average % usage of that data type. The usage of a particular data type and a database engine by the same respondent within a year does not mean that that DB was used for analyzing this data type, but we found some interesting and strong correlations.

The database engines with the most affinity for text data were MongoDB (1.88) and Hadoop (1.65), while for time series, the most popular database engines were Postgres (1.65).

We also analyzed the regional breakdown, including total number of data types used, and database sources.

Table below shows breakdown by Region, with columns:

% Participants: % of participants from that region
Ntypes: N. of Different data Sources used,
%from DB: % used data from a database engine or Hadoop
%from NoSQL: % used data from NoSQL engine
%text: % used text data

Region	% Participants	Ntypes	%from DB engine	%from NoSQL	%text
US/Canada	47%	5.4	78%	20%	44%
Europe	26%	4.5	66%	10%	38%
Asia	13%	3.8	41%	15%	41%
Latin America	8.7%	4.4	74%	17%	35%
Africa/Middle East	3.8%	4.5	70%	20%	30%
Australia/NZ	2.3%	4.2	83%	0%	50%
ALL	100%	4.8	70%	16%	41%

We note that US/Canada region is leading in N. of different types used, usage of all database engines, and in NoSQL engines. Europe is lagging in using NoSQL engines, while Asia is lagging in usage of data from database engines in general (but not so much with NoSQL engines). Australia and Middle East participation is too low in this poll to draw inferences.

Related:

2014 Poll: What data types/sources you analyzed in the past 12 months?
2014 Poll: Salary/Income for Analytics, Data Mining, Data Science professionals Poll
2014 Poll: How much did you use text analytics / text mining in the past 12 months?
2013 Poll: Largest Dataset Analyzed / Data Mined
2012 Poll: Data types analyzed/mined

Poll Results: Data Types/Sources Analyzed

More On This Topic

Latest Posts

Top Posts