Poll Results: Data Types/Sources Analyzed
Trends in data sources for data mining include: table data dominates, followed by time series and text; audio, JSON grows in popularity, while itemsets decline; 70% access DB engines, but only 20% access NoSQL stores; Hadoop, MongoDB used more for text; Europe is lagging in NoSQL usage.
By Gregory Piatetsky, May 17, 2014.
Recent KDnuggets Poll asked:
What data types/sources you analyzed in the past 12 months?
Multiple choice were allowed.
The most popular data types, not surprisingly, were
- table data (fixed n. columns), 77%
- time series, 48%
- text, 41%
- itemsets / transactions, 27%
- location/geo/mobile, 20%
Comparing with a similar
2012 KDnuggets Poll: Data types analyzed/mined, we see that the data types/sources with the highest increase in usage were
The largest declines in usage were for
We also added new options in 2014 poll for accessing data from a database engine, and took the top 7 database engines from db-engines.com/en/ranking.
Overall, 70% of all respondents have accessed data from some database, but only about 20% accessed NoSQL databases (Hadoop, MongoDB or another DB engine)
Here is a table with full results.
Here is a breakdown of database engine source by popularity, with the rank and popularity score from db-engines.com site for May 2014, rounded.
We note that popularity ranking of database engines for data mining, as shown in this poll, is NOT the same as db-engines ranking, with SQL Server being an especially popular source for data analysis.
We also analyzed co-occurrence of popular data types with different types of databases, and measured "affinity" of a database engine to data type as ratio of how frequently this data type was used in conjunction with that database, divided by average % usage of that data type. The usage of a particular data type and a database engine by the same respondent within a year does not mean that that DB was used for analyzing this data type, but we found some interesting and strong correlations.
The database engines with the most affinity for text data were MongoDB (1.88) and Hadoop (1.65), while for time series, the most popular database engines were Postgres (1.65).
We also analyzed the regional breakdown, including total number of data types used, and database sources.
Table below shows breakdown by Region, with columns:
We note that US/Canada region is leading in N. of different types used, usage of all database engines, and in NoSQL engines. Europe is lagging in using NoSQL engines, while Asia is lagging in usage of data from database engines in general (but not so much with NoSQL engines). Australia and Middle East participation is too low in this poll to draw inferences.
Related:
Recent KDnuggets Poll asked:
What data types/sources you analyzed in the past 12 months?
Multiple choice were allowed.
The most popular data types, not surprisingly, were
- table data (fixed n. columns), 77%
- time series, 48%
- text, 41%
- itemsets / transactions, 27%
- location/geo/mobile, 20%
Comparing with a similar
2012 KDnuggets Poll: Data types analyzed/mined, we see that the data types/sources with the highest increase in usage were
- music / audio: 143% up, from 1.1% rate in 2012 to 2.7% in 2014
- JSON: 95% up, from 8.7% in 2012 to 17%
- XML: 51% up, from 9.3% in 2012 to 14%.
The largest declines in usage were for
- anonymized data: 42% down, from 24% in 2012 to 14% in 2014
- itemsets / transactions: 19% down, from 33% to 27%
- images / video: 18% down, from 6.0% to 4.9%
We also added new options in 2014 poll for accessing data from a database engine, and took the top 7 database engines from db-engines.com/en/ranking.
Overall, 70% of all respondents have accessed data from some database, but only about 20% accessed NoSQL databases (Hadoop, MongoDB or another DB engine)
Here is a table with full results.
What data types/sources you analyzed in the past 12 months? [264 votes total] | |
table data (fixed n. columns) (203) |
|
time series (126) |
|
text (108) |
|
itemsets / transactions (70) |
|
location/geo/mobile (52) |
|
Twitter (47) |
|
JSON (45) |
|
web content (42) |
|
social network (41) |
|
anonymized data (37) |
|
XML (37) |
|
web clickstream/web log (33) |
|
email (26) |
|
images / video (13) |
|
music / audio (7) |
|
Other (19) |
|
Here is a breakdown of database engine source by popularity, with the rank and popularity score from db-engines.com site for May 2014, rounded.
Data Source | % Used | db-engines Rank (score) |
---|---|---|
Microsoft SQL Server (84) | 3 (1208) | |
Oracle (65) | 1 (1503) | |
MySQL (60) | 2 (1309) | |
another database engine (41) | na | |
Hadoop/HDFS (34) | na | |
Microsoft Access (31) | 7 (145) | |
PostgreSQL (25) | 4 (241) | |
DB2 (19) | 6 (186) | |
MongoDB (13) | 5 (225) |
We note that popularity ranking of database engines for data mining, as shown in this poll, is NOT the same as db-engines ranking, with SQL Server being an especially popular source for data analysis.
We also analyzed co-occurrence of popular data types with different types of databases, and measured "affinity" of a database engine to data type as ratio of how frequently this data type was used in conjunction with that database, divided by average % usage of that data type. The usage of a particular data type and a database engine by the same respondent within a year does not mean that that DB was used for analyzing this data type, but we found some interesting and strong correlations.
The database engines with the most affinity for text data were MongoDB (1.88) and Hadoop (1.65), while for time series, the most popular database engines were Postgres (1.65).
We also analyzed the regional breakdown, including total number of data types used, and database sources.
Table below shows breakdown by Region, with columns:
- % Participants: % of participants from that region
- Ntypes: N. of Different data Sources used,
- %from DB: % used data from a database engine or Hadoop
- %from NoSQL: % used data from NoSQL engine
- %text: % used text data
Region | % Participants | Ntypes | %from DB engine | %from NoSQL | %text |
---|---|---|---|---|---|
US/Canada | 47% | 5.4 | 78% | 20% | 44% |
Europe | 26% | 4.5 | 66% | 10% | 38% |
Asia | 13% | 3.8 | 41% | 15% | 41% |
Latin America | 8.7% | 4.4 | 74% | 17% | 35% |
Africa/Middle East | 3.8% | 4.5 | 70% | 20% | 30% |
Australia/NZ | 2.3% | 4.2 | 83% | 0% | 50% |
ALL | 100% | 4.8 | 70% | 16% | 41% |
We note that US/Canada region is leading in N. of different types used, usage of all database engines, and in NoSQL engines. Europe is lagging in using NoSQL engines, while Asia is lagging in usage of data from database engines in general (but not so much with NoSQL engines). Australia and Middle East participation is too low in this poll to draw inferences.
Related:
- 2014 Poll: What data types/sources you analyzed in the past 12 months?
- 2014 Poll: Salary/Income for Analytics, Data Mining, Data Science professionals Poll
- 2014 Poll: How much did you use text analytics / text mining in the past 12 months?
- 2013 Poll: Largest Dataset Analyzed / Data Mined
- 2012 Poll: Data types analyzed/mined