Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range

A majority of data scientists (56%) work in Gigabyte dataset range. We note a small increase in Petabyte (web-scale) data miners, and a decline in Megabyte data miners. US, Australia/NZ, and Asia lead in percentage of Terabyte and Petabyte analysts.

Latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined?

The 2015 results, based on 459 votes, show a very similar pattern that has remained surprisingly stable since 2012, and which suggests that majority of data scientists and analysts do not work with really big data.
  • Majority of answers (52.8% in 2013, 54.3% in 2014, 55.6% in 2015) are in Gigabyte range. The median response was between 11 and 100 GB (which comfortably fits on one laptop) for each year 2012-15.
  • Slight growth in responses from web-scale "peta-data-miners", which have analyzed petabyte scale databases (from 2.5% in 2013 to 4.6% in 2015).
  • a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with multi-petabyte Internet-scale data stores.

Largest Dataset Analyzed 2013-2015

To see the trends better, we grouped the answers into ranges for Megabytes (< 1GB), Gigabytes (1-999 GB), Terabytes (1-999 TB), and Petabytes (>1 PB). We will call data scientists with largest dataset analyzed in each range Mega-analysts, Giga-analysts, etc.

The global percent of Giga-analysts continued to slightly increase: 52.8% in 2013, 54.3% in 2014, 55.6% in 2015. The percent of Mega-analysts has steadily declined (from 26.1% in 2013 to 21.6% in 2015), as can be expected. The share of Tera-analysts has remained steady at 18.3-18.6% over 3 years. We do see slight growth at the upper end with Peta-analysts, from 2.5% in 2013 to 4.6% in 2015.

KDnuggets Poll: Largest Dataset Analyzed, 2014-2015, ranges

Here is a similar chart just for the US, which shows growth in Giga- and Peta-analysts and the corresponding decline in Megabyte and Terabyte analysts.

KDnuggets Poll: Largest Dataset Analyzed, 2012-2014, for US/Canada

Regional participation was
  • 42%, US/Canada
  • 29%, Europe
  • 18%, Asia
  • 4.1%, Latin America
  • 3.9%, AU/NZ
  • 2.4%, Africa/MidEast

The chart below shows the distribution of largest dataset ranges by region, sorted by % of TB+ answers. In US/Canada, 26.4% analysts worked with TB+ datasets. Next is AU/NZ where 22.2% worked on TB+ data, followed by Asia (21.7%), and Europe (20.7%).

KDnuggets 2015 Poll: Largest Dataset Analyzed, by region

