Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The poll results show amazing consistency to past years, with median answers still in 10-100 gigabytes range. Really Big Data Scientists (100 Petabytes and more) continue to stand apart, but remain small segment where Asian data scientists lead for the first time in this poll.



The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:
  • Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
  • Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
  • Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
  • Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.
Largest Dataset 2014-2018
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018
2018 data is shown as a column, to stand apart from lines for previous years.


This poll also asked about employment type, and the breakdown was
  • Company or Self-Employed, 62% (was also 62% in 2016)
  • Student, 17% (was 20% in 2016)
  • Academia/University, 13% (was 10% in 2016)
  • Government/non-profit, 4.8% (was 5.1% in 2016)
  • Other, 3.2% (was 2.4% in 2016)
Largest Dataset 2018, by Employment
Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median
Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:
  • Europe, 34.9% (was 35.1%)
  • US/Canada, 34.4% (was 36.9% in 2016)
  • Asia, 15.6% (was 17%)
  • Latin America, 6.9% (was 5.6%)
  • Africa/Middle East, 4.9% (was 3.2%)
  • Australia/NZ, 3.2% (was 2.3%)
Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

KDnuggets 2018 Poll: Largest Dataset Analyzed, by employment and region
Fig. 3: Largest Dataset Analyzed, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses

We got more responses from Asian "Company" Data Scientists for 100PB data than from US/Canada or Europe Data Scientist. We see a similar situation with Asian students.

Here are the results of past polls: