Largest Dataset Analyzed – Poll Results and Trends

The results show that despite the deluge of Big Data, large majority still works in Gigabyte or Megabyte-size datasets. Data Scientists work with the largest-size datasets, followed by Data Engineers, Data Analysts, and Business Analysts. Read more for details.



The latest KDnuggets Poll asked:
What was the largest dataset you analyzed / data mined?


Despite the deluge of big data, the results of 2020 Poll on largest dataset analyzed show that most respondents still work in gigabyte range, with essentially the same curve as in previous polls which asked this question starting in 2012. The data also shows a small, but notable segment working with web-scale data of over 100 Petabytes.

Poll Largest Dataset Analyzed 2020, 2018, 2016
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2020, 2018, 2016
2020 data is shown as a column, to stand apart from lines for previous years.
The results are based on 562 participants.

Note that the poll asks about the largest dataset analyzed, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:
  • Most data people still work in Gigabytes range: Majority of answers (78% in 2020, 80% in 2018, 83% in 2016) are in Gigabyte or Megabyte range. The overall median response was yet again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
  • Consistency: the shape of the curve each year is almost the same. We see some changes in 2020 curve with more respondents on the lower end, reflecting perhaps the entrance of many junior people in the field, but the overall shape is still the same.
  • Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-100 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with web-scale 100+ petabyte data stores.

This poll also asked about employment type, and the breakdown was
  • Company or Self-Employed, 62% (amazingly, this was also 62% in both 2018 and 2016)
  • Student, 20% (was 17% in 2018, 20% in 2016)
  • Academia/University, 8% (was 13% in 2018, 10% in 2016)
  • Government/non-profit, 4.4% ( was 4.8% in 2018, 5.1% in 2016)
  • Unemployed or Other, 5% (was 3.2% in 2018, 2.4% in 2016)


We also asked a new question - what was your main "data role", and the responses were

Poll Largest Dataset Role
Fig. 2: KDnuggets Poll: Respondents by Data Role.

Next chart shows the distribution of responses by role for the most common roles, including the estimated median value Poll Largest Dataset 2020 Role
Fig. 3: KDnuggets Poll: Largest Dataset Analyzed as of 2020, by Data Role.
Circle size corresponds to the number of responses.
Red line corresponds to estimated median value.

The distribution of responses by region was
  • Europe, 34.1% (was 34.9% in 2018, 35.1% in 2016)
  • US/Canada, 31% (was 34.4% in 2018, 36.9% in 2016)
  • Asia, 21.2% (was 15.6% in 2018, 17% in 2016)
  • Latin America, 7.1% (was 6.9% in 2018, 5.6% in 2016)
  • Africa/Middle East, 4.8% (was 4.9% in 2018, 3.2% in 2016)
  • Australia/NZ, 1.9% (was 3.2% in 2018, 2.3% in 2016)


Finally, the last chart shows the largest dataset analyzed, by both employment and region, for the three largest regions.

Poll Largest Dataset 2020 Emp Region
Fig. 4: Largest Dataset Analyzed, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses

Here are the results of past polls: