Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends
The poll results show amazing consistency to past years, with median answers still in 10-100 gigabytes range. Really Big Data Scientists (100 Petabytes and more) continue to stand apart, but remain small segment where Asian data scientists lead for the first time in this poll.
What was the largest dataset you analyzed / data mined?
This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.
Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.
- Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
- Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
- Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
- Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018
2018 data is shown as a column, to stand apart from lines for previous years.
This poll also asked about employment type, and the breakdown was
- Company or Self-Employed, 62% (was also 62% in 2016)
- Student, 17% (was 20% in 2016)
- Academia/University, 13% (was 10% in 2016)
- Government/non-profit, 4.8% (was 5.1% in 2016)
- Other, 3.2% (was 2.4% in 2016)
Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median
Circle size corresponds to the number of responses.
Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:
- Europe, 34.9% (was 35.1%)
- US/Canada, 34.4% (was 36.9% in 2016)
- Asia, 15.6% (was 17%)
- Latin America, 6.9% (was 5.6%)
- Africa/Middle East, 4.9% (was 3.2%)
- Australia/NZ, 3.2% (was 2.3%)
Fig. 3: Largest Dataset Analyzed, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses
We got more responses from Asian "Company" Data Scientists for 100PB data than from US/Canada or Europe Data Scientist. We see a similar situation with Asian students.
Here are the results of past polls:
- Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists, 2016
- Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range, 2015
- 2014 KDnuggets Poll Results: Largest Dataset Analyzed surprisingly stable
- 2013 KDnuggets Poll Results: largest dataset analyzed / data mined.
- 2012 KDnuggets Poll: largest dataset you analyzed / data mined?.