Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists

The majority (57%) of respondents only worked with Gigabyte range data. More junior Data Scientists enter the market, but Petabyte Big Data Scientists still stand apart.

The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1240 votes, almost 3 times as many as in 2015, but results show surprising stability, fitting a pattern that emerged already in 2012, which suggests that majority of data scientists and analysts do not work with really big data.
  • Gigabytes still rule: Majority of answers (57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
  • More Junior Data Scientists: compared to 2015, we see higher percentage of responses in ALL ranges under 100GB, and fewer in ranges over 100GB, which indicates more junior Data Scientists coming into the industry (and taking part in this poll)
  • Petabyte Big Data Scientists stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with multi-petabyte Internet-scale data stores.
  • Government, Industry lead: The median Government and Industry analysts work with an order of magnitude larger datasets
  • US and Europe have the largest share of Terabyte-level analysts
Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Largest Dataset 2016 Trend
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2012-2016

This poll also asked about employment type, and the breakdown was
  • Company or Self-Employed, 62%
  • Student, 20.2%
  • Academia/University, 10.2%
  • Government/non-profit, 5.1%
  • Other, 2.4%

Largest Dataset 2016, by Employment
Fig. 2: KDnuggets Poll: Largest Dataset 2016, by Employment. Red line shows the estimated median

Figure 3 below shows the distribution of largest dataset ranges by region, sorted by % of TB+ answers.

In US/Canada, 22% analysts worked with TB+ datasets. Next is Europe (15%), AU/NZ, 14%, Africa/Mideast (13%), and Latin America (10%). These numbers are consistently lower than in 2015, which combined with 3x increase in poll participation, suggest big growth in new Data Scientists entering the field and working on regular size data.

KDnuggets 2016 Poll: Largest Dataset Analyzed, by region
Fig. 3: KDnuggets Poll: Largest Dataset 2016, by Region, ordered by share of TB+ entries.

Regional participation was
  • US/Canada, 37%
  • Europe, 35%
  • Asia, 17%
  • Latin America, 5.6%
  • Africa/Middle East, 3.2%
  • Australia/NZ, 2.3%
Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

KDnuggets 2016 Poll: Largest Dataset Analyzed, by employment and region
Fig. 4: KDnuggets Poll: Largest Dataset 2016, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses

We note that the regions are mostly similar, but see that more Asian data scientists get to work with web-scale data than in Europe. Some lucky students in every region also get to work with web-scale data.

Here are the results of past polls: