Poll Results: Largest Dataset Analyzed surprisingly stable
The results of KDnuggets annual poll on Largest Dataset Analyzed show surprising stability over the last 3 years, with about 54% of answers in GB range, and confirm the gap between the internet-scale data miners and the rest.
By Gregory Piatetsky,
@kdnuggets, Jul 17, 2014.
Latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined?
The results, based on 392 votes, show a pattern that has remained surprisingly stable over the last 3 years
We can see the trends more clearly by grouping the answers into ranges for Megabytes (< 1GB), Gigabytes (1-999 GB), Terabytes (1-999 TB), and Petabytes (>1 PB). We will call data scientists with largest dataset analyzed in each range Megabyte analysts, Gigabyte analysts, etc.
The global percent of Gigabyte analysts slightly increased from 53% in 2012 to 54% in 2014. The percent of Megabyte analysts has steadily declined, as expected, while percent of Terabyte analysts has grown slightly, from 16% to 18%.
Here is a similar chart just for the US, which shows decline of in percent of Gigabyte analysts and the corresponding growth in Terabyte and Petabyte analysts.
Regional participation was
The chart below shows the distribution of Largest Dataset Ranges by Region, sorted by % of TB+ answers. We see that US/Canada and AU/NZ lead, with about 30% of their data miners having worked on TB-size databases. Next is Europe (19%), Latin America (15%), Asia (10%), and Africa/MidEast (7.7%).
Here are the results of past polls:
Latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined?
The results, based on 392 votes, show a pattern that has remained surprisingly stable over the last 3 years
- over 50% of answers are in the Gigabyte range (median answer between 11 and 100 GB for each each year 2012-14)
- a small number (2-3%) of Big Data miners are working with internet-scale data sets (over 100 PB), at companies like Google and Facebook.
- a small but significant gap, with almost no answers in 1-100 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and 100 PB+ Internet-scale data stores.

We can see the trends more clearly by grouping the answers into ranges for Megabytes (< 1GB), Gigabytes (1-999 GB), Terabytes (1-999 TB), and Petabytes (>1 PB). We will call data scientists with largest dataset analyzed in each range Megabyte analysts, Gigabyte analysts, etc.
The global percent of Gigabyte analysts slightly increased from 53% in 2012 to 54% in 2014. The percent of Megabyte analysts has steadily declined, as expected, while percent of Terabyte analysts has grown slightly, from 16% to 18%.

Here is a similar chart just for the US, which shows decline of in percent of Gigabyte analysts and the corresponding growth in Terabyte and Petabyte analysts.

Regional participation was
- 38%, US/Canada
- 31%, Europe
- 18%, Asia
- 6.9%, Latin America
- 3.3%, Africa/MidEast
- 2.6%, AU/NZ
The chart below shows the distribution of Largest Dataset Ranges by Region, sorted by % of TB+ answers. We see that US/Canada and AU/NZ lead, with about 30% of their data miners having worked on TB-size databases. Next is Europe (19%), Latin America (15%), Asia (10%), and Africa/MidEast (7.7%).

Here are the results of past polls:
- 2013 KDnuggets Poll Results: largest dataset analyzed / data mined.
- 2012 KDnuggets Poll: largest dataset you analyzed / data mined?.