KDnuggets Home » Polls » Largest Database or Dataset You Data-Mined (June 2008)

What was the largest database or dataset you data-mined:


What was the largest database or dataset you data-mined: [127 votes total]

less than 1 MB (4) 3%
1.1 to 10 MB (4) 3%
11 to 100 MB (10) 8%
101 MB to 1 GB (13) 10%
1.1 to 10 GB (35) 28%
11 to 100 GB (21) 17%
101 GB to 1 Terabyte (20) 16%
1.1 to 10 Terabytes (8) 6%
over 10 Terabytes (12) 9%


For comparison, here are the results of 2007 KDnuggets Poll: Largest Data Size Data-Mined.

The median DB size in this poll was around 10 GB, smaller than 30-60 GB range in 2007 poll. One explanation is that fewer people participated in 2008 poll (perhaps Yahoo data miners who mine some of the largest data anywhere were distracted by the Microsoft takeover bid and did not take part in this poll).

The following chart shows the breakdown of the largest DB size (estimated median value) by region. The values in US/CA/Au, W. Europe, and Asia regions are close to what we expected. We note the unusually large DB size reported two regions: Latin America, and Africa and Middle East, but these are probably not representative for these regions because of the small number of responses.

Largest DB size by region
Largest DB size by region. Yellow bar shows responses over 1 TB

RegionCountLargest DB Size
(est. median value)
Pct with
largest DB size
> 1 TB
US, Canada,
5130 GB29%
W. Europe 374 GB8.1%
E. Europe 1210 GB0
Asia 121 GB8.3%
Latin America 9100 GB11.1%
Africa, Middle East 550 GB0

TimManns, Size of data in bytes or rows?
I am usually fairly ignorant of the size in bytes of the data. I know how many rows and columns I process (and that the size of our entire Teradata data warehouse exceeds many terabytes).

I'd guess that most analysts can report rows and columns easier than bytes. Here's mine;

My largest queries access data from multiple tables, with -more than- 2 billion rows within the largest single table (and I create approx 30 byteint columns from that table). This transactional level data is summarised to single row per customer, therefore the end result set is much smaller.

To be practical, processing time for any analyis can never exceed a few hours.

Rafal Latkowski, What was the largest dataset?
I'm assumming that we are talking about size without compression. Some database/datamining suites have very efficient compression especially for de-normalized data (in range about 1:5-1:20), but all algorithms are processing uncompressed data.

KDnuggets Home » Polls » Largest Database or Dataset You Data-Mined (June 2008)