Poll Results: Largest Dataset Analyzed/Data Mined
The largest dataset analyzed kept growing, with the median value in 40-60 GB range, about twice the 2012 value. US data miners lead other regions in Big Data: about 28% of them worked with TB size databases. We again observed the 11-100 Petabyte gap.
Latest KDnuggets Poll asked:
Based on about 320 answers, median can be estimated in 40-50 GB range, about double of median answer (20-40 GB) in 2012 KDnuggets Poll: largest dataset analyzed / data mined.
We note that the largest increase was in 100 GB to 1 TB range and there was a decline in people selecting 11 to 100 GB as the largest DB they analyzed.
The proportion of data miners who worked with Terabyte range databases has stayed around 20% globally, but increased from 24% to 28% among US/Canada analysts. We also note a gap in the 11-100 PB range - no answers in 2013 poll and very few answers in 2012 poll. Perhaps this gap separates regular data from web-scale data, such as those that "Big Data Miners" at Google, Facebook, and other web-scale companies work with.
Here is a comparison of 2013 and 2012 poll results.
Here is a regional breakdown which shows that US/Canada data miners lead other regions in analyzing large databases. The numbers for Latin America and AU/NZ are not large enough to be statistically significant, but are reported for completeness.
|Region (voters)||Largest Dataset Analyzed (median)||% analyzed TB+ data|
|US/Canada (156)||30-50 GB||28%|
|Europe (92)||5-10 GB||12%|
|Asia (47)||1-5 GB||11%|
|Latin America (12)||11-100 GB||17%|
|AU/New Zealand (7)||5-10 GB||14%|
Comments from around the web
In Advanced Business Analytics, Data Mining and Predictive Modeling LinkedIn group
Bill Winkler - We have 10-20 files with 100 million to 300 million records. We need to do modeling/edit/imputation to fill in for missing data so that records satisfy joint distributions in a principled manner and so that individual records satisfy edit (business) rules. We need to unduplicate the individual files. We need to merge files using quasi-identifiers such as name, address, date-of-birth (when available) and other fields. We need to adjust the statistical analyses on merged files for linkage error. Several files are on the order of 6+ TB.
Some of the methods are discussed in www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf .
There is a group looking at methods for cleaning up and analyzing groups of national files. www.newton.ac.uk/programmes/INI/iniw91.html .
With the fastest methods, a group of 10+ individuals could clean-up and analyze a group of national files in 1-3 months. With many other methods that are 100 or 10 times as slow, the clean-up and analysis could take far longer. At present, Hadoop and other type methods are not able to deal with very large scale optimization needed for many statistical and machine learning algorithms. Hadoop-type methods and most other type methods are not able to look effectively for duplicates when there are moderate amounts of typopgraphical error in most quasi-identifying fields (the so-called approximate string search problem from Knuth's monograph).
Gregory Piatetsky-Shapiro - Very impressive - thanks for sharing the details ! At BigDataTechCon, many people criticize Hadoop as hitting the wall for complex analysis. Have you tried array databases like SciDB?
Bill Winkler - I have developed proprietary algorithms and data structures (as ruled by our lawyers after some very large companies tried to get our source code). Our three largest files (800 million to 2 billion records - 20 TB) are too large for some of our methods.
Most of the software is designed for the clean-up of the data although some of the algorithms could be used for certain types of data mining. Fayad and Uthurusamy have stated that 90+% of the work in creating a data warehouse is in the clean-up of the data.
Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining into solutions for insights. Communications of the Association of Computing Machinery, 45 (8), 28-31.
If you understand the models (analyses) that you are performing, then you will often know the sufficient statistics you need (which can be handled with approximations in large situations). The intent of the work below was to produce aggregates that could be used in existing statistical modeling software because no statistical modeling software could handle even moderate-size situations 15 years ago.
- Squashing flat files flatter, Data Mining and Knowledge Discovery (1999)
by W DuMouchel, C Volinsky, T Johnson, C Cortes, D Pregibon
- Cached Sufficient Statistics for Efficient Machine Learning on Large Datasets, 1998, Journal of Artificial Intelligence Research, Andrew Moore, Mary Soon Lee
You still need basic understanding of data and of modeling to apply the two methods.
Kirk Borne - I know that you didn't ask about future activities, but we are working on plans to mine, analyze, and generate discoveries from a truly huge data set:
Julien Sauvage Everything depends so much on how you define a "large data set"...
I agree that the number of files or tables are important, as well as their size (in GB, TB...), but the question is... how large is the USEFUL information that needs to be analyzed? Silly example: if you have only one customer attributes and 1 billion duplicates, the corresponding file will be big, but the size of the valuable information remains pretty small... And of course it depends on what will be the type of analysis performed - only cleaning data or building a predictive model?
For predictive modeling an analytical data set will be considered LARGE because the quantity of valuable information to be processed is large! This usually corresponds to datasets that are very long (big number of records) and, more importantly, very WIDE. Such data sets contain a lot of attributes for each record, e.g. a lot of customer variables.
KXEN customers analyze such large datasets. A good example is Mobilink, Pakistan's first mobile communications service, who analyzes 900 million distinct monthly communications between 70 million phone numbers, stored in 4.3 billion raw monthly call detail records (900 TB). Using KXEN, Mobilink builds predictive models for customer retention based on analytical data sets made of 1000+ customer attributes ("variables").
Andrew Pandre, Ph.D. - @ Gregory: thanks for doing this poll for many years! I think your poll is showing that "Big Data" is a Science fiction: only 10% of datasets above 10TB and almost 79% of datasets are below 1TB in size...
Gregory Piatetsky-Shapiro - Not every data miner voted in this poll, but the results are consistent with past polls and other surveys - most data scientists and data miners deal with smaller databases (less than TB size), and this is reasonable - not all data mining problems are web-scale. The "Big Data Miners" group that has web-scale databases is a minority but it is growing