New Poll: Largest Dataset Analyzed / Data Mined?

New KDnuggets Poll is asking: what was the largest dataset you analyzed/ data mined? Please vote on www.kdnuggets.com.



New KDnuggets Poll is asking:

This poll is closed - here are the results of KDnuggets Poll

What was the largest dataset you analyzed / data mined?

For comparison, here are results

2012 KDnuggets Poll: largest dataset analyzed / data mined , where the median answer was in 20-40 GB range.

In a similar 2011 KDnuggets Poll: largest dataset analyzed / data mined, median answer was in 10-20 GB range.

Bill Winkler posted on LinkedIn:

We have 10-20 files with 100 million to 300 million records. We need to do modeling/edit/imputation to fill in for missing data so that records satisfy joint distributions in a principled manner and so that individual records satisfy edit (business) rules. We need to unduplicate the individual files. We need to merge files using quasi-identifiers such as name, address, date-of-birth (when available) and other fields. We need to adjust the statistical analyses on merged files for linkage error. Several files are on the order of 6+ TB.

Some of the methods are discussed in www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf .

There is a group looking at methods for cleaning up and analyzing groups of national files. www.newton.ac.uk/programmes/INI/iniw91.html .