New Poll: Largest Dataset Analyzed / Data Mined?
New KDnuggets Poll is asking: what was the largest dataset you analyzed/ data mined? Please vote on www.kdnuggets.com.
New KDnuggets Poll is asking:
This poll is closed - here are the results of KDnuggets Poll
For comparison, here are results
2012 KDnuggets Poll: largest dataset analyzed / data mined , where the median answer was in 20-40 GB range.
In a similar 2011 KDnuggets Poll: largest dataset analyzed / data mined, median answer was in 10-20 GB range.
Bill Winkler posted on LinkedIn:
We have 10-20 files with 100 million to 300 million records. We need to do modeling/edit/imputation to fill in for missing data so that records satisfy joint distributions in a principled manner and so that individual records satisfy edit (business) rules. We need to unduplicate the individual files. We need to merge files using quasi-identifiers such as name, address, date-of-birth (when available) and other fields. We need to adjust the statistical analyses on merged files for linkage error. Several files are on the order of 6+ TB.
Some of the methods are discussed in www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf .
There is a group looking at methods for cleaning up and analyzing groups of national files. www.newton.ac.uk/programmes/INI/iniw91.html .