New KDnuggets Poll is asking: what was the largest dataset you analyzed/ data mined? Please vote on www.kdnuggets.com.
New KDnuggets Poll is asking:
This poll is closed - here are the results of KDnuggets Poll
What was the largest dataset you analyzed / data mined?
For comparison, here are results
2012 KDnuggets Poll: largest dataset analyzed / data mined ,
where the median answer was in 20-40 GB range.
In a similar
2011 KDnuggets Poll: largest dataset analyzed / data mined,
median answer was in 10-20 GB range.
Bill Winkler
posted on LinkedIn:
We have 10-20 files with 100 million to 300 million records. We need to do modeling/edit/imputation to fill in for missing data so that records satisfy joint distributions in a principled manner and so that individual records satisfy edit (business) rules. We need to unduplicate the individual files. We need to merge files using quasi-identifiers such as name, address, date-of-birth (when available) and other fields. We need to adjust the statistical analyses on merged files for linkage error. Several files are on the order of 6+ TB.
Some of the methods are discussed in
www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf .
There is a group looking at methods for cleaning up and analyzing groups of national files.
www.newton.ac.uk/programmes/INI/iniw91.html .
|