Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News :: 2013 :: Apr :: New Poll: Largest Dataset Analyzed / Data Mined? ( 13:n09 )

New Poll: Largest Dataset Analyzed / Data Mined?


New KDnuggets Poll is asking: what was the largest dataset you analyzed/ data mined? Please vote on www.kdnuggets.com.



New KDnuggets Poll is asking:

This poll is closed - here are the results of KDnuggets Poll

What was the largest dataset you analyzed / data mined?

For comparison, here are results

2012 KDnuggets Poll: largest dataset analyzed / data mined , where the median answer was in 20-40 GB range.

In a similar 2011 KDnuggets Poll: largest dataset analyzed / data mined, median answer was in 10-20 GB range.

Bill Winkler posted on LinkedIn:

We have 10-20 files with 100 million to 300 million records. We need to do modeling/edit/imputation to fill in for missing data so that records satisfy joint distributions in a principled manner and so that individual records satisfy edit (business) rules. We need to unduplicate the individual files. We need to merge files using quasi-identifiers such as name, address, date-of-birth (when available) and other fields. We need to adjust the statistical analyses on merged files for linkage error. Several files are on the order of 6+ TB.

Some of the methods are discussed in www.ine.es/e/essnetdi_ws2011/ppts/Winkler.pdf .

There is a group looking at methods for cleaning up and analyzing groups of national files. www.newton.ac.uk/programmes/INI/iniw91.html .


Sign Up

By subscribing you accept KDnuggets Privacy Policy