KDnuggets Home » News » 2011 » May » Publications » Why you can't really anonymize your data  ( < Prev | 11:n13 | Next > )

Why you can't really anonymize your data


 
  
The anonymization process is an illusion. There are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone's actions has a good chance of matching identifiable public records.


It's time to accept and work within the limits of data anonymization.

O'Reilly, By Pete Warden, May 17, 2011

Privacy One of the joys of the last few years has been the flood of real-world datasets being released by all sorts of organizations. These usually involve some record of individuals' activities, so to assuage privacy fears, the distributors will claim that any personally-identifying information (PII) has been stripped. The idea is that this makes it impossible to match any record with the person it's recording.

Something that my friend Arvind Narayanan has taught me, both with theoretical papers and repeated practical demonstrations, is that this anonymization process is an illusion. Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone's actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the "anonymous" dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.

... So, what should we do? Accepting that anonymization is not a complete solution doesn't mean giving up, it just means we have to be smarter about our data releases. Below I outline four suggestions. ... Read more.


KDnuggets Home » News » 2011 » May » Publications » Why you can't really anonymize your data  ( < Prev | 11:n13 | Next > )