Differential Privacy: How to make Privacy and Data Mining Compatible

Can privacy coexist with machine learning and data mining? Differential privacy allows the learning of general characteristics of populations while guaranteeing the privacy of individual records.

User data is commercially valuable. The burgeoning data science industry is predicated on the value of insights extracted from databases. At the same time, many users and politicians are concerned about Internet privacy. Intuitively, it might seem that data mining and privacy protection are mutually incompatible goals. Differential privacy, a mathematical definition of privacy invented by Cynthia Dwork in 2006 at Microsoft Research Labs, offers the possibility of reconciling these competing interests. With differential privacy, general characteristics of populations can be learned while guaranteeing the privacy of any individual's records.

This past week, Zhanglong Ji, myself, and Professor Charles Elkan at UCSD published an arXiv preprint of a review paper describing differentially private machine learning. We describe differentially private versions of commonly used machine learning algorithms as well as differentially private methods for data release. As the theory is still relatively unknown outside of academia, an informal treatment here seems appropriate.

Mobile devices are generating tremendous amounts of big data

Big data abounds. No precise definition of "big data" exists, but a good rule of thumb is data sets too large to fit in main memory on a single machine. While the buzzword may be overused, the trend is real. Cheap memory, fast Internet connections, and obsessively used, sensor-laden smart-phones have combined to generate massive datasets as well as the means to transmit and store them.

While companies could amass datasets about anything measurable, generally the most coveted datasets in Silicon Valley contain individuals' personal information. The economic significance of such data is obvious. If a company can predict a user's purchasing decisions, it can advertise optimally. Google and Facebook rely upon well-chosen ads to monetize their otherwise free web services.

The potential value of mining human-generated data goes beyond advertising. the collective health data generated by a large population may contain insights which could bring about better health outcomes for everyone. Medical institutions are eager to mine patient records for longitudinal observations in the hope of generating the knowledge necessary for personalized medicine.

What About Privacy?

While the potential benefits of large-scale data mining are obvious, so too are the pitfalls. As user information is collected, mined, and sometimes published for profit, concerns about privacy have grown. Recently, the European Union issued a "Right to be Forgotten" ruling, reflecting the desires of many individuals to restrict the use of their data. Other recent well-publicized cases contested the unauthorized inclusion of user data in advertisements. Several well-documented privacy mishaps, including the re-identification users from the Netflix challenge dataset, have attracted national attention. In another famous case, the private medical records of Governor William Weld of Massachusetts were identified in supposedly anonymous records released by the Group Insurance Commission.

Cynthia Dwork lecturing

Differential privacy offers one way forward, allowing data scientists to extract insights from a database while guaranteeing that no individual can be identified. The definition is actually more general. It guarantees that the answer one gets from any query on a database is not perceptibly different if any one individual is excluded from the database. Concretely, one could guarantee an individual that no additional harm would come to them should they choose to participate in a study. Of course, it's possible that even if they don't participate in the study they may feel that some harm has come to them. For example a study may reveal some unfavorable health characteristic of a specific subpopulation. Differential privacy ensures the privacy of people but not of populations.

Differentially private mechanisms typically accomplish this guarantee by adding noise to any answer returned by the database. This may seem counterintuitive. We typically mine data in the hope of seeing through noise. Why would we deliberately add it? The hope with differential privacy is that the amount of noise added should be large enough to conceal the effects of individuals, but small enough that it does not seriously impact the usefulness of the answer.

The magnitude of noise added depends upon the maximum possible effect that one individual's inclusion might have have on the true answer. Intuitively, population-wide statistical estimators for common traits, e.g. the average height in a population, would typically require small amounts of noise, while the values of very rare features might need to be distorted more substantially to obscure the contribution of an individual.

Securing private data

Limits of Differential Privacy

Differential privacy addresses a very specific notion of privacy. It is suited to the situation in which one is deciding whether to allow their data to be included in database, say for a research study. If all mechanisms which access the data are proven to be differentially private, since any individual's data does not perceptibly affect the study, a guarantee of differential privacy seems a compelling argument for participation.

However, this is not the only notion of privacy. An individual might consider any piece of information that is not obviously observable to be private. For example, one could prefer to keep their sexual orientation private. As another example, one could reasonably want to keep secret any family predisposition for certain medical conditions. In a world in which these things are unknowable, that privacy is attainable.

But as machine learning advances, and tools for prediction become increasingly powerful, algorithms could conceivably make high confidence predictions, even without spying directly on your private information. Imagine a system which could infer your sexual orientation from subtle clues in your speaking voice and video of you walking. If such a system reached 99.9% accuracy, no individual could any longer claim this information as private.

The interplay between knowledge discovery and privacy is complicated. Differential privacy offers a powerful theoretical framework for considering concrete ways in which the two may be reconcilable. But many open questions remain regarding the future of privacy in a world awash with large-scale data mining, even if all data access were differentially private.

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.