Privacy-preserving AI – Why do we need it?
Various data privacy threats can result from the usual process of building and constructing data and AI-based systems. Avoiding these challenges can be supported by utilizing state-of-the-art technologies in the domain of privacy-preserving AI.
In today's digitized world, data privacy is an important concern for both private and public organizations. This has led to an interesting challenge in the analytics industry, where data accessibility is key to the development of high-quality machine learning models.
When working on artificial intelligence projects in the real world, you will find that most datasets are siloed within large enterprises for two reasons:
First, organizations have legal requirements, which preclude them from sharing their datasets outside of their organization, to keep it safe from both accidental and intentional leakage.
Second, retaining large datasets collected from or about their customers also confers a competitive advantage.
This data can help organizations improve and personalize their products and services. And it makes a lot of sense if you think about it. It's better to measure what your users like than to guess and build products that no one wants to use. But this can also be dangerous. It undermines the privacy of customers because the collected data can be sensitive, causing harm if leaked.
So while companies love to use data to improve their products, as users, we would want to protect our privacy.
These contradicting needs can be met with a technique called differential privacy, or Privacy-Preserving AI. PPAI allows organizations to collect information about their users without compromising the privacy of an individual.
PPAI is a relatively new field, beginning with statistical database queries around 2003. Its application in contexts such as Machine Learning is more recent. The general goal of PPAI is to ensure that different kinds of statistical analyses don't compromise data privacy. When we say statistical analysis, we mean it in the most general sense – for example, we have some training data or database or just a dataset about individuals, and we want to make sure that our statistical analysis of that dataset does not compromise the privacy of any particular individual contained within that dataset.
Now, in order to be able to accomplish this goal, we first need to propose a robust definition of privacy. There are many definitions available:
Definition 1: Privacy is preserved if, after the analysis, the analyzer doesn't know anything about the people in the dataset. They remain 'unobserved.'
But this definition may not be sufficient in all contexts. For example, if we apply this definition of privacy to our homes, using door curtains and window blinds, we are able to ensure information about individuals in the house does not leak, and they remain "unobserved." However, in the context of statistics and Machine Learning, this is insufficient. This is because we're trying to learn something about a dataset without learning specific things about individuals, which might be sensitive or might harm them in some way.
In truth, we're not only really trying to protect the information, but also people, right?
So this leads us to the modern, intuitive definition by Cynthia Dwork, widely recognized as a pioneering figure in the field of PPAI. Cynthia offers the following definition of privacy in her book, ‘The Algorithmic Foundations of Differential Privacy.’
Definition 2: "Differential privacy" describes a promise, made by a data holder, or curator, to a data subject, and the promise is this -
"You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources are available."
(Note that the terms PPAI and Differential Privacy are often used interchangeably in academia and industry.)
As you can see, this definition of differential privacy is broader than the previous one and quite challenging to fulfill. But the true goal of the field of differential privacy is to propose these tools and techniques that allow a data holder to make these promises to individuals who are being studied.
But why go to all this trouble? Couldn't companies could just collect our data with consent and remove our names/personally-identifiable content? Anonymize the data and remove the sensitive stuff?
Easy, right? Not quite!
There are two problems with this approach.
First, the anonymization usually happens on the servers of the companies that collect your data. So you have to trust them to really remove these identifiable records. How effective this will be boils down to their data quality and security control, and whether it is auditable or not (GDPR)
Second, how anonymous is 'anonymized' data, really?
In 2006, Netflix launched a contest called the Netflix Prize. Teams had to create an algorithm that could predict how someone would rate a movie. To help with this challenge, Netflix provided a dataset that contained over 100 million ratings submitted by 480,000 users across 17,000 movies. Of course, Netflix anonymized this dataset by removing the names of users, and by replacing some ratings with fake and random data. Specifically, the names of the movies and the username of the individuals had been replaced with unique integer ids.
The dataset did not disclose any private information about the individuals doing the rating. Despite this, two researchers at the University of Texas were able to de-anonymize both the names of the movies and the names of the individuals using a clever statistical technique (relevant paper here). They scraped the IMDb movie review site and applied statistical analysis to find individuals who were rating movies on both IMDb and Netflix. This allowed them to de-anonymize a large percentage of the users on Netflix as well as the movies they were watching.
Now consider a hypothetical scenario in which this anonymized data on Netflix users was leaked in a hacking attack instead of being shared for a contest. In either scenario, the result remains the same. Therefore, assuming that leaked data will be anonymized and is thus safe is incorrect.
Perhaps even more extraordinary, in 1997, a similar technique was used to de-anonymize health records, by looking at multiple, separate anonymized data sets as well as online voter registration records. This ultimately led to the re-identification of then Massachusetts Governor William Weld's, medical records from 'anonymous' data.
Again, in 2013, a Harvard professor was able to re-identify the names of over 40 percent of a sample of anonymous participants in a high profile DNA study, by cross-referencing them with other publicly available datasets (relevant Nature article here).
These examples illustrate the significance of Cynthia Dwork's definition of privacy in which the notion of “no matter what other sources datasets or information sources are available” becomes paramount. Furthermore, old-fashioned data anonymization, which many organizations continue to follow, is simply not robust enough given recent advance ins statistical techniques. Privacy-preserving AI can help plug many of these gaps and allows organizations to develop cutting-edge machine learning models for customer insights without compromising on data privacy.
Bio: Upendra Singh is a seasoned full-stack data scientist with 12+ years of experience in data science, machine learning, and big data engineering. Areas of interest include Privacy Preservation, Deep Learning, and Data Deduplication. An avid reader and dog lover, he has spoken at big data conferences like “The Fifth Elephant.” He is currently a Senior Principal Architect – Data Sciences, DX, at Epsilon.
- The 7 Myths of Data Anonymisation
- How to build analytic products in an age of data privacy
- Data Anonymization – History and Key Ideas