A Simpler Explanation of Differential Privacy

Privacy concerns in data mining have been raised from time to time, could differential privacy be a solution? Differential privacy was devised to facilitate secure analysis over sensitive data, learn how it can be used to improve the model fitting process.

By Nina Zumel (Win-Vector).

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning.

In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.

The Voight-Kampff Test: Looking for a difference. Scene from Blade Runner

The Definition of Differential Privacy

We can define epsilon-differential privacy through the following game:

• A learner implements a summary statistic called A().
• A (notional) adversary proposes two data sets S and S’ that differ by only one row or example, and a test set Q.
• A() is called epsilon-differentially private iff:

| log( Prob[A(S) in Q] / Prob[A(S’) in Q] ) | ≤ epsilon

for all of the adversary’s choices of S, S’ and Q. The probabilities are defined over coin flips in the implementation of A(), not over the data or the adversary’s choices.

The adversary’s goal is to use A() to tell between S and S’, representing a failure of privacy. The learner wants to extract useful statistics from S and S’ without violating privacy. Identifying a unique row (the one which changed markings) violates privacy. If the adversary can tell which set (S or S’) the learner is working on by the value of A(), then privacy is violated.

Notice S and S’ are “data sets” in the machine learning sense (collections of rows carrying information). Q is a set in the mathematic sense: a subset of the possible values that A() can return. To simplify the demonstration, we will take Q to be an interval.

In the following examples, A() returns the (approximate) expected value of a set s. The adversary has chosen two sets S and S’ of size n = 100:

• S = {0,0,0,…,0} (100 zeros)
• S’ = {1,0,0,…,0} (1 one and 99 zeroes)

The set Q will be an interval [T, 1], where the adversary picks the threshold T.

The adversary’s goal is to pick T such that when he sees that A(s) ≥ T, he knows that A() has just evaluated S’. The learner has two (competing) goals:

• To pick an algorithm A() such that A(S) and A(S’) are so “close” that the adversary can’t pick a reliable T, to preserve differential privacy. “Close” is defined by epsilon.
• To have A() be a good estimate of the expectation, for performing useful analysis.

The Deterministic Case

Here, A(s) simply returns the expected value `mean(s)`, so it always returns A(S) = 0 when evaluating S, and A(S’) = 0.01 when evaluating S’. This is clearly not differentially private for any value of epsilon. If the adversary picks T = 1/200 = 0.005, then he can reliably identify when A() has evaluated the set S’, every time they play a round of the game.

In the figures, set1 in green represents the distribution of values returned by calls to A(S), and set2 in orange returns the distribution of values returned by calls to A(S’). I’ve plotted set2 upside down for clarity.

One way for the learner to obscure which set she is evaluating is to add a little random noise to the estimate, to “blur” it. Following Dwork et.al.’s methods, we’ll add noise from a Laplace distribution, which is symmetric and falls off slower than gaussian noise would. Here we show adding just a little bit of noise (of “width” sigma = 1/3n):

The shaded green region represents the chance that A(S) will return a value greater than the adversary’s threshold T — in other words, the probability that the adversary will mistake S for S’ in a round of the game. We’ve now made that probability non-zero, but it’s still much more likely that if A(s) > T, then s = S’. We need more noise.

In particular, we need the noise to be bigger than the gap A(S’)-A(S) = 1/n, or 0.01. Let’s pick sigma = 3/n = 0.03:

Now we see that the shaded green region has almost the same area as the shaded orange region — you can think of epsilon as expressing the difference between the shaded green region and the shaded orange region. In fact, the absolute value of the logratio of the two areas is epsilon. In other words, observing that A(s) > T is no longer strong evidence that s = S’, and we have achieved differential privacy, to the degree epsilon.

Of course, A(s) is also no longer a precise estimate of `mean(s)`. We’ll return to that point in a bit.

Simulating the Game

We can simulate the game I described above in R. I’ve put the code on github, here.

The code sweeps through a range of values of epsilon, and for each value selects a suitable noise width and runs 1000 rounds of the differential privacy game, returning a value for A(S) and A(S’). Each round we record whether A(S) and A(S’) are greater than T = 1/200.