A Simpler Explanation of Differential Privacy
Privacy concerns in data mining have been raised from time to time, could differential privacy be a solution? Differential privacy was devised to facilitate secure analysis over sensitive data, learn how it can be used to improve the model fitting process.
From Differential Privacy to Machine Learning
Differential privacy aims to make the answers to “snooping queries” too vague to distinguish closely related sets (in this case, it makes the probability that A(S) ≥ T about the same as the probability that A(S’) ≥ T). But for machine learning, we are also interested in the output of A(). We can use the simulation above to estimate what happens to A(S) and A(S’) for different values of epsilon. Here we plot the (normalized) gap between the expected values of A(S) and A(S’) as a function of epsilon, from the simulation:
As epsilon gets smaller (implying stricter privacy), the relative gap between the expected values of A(S) and A(S’) gets smaller. The discrepancies in the plot are probably due to poor choices of the noise parameter (I picked them heuristically), but the trend is clear. This makes sense, since privacy implies that the A(S) should look a lot like A(S’).
However, as epsilon gets stricter, the estimates of
mean(S) = 0 and
mean(S') = 0.01 — which is what A() is supposed to estimate — also become poorer.
This is the trade-off: how can we preserve privacy and still perform useful analysis?
Differential Privacy and Adaptive Data Analysis
The recent papers by Dwork, et.al. apply differential privacy to the problem of adaptive data analysis, or reusable test sets. In standard machine learning practice, we use two data sets to model: a training set to fit the model, and a test set to evaluate the model. Ideally we only use the test set once; but in practice we go back to it again and again, as we try to improve our model. We already know that performance estimates of a model over its training set are upwardly biased (the model looks more accurate on training than it really is, because it “knows” the training set); if we go back to the test set too many times, then performance estimates of the model on the test set are also upwardly biased, because we also “know” the test set. This problem is exacerbated when we also use the test set to tune the model: for example to pick the variables we use, or tune modeling parameters. In other words, even if we observe the best practice of using a training set to fit models, a calibration set to tune models, and a holdout set to evaluate the model, we can also contaminate the calibration set if we look at it too many times.
Dwork, et.al., describe how (and how many times) we can go back to a test set without biasing its evaluations of model performance. To do this, we make sure that the modeling procedure only interacts with the test set in a differentially private manner. Because it has limited access to the test set, the modeling procedure can’t learn the test set, so evaluations of a model over the test set still accurately reflect a model’s performance on unseen data.
Describing the proofs and techniques in these papers is outside the scope of this article, but we can demonstrate the results in a simple example, similar to the example application shown in Dwork, et.al.’s Science paper.
Using Differential Privacy for Stepwise Regression
For our example, suppose we have 2000 rows of data with a binary outcome y (50% prevalence of the positive class), and 110 possible input variables to choose from. In our tests we used synthetic data with ten variables (
x1...x10) that carry signal and one hundred noise variables (
We decide to use forward-stepwise logistic regression to choose a suitable model for predicting y. We split the sample data into 1000 rows for the training set, and 1000 rows for the test set. Given that we have selected k-1 variables and have fit a model, M(k-1), that uses those variables, we pick the kth variable by fitting a model Mnew on the training set using the (k-1) previously selected variables plus one new variable from the remaining candidates. We then evaluate the accuracy of M(k-1) and Mnew on the test set, and pick as the kth variable the one for which Mnew showed the most improvement. In other words, we fit candidate models on the training set, and evaluate them for improvement on the test set. This is a fairly common way of tuning models. Standard implementations of stepwise regression often use training set AIC as the criterion for picking the next model; we use accuracy improvement to keep the code straightforward, and closer to the assumptions in the paper.
Normally, the stepwise procedure would terminate when the model stops improving, but for this example we let it run to pick fifty variables. At each step we recorded the accuracy of the selected model on the training set (green circles), the test set (orange triangles) and on a fresh holdout set of 10,000 rows (purple squares). We picked this additional large holdout set (called “fresh”) to get a good estimate of the true accuracy of the model. Here’s the performance of the naive procedure: