# A Simpler Explanation of Differential Privacy

Privacy concerns in data mining have been raised from time to time, could differential privacy be a solution? Differential privacy was devised to facilitate secure analysis over sensitive data, learn how it can be used to improve the model fitting process.

The test set performance estimates were are more upwardly biased than the training set estimates! This is not too surprising, since we are using the test set to select variables. Despite the optimistic curve of the test scores, we can see from the true out-of-sample performance (`freshScore`

) that the model performance is not much better than random; in fact, the algorithm only picked one of the ten signal carrying variables (the first one selected). The test set (and to a lesser extent, the training set) believe that performance continues to improve, when in fact performance degrades as more noise variables swamp out the single signal variable.

We want to use the test set to pick variables, while also using it to accurately estimate out-of-sample error. In their recent *Science* article, Dwork, et.al. propose an algorithm called *Thresholdout* to let us safely reuse the test set. We implemented our `diffPrivMethod`

based on *Thresholdout*. `diffPrivMethod`

evaluates proposed models on both the training and test sets. If the model’s delta accuracies on the test and the training sets are within a certain tolerance, then `diffPrivMethod`

returns the training set delta accuracy with a bit of Laplacian noise as the model score. If the two delta accuracies are far apart, then `diffPrivMethod`

adds a bit of Laplacian noise to the test set delta accuracy, and returns that as the model score. In R, the code would look something like this:

# Return an estimate of model performance in a # differentially private way # v is the new candidate variable # varSet is the set of already selected variables diffPrivMethod = function(v) { # Original model was fit on # training data using varSet; # Fit a new model using varSet+v # using the training data # and evaluate the difference # between the new models on the test set testScore <- modelScoreImprovement(varSet,v, trainData, testData) # Do the same thing, but evaluate # the improvement on the training set trainScore <- modelScoreImprovement(varSet,v, trainData, trainData) if(abs(testScore-trainScore)<=eps/2) { # if the scores are close, # return trainScore with a little noise sc <- trainScore + rlaplace(1,eps/2) } else { # otherwise, return testScore # with a little more noise sc <- testScore + rlaplace(1,eps) } # make sure the score is in range [0,1] pmax(0,pmin(1,sc)) }

We picked the tolerance and the widths of the Laplace noise somewhat arbitrarily.

The beauty of this scheme is that we never look directly at the test set. If the test and training set performances are similar, we see the (noisy) training performance; when we do look at the test set, we only see a noisy view of it, so it leaks information more slowly.

The original *Thresholdout* algorithm does not add noise to the training set performance, probably under the reasoning that we know it already. We decided to add noise anyway, on the principle that even knowing *when* you are using the test set performance leaks a little information, and we wanted to reduce the leakage a little more.

Here are the results of the stepwise regression using `diffPrivMethod`

(we set the tolerance/noise parameter eps=0.04 for this run):

Now the test set estimates the true out-of-sample accuracies quite well. The chosen models achieve a peak accuracy of about 61%, at around 36 variables. This method did a better job of picking variables — it found all ten signal variables — though it did start picking noise variables fairly early on.

However, there is still money left on the table. What if we used a really large test set? Here, we show the results of running the naive stepwise regression (no differential privacy), with the same training set of 1000 rows and a test set of 10,000 rows.

Now the learning algorithm picks nine of the signal variables immediately, and achieves an accuracy of 62%. We can see that the model’s performance on the test set is still upwardly biased, but much less so than with the naive method. Because the test set is larger, we don’t contaminate it as much, even when looking at it directly.

This suggests that we can think of differential privacy as letting us simulate having a larger test set, though obviously we still need our actual test set to be a reasonable size, especially since stepwise regression requires a lot of queries to the test set. Our results also show that while these new differential-privacy based schemes let us do a good job of estimating out-of-sample performance, that alone is not enough to insure best possible model performance.

All code, data, and examples for this experiment can be found here.

### References

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “Preserving Statistical Validity in Adaptive Data Analysis”, arXiv:1411.2664, April 2015.

Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth. “The reusable holdout: Preserving validity in adaptive data analysis”, *Science*, vol 349, no 6248 pp 636-638, August 2015. [link to abstract]

Blum, Avrim and Moritz Hardt. “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”, arXiv:1502.04585, February 2015.

**Bio:** Nina Zumel is a Principal Consultant with Win-Vector LLC, a data science consulting firm based in San Francisco. Previously, she was the co-founder and owner of Quimba Software, which conducted research and specialty software development for defense and emergency response applications.

**Related:**