KDnuggets News 02:11, item 17, Briefs

KDnuggets : News : 2002 : n11 : item17 (previous | next)

Briefs

IBM works on Privacy-Preserving Data Mining

Agrawal and Ramakrishnan Srikant, researchers at IBM's Almaden Research Center in California are developing "Privacy-Preserving Data Mining," and it hinges on the idea that a customer's personal data can be scrambled before it is relayed across a Web server or database.

With this, a retailer could generate accurate data models -- so valuable in helping e-businesses customize services according to demographics or tastes -- without ever seeing personal information.

For example, if a web site asking for salary can set a randomization parameter of -$15,000 to +15,000. If John Doe wants to enter his salary onto a Web merchant's site he may do so, without fear of the actual salary being presented. Suppose he earns $90,000. He enters it, but the software would scramble it, and depict it as either $75,000, or $105,000. Hence, John Doe's true salary is masked. The software has access to only the randomized values and the parameters of randomization and nothing else.

Now, what remains constant is the allowed range of the randomization, which is linked to the desired level of privacy. Large randomization increases the uncertainty and the privacy of the users, and naturally causes loss in the accuracy of the results that are produced by a data mining algorithm that uses the randomized data as input.

Agrawal maintains that this is a trade off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the randomized distributions.

"Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," said Agrawal.

See http://www.internetnews.com/infra/article.php/1150901

KDnuggets : News : 2002 : n11 : item17 (previous | next)