New York Times, By STEVE LOHR, May 21, 2011
Earlier this month, Overstock.com, an online retailer, announced it would sponsor a $1 million contest to the person or team that could most improve its product recommendations. And a few weeks earlier, the Heritage Provider Network, a medical group in California, released the data and the details for its $3 million contest. Its prize will go to the team that comes up a technique for most accurately predicting which patients will be admitted to hospitals in the next year.
Both contests, like the Netflix competition, require contestants to come up with predictive algorithms, using anonymized personal data as the test bed.
So how to avoid a Netflix-style privacy blowup in the new contests?
Darren Vengroff, chief scientist at Rich Relevance, a start-up that develops recommendation technology for online retailers, has a plan. Rich Relevance is running the RecLab Prize, with Overstock putting up the $1 million.
Mr. Vengroff's strategy involves limiting the number of contestants who receive real customer data, with names and other identifying information stripped off. In the early round of competition, teams will instead get a hypothetical data set.
Then, in the semifinals (10 contestants) and finals (down to three), he explains, the competing algorithms will be running on real customer data. But that customer data will reside on Rich Relevance's computers, in a private "cloud" environment. That is a different approach than the model used by Netflix, which released the anonymized data to contestants.
The organizers of the $3 million Heritage Health Prize, it seems, are counting on Arvind Narayanan. He was one of the two researchers who took the Netflix data, and showed it could mined and massaged to identify customers.
Heritage has put Mr. Narayanan on its advisory board. Today, he is a postdoctorate researcher at Stanford University and a scholar at Stanford's Center for Internet and Society.