Microsoft researchers say anonymized data isn't so anonymous

Web log data such as can identify individual machines. HTTP user-agent information alone can identify the host 62% of the time, and with the IP address accuracy jumps to 81%.

By Tim Greene, Network World, February 02, 2012

Privacy Data routinely gathered in Web logs - IP address, cookie ID, operating system, browser type, user-agent strings - can threaten online privacy because they can be used to identify the activity of individual machines, Microsoft researchers say.

At the same time, analysis of such data when anonymized can help detect malicious activity and so improve overall Internet security, they add.

The researchers found that 62% of the time, HTTP user-agent information alone can accurately tag a host. Combine that same information with the IP address, and the accuracy jumps to 80.6%. If the user-agent information is combined with just the IP prefix the accuracy is still 79.3%, they say.

The highest accuracy came when more than one user ID was linked to a single host, as would be the case in a family that shares a single computer. In such cases, multiple IDs would accurately represent that one host computer. The accuracy rate was 92.8%.

The analysis of this seemingly benign information was based on a month - August 2010 - of anonymized Hotmail and Bing data on hundreds of millions of users. The researchers say they tried to find out whether a single piece of log data can uniquely reveal a particular host.

They found that even anonymized data can leak information. For example, replacing an IP address with its IP prefix still yields enough information that when combined with other commonly logged factors can be revealing. "Coarse grained IP prefixes achieve similar host-tracking accuracy to that of precise IP address information when they are combined with hashed [user-agent] strings," the researchers say.


The researchers did offer some tips for maintaining anonymity:

  • Use a browser whose default user agent string is popular, making that string less useful for identifying your machine in particular.
  • Even when using anonymous routing like Tor, use tools such as Torbutton to manage identity information.
  • Consider using proxies.
