Top Data Scientist Daniel Tunkelang on Data Recycling

Respected Data Scientist Daniel Tunkelang shares some insight into data recycling, using data from other contexts to bootstrap your initial statistical models until you can collect live data.

By Daniel Tunkelang, Data Science Consultant.


One technique I’ve seen succeed again and again is data recycling. You may be doing it already. If not, you should add it to your toolbox.

As Monica Rogati explained in a Strata presentation, data recycling is using data from other contexts to bootstrap your initial statistical models until you can collect live data. She went on to provide illustrative examples from her work at LinkedIn, such as bootstrapping LinkedIn’s recommender system for profile similarity using the contents of saved folders.

In particular, there are often opportunities to recycle data about users in order to make inference about content. Here are a couple of examples:

  • Topic mining through user interests. LinkedIn did this with LinkedIn Today, the precursor of Pulse. Rather than categorizing news articles through a document understanding pipeline, LinkedIn focused on who was reading and sharing those articles. For example, an article that’s disproportionately shared by real estate agents is likely to be about real estate. The approach wasn’t perfect, and LinkedIn eventually made significant investments in document understanding. But it was impressive how much LinkedIn could do using its robust knowledge of users, rather than relying on the noisier process of document understanding.
  • Locating a content based on who accesses it. A fair amount of online content is local, e.g., information about local businesses. It’s important for a search engine to identify such content and locate it, since the relevance of local content is highly correlated to the searcher’s location. Extracting location information from documents is difficult, and it’s easy to get false positives on local content, e.g., a piece of national or international news may still take place in a particular city. It’s easier to locate searchers using their IP addresses or other signals. Content with local interest typically yields a tight geographical clustering of searchers, and their centroid is a good proxy for the content’s location. And yes, I might have learned this while working on local search at Google.

There’s a more a general data science principle at play here: don’t try to solve a hard problem when you can get there by solving an easier one. If you have good data, recycle it!

Bio: Daniel Tunkelang is a data science and engineering executive who has built and led some of the strongest teams in the software industry.

Original. Reposted with permission.