Google Wikilinks Big Data: 40 Millions Entities in Context

Google research has released Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages - over 100 times bigger than any previous entries, and this data can be used to research many interesting text problems.



DisambiguationWhen someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? To help computers with such disambiguation, Google Research has released Wikilinks Corpus, which contains 40 million total disambiguated mentions within over 10 million web pages.

Some ideas to investigate with this corpus:

  • Coreference, when different mentions mention the same entity
  • Entity resolution: matching a mention to the underlying entity
  • The bigger problem of cross-document coreference - if different web pages are talking about the same person or other entity
  • Learn things about entities by aggregating information across all the documents they're mentioned in

Here is Google's Wikilinks Corpus.

Tools and data with extra context are at UMass Wiki-links.

Read more to learn about this dataset and how to use it.