WDC Huge Web Graph – 128 billion hyperlinks – publicly available

Huge Web Graph, with 3.5 billion pages and 128 billion hyperlinks is now publicly available for web and network research. This is probably the largest publicly available graph.

Web GraphWeb Data Commons - Hyperlink Graph

A huge web graph has been made publicly available by researchers from University of Mannheim. The graph was extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages, and is probably the largest now publicly available.

This graph would be useful for researchers who analyze the web, and work on

  • search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.
  • Web Science and linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.

The hyperlink graph is provided on 4 different levels of aggregation:

  • Page-Level Graph - all details with each node representing a single web page and each arc a hyperlink between to two pages.
  • Subdomain-Level Graph - aggregates the page graph by subdomain. Each node in the graph represents a specific subdomain (like research.dws.uni-mannheim.de) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains.
  • First-Level-Subdomain Graph - Each node represents a first level subdomain (like dws.uni-mannheim.de) with all subjacent subdomains aggregated into this domain.
  • Pay-Level-Domain Graph - Each node represents a pay-level-domain (lie uni-mannheim.de). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.

The table below gives an overview of the size of the different graphs:

Graph #Nodes #Arcs
Page Graph 3,563 million 128,736 million
Subdomain Graph 101 million 2,043 million
1st Level Subdomain Graph 95 million 1,937 million
PLD Graph 43 million 623 million

For more information, and to download the data, visit