Jon Kleinberg on Links, Authorities, and Hubs (KDnuggets News 07:11, item 5, Features)

KDnuggets : News : 2007 : n11 : item5

Features

Subject: Jon Kleinberg on Links, Authorities, and Hubs

GPS, Q2: Around 1996 you developed the HITS algorithm (Hypertext Induced Topic Selection), which is a link analysis algorithm that rates Web pages for their authority and hub values. Authority value estimates the value of the content of the page; hub value estimates the value of its links to other pages. Using links rather than keywords was a major advance for web search engines. How did you come to the idea of using links and authorities and hubs ?

Kleinberg: Improving the quality of Web search was a research problem on the minds of many people in 1996, and Prabhakar Raghavan, who was in the process of recruiting me to spend a year post-PhD at IBM Almaden, had very interesting ideas on this issue, and he encouraged me to join his group in thinking about it. From my perspective, there was considerable motivation to explore Web search as something more than a pure text retrieval problem: I was coming from a theoretical computer science background, focusing on graph algorithms, and it was hard to avoid noticing that in the Web we had a system where enormous numbers of people were, through their browsing, busily engaged in the traversal of an enormous graph, harvesting information as they went.

A few researchers, including Ellen Spertus and Peter Pirolli, Jim Pitkow, and Ramana Rao, had begun considering the link structure of the Web in their work on search. Understanding the network structure of the Web seemed to be both scientifically compelling, and also a potential route for improving Web search.

The idea that there were two kinds of interesting pages -- essentially, resource lists and the authorities themselves -- was an intuitively fundamental part of one's Web browsing experience in mid-1990's. It was a common experience that you would try to learn about a topic, gradually find people who had assembled hub-like resource pages with extensive information and links, and eventually discover a kind of consensus in where these links pointed -- in other words, you would find the authorities that these hubs consistently endorsed, and from this you'd find both the good authorities and also the good hubs that had led you there. This intuition from everyday experience suggested that there was a more general rule here that could be formalized.

What's nice is that a direct way of formalizing the idea works well: you simply start from the premise that you want to assign scores to pages as authorities and hubs, in such a way that the good authorities tend to receive links from good hubs, and the good hubs tend to point to good authorities. The equilibrium inherent in this notion -- that the quality of a page depends on the quality of other pages -- is related to ideas in the area of spectral graph theory, and from this body of work one knows that such equilibria can be made precise using eigenvectors. Brin and Page's definition of PageRank has an analogous kind of eigenvector-based equilibrium in it, in which you roll the notions of quality into a single "goodness" measure: roughly, good pages are those that receive links from other good pages.

From the perspective of designing search tools, there is clearly a huge amount of information contained in both the text content and the links -- as there also is in a searcher's pattern of clicks, and a number of other sources of data. Over time, commercial search engines have built on the different threads of research in this area to take advantage of numerous features of all these types.

What's this?

KDnuggets : News : 2007 : n11 : item5

PREVIOUS | NEXT