KDnuggets : News : 2008 : n24 : item17 < PREVIOUS | NEXT >

Briefs

CiteSeerX and SeerSuite - Adding to the Semantic Web

by Avi Rappoport, Posted On December 11, 2008

CiteSeer (http://citeseerx.ist.psu.edu) could be called a vertical research portal, a niche search engine, or a specialized digital library. It uses a specialized crawler (robot) to find scholarly papers; it then extracts the text from PDF and PostScript files and creates a searchable full-text index. CiteSeer enriches access to these materials by extracting metadata such as author names and publication information. The pioneering Autonomous Citation Indexing tool follows citations and acknowledgments from one paper to another, science mapping and data mining as it progresses. Digital libraries of scholarly works need structure and context. The newly announced SeerSuite open source code base offers excellent tools for this process, from automated citation indexing to web crawling to Boolean queries.

CiteSeer, in one version or another, has been running for more than 10 years, with hundreds of thousands of documents and millions of citations from papers in computer and information sciences posted to the web. It was first created as Research Index in 1998 by Steve Lawrence (now at Google) and professor C. Lee Giles of The Pennsylvania State University. Over the years, it has provided a rigorous testbed for both software development and content research. Its emphasis on automation of web crawling to find academic documents and automatically extracting citations has proven very successful.

...

Building on that experience, CiteSeerX is a completely new system, re-architected for scaling and modularity, to handle increasing demands from both researchers and digital library programmatic interfaces. The system uses artificial intelligence, machine learning, support vector machines, and other techniques to recognize and extract metadata for the articles found. It now uses the Lucene search engine and supports standards such as the Open Archives Initiative (OAI), including metadata browsing, and Z39.50. CiteSeerX has a simple but powerful internal structure for documents and citations. If it cannot access a document cited, it creates a virtual document as a place holder, which can then be filled when the document is available.

SeerSuite beta 0.1 (http://sourceforge.net/projects/citeseerx) is the Java open source code version of CiteSeerX, distributed under the Apache license. This includes the citation indexing and search features, as well as a scalable modular framework that can handle thousands of simultaneous queries, distribute indexes, and balance their demands across many servers. Documentation, currently sparse, will be added within the next 6 months. While at an early stage now, Giles says that SeerSuite will bring this form of digital library structure to many different researchers and fields, requiring IT support mainly to install and configure the system.

Read more.

Bookmark using any bookmark manager!


KDnuggets : News : 2008 : n24 : item17 < PREVIOUS | NEXT >

Copyright © 2008 KDnuggets.   Subscribe to KDnuggets News!