Results 1 -
3 of
3
WebBase : A repository of web pages
- In Proceedings of the Ninth International World Wide Web Conference
, 1999
"... In this paper, we study the problem of constructing and maintaining a large shared repository of web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We focus on the storage manager module, and illustrate how traditional ..."
Abstract
-
Cited by 85 (7 self)
- Add to MetaCart
In this paper, we study the problem of constructing and maintaining a large shared repository of web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We focus on the storage manager module, and illustrate how traditional techniques for storage and indexing can be tailored to meet the requirements of a web repository. To evaluate design alternatives, we also present experimental results from a prototype repository called WebBase, that is currently being developed at Stanford University.
Trovatore: Towards a Highly Scalable Distributed Web Crawler
- Winner of the Best Poster Award
, 2001
"... Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design. ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design.
A Novel Architecture of a Parallel Web Crawler
"... Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler ..."
Abstract
- Add to MetaCart
Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture is to efficiently and effectively crawl the current set of publically indexable web pages so that we can maximize the download rate while minimizing the overhead from parallelization

