Results 1 -
3 of
3
The WebGraph Framework I: Compression Techniques
- In Proc. of the Thirteenth International World Wide Web Conference
, 2003
"... Studying web graphs is often dicult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms ..."
Abstract
-
Cited by 114 (23 self)
- Add to MetaCart
Studying web graphs is often dicult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other).
The WebGraph Framework II: Codes for the World-Wide Web
- In DCC
, 2003
"... A fundamental observation about compression of the web graph was made in the construction of the LINK database [11]: if we order URLs lexicographically, ordered successor lists tend to have small gaps, which can be coded using standard methods from full-text index construction. In this paper, we ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
A fundamental observation about compression of the web graph was made in the construction of the LINK database [11]: if we order URLs lexicographically, ordered successor lists tend to have small gaps, which can be coded using standard methods from full-text index construction. In this paper, we propose codes, a family of simple at codes that are targeted at those gaps, and we give the rst thorough mathematical comparative analysis of several codes against power-law distributions with small exponent, which are common for such gaps in web graphs.
Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web
- In Proceedings of the SIGIR 2003 Workshop on Distributed Information Retrieval
, 2003
"... Abstract. 1 This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawl ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Abstract. 1 This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions are known to have problems like link congestion, being a single point of failure, and expensive administration. It requires both horizontal and vertical scalability solutions to manage Network File Systems (NFS) and load balancing DNS and HTTP requests. In this paper, we present an architecture of a completely distributed and decentralized Peer-to-Peer (P2P) crawler called Apoidea, which is self-managing and uses geographical proximity of the web resources to the peers for a better and faster crawl. We use Distributed Hash Table (DHT) based protocols to perform the critical URL-duplicate and content-duplicate tests. 1

