Results 1 - 10
of
25
The WebGraph Framework I: Compression Techniques
- In Proc. of the Thirteenth International World Wide Web Conference
, 2003
"... Studying web graphs is often dicult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms ..."
Abstract
-
Cited by 114 (23 self)
- Add to MetaCart
Studying web graphs is often dicult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other).
A survey on pagerank computing
- Internet Mathematics
, 2005
"... Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. T ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing. 1.
I/O-Efficient Techniques for Computing Pagerank
"... Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user
The WebGraph Framework II: Codes for the World-Wide Web
- In DCC
, 2003
"... A fundamental observation about compression of the web graph was made in the construction of the LINK database [11]: if we order URLs lexicographically, ordered successor lists tend to have small gaps, which can be coded using standard methods from full-text index construction. In this paper, we ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
A fundamental observation about compression of the web graph was made in the construction of the LINK database [11]: if we order URLs lexicographically, ordered successor lists tend to have small gaps, which can be coded using standard methods from full-text index construction. In this paper, we propose codes, a family of simple at codes that are targeted at those gaps, and we give the rst thorough mathematical comparative analysis of several codes against power-law distributions with small exponent, which are common for such gaps in web graphs.
On Compressing Social Networks
"... Motivated by structural properties of the Web graph that support efficient data structures for in memory adjacency queries, we study the extent to which a large network can be compressed. Boldi and Vigna (WWW 2004), showed that Web graphs can be compressed down to three bits of storage per edge; we ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Motivated by structural properties of the Web graph that support efficient data structures for in memory adjacency queries, we study the extent to which a large network can be compressed. Boldi and Vigna (WWW 2004), showed that Web graphs can be compressed down to three bits of storage per edge; we study the compressibility of social networks where again adjacency queries are a fundamental primitive. To this end, we propose simple combinatorial formulations that encapsulate efficient compressibility of graphs. We show that some of the problems are NP-hard yet admit effective heuristics, some of which can exploit properties of social networks such as link reciprocity. Our extensive experiments show that social networks and the Web graph exhibit vastly different compressibility characteristics.
Towards Exploiting Link Evolution
, 2001
"... Exploiting hyperlink information has revolutionized search algorithms for the Web [8, 2]. As the Web grows, its link structure, along with content, evolves at a rapid rate. Consequently, large-scale static hyperlink-based ranking computations become too expensive to be performed frequently. ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Exploiting hyperlink information has revolutionized search algorithms for the Web [8, 2]. As the Web grows, its link structure, along with content, evolves at a rapid rate. Consequently, large-scale static hyperlink-based ranking computations become too expensive to be performed frequently.
Sorting out the document identifier assignment problem
- In Proc. of 29th European Conference on IR Research (ECIR
, 2007
"... Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In t ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Abstract. The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40 % using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory. 1
Efficient URL Caching for World Wide Web Crawling
- In Proceedings of the twelfth international conference on World Wide Web (WWW2003
, 2003
"... Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)--(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)--(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.
A Fast and Compact Web Graph Representation
"... Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In t ..."
Abstract
-
Cited by 11 (10 self)
- Add to MetaCart
Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In this paper we show that the same properties can be exploited with a different and elegant technique, built on Re-Pair compression, which achieves about the same space but much faster navigation of the graph. Moreover, the technique has the potential of adapting well to secondary memory. In addition, we introduce an approximate Re-Pair version that works efficiently with limited main memory.
The Best Trail Algorithm for Assisted Navigation of Web Sites
- In Proc. LA-WEB Conference on Latin American Web Congress
, 2003
"... We present an algorithm called the Best Trail Algorithm, which helps solve the hypertext navigation problem by automating the construction of memex-like trails through the corpus. The algorithm performs a probabilistic best-first expansion of a set of navigation trees to find relevant and compact tr ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present an algorithm called the Best Trail Algorithm, which helps solve the hypertext navigation problem by automating the construction of memex-like trails through the corpus. The algorithm performs a probabilistic best-first expansion of a set of navigation trees to find relevant and compact trails. We describe the implementation of the algorithm, scoring methods for trails, filtering algorithms and a new metric called potential gain which measures the potential of a page for future navigation opportunities.

