## A Fast and Compact Web Graph Representation

### Cached

### Download Links

Citations: | 17 - 12 self |

### BibTeX

@MISC{Claude_afast,

author = {Francisco Claude and Gonzalo Navarro},

title = {A Fast and Compact Web Graph Representation },

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Compressed graphs representation has become an attractive research topic because of its applications in the manipulation of huge Web graphs in main memory. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. In this paper we show that the same properties can be exploited with a different and elegant technique, built on Re-Pair compression, which achieves about the same space but much faster navigation of the graph. Moreover, the technique has the potential of adapting well to secondary memory. In addition, we introduce an approximate Re-Pair version that works efficiently with limited main memory.

### Citations

9056 | Introduction to Algorithms - Cormen, Leiserson, et al. - 2002 |

2997 | Authoritative Sources in a Hyperlinked Environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...re essentially basic algorithms applied over the Web graph. One of the classical references on this topic [Kleinberg et al. 1999] shows how the HITS algorithm to find hubs and authorities on the Web [=-=Kleinberg 1999-=-] starts by selecting random pages and finding the induced subgraphs, which are the pages that point to or are pointed from the selected pages. Donato et al. [2006] show that several common Web mining... |

1220 | A Lempel, “A Universal Algorithm for Sequential Data Compression
- Ziv
- 1977
(Show Context)
Citation Context ...s. When n is significant compared to c, space reduction is achieved at the expense of slower access to the adjacency lists. 105 Lempel-Ziv Compression of Web Graphs The Lempel-Ziv compression family =-=[40, 41]-=- achieves compression by replacing repeated sequences found in the text by a pointer to a previous occurrence thereof. In particular, the LZ78 variant [41] stands as a plausible alternative candidate ... |

779 | Compression of individual sequences via variable rate coding
- Ziv, Lempel
- 1978
(Show Context)
Citation Context ...an be used instead of Re-Pair, as long as they are able of efficiently extracting snippets from a sequence and of handling large alphabets. In particular, we modify the Ziv-Lempel variant called LZ78 =-=[41]-=- in order to achieve random access. LZ78 does not compress as much as our Re-Pair variants, yet it is slightly faster to extract snippets. Our experimental results over different Web crawls show that ... |

398 |
Introduction to Algorithms, 2nd ed
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...Intel(R) Xeon(R) CPU running at 2 GHz, with 8 cores and 16 GB of RAM, running Ubuntu GNU/Linux (Server) with kernel 2.6.24-27 in 64-bit mode. For both traversals we implemented a similar queue/stack [=-=Cormen et al. 2001-=-] using arrays, to make sure that the STL and the Java API were not altering the performance results. Table IX shows the time for the two traversals, including the space required by each representatio... |

349 | A random graph model for massive graphs
- Aiello, Chung, et al.
- 2000
(Show Context)
Citation Context ...that is, the probability that a page has i links is 1/i θ for some parameter θ > 0. Several experiments give reasonably consistent values of θ = 2.1 for incoming links and θ = 2.72 for outgoing links =-=[2, 7]-=-. Locality of reference: Most of the links from a site point within the site. This motivates in [3] the use of lexicographical URL order to list the pages, so that outgoing links go to nodes whose pos... |

312 | The web as a graph: Measurements, models, and methods
- Kleinberg, Kumar, et al.
- 1999
(Show Context)
Citation Context ... communities, etc. Many techniques of interest to obtain information from the Web structure are essentially basic algorithms applied over the Web graph. One of the classical references on this topic [=-=Kleinberg et al. 1999-=-] shows how the HITS algorithm to find hubs and authorities on the Web [Kleinberg 1999] starts by selecting random pages and finding the induced subgraphs, which are the pages that point to or are poi... |

277 | Edgebreaker: Connectivity compression for triangle meshes
- Rossignac
- 1999
(Show Context)
Citation Context ...e represented as a balanced sequence of parentheses. Some classes of planar graphs have also received special attention, for example trees, triangulated meshes, triconnected planar graphs, and others =-=[15, 17, 13, 28]-=-. For dense graphs, it is shown that little can be done to improve the space required by the adjacency matrix [23]. The above techniques consider just the compression of the graph, not its access in c... |

245 | Graph structure in the web
- Broder, Kumar, et al.
(Show Context)
Citation Context ...le to larger graphs, as much of their improvement relies on smart caching, and this effect should vanish with real Web graphs. There is also some work specifically aimed at compression of Web graphs [=-=Broder et al. 2000-=-; Adler and Mitzenmacher 2001; Suel and Yuan 2001; Boldi and Vigna 2004a; Boldi et al. 2009]. Several properties of Web graphs have been identified and exploited to achieve compression: Skewed distrib... |

200 | Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets
- Raman, Raman, et al.
(Show Context)
Citation Context ...e are many constant-time solutions for the rank/select problem on bitmaps B[1,n]. One of them requires n + o(n) space (that is, o(n) bits on top of B itself) [12, 28]. An improvement to this solution =-=[34]-=- retains constant-time queries while using nH0(B)+o(n) bits of space to represent B and the extra data structures. H0(B) corresponds to the zero-order entropy of bitmap B: The zero-order entropy for a... |

199 | High-order entropy-compressed text indexes
- Grossi, Gupta, et al.
(Show Context)
Citation Context ... to sequences as follows: H0(S) = ∑ a∈Σ na n log n na where na is the number of occurrences of symbol a in S. The solution by Ferragina et al. builds over an elegant structure called the wavelet tree =-=[17, 32]-=-. This is a perfect binary tree where the root stores a bitmap formed by the n highest bits of each symbol in the sequence. Those symbols with highest bit 0 are then sent to the left subtree, and thos... |

192 | Representing Web graphs
- Raghavan, Garcia-Molina
- 2003
(Show Context)
Citation Context ...compression. In [29] they partition the adjacency lists considering popularity of the nodes, and use different codings method for each partition. A more hierarchical view of the nodes is exploited in =-=[26]-=-. In [1, 27] they take explicit advantage of the similarity property. A page with similar outgoing links is identified with some heuristic, and then the current page is expressed as a reference to the... |

181 |
Space-efficient static trees and graphs
- Jacobson
- 1989
(Show Context)
Citation Context ...ce required by the adjacency matrix [23]. The above techniques consider just the compression of the graph, not its access in compressed form. The first compressed data structure for graphs we know of =-=[16]-=- requires O(gn) bits of space for a g-page graph. The neighbors of a node can be retrieved in O(log n) time each (plus an extra O(g) complexity for the whole query). The main idea is again to represen... |

180 | Compressed full-text indexes
- Navarro, Mäkinen
(Show Context)
Citation Context ...ailable. 3 www-diglib.stanford.edu/~testbed/doc2/WebBase/ 3A recent proposal [24] advocates regarding the adjacency list representation as a text sequence and use compressed text indexing techniques =-=[25]-=-, so that neighbors can be obtained via text decompression and reverse neighbors via text searching. The concept and the results are interesting but not yet sufficiently competitive with those of [6].... |

174 | The WebGraph framework I: Compression techniques
- Boldi, Vigna
(Show Context)
Citation Context ...ir compressed form, say for navigation purposes. As far as we know, the best results in practice to compress Web graphs such that they can be navigated in compressed form are those of Boldi and Vigna =-=[6]-=-. They exploit several well-known regularities of Web graphs, such as their skewed in- and out-degree distributions, repetitiveness in the sets of outgoing links, and locality in the references. For t... |

165 | The indexable web is more than 11.5 billion pages
- Gulli, Signorini
- 2005
(Show Context)
Citation Context ...ly available, (2) the ever-growing speed gaps in the memory hierarchy. As an example of the former, the graph of the static indexable Web was estimated in 2005 to contain more than 11.5 billion nodes =-=[18]-=- and more than 150 billion links. A plain adjacency list representation of this graph would need around 600 GB. As an example of (2), access time to main memory is about one million times faster than ... |

144 | Succinct representation of balanced parentheses, static trees and planar graphs
- Munro, Raman
- 1997
(Show Context)
Citation Context ...are supported using succinct data structures that permit navigating a sequence of balanced parentheses. The retrieval was later improved to constant time by using improved parentheses representations =-=[22]-=-, and also the constant term of the space complexity was improved [9]. The representation also permits finding the degree (number of neighbors) of a node, as well as testing whether two nodes are conn... |

120 |
The Connectivity Server: fast access to linkage information on the Web
- Bharat, Broder, et al.
- 1998
(Show Context)
Citation Context ...s give reasonably consistent values of θ = 2.1 for incoming links and θ = 2.72 for outgoing links [2, 7]. Locality of reference: Most of the links from a site point within the site. This motivates in =-=[3]-=- the use of lexicographical URL order to list the pages, so that outgoing links go to nodes whose position is close to that of the current node. Gap encoding techniques are then used to encode the dif... |

114 | Compressed Representations of Sequences and FullText Indexes
- Ferragina
- 2006
(Show Context)
Citation Context ... these three operations in a sequence S[1,n] using n log σ + o(n log σ) bits and O(log log σ) time. Note that n log σ is the space required by a plain representation of the sequence. Ferragina et al. =-=[14]-=- achieve zero-order log σ compression, that is, nH0(S) + o(n log σ) bits of space, and O(1 + log log n ) time per operation (this is a constant if σ = O(polylog(n))). The zero-order entropy formula ge... |

97 |
Compact Pat Trees
- Clark
- 1996
(Show Context)
Citation Context ...i-th occurrence of a 1 in the bitmap. There are many constant-time solutions for the rank/select problem on bitmaps B[1,n]. One of them requires n + o(n) space (that is, o(n) bits on top of B itself) =-=[12, 28]-=-. An improvement to this solution [34] retains constant-time queries while using nH0(B)+o(n) bits of space to represent B and the extra data structures. H0(B) corresponds to the zero-order entropy of ... |

83 | Towards compressing web graphs
- Adler, Mitzenmacher
- 2001
(Show Context)
Citation Context ...ale to larger graphs, as much of their improvement relies on smart caching, and this effect should vanish with real Web graphs. There is also some work specifically aimed at compression of Web graphs =-=[7, 1, 29, 6]-=-. In this graph the (labeled) nodes are Web pages and the (directed) edges are the hyperlinks. Several properties of Web graphs have been identified and exploited to achieve compression: Skewed distri... |

80 |
Succinct representations of graphs
- Turan
- 1984
(Show Context)
Citation Context ...log n, respectively 1 . We call the neighbors of a node v ∈ V those u ∈ V such that (v,u) ∈ E. The oldest work on graph compression focuses on undirected unlabeled graphs. The first result we know of =-=[30]-=- shows that planar graphs can be compressed into O(n) bits. The constant factor was later improved [17], and finally a technique yielding the optimal constant factor was devised [14]. Results on plana... |

78 | Offline dictionary-based compression
- Larsson, Moffat
- 1999
(Show Context)
Citation Context ...lists. In this paper we present a new way to take advantage of the regularities that appear in Web graphs. Instead of different ad-hoc techniques, we use a uniform and elegant technique called RePair =-=[19]-=- to compress the adjacency lists. As the original linear-time Re-Pair compression requires ⋆ Partially funded by a grant from Yahoo! Research Latin America.much main memory, we develop an approximate... |

72 | Fully automatic cross-associations
- Chakrabarti, Papadimitriou, et al.
- 2014
(Show Context)
Citation Context ...l to more general graphs, in particular to Web graphs. A more powerful concept that applies to this type of graph is that of graph separators. Although the separator concept has been used a few times =-=[10, 14, 8]-=- (yet not suporting access to the compressed graph), the most striking results are achieved in recent work [5, 4]. Their idea is to find graph components that can be disconnected from the rest by remo... |

65 | Performance of inverted indices in sharednothing distributed text document information retrieval systems
- Tomasic, Garcia-Molina
(Show Context)
Citation Context ...emory, even if I/O-optimal. Yet their advantage is that they can manage huge graphs at low cost, since external memory is much cheaper than main memory. —Using distributed systems [Badue et al. 2001; =-=Tomasic and Garcia-Molina 1993-=-]: Distributing the information among many computers is a good solution to manage huge amounts of data, in the aggregated main memory of all the machines. Still, depending on the problem, the communic... |

62 |
Rank/select operations on large alphabets: a tool for text indexing
- Golynski, Munro, et al.
(Show Context)
Citation Context ...r of occurrences of a until position i; and select(a,i) returns the position where the i-th occurrence of the character a appears. 3 www-diglib.stanford.edu/~testbed/doc2/WebBase/ . 4Golynski et al. =-=[15]-=- presented a data structure capable of performing these three operations in a sequence S[1,n] using n log σ + o(n log σ) bits and O(log log σ) time. Note that n log σ is the space required by a plain ... |

62 |
Succinct Static Data Structure
- Jacobson
- 1988
(Show Context)
Citation Context ...ce required by the adjacency matrix [30]. The above techniques consider just the compression of the graph, not its access in compressed form. The first compressed data structure for graphs we know of =-=[23]-=- requires O(gn) bits of space for a g-page graph. The neighbors of a node can be retrieved in O(log n) time each (plus an extra O(g) complexity for the whole query). The main idea is again to represen... |

62 | Algorithms and Data Structures for External Memory - Vitter - 2008 |

52 | Compressing the graph structure of the web
- Suel, Yuan
- 2001
(Show Context)
Citation Context ...ale to larger graphs, as much of their improvement relies on smart caching, and this effect should vanish with real Web graphs. There is also some work specifically aimed at compression of Web graphs =-=[7, 1, 29, 6]-=-. In this graph the (labeled) nodes are Web pages and the (directed) edges are the hyperlinks. Several properties of Web graphs have been identified and exploited to achieve compression: Skewed distri... |

48 | Distributed query processing using partitioned inverted files
- Badue, Baeza-Yates, et al.
- 2001
(Show Context)
Citation Context ...e version in main memory, even if I/O-optimal. Yet their advantage is that they can manage huge graphs at low cost, since external memory is much cheaper than main memory. —Using distributed systems [=-=Badue et al. 2001-=-; Tomasic and Garcia-Molina 1993]: Distributing the information among many computers is a good solution to manage huge amounts of data, in the aggregated main memory of all the machines. Still, depend... |

42 | Short Encodings of Planar Graphs and Maps
- Keeler, Westbrook
- 1995
(Show Context)
Citation Context ...st work on graph compression focuses on undirected unlabeled graphs. The first result we know of [30] shows that planar graphs can be compressed into O(n) bits. The constant factor was later improved =-=[17]-=-, and finally a technique yielding the optimal constant factor was devised [14]. Results on planar graphs can be generalized to graphs with constant genus [20]. More generally, a graph with genus g ca... |

42 | Application of Lempel-Ziv factorization to the approximation of grammar-based compression - Rytter |

41 | Succinct indexes for strings, binary relations and multi-labeled trees
- Barbay, He, et al.
(Show Context)
Citation Context ...V , and then use the techniques of Barbay et al. [3] for binary relations, where forward and reverse traversal operations can be solved in time O(log log n) per node delivered. A more recent followup =-=[4]-=- retains those times and reduces the space to the zero-order entropy of the binary relation, that is, log ( m) n . This compression result is still poor for Web graphs, see Table 8 where we give lower... |

37 | The link database: Fast access to graphs of the web
- Randall, Stata, et al.
- 2001
(Show Context)
Citation Context ...on. In [29] they partition the adjacency lists considering popularity of the nodes, and use different codings method for each partition. A more hierarchical view of the nodes is exploited in [26]. In =-=[1, 27]-=- they take explicit advantage of the similarity property. A page with similar outgoing links is identified with some heuristic, and then the current page is expressed as a reference to the similar pag... |

36 | Compact Representations of Separable Graphs
- Blandford, Blelloch, et al.
- 2003
(Show Context)
Citation Context ...that of graph separators. Although the separator concept has been used a few times [10,14,8] (yet not supporting access to the compressed graph), the most striking results are achieved in recent work =-=[5,4]-=-. Their idea is to find graph components that can be disconnected from the rest by removing a small number of edges. Then, the nodes within each component can be renumbered to achieve smaller node ide... |

31 | Practical rank/select queries over arbitrary sequences - CLAUDE, NAVARRO |

30 | Succinct representation of general unlabeled graphs
- Naor
- 1990
(Show Context)
Citation Context ...xample trees, triangulated meshes, triconnected planar graphs, and others [15, 17, 13, 28]. For dense graphs, it is shown that little can be done to improve the space required by the adjacency matrix =-=[23]-=-. The above techniques consider just the compression of the graph, not its access in compressed form. The first compressed data structure for graphs we know of [16] requires O(gn) bits of space for a ... |

29 | The webgraph framework ii: Codes for the world-wide web - Boldi, Vigna - 2004 |

27 | Adaptive searching in succinctly encoded binary relations and tree-structured documents
- Barbay, Golynski, et al.
(Show Context)
Citation Context ...e nodes that point to vi), which permits backward traversal in the graph. One way to address this is to consider the graph as a binary relation on V × V , and then use the techniques of Barbay et al. =-=[3]-=- for binary relations, where forward and reverse traversal operations can be solved in time O(log log n) per node delivered. A more recent followup [4] retains those times and reduces the space to the... |

26 |
Extracting Large Scale Knowledge bases from the Web
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ... encoding techniques are then used to encode the differences among consecutive target node positions. Similarity of adjacency lists: Nodes close in URL lexicographical order share many outgoing links =-=[18, 6]-=-. This permits compressing them by a reference to the similar list plus a list of edits. Moreover, this translates into source nodes pointing to a given target node forming long intervals of consecuti... |

26 | The smallest grammar problem - Charikar, Lehman, et al. |

25 | Compressed text indexes with fast locate
- González, Navarro
(Show Context)
Citation Context ...ts of repeatedly finding the most frequent pair of symbols in a sequence of integers and replacing it with a new symbol, until no more replacements are convenient. This technique was recently used in =-=[11]-=- for compressing suffix arrays. More precisely, Re-Pair over a sequence T works as follows: 1. It identifies the most frequent pair ab in T 2. It adds the rule s → ab to a dictionary R, where s is a n... |

25 |
C.P.: Inverted file compression through document identifier reassignment
- Shieh, Chen, et al.
- 2003
(Show Context)
Citation Context ...ic, and one could aim to finding the optimal ordering. However, similar problems have been studied for differential encoding of inverted lists, and they have been found to be hard [Fink and Voß 1999; =-=Shieh et al. 2003-=-]. Indeed, the whole point of the work of Buehrer and Chellapilla [2008] is to develop heuristics to find good subsequences efficiently. Removing pointers. It might be advantageous, for relatively spa... |

24 |
A scalable pattern mining approach to web graph compression with communities
- Buehrer, Chellapilla
- 2008
(Show Context)
Citation Context ...ression Our experiments indicate that our technique offers a good space/time tradeoff, yet it is unable to achieve the best compression ratios reached by alternative methods [Boldi et al. 2008; 2009; =-=Buehrer and Chellapilla 2008-=-]. We explore now how far can we reach in terms of compression ratio, even if sacrificing access time. As explained, compressed sequence C is stored with fixed-length integers, and possibly amenable o... |

23 | A fast general methodology for information-theoretically optimal encodings of graphs
- He, Kao, et al.
(Show Context)
Citation Context ...result we know of [30] shows that planar graphs can be compressed into O(n) bits. The constant factor was later improved [17], and finally a technique yielding the optimal constant factor was devised =-=[14]-=-. Results on planar graphs can be generalized to graphs with constant genus [20]. More generally, a graph with genus g can be compressed into O(g + n) bits [10]. The same holds for a graph with g page... |

23 | Voß “Applications of modern heuristic search methods to pattern sequencing problems
- Fink, S
- 1999
(Show Context)
Citation Context ...g is just a heuristic, and one could aim to finding the optimal ordering. However, similar problems have been studied for differential encoding of inverted lists, and they have been found to be hard [=-=Fink and Voß 1999-=-; Shieh et al. 2003]. Indeed, the whole point of the work of Buehrer and Chellapilla [2008] is to develop heuristics to find good subsequences efficiently. Removing pointers. It might be advantageous,... |

22 |
Representation of Graphs
- Itai, Rodeh
- 1982
(Show Context)
Citation Context ...e represented as a balanced sequence of parentheses. Some classes of planar graphs have also received special attention, for example trees, triangulated meshes, triconnected planar graphs, and others =-=[15, 17, 13, 28]-=-. For dense graphs, it is shown that little can be done to improve the space required by the adjacency matrix [23]. The above techniques consider just the compression of the graph, not its access in c... |

21 | Linear-time succinct encodings of planar graphs via canonical orderings
- He, Kao, et al.
- 1999
(Show Context)
Citation Context ...e represented as a balanced sequence of parentheses. Some classes of planar graphs have also received special attention, for example trees, triangulated meshes, triconnected planar graphs, and others =-=[15, 17, 13, 28]-=-. For dense graphs, it is shown that little can be done to improve the space required by the adjacency matrix [23]. The above techniques consider just the compression of the graph, not its access in c... |

19 | Algorithms and Experiments for the Web Graph - Laura, Leonardi, et al. - 2003 |

15 | A large-scale study of link spam detection by graph algorithms - Saito, Toyoda, et al. - 2007 |