## Compressing the graph structure of the web (2001)

### Cached

### Download Links

- [cis.poly.edu]
- [cis.poly.edu]
- [cis.poly.edu]
- [cis.poly.edu]
- [cis.poly.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In IEEE Data Compression Conference (DCC |

Citations: | 52 - 2 self |

### BibTeX

@INPROCEEDINGS{Suel01compressingthe,

author = {Torsten Suel},

title = {Compressing the graph structure of the web},

booktitle = {In IEEE Data Compression Conference (DCC},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the web. However, since the graphs in these scenarios involve hundreds of millions of nodes and even more edges, highly space-efficient data structures are needed to fit the data in memory. A first step in this direction was done by the DEC Connectivity Server, which stores the graph in compressed form. In this paper, we describe techniques for compressing the graph structure of the web, and give experimental results of a prototype implementation. We attempt to exploit a variety of different sources of compressibility of these graphs and of the associated set of URLs in order to obtain good compression performance on a large web graph. 1

### Citations

3648 | The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Brin, Page
- 1998
(Show Context)
Citation Context ...s. A number of recent papers have studied this graph, and have looked for ways to exploit its properties to improve the quality of search results. For example, the Pagerank algorithm of Brin and Page =-=[4]-=-, used in the Google search engine, and the HITS algorithm of Kleinberg [16] both rank pages according to the number and importance of other pages that link to them. Currently, almost all of the major... |

2999 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...ays to exploit its properties to improve the quality of search results. For example, the Pagerank algorithm of Brin and Page [4], used in the Google search engine, and the HITS algorithm of Kleinberg =-=[16]-=- both rank pages according to the number and importance of other pages that link to them. Currently, almost all of the major search engines use information about the link structure in their decision o... |

862 | Managing Gigabytes: Compressing and Indexing Documents and Images
- Witten, Moffat, et al.
- 1994
(Show Context)
Citation Context ...Related Work We now discuss related work in the data compression and web search areas. For a recent overview of compression techniques, we refer the reader to the textbook by Witten, Moffat, and Bell =-=[20]-=-, which contains excellent descriptions of most of the coding techniques that we exploit, including canonical Huffman coding, and techniques for encoding gap sizes. An idea that we apply repeatedly in... |

534 | Focused Crawling: A New Approach to Topic-specific Web Resource Discovery
- Chakrabarti, Berg, et al.
- 1999
(Show Context)
Citation Context ... on how to rank search results. Link structure has also been exploited for a variety of other tasks, e.g., finding related pages [3], classifying pages [7], crawling for pages on a particular subject =-=[10, 9]-=-, and many other examples. Recent large-scale studies of the web [6, 18, 17] have looked at basic graph-theoretic properties, such as connectivity, diameter, or the existence of bipartite cliques, on ... |

432 | Improved algorithms for topic distillation in a hyperlinked environment
- Bharat, Henzinger
- 1998
(Show Context)
Citation Context ...arch engines use information about the link structure in their decision on how to rank search results. Link structure has also been exploited for a variety of other tasks, e.g., finding related pages =-=[3]-=-, classifying pages [7], crawling for pages on a particular subject [10, 9], and many other examples. Recent large-scale studies of the web [6, 18, 17] have looked at basic graph-theoretic properties,... |

406 | Enhanced hypertext categorization using hyperlinks - Chakrabarti, Dom, et al. - 1998 |

333 | External memory algorithms and data structures: Dealing with
- Vitter
(Show Context)
Citation Context ...ciently compute with graphs whose size is significantly larger than the memory of most current workstations. One possibility is to employ I/O-efficient techniques for computing with graphs (e.g., see =-=[12, 19]-=-), but even with these advanced techniques, computation times on large graphs can be prohibitive. The alternative is to use machines with massive amounts of main memory, but this is expensive and not ... |

313 | Trawling the Web for emerging cyber-communities
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ...or a variety of other tasks, e.g., finding related pages [3], classifying pages [7], crawling for pages on a particular subject [10, 9], and many other examples. Recent large-scale studies of the web =-=[6, 18, 17]-=- have looked at basic graph-theoretic properties, such as connectivity, diameter, or the existence of bipartite cliques, on subsets of the web consisting of several hundred million nodes and over a bi... |

180 | Mining Web’s Link Structure
- Chakrabarti, Dom, et al.
- 1999
(Show Context)
Citation Context ...raphs, and we thus rely on standard coding and text compression techniques in our construction. For recent surveys of information retrieval on the Web with emphasis on link-based methods, we refer to =-=[5, 8]-=-. Examples of ranking techniques based on link structure are the Pagerank algorithm of Brin and Page [4] and the HITS algorithm of Kleinberg [16]. Recent large-scale studies of the graph structure of ... |

178 | External-memory graph algorithms
- Chiang, Goodrich, et al.
- 1995
(Show Context)
Citation Context ...ciently compute with graphs whose size is significantly larger than the memory of most current workstations. One possibility is to employ I/O-efficient techniques for computing with graphs (e.g., see =-=[12, 19]-=-), but even with these advanced techniques, computation times on large graphs can be prohibitive. The alternative is to use machines with massive amounts of main memory, but this is expensive and not ... |

120 | The Connectivity Server: fast access to linkage information on the Web - Bharat, Broder, et al. - 1998 |

110 | Extracting large-scale knowledge bases from the web
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ...or a variety of other tasks, e.g., finding related pages [3], classifying pages [7], crawling for pages on a particular subject [10, 9], and many other examples. Recent large-scale studies of the web =-=[6, 18, 17]-=- have looked at basic graph-theoretic properties, such as connectivity, diameter, or the existence of bipartite cliques, on subsets of the web consisting of several hundred million nodes and over a bi... |

106 |
Graph structure in the web: experiments and models
- Broder, Kumar, et al.
- 2000
(Show Context)
Citation Context ...or a variety of other tasks, e.g., finding related pages [3], classifying pages [7], crawling for pages on a particular subject [10, 9], and many other examples. Recent large-scale studies of the web =-=[6, 18, 17]-=- have looked at basic graph-theoretic properties, such as connectivity, diameter, or the existence of bipartite cliques, on subsets of the web consisting of several hundred million nodes and over a bi... |

83 | Towards compressing web graphs
- Adler, Mitzenmacher
- 2001
(Show Context)
Citation Context ... main objectives is to build a similar system that obtains better compression and that is fully accessible to the academic community. Very recently and independent of our work, Adler and Mitzenmacher =-=[1]-=- have proposed a new technique that exploits the special global structure of the web for compression, and that is based on recent attempts to model the web graph using a new type of random graph model... |

42 | Short Encodings of Planar Graphs and Maps
- Keeler, Westbrook
- 1995
(Show Context)
Citation Context ...ucture does not seem to fit well into any of the families of graphs, such as trees, planar graphs, or graphs of bounded genus or arboricity, that have been studied in the graph compression literature =-=[11, 13, 15]-=-. We can identify a number of possible “sources of compressibility” in the link structure. First, some very popular pages (e.g., www.yahoo.com) have much higher in-degree than others, which suggests u... |

35 | Distributed Hypertext Resource Discovery through Examples
- Chakrabarti, Berg, et al.
- 1999
(Show Context)
Citation Context ... on how to rank search results. Link structure has also been exploited for a variety of other tasks, e.g., finding related pages [3], classifying pages [7], crawling for pages on a particular subject =-=[10, 9]-=-, and many other examples. Recent large-scale studies of the web [6, 18, 17] have looked at basic graph-theoretic properties, such as connectivity, diameter, or the existence of bipartite cliques, on ... |

14 | A structural approach to graph compression
- Deo, Litow
- 1998
(Show Context)
Citation Context ...ucture does not seem to fit well into any of the families of graphs, such as trees, planar graphs, or graphs of bounded genus or arboricity, that have been studied in the graph compression literature =-=[11, 13, 15]-=-. We can identify a number of possible “sources of compressibility” in the link structure. First, some very popular pages (e.g., www.yahoo.com) have much higher in-degree than others, which suggests u... |

6 | Efficient lossless compression of trees and graphs
- Chen, Reif
- 1996
(Show Context)
Citation Context ...ucture does not seem to fit well into any of the families of graphs, such as trees, planar graphs, or graphs of bounded genus or arboricity, that have been studied in the graph compression literature =-=[11, 13, 15]-=-. We can identify a number of possible “sources of compressibility” in the link structure. First, some very popular pages (e.g., www.yahoo.com) have much higher in-degree than others, which suggests u... |

3 |
Information retrieval on the web
- Broder, Henzinger
- 1998
(Show Context)
Citation Context ...raphs, and we thus rely on standard coding and text compression techniques in our construction. For recent surveys of information retrieval on the Web with emphasis on link-based methods, we refer to =-=[5, 8]-=-. Examples of ranking techniques based on link structure are the Pagerank algorithm of Brin and Page [4] and the HITS algorithm of Kleinberg [16]. Recent large-scale studies of the graph structure of ... |

1 |
gzip compression utility. Available at http://www.gzip.org
- Gailly
(Show Context)
Citation Context ...e coding techniques that we exploit, including canonical Huffman coding, and techniques for encoding gap sizes. An idea that we apply repeatedly involves the use of extra bits, as used, e.g., in gzip =-=[14]-=-, to reduce the number of codewords needed. As mentioned, there are a number of known techniques for compressing special families of graphs [11, 13, 15]. However, these techniques do not seem to be ap... |