MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Towards Web-Scale Web Archeology (2001) [6 citations — 0 self]

by Shun-Tak A. Leung ,  Shun-tak A. Leung ,  Sharon E. Perl ,  Sharon E. Perl ,  Raymie Stata ,  Raymie Stata ,  Janet L. Wiener ,  Janet L. Wiener
Add To MetaCart

Abstract:

Web-scale Web research is difficult. Information on the Web is vast in quantity, unorganized and uncatalogued, and available only over a network with varying reliability. Thus, Web data is difficult to collect, to store, and to manipulate efficiently. Despite these difficulties, we believe performing Web research at Web-scale is important. We have built a suite of tools that allow us to experiment on collections that are an order of magnitude or more larger than are typically cited in the literature. Two key components of our current tool suite are a fast, extensible Web crawler and a highly tuned, in-memory database of connectivity information. A Web page repository that supports easy access to and storage for billions of documents would allow us to study larger data sets and to study how the Web evolves over time.

Citations

1839 The Anatomy of a Large-Scale Hypertextual Web Search Engine – Brin, Page - 1998
349 Improved algorithms for topic distillation in hyperlinked environments – Bharat, Henzinger - 1998
263 Syntactic clustering of the Web – Broder, Glassman, et al.
200 Efficient crawling through URL ordering – Cho, Garcia-Molina, et al. - 1998
136 Graph structure in the web – Broder, Kumar, et al. - 2000
128 A.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines – Bharat, Broder - 1998
124 Finding related pages in the World Wide Web – Dean, Henzinger - 1999
102 Mercator: A Scalable, Extensible Web Crawler – Heydon, Najork - 1999
91 The Connectivity Server: Fast access to linkage information on the Web – Bharat, Bröder, et al. - 1998
78 WebL - A Programming Language for the Web – Kistler, Marais - 1998
71 WebBase: A Repository of Web Pages – Hirai, Raghavan, et al. - 2000
64 On nearuniform URL sampling – Henzinger, Heydon, et al. - 2000
60 Breadth-First Search Crawling Yields High-Quality Pages – Najork, Wiener - 2001
53 M.: Measuring Index Quality using Random Walks on the Web – Henzinger, Heydon, et al. - 1999
48 A comparison of techniques to find mirrored hosts on the WWW – Bharat, Broder, et al. - 2000
39 mirror on the web: a study of host pairs with replicated content – Mirror - 1999
20 SpeechBot: a Speech Recognition based Audio Indexing System for the Web – Thong, Litvinova, et al. - 2000
18 The term vector database: Fast access to indexing terms for web pages – Stata, Bharat, et al. - 2000
4 The AltaVista Search Revolution. Osborne McGraw-Hill – Ray, Ray, et al. - 1998
3 personal communication – Manasse
2 High-Performance Web Crawling. Chapter 2 – Najork, Heydon - 2001
2 The Link Database: Fast access to very large Web Graphs – Randall, Stata, et al. - 2001