by
Shun-Tak A. Leung
,
Shun-tak A. Leung
,
Sharon E. Perl
,
Sharon E. Perl
,
Raymie Stata
,
Raymie Stata
,
Janet L. Wiener
,
Janet L. Wiener
Add To MetaCart
Abstract:
Web-scale Web research is difficult. Information on the Web is vast in quantity, unorganized and uncatalogued, and available only over a network with varying reliability. Thus, Web data is difficult to collect, to store, and to manipulate efficiently. Despite these difficulties, we believe performing Web research at Web-scale is important. We have built a suite of tools that allow us to experiment on collections that are an order of magnitude or more larger than are typically cited in the literature. Two key components of our current tool suite are a fast, extensible Web crawler and a highly tuned, in-memory database of connectivity information. A Web page repository that supports easy access to and storage for billions of documents would allow us to study larger data sets and to study how the Web evolves over time.
Citations
|
1839
|
The Anatomy of a Large-Scale Hypertextual Web Search Engine
– Brin, Page
- 1998
|
|
349
|
Improved algorithms for topic distillation in hyperlinked environments
– Bharat, Henzinger
- 1998
|
|
263
|
Syntactic clustering of the Web
– Broder, Glassman, et al.
|
|
200
|
Efficient crawling through URL ordering
– Cho, Garcia-Molina, et al.
- 1998
|
|
136
|
Graph structure in the web
– Broder, Kumar, et al.
- 2000
|
|
128
|
A.: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines
– Bharat, Broder
- 1998
|
|
124
|
Finding related pages in the World Wide Web
– Dean, Henzinger
- 1999
|
|
102
|
Mercator: A Scalable, Extensible Web Crawler
– Heydon, Najork
- 1999
|
|
91
|
The Connectivity Server: Fast access to linkage information on the Web
– Bharat, Bröder, et al.
- 1998
|
|
78
|
WebL - A Programming Language for the Web
– Kistler, Marais
- 1998
|
|
71
|
WebBase: A Repository of Web Pages
– Hirai, Raghavan, et al.
- 2000
|
|
64
|
On nearuniform URL sampling
– Henzinger, Heydon, et al.
- 2000
|
|
60
|
Breadth-First Search Crawling Yields High-Quality Pages
– Najork, Wiener
- 2001
|
|
53
|
M.: Measuring Index Quality using Random Walks on the Web
– Henzinger, Heydon, et al.
- 1999
|
|
48
|
A comparison of techniques to find mirrored hosts on the WWW
– Bharat, Broder, et al.
- 2000
|
|
39
|
mirror on the web: a study of host pairs with replicated content
– Mirror
- 1999
|
|
20
|
SpeechBot: a Speech Recognition based Audio Indexing System for the Web
– Thong, Litvinova, et al.
- 2000
|
|
18
|
The term vector database: Fast access to indexing terms for web pages
– Stata, Bharat, et al.
- 2000
|
|
4
|
The AltaVista Search Revolution. Osborne McGraw-Hill
– Ray, Ray, et al.
- 1998
|
|
3
|
personal communication
– Manasse
|
|
2
|
High-Performance Web Crawling. Chapter 2
– Najork, Heydon
- 2001
|
|
2
|
The Link Database: Fast access to very large Web Graphs
– Randall, Stata, et al.
- 2001
|