Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides a set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with Mosaic and with HTTP, FTP, and Gopher information resources. We discuss the design and implementation of each subsystem, and provide measurements indicating that Harvest can reduce server load, network traffic, and space requirements significantly when building indexes, compared with previous systems. We also discuss a half dozen indexes we have built using Harvest, underscoring both the customizability and scalability of the system. 1 Introduction Over the past few years a progression of Internet publishing tools have ...
|
701
|
Scale and Performance in Distributed File Systems
– Howard, Kazar, et al.
- 1988
|
|
433
|
A hierarchical Internet object cache
– Chankhunthod, Danszig, et al.
- 1996
|
|
287
|
The vocabulary problem in human-system communication
– Furnas, Landauer, et al.
- 1987
|
|
249
|
Fast text searching allowing errors
– Wu, Manber
- 1992
|
|
188
|
The harvest information discovery and access system
– Bowman, Danzig, et al.
- 1995
|
|
182
|
World-Wide Web: The information universe
– Berners-Lee, Cailliau, et al.
- 1992
|
|
170
|
Grapevine: An exercise in distributed computing
– Birrell, Levin, et al.
- 1982
|
|
170
|
Glimpse: a tool to search through entire file systems
– Manber, Wu
- 1993
|
|
164
|
An evaluation of retrieval effectiveness for a full-text document retrieval system
– Blair, Maron
|
|
149
|
RFC 1321 - The MD5 Message-Digest Algorithm
– Rivest
- 1992
|
|
114
|
Scalable internet resource discovery: Research problems and approaches
– Bowman, Danzig, et al.
- 1994
|
|
113
|
An Information System for Corporate Users: Wide Area Information Servers, Thinking Machines technical report TMC-99
– Kahle
- 1991
|
|
104
|
an electronic directory service for the Internet
– Emtage, Deutsch
- 1992
|
|
94
|
Replication and fault-tolerance in the ISIS system
– Birman
- 1985
|
|
93
|
GENVL and WWWW: Tools for Taming the Web
– McBryan
- 1994
|
|
80
|
A case for caching file objects inside internetworks
– Danzig, Hall, et al.
- 1993
|
|
67
|
A Comparison of Internet Resource Discovery Approaches
– Schwartz, Emtage, et al.
- 1992
|
|
54
|
An analysis of wide-area name server traffic: A study of the internet domain name system
– Danzig, Obraczka, et al.
- 1994
|
|
44
|
Information retrieval in the world-wide web: Making client-based searching feasible
– DeBra, Post
- 1994
|
|
36
|
NCSA Mosaic Technical Summary
– Andreessen
- 1993
|
|
34
|
Essence: A resource discovery system based on semantic file indexing
– Hardy, Schwartz
- 1993
|
|
32
|
Univers: An attribute-based name server
– Bowman, Peterson, et al.
- 1990
|
|
31
|
RFC 768: User Datagram Protocol
– Postel
- 1980
|
|
26
|
Customized information extraction as a basis for resource discovery
– Hardy, Schwartz
- 1994
|
|
25
|
The Internet Gopher: A Distributed Server Information System
– McCahill
- 1992
|
|
19
|
Guidelines for Robot Writers
– Koster
- 1994
|
|
18
|
A File System for Information Management
– Bowman, Dharap, et al.
- 1994
|
|
16
|
Harvest User’s Manual
– Hardy, Schwartz, et al.
- 1996
|
|
15
|
Massively replicating services in autonomously managed wide-area internetworks
– Danzig, Obraczka, et al.
- 1994
|
|
13
|
RFC 1521: MIME (Multipurpose Internet Mail Extensions) part one: Mechanisms for specifying and describing the format of Internet message bodies
– Borenstein, Freed
- 1993
|
|
10
|
Quorum-oriented multicast protocols for data replication
– Golding, Long
- 1992
|
|
10
|
Experiences with a survey tool for discovering network time protocol servers
– Guyton, Schwartz
- 1994
|
|
8
|
Katia Obraczka. Distributed indexing of autonomous internet services
– Danzig, Li
- 1992
|
|
8
|
Publishing Information on the Internet with Anonymous FTP
– Deutsch, Emtage
- 1994
|
|
8
|
About the Veronica service
– Foster
- 1992
|
|
6
|
Integrating complex data access methods into the Mosaic/WWW environment
– Chhabra, Hardy, et al.
- 1994
|
|
6
|
Semantic file systems
– O’Toole
- 1991
|
|
6
|
CERN HTTPD public domain full-featured hypertext/proxy server with caching
– Luotonen, Frystyk, et al.
- 1994
|
|
6
|
A dial-up network of UNIX systems
– Nowitz, Lesk
- 1978
|
|
6
|
RFC 959: File transfer protocol (FTP
– Postel, Reynolds
- 1985
|
|
6
|
Architecture of the Whois++ index service
– Weider, Fullton, et al.
- 1992
|
|
3
|
LaTeX: A Document Prepartion System
– Lamport
- 1986
|
|
3
|
FTP mirroring software. Available from ftp://src.doc.ic.ac.uk/package/mirror.shar
– McLoughlin
- 1991
|
|
3
|
Traceroute software
– Jacobsen
- 1988
|
|
2
|
Uniform Resource Locators. CERN
– Berners-Lee
- 1993
|
|
2
|
Introduction to ALIWEB
– Koster
- 1994
|
|
2
|
The WebCrawler
– Pinkerton
- 1994
|
|
2
|
Content Routing for Distributed Information Servers
– O'Toole, Gifford
- 1994
|
|
2
|
Harvest protocol and subsystem specifications
– Bowman, Danzig, et al.
- 1994
|
|
1
|
2.01 Design Specification. Microsoft OLE2 Design Team
– OLE
- 1993
|