Results 1 -
8 of
8
Harvest: A Scalable, Customizable Discovery and Access System
, 1995
"... Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, buil ..."
Abstract
-
Cited by 159 (7 self)
- Add to MetaCart
Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with WWW clients and with HTTP,FTP, Gopher, and NetNews information resources. We discuss the design and implementation of Harvest and its subsystems, give examples of its uses, and provide measurements indicating that Harvest can significantly reduce server load, network traffic, and space requirements when building indexes, compared with previous systems. We also discuss several popular indexes wehave built using Harvest, underscoring the customizability and scalability of the system.
Scalable Internet Resource Discovery: Research Problems and Approaches
, 1994
"... Over the past several years, a number of information discovery and access tools have been introduced in the Internet, including Archie, Gopher, Netfind, and WAIS. These tools have become quite popular, and are helping to redefine how people think about wide-area network applications. Yet, they ar ..."
Abstract
-
Cited by 121 (3 self)
- Add to MetaCart
Over the past several years, a number of information discovery and access tools have been introduced in the Internet, including Archie, Gopher, Netfind, and WAIS. These tools have become quite popular, and are helping to redefine how people think about wide-area network applications. Yet, they are not well suited to supporting the future information infrastructure, which will be characterized by enormous data volume, rapid growth in the user base, and burgeoning data diversity. In this paper we indicate trends in these three dimensions and survey problems these trends will create for current approaches. We then suggest several promising directions of future resource discovery research, along with some initial results from projects carried out by members of the Internet Research Task Force Research Group on Resource Discovery and Directory Service.
Integrating content-based access mechanisms with hierarchical file systems
, 1999
"... We present a new file system that combines name-based and content-based access to files at the same time. Our design allows both methods to be used at any time, thus preserving the benefits of both. Users can create their own name spaces based on queries, on explicit path names, or on any combinatio ..."
Abstract
-
Cited by 62 (0 self)
- Add to MetaCart
We present a new file system that combines name-based and content-based access to files at the same time. Our design allows both methods to be used at any time, thus preserving the benefits of both. Users can create their own name spaces based on queries, on explicit path names, or on any combination interleaved arbitrarily. All regular file operations -- such as adding, deleting, or moving files -- are supported in the same way, and in addition, query consistency is maintained and adapted to what the user is manually doing. One can add, remove, or move results of queries, and in general handle them as if they were regular files. This creates interesting new consistency problems, for which we suggest and implement solutions. Remote le systems or remote query systems (e.g., web search) can be integrated by users into their own coherent name spaces in a clean way. We believe that our design can serve as the basis for the future information-rich file systems, allowing users better handle on their information.
Finding near-replicas of documents on the Web
- In International Workshop on the World Wide Web and Databases (WebDB’98
, 1998
"... We consider how to e ciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing ..."
Abstract
-
Cited by 54 (0 self)
- Add to MetaCart
We consider how to e ciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information. 1
The use of categories and clusters for organizing retrieval results
- Natural Language Information Retrieval
, 1999
"... Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This c ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This chapter describes user interfaces that use categories and clusters to organize retrieval results, and examines the relationship between the two. 1 1.
The Lifestreams Software Architecture
, 1997
"... \Typical " computer users struggle to organize and nd their own electronic documents, manage their schedules and correspondence, and lter an ever increasing deluge of information. The process is made worse as users are forced to combine the disparate features of many applications to achieve the ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
\Typical " computer users struggle to organize and nd their own electronic documents, manage their schedules and correspondence, and lter an ever increasing deluge of information. The process is made worse as users are forced to combine the disparate features of many applications to achieve these tasks. These problems suggest that our current software systems are ill-equipped to handle the demands of the typical computer user. Research has shown that common desktop environments (such asthe Macintosh \desktop") are often badly tted to users ' needs. In an attempt to do better we have reduced \information management " to a few simple and unifying concepts and created \Lifestreams. " Lifestreams is a software architecture based on a simple data structure, a time-ordered stream of documents, that can be manipulated with a small numberofpowerful operators to locate, organize, summarize and monitor information. In this dissertation we rst provide motivation for Lifestreams. We then present the model and discuss the development of our research prototype. Our prototype realizes many of the system's de ning features and has allowed us to experiment with the model's key ideas with actual users (of di ering levels of computer experience) over the course of its development. Results from its use suggest that Lifestreams is an e ective software architecture for managing common computer tasks; its simple organizational storage system (the stream) combined with a small number of powerful operators provides a uni ed framework that subsumes many separate desktop applications to accomplish and handle the most common personal communication, reminding, and storage and retrieval tasks. In addition, Lifestreams suggests valuable new capabilities for electronic systems.
Lightweight Document Matching for Help Desk Applications
, 1999
"... We describe a fast documentmatcher that matches new documents to those stored in a database. The matcher lists in order those stored documents that are most similar to the new document. The new documents are typically detailed problem descriptions or free form textual queries of unlimited length ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We describe a fast documentmatcher that matches new documents to those stored in a database. The matcher lists in order those stored documents that are most similar to the new document. The new documents are typically detailed problem descriptions or free form textual queries of unlimited length, and the stored documents are potential answers suchasfrequently asked questions or service tips. The method uses minimal data structures and lightweight scoring algorithms to compute efficiently even in restricted environments, such as mobile or small desktop computers. Evaluations on benchmark document collections demonstrate that predictive performance for multiple document matches is competitive with more computationally expensive procedures. Keywords Text mining, Information Retrieval, Text Categorization, Case-Based Reasoning 1 1 Introduction In an age of distributed and pervasive computing, many future programs will run on smaller capacitymachines requiring "lightweight" ...

