Results 1 - 10
of
24
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results
, 1996
"... We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This resul ..."
Abstract
-
Cited by 331 (5 self)
- Add to MetaCart
We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This result provides evidence validating the cluster hypothesis which states that relevant documents tend to be more similar to each other than to non-relevant documents. We describe a system employing Scatter/Gather and demonstrate that users are able to use this system close to its full potential. 1 Introduction An important service offered by an information access system is the organization of retrieval results. Conventional systems rank results based on an automatic assessment of relevance to the query [20]. Alternatives include graphical displays of interdocument similarity (e.g., [1, 22, 7]), relationship to fixed attributes (e.g., [21, 14]), and query term distribution patterns (e.g., [12]). I...
Pharos: A Scalable Distributed Architecture for Locating Heterogeneous Information Sources
- In In Proceedings of the 6th International Conference on Information and Knowledge Management
, 1996
"... This paper presents the design of Pharos: a scalable distributed architecture for locating heterogeneous information sources. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first st ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
This paper presents the design of Pharos: a scalable distributed architecture for locating heterogeneous information sources. The system incorporates a hierarchical metadata structure into a multi-level retrieval system. Queries are resolved through an iterative decision-making process. The first step retrieves coarse-grain metadata, about all sources, stored on local, massively replicated, high-level servers. Further steps retrieve more detailed metadata, about a greatly reduced set of sources, stored on remote, sparsely replicated, topic-based mid-level servers. We describe the structure, distribution, and retrieval of the metadata in Pharos to enable users to locate desirable information sources over the Internet. Contents 1 Introduction 1 2 Design Overview 3 2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 2.2 Example Query : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2.3 Multi-Level Approach : : : : : : : : : : :...
Automatic Subject Indexing Using An Associative Neural Network
- IN: PROCEEDINGS OF THE 3 RD ACM INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES (DL’98
, 1998
"... The global growth in popularity of the World Wide Web has been enabled in part by the availability of browser based search tools which in turn have led to an increased demand for indexing techniques and technologies. As the amount of globally accessible information in community repositories grows, i ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
The global growth in popularity of the World Wide Web has been enabled in part by the availability of browser based search tools which in turn have led to an increased demand for indexing techniques and technologies. As the amount of globally accessible information in community repositories grows, it is no longer cost-effective for such repositories to be indexed by professional indexers who have been trained to be consistent in subject assignment from controlled vocabulary lists. The era of amateur indexers is thus upon us, and the information infrastructure needs to provide support for such indexing if search of the Net is to produce useful results. In this paper
The use of categories and clusters for organizing retrieval results
- Natural Language Information Retrieval
, 1999
"... Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This c ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract. An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This chapter describes user interfaces that use categories and clusters to organize retrieval results, and examines the relationship between the two. 1 1.
Scalable Collection Summarization and Selection
- In Proc. of ACM Conference on Digital Libraries
, 1999
"... Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalabl ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Information retrieval over the Internet increasingly requires the filtering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classification. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classification information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classification are potentially useful ...
Predicting Library of Congress Classifications from Library of Congress Subject Headings
, 2004
"... This paper addresses the problem of automatically assigning a Library of Congress Classi cation (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the problem of automatically assigning a Library of Congress Classi cation (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas de ned. We describe a procedure that, given a resource identi ed by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classi cation model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.
Practical Evaluation of IR within Automated Classification Systems
- Eighth International Conference of Information and Knowledge Management
, 1999
"... This paper describes some of the work we have done to evaluate and compare the use of three IR systems (Verity, LSI, and SMART) as black boxes within an automated classification environment. We use automated classification to make a quantitative comparison of the effectiveness of the systems within ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This paper describes some of the work we have done to evaluate and compare the use of three IR systems (Verity, LSI, and SMART) as black boxes within an automated classification environment. We use automated classification to make a quantitative comparison of the effectiveness of the systems within this context. In so doing, we also develop criteria for the construction of a useful training set. These results lead to metrics useful in the integration of IR systems into larger applications. We conclude with an initial API for an IR component within an automated classification architecture. KEYWORDS: IR evaluation, automated classification, training sets. 1 Introduction A library environment includes such tasks as metadata generation, finding relationships between various thesauri and/or classification schemes, and document classification. There are many commonly used controlled vocabularies and classification schemes in industry, such as ICD9 1 for medicine, SIC 2 for market anal...
Filtering medical documents using automated and human classification methods
- Journal of the American Society for Information Science
, 1998
"... The goal of this research is to clarify the role of document classification in information filtering. An important function of classification, in managing computational complexity, is described and illustrated in the context of an existing filtering system. A parameter called classification homogene ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The goal of this research is to clarify the role of document classification in information filtering. An important function of classification, in managing computational complexity, is described and illustrated in the context of an existing filtering system. A parameter called classification homogeneity is presented for analyzing unsupervised automated classification by employing human classification as a control. Two significant components of the automated classification approach, vocabulary discovery and classification scheme generation, are described in detail. Results of classification performance revealed considerable variability in the homogeneity of automatically produced classes. Based on the classification performance, different types of interest profiles were created. Subsequently, these profiles were used to perform filtering sessions. The filtering results showed that with increasing homogeneity, filtering performance improves, and, conversely, with decreasing homogeneity, filtering performance degrades.
Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries
- Proc. of 25th ACM SIGIR Conf
, 2002
"... This paper presents a method of information harvesting and consolidation to support the multilingual information requirements for cross-language information retrieval within digital library systems. We describe a way to create both customized bilingual dictionaries and multilingual query mappings fr ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper presents a method of information harvesting and consolidation to support the multilingual information requirements for cross-language information retrieval within digital library systems. We describe a way to create both customized bilingual dictionaries and multilingual query mappings from a source language to many target languages. We will describe a multilingual conceptual mapping resource with broad coverage (over 100 written languages can be supported) that is truly multilingual as opposed to bilingual parings usually derived from machine translation. This resource is derived from the 10+ million title online library catalog of the University of California. It is created statistically via maximum likelihood associations from word and phrases in book titles of many languages to human assigned subject headings in English. The 150,000 subject headings can form interlingua mappings between pairs of languages or from one language to several languages. While our current demonstration prototype maps between ten languages (English, Arabic, Chinese, French, German, Italian, Japanese, Portuguese, Russian, Spanish), extensions to additional languages are straightforward. We also describe how this resource is being expanded for languages where linguistic coverage is limited in our initial database, by automatically harvesting new information from international online library catalogs using the Z39.50 networked library search protocol.

