Results 1 -
5 of
5
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections
, 1992
"... Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably ..."
Abstract
-
Cited by 519 (12 self)
- Add to MetaCart
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm. 1 Introduction Document clustering has been extensively investigated as a methodology for improving document search and retrieval (see [15] for an excellent review). The general assumption is that mutua...
The limitations of term co-occurrence data for query expansion in document retrieval systems
- Journal of the American Society for Information Science
, 1991
"... Term cooccurrence data has been extensively used in document retrieval systems for the identification of indexing terms that are similar to those that have been specified in a user query: these similar terms can then be used to augment the original query statement. Despite the plausibility of this a ..."
Abstract
-
Cited by 82 (0 self)
- Add to MetaCart
Term cooccurrence data has been extensively used in document retrieval systems for the identification of indexing terms that are similar to those that have been specified in a user query: these similar terms can then be used to augment the original query statement. Despite the plausibility of this approach to query expan-sion, the retrieval effectiveness of the expanded que-ries is often no greater than, or even less than, the effectiveness of the unexpanded queries. This article demonstrates that the similar terms identified by cooc-currence data in a query expansion system tend to occur very frequently in the database that is being searched. Unfortunately, frequent terms tend to discrimi-nate poorly between relevant and nonrelevant docu-ments, and the general effect of query expansion is thus to add terms that do little or nothing to improve the dis-criminatory power of the original query.
Term Clustering of Syntactic Phrases
- Proceedings of ACM SIGIR-90
, 1990
"... Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combini ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations. In this paper we discuss our implementation of a syntactic phrase generator, as well as our preliminary experiments with producing phrase clusters. These experiments show small improvements in retrieval effectiveness resulting from the use of phrase clusters, but it is clear that corpora much larger than standard information retrieval test collections will be required to thoroughly evaluate the use of this technique.
Order-Theoretical Ranking
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCES (JASIS
, 2000
"... Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretic ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretical and practical limitations. We present an approach to document ranking that explicitly addresses the word mismatch problem by exploiting interdocument similarity information in a novel way. Document ranking is seen as a querydocument transformation driven by a conceptual representation of the whole document collection, into which the query is merged. Our approach is based on the theory of concept (or Galois) lattices, which, we argue, provides a powerful, well-founded, and computationallytractable framework to model the space in which documents and query are represented and to compute such a transformation. We compared information retrieval using concept lattice-based ranking (CLR) to BMR and HCR. The results showed that HCR was outperformed by CLR as well as by BMR, and suggested that, of the two best methods, BMR achieved better performance than CLR on the whole document set while CLR compared more favorably when only the first retrieved documents were used for evaluation. We also evaluated the three methods' specific ability to rank documents that did not match the query, in which case the superiority of CLR over BMR and HCR (and that of HCR over BMR) was apparent.
Improving the Retrieval Effectiveness by a Similarity Thesaurus
- ETH Zürich, Department of Computer Science, Zürich, Switzerland
, 1994
"... A novel information structure and its use for query expansion is presented. The information structure, called a similarity thesaurus, consists of term-term similarities that are based on how the terms of a collection "are indexed" by the documents. In this way, the similarity thesaurus reflects doma ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
A novel information structure and its use for query expansion is presented. The information structure, called a similarity thesaurus, consists of term-term similarities that are based on how the terms of a collection "are indexed" by the documents. In this way, the similarity thesaurus reflects domain knowledge about the collection from which it is constructed. It is used to select and weight additional query terms when expanding an existing query. This is in contrast to conventional query expansion methods as the similarity between candidate terms and the concept of the entire query is taken into account. Experiments on test collections show that the retrieval effectiveness is considerably higher when this method is applied. That this concept-based query expansion model can also be used to produce better results in large-scale operational IR environments is the final aspiration. Contents 1 Introduction 5 2 Constructing a Similarity Thesaurus 6 2.1 Similarity Thesaurus : : : : : : ...

