Results 1 -
6 of
6
Projections for Efficient Document Clustering
, 1997
"... Clustering is increasing in importance, but linear- and even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the ..."
Abstract
-
Cited by 86 (0 self)
- Add to MetaCart
Clustering is increasing in importance, but linear- and even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the cost of distance calculations, LSI and truncation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clusters. We find that the speed increase is significant while --- surprisingly --- the quality of clustering is not adversely affected. We conclude that truncation yields clusters as good as those produced by full-profile clustering while offering a significant speed advantage.
Query Expansion and Classification of Retrieved Documents
- Proceedings of the 7th Text Retrieval Conference (TREC-7
, 1998
"... This paper presents different methods tested by the University of Avignon and Bertin at the TREC-7 evaluation. A first section describes several methodologies used for query expansion: synonymy and stemming. Relevance feedback is applied both to the TIPSTER corpora and Internet documents. In a secon ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper presents different methods tested by the University of Avignon and Bertin at the TREC-7 evaluation. A first section describes several methodologies used for query expansion: synonymy and stemming. Relevance feedback is applied both to the TIPSTER corpora and Internet documents. In a second section, we describe a classification algorithm based on hierarchical and clustering methods. This algorithm improves results given by any Information Retrieval system (that retrieves a list of documents from a query) and helps the users by automatically providing a structured document map from the set of retrieved documents. Lastly, we present the first results obtained with TREC-6 and TREC-7 corpora and queries by using this algorithm.
Document Clustering in Reduced Dimension Vector Space. unpublished. [Littman et
- Automatic CrossLanguage Information Retrieval using Latent Semantic Indexing. Proceedings of SIGIR96
, 1999
"... Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering algorithms are usually applied to documents represented as vectors in a high dimensional term space. We investigate the use of Latent Semantic Analysis to create a new vector space, that is the ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Document clustering is a popular tool for automatically organizing a large collection of texts. Clustering algorithms are usually applied to documents represented as vectors in a high dimensional term space. We investigate the use of Latent Semantic Analysis to create a new vector space, that is the optimal representation of the document collection. Documents are projected onto a small subspace of this vector space and clustered. We compare the performance of clustering algorithms when applied to documents represented in the full term space and in reduced dimension subspace of the LSA-generated vector space. We report significant improvements in cluster quality for LSA subspaces with optimal dimensionality. We discuss the procedure for determining the right number of dimensions for the subspace. Moreover, when this number is small, the total running time of the clustering algorithm is comparable to the one that uses the full term space.
A Clustering Method for Information Retrieval
- of the 39th Hawaii International Conference on System Sciences - 2006
, 1999
"... : A classical information retrieval system retrieves and ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never examined. To solve this pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
: A classical information retrieval system retrieves and ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never examined. To solve this problem, retrieved documents are automatically clustered. We describe a classification algorithm based on hierarchical and clustering methods. It classifies the set of documents retrieved by any IR-system. Then, this method is evaluated over the TREC-7 corpora and queries. We show that it improves the results of the retrieval by providing to users at least one high precision cluster. The impact of the number of clusters and the way to browse them are examined. The best results are obtained by choosing the number of clusters according to query sizes. 1. Introduction A classical information retrieval system retrieves and ranks documents extracted from a corpus according to a similarity functi...
Information Forage through Adaptive Visualization
- In Proc of the ACM Int'l Conf on Digital Libraries
, 1998
"... Automatically created maps of concepts improve navigation in a collection of text documents. We report our research on leveraging navigation by providing interactively the ability to modify the maps themselves. We believe that this functionality leads to better responsiveness to the user and a more ..."
Abstract
- Add to MetaCart
Automatically created maps of concepts improve navigation in a collection of text documents. We report our research on leveraging navigation by providing interactively the ability to modify the maps themselves. We believe that this functionality leads to better responsiveness to the user and a more effective search. For this purpose we have created and tested a prototype system that builds and refines in real-time a map of concepts found in Web documents returned by a commercial search engine. KEYWORDS: Intelligent searching, interactive data exploration, information representation, WWW, search engines, information retrieval. INTRODUCTION Summarization and visualization tools are believed to be helpful in navigating rough large volumes of data since a visual representation may solicit more deliberate query reformulating and better feedback to the retrieval system. [3], [4]. Existing commercial systems ("Hyperbolic Tree" by Inxight Software or "SemioMap" by Semio Corporation) allow n...
Lecture Notes in Computer Science (LNCS) 1749
"... A classical information retrieval system ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never retrieved. To solve this problem, retri ..."
Abstract
- Add to MetaCart
A classical information retrieval system ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never retrieved. To solve this problem, retrieved documents are automatically clustered. We describe an algorithm based on hierarchical and clustering methods. It classifies the set of documents retrieved by any IR-system. This method is evaluated over the TREC-7 corpora and queries. We show that it improves the results of the retrieval by providing users at least one high precision cluster. The impact of the number of clusters and the way to browse them to build a reordered list are examined. Over TREC corpora and queries, we show that the choice of the number of clusters according to the length of queries improves results compared with a prefixed number.

