## Improving text retrieval for the routing problem using latent semantic indexing (1994)

Venue: | In Proc. of the 17th ACM-SIGIR Conference |

Citations: | 92 - 2 self |

### BibTeX

@INPROCEEDINGS{Hull94improvingtext,

author = {David Hull},

title = {Improving text retrieval for the routing problem using latent semantic indexing},

booktitle = {In Proc. of the 17th ACM-SIGIR Conference},

year = {1994},

pages = {282--291}

}

### Years of Citing Articles

### OpenURL

### Abstract

Latent Semantic Indexing (LSI) is a novel approach to information retrieval that attempts to model the underlying structure of term associations by transforming the traditional representation of documents as vectors of weighted term frequencies to a new coordinate space where both documents and terms are represented as linear combinations of underlying semantic factors. In previous research, LSI has produced a small improvement in retrieval performance. In this paper, we apply LSI to the routing task, which operates under the assumption that a sample of relevant and non-relevant documents is available to use in constructing the query. Once again, LSI slightly improves performance. However, when LSI is used is conduction with statistical classification, there is a dramatic improvement in performance. 1

### Citations

2735 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...e it ignores both the order and association between terms, but it is hard to find a better method with an equivalent computational complexity. The recent development of Latent Semantic Indexing (LSI) =-=[2]-=- opens a promising avenue of research into methods for using term associations. LSI is a method designed to refine and improve vector space retrieval by transforming the search space to a new coordina... |

1538 | Term-weighting approaches in automatic text retrieval
- Salton, Buckley
- 1988
(Show Context)
Citation Context ...onal inverse document frequecy (idf) weighting. This weight function strongly resembles the tf-idf weighting, structure and its variants which have been used frequently in other retrieval experiments =-=[12]-=-. The similarity between the routing queries and the relevant documents is measured by the inner product of the corresponding weighted query and document vectors. In order to provide the best match be... |

852 |
Relevance feedback in information retrieval
- Rocchio
- 1971
(Show Context)
Citation Context ...djust the rout ing query to evaluate each document. Given a set of relevant and non-relevant documents, one must determine the best way to incorporate this information into the routing query. Rocchio =-=[11]-=- suggests using the difference between the mean of the relevant and the mean of the non-relevant documents. where N is the size of the collection, n is the number of relevant documents, and rel and no... |

601 | Improving Retrieval Performance by Relevance Feedback
- Salton, Buckley
- 1990
(Show Context)
Citation Context ... for system evaluation at the TREC retrieval conference [3]. One can also imagine this task as the second stage in a retrieval algorithm in place of the the traditional strategy of relevance feedback =-=[4]-=-. Since this approach starts with an information-rich environment, we can concentrate on overcoming the problems associated with using term frequencies as the underlying variables in the retrieval mod... |

588 | An algorithm for finding best matches in logarithmic expected time
- Friedman, Bentley, et al.
- 1977
(Show Context)
Citation Context ...number of terms in the query is likely to grow to be many times the number of LSI vectors, leading to a corresponding increase in search time. In addition, using a data structure such aa the k-d tree =-=[8]-=- in conduction with LSI would greatly speed the search for nearest neighbors, provided only a partial ordering of the documents is required. Most of the additional costs come in the pre-processing sta... |

385 |
Discriminant Analysis and Statistical Pattern Recognition
- McLachlan
- 1992
(Show Context)
Citation Context ...plies the compute time and the storage requirement by the number of local LSI factors. We now describe how statistical classification is used to identify the relevant documents. Discriminant analysis =-=[16]-=- is a commonly used statistical classification technique. Essentially, it characterizes each population by its estimated mean vector and covariance matrix measured over the known observations. Each un... |

181 | Concept based query expansion
- Qiu, Frei
- 1993
(Show Context)
Citation Context ...r weighted combination of indexing variables. In essence, LSI can be described as a method for automatic query expansion. It makes use of similar information to the technique proposed by Qiu and Frei =-=[5]-=-, which performs query expansion using a term-term similarity matrix. Polysemy describes words that have more than one meaning, which is common property of language. Large numbers of polysemous words ... |

175 |
Relevance Feedback Revisited
- Harman
- 1992
(Show Context)
Citation Context ...he most significant contribution of the LSI model. For the routing problem, we already use a significant number of relevant documents in the query, so query expansion should not be so helpful. Harman =-=[15]-=- suggests that relevance feedback can be improved by selectively choosing the most important terms to add to the query. Perhaps in the routing problem LSI performs this term selection process in rever... |

160 |
A Vector Space Model for Information Retrieval
- Salton, Wong, et al.
- 1975
(Show Context)
Citation Context ...ed in the reduced space in a way that reflects the correlations in their use across documents. Some alternative methods for incorporating term associations into a retrieval model are given by Wong in =-=[7, 18]-=-. It is very difficult to take advantage of term associations without dramatically increasing the computational requirements of the retrieval problem. While the LSI solution is difficult to compute fo... |

101 |
Information retrieval using a singular value decomposition model of latent semantic structure
- Furnas, Deerwester, et al.
- 1988
(Show Context)
Citation Context ...f LSI does require an additional investment of storage and computing time. The disadvantages of LSI in terms of retrieval performance are difficult to quantify. Other useful references to LSI include =-=[9, 10]-=-. 2.3 Comparing LSI to the vector space model Can LSI provide better performance than the vector space model? Deerwester et al. obtain experimental results for two collections, MED and CISI, comparing... |

95 |
Large scale singular value computations
- Berry
- 1992
(Show Context)
Citation Context ..., we insure that the LSI and term-matching solutions are directly comparable. The SVD is then computed using SVDPACK, a sparse matrix SVD program written by Michael Berry and available through NETLIB =-=[13]-=-. The routing query for the LSI solution is chosen to be the average of the relevant documents as represented in LSI space. The similarities are then computed directly from the LSI representation, usi... |

66 |
Overview of the first trec conference
- Harman
- 1993
(Show Context)
Citation Context ...lection or the remaining relevant documents in the collection that the sample is drawn from. This task is equivalent to the routing problem used for system evaluation at the TREC retrieval conference =-=[3]-=-. One can also imagine this task as the second stage in a retrieval algorithm in place of the the traditional strategy of relevance feedback [4]. Since this approach starts with an information-rich en... |

48 | Latent semantic indexing is an optimal special case of multidimensional scaling. SIGIR
- Bartell, Cottrell, et al.
- 1992
(Show Context)
Citation Context ...f LSI does require an additional investment of storage and computing time. The disadvantages of LSI in terms of retrieval performance are difficult to quantify. Other useful references to LSI include =-=[9, 10]-=-. 2.3 Comparing LSI to the vector space model Can LSI provide better performance than the vector space model? Deerwester et al. obtain experimental results for two collections, MED and CISI, comparing... |

44 |
Using the cosine measure in a neural network for document
- Wilkinson, Hingston
- 1991
(Show Context)
Citation Context ...tor space model. Even so, TDA is still much faster than many of the alternative strategies for incorporating term associations into the retrieval model such as current applications of neural networks =-=[7, 17]-=- or the generalized vector space model [18], since it is applied to such a low dimensional subspace. In addition, time is a much less important issue for the routing problem. Of course, the SVD of the... |

17 |
Computation of term association by a Neural Network
- Wong, Cai, et al.
- 1993
(Show Context)
Citation Context ...ed in the reduced space in a way that reflects the correlations in their use across documents. Some alternative methods for incorporating term associations into a retrieval model are given by Wong in =-=[7, 18]-=-. It is very difficult to take advantage of term associations without dramatically increasing the computational requirements of the retrieval problem. While the LSI solution is difficult to compute fo... |

13 |
Using statistical testing in the evaluation of retrieval performance
- Hull
- 1993
(Show Context)
Citation Context ...) precision-recall curve (avg. pr.curve), the average of precision evaluated at 120 documents retrieved (avg. precision) and the average of recall evaluated at 21-50 documents retrieved (avg. recall) =-=[14]-=-. These measures represent a number of different retrieval strategies that could be adopted by the user. The first scoring method is the traditional measure of evaluation in the IR community. The seco... |

6 |
Dimensions of meaning
- Schiitze
- 1992
(Show Context)
Citation Context ...reduce the quality of the search. An SVD of the term similarity matrix can be used in conjunction with cluster analysis to directly determine the sense of a particular word as demonstrated by Schutze =-=[6]-=-. The traditional vector space model assumes term independence and terms serve as the orthogonal basis vectors of the vector space. Since there are strong associations between terms in language, this ... |

1 |
The SMART retrieval system: Ezpertments tn Automattc Document Processing
- Salton, editor
- 1971
(Show Context)
Citation Context ...LSI slightly improves performance. However, when LSI is used is conduction with statistical classification, there is a dramatic improvement in performance. 1 Introduction The vector space model (VSM) =-=[1]-=-, which measures the similarity between the query and each document by the weighted inner product of overlapping terms, has long been a standard in information retrieval. The VSM has its flaws, since ... |