MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

A comparison of classifiers and document representations for the routing problem (1995) [134 citations — 2 self]

by Hinrich Schütze ,  David A. Hull ,  Jan O. Pedersen
ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR
Add To MetaCart

Abstract:

In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.

Citations

1636 Indexing by latent semantic analysis – Deerwester, Dumais, et al. - 1990
260 Okapi at TREC-3 – Robertson, Walker, et al. - 1992
219 Multi-paragraph segmentation of expository text – Hearst - 1994
213 A comparison of two learning algorithms for text categorization – Lewis, Ringuette - 1994
172 An evaluation of phrasal and clustered representations on a text categorization task – Lewis - 1992
125 The Effect of Adding Relevance Information in a Relevance Feedback Environment – Buckley, Salton, et al. - 1994
79 Latent semantic indexing (LSI) and TREC-2 – Dumais - 1994
79 Improving text retrieval for the routing problem using latent semantic indexing – Hull - 1994
74 Towards language independent automated learning of text categorization models – Apté, Damerau, et al. - 1994
70 Classifying news stories using memory based reasoning – Masand, Linoff, et al. - 1992
63 Methods for Statistical Data Analysis of Multivariate Observations – Gnanadesikan - 1977
62 Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents – Belew - 1989
56 Overview of the First TREC Conference – Harman - 1993
50 Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression – Cooper, Chen, et al. - 1994
44 Large-scale sparse singular value computations – Berry - 1992
35 An Object-Oriented Architecture for Text Retrieval – Cutting, Pedersen, et al. - 1991
30 Foundations of Modern Analysis – Friedman - 1982
25 Experiments with a component theory of probabilistic information retrieval based on single terms as document components – Kwok - 1990
23 Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions – Fuhr, Pfeifer - 1994
15 Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System – Croft, Callan, et al. - 1994
12 Using statistical testing in the evaluation of retrieval performance – Hull - 1993
6 Information Retrieval Using Statistical Classification – Hull - 1994
4 Optimum polynomial retrieval funcions based on the probability ranking principle – Fuhr - 1989
2 Generalized Linear Models, chapter 4 – McCullagh, Nelder - 1989
1 Special issue on text categorization. guest editorial – Lewis, Hayes - 1994