Abstract:
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Citations
|
1636
|
Indexing by latent semantic analysis
– Deerwester, Dumais, et al.
- 1990
|
|
260
|
Okapi at TREC-3
– Robertson, Walker, et al.
- 1992
|
|
219
|
Multi-paragraph segmentation of expository text
– Hearst
- 1994
|
|
213
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
172
|
An evaluation of phrasal and clustered representations on a text categorization task
– Lewis
- 1992
|
|
125
|
The Effect of Adding Relevance Information in a Relevance Feedback Environment
– Buckley, Salton, et al.
- 1994
|
|
79
|
Latent semantic indexing (LSI) and TREC-2
– Dumais
- 1994
|
|
79
|
Improving text retrieval for the routing problem using latent semantic indexing
– Hull
- 1994
|
|
74
|
Towards language independent automated learning of text categorization models
– Apté, Damerau, et al.
- 1994
|
|
70
|
Classifying news stories using memory based reasoning
– Masand, Linoff, et al.
- 1992
|
|
63
|
Methods for Statistical Data Analysis of Multivariate Observations
– Gnanadesikan
- 1977
|
|
62
|
Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents
– Belew
- 1989
|
|
56
|
Overview of the First TREC Conference
– Harman
- 1993
|
|
50
|
Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression
– Cooper, Chen, et al.
- 1994
|
|
44
|
Large-scale sparse singular value computations
– Berry
- 1992
|
|
35
|
An Object-Oriented Architecture for Text Retrieval
– Cutting, Pedersen, et al.
- 1991
|
|
30
|
Foundations of Modern Analysis
– Friedman
- 1982
|
|
25
|
Experiments with a component theory of probabilistic information retrieval based on single terms as document components
– Kwok
- 1990
|
|
23
|
Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions
– Fuhr, Pfeifer
- 1994
|
|
15
|
Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System
– Croft, Callan, et al.
- 1994
|
|
12
|
Using statistical testing in the evaluation of retrieval performance
– Hull
- 1993
|
|
6
|
Information Retrieval Using Statistical Classification
– Hull
- 1994
|
|
4
|
Optimum polynomial retrieval funcions based on the probability ranking principle
– Fuhr
- 1989
|
|
2
|
Generalized Linear Models, chapter 4
– McCullagh, Nelder
- 1989
|
|
1
|
Special issue on text categorization. guest editorial
– Lewis, Hayes
- 1994
|