We discuss two learning algorithms for text filtering: modified Rocchio and a boosting algorithm called AdaBoost. We show how both algorithms can be adapted to maximize any general utility matrix that associates cost (or gain) for each pair of machine prediction and correct label. We first show that AdaBoost significantly outperforms another highly effective text filtering algorithm. We then compare AdaBoost and Rocchio over three large text filtering tasks. Overall both algorithms are comparable and are quite effective. AdaBoost produces better classifiers than Rocchio when the training collection contains a very large number of relevant documents. However, on these tasks, Rocchio runs much faster than AdaBoost. 1 Introduction It is becoming increasingly hard to cope with the explosion of electronic information that is now available. Information filtering systems that automatically send articles of potential interest are becoming a necessity in this information age. Typically users i...
|
2329
|
Introduction to modern information retrieval
– Salton
- 1983
|
|
1205
|
Schapire, “Decision-theoretic generalization of on-line learning and application to boosting
– Freund, E
- 1997
|
|
1053
|
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
– Joachims
- 1998
|
|
1045
|
Experiments with a new boosting algorithm
– Freund, Schapire
- 1996
|
|
594
|
Relevance feedback in information retrieval
– Rocchio
- 1971
|
|
500
|
Boosting the margin: A new explanation for the effectiveness of voting methods
– Schapire, Freund, et al.
- 1998
|
|
465
|
Improving retrieval performance by relevance feedback
– Salton, Buckley
- 1990
|
|
346
|
An evaluation of statistical approaches to text categorization
– Yang
- 1999
|
|
282
|
A sequential algorithm for training text classifiers
– Lewis, Gale
- 1994
|
|
261
|
Pivoted document length normalization
– Singhal, Buckley, et al.
- 1996
|
|
222
|
Bagging, boosting, and C4.5
– Quinlan
- 1996
|
|
215
|
Some simple effective approximations to 2-Poisson method for probabilistic weighted retrieval
– Robertson, Walker
- 1994
|
|
213
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
209
|
Training algorithms for linear text classifiers
– Lewis, Schapire, et al.
- 1996
|
|
194
|
Context-sensitive learning methods for text categorization
– Cohen, Singer
- 1996
|
|
174
|
Overview of the Third Text REtrieval Conference
– Harman
- 1995
|
|
172
|
An evaluation of phrasal and clustered representations on a text categorization task
– Lewis
- 1992
|
|
147
|
Employing EM in pool-based active learning for text classification
– McCallum, Nigam
- 1998
|
|
127
|
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
109
|
Generalized vector space model in information retrieval
– Wong, Ziarko, et al.
- 1985
|
|
105
|
Empirical support for winnow and weighted-majority based algorithms: results on a calendar scheduling domain
– Blum
- 1997
|
|
102
|
Optimization of relevance feedback weights
– Buckley, Salton
- 1995
|
|
89
|
Feature selection, perceptron learning, and a usability case study for text categorization
– Ng, Goh, et al.
- 1997
|
|
87
|
Overview of the sixth text retrieval conference
– Voorhees, Harman
- 1998
|
|
83
|
Incremental Relevance Feedback for Information Filtering
– Allan
- 1996
|
|
74
|
Towards language independent automated learning of text categorization models
– Apté, Damerau, et al.
- 1994
|
|
73
|
arcing classifiers
– Bias
- 1996
|
|
72
|
Evaluating and optimizing autonomous text classification systems
– Lewis
- 1995
|
|
55
|
Using and combining predictors that specialize
– Freund, Schapire, et al.
- 1997
|
|
49
|
Learning routing queries in a query zone
– Singhal, Mitra, et al.
- 1997
|
|
44
|
Document filtering with inference networks
– Callan
- 1996
|
|
44
|
Noise reduction in a statistical approach to text categorization
– Yang
- 1995
|
|
40
|
Method combination for document filtering
– Hull, Pedersen, et al.
- 1996
|
|
38
|
The TREC-7 filtering track: description and analysis
– Hull
- 1998
|
|
27
|
The trec-4 filtering track
– Lewis
- 1996
|
|
23
|
Automatic Text Processing---the Transformation, Analysis and Retrieval of Information by Computer
– Salton
- 1989
|
|
21
|
AT&T at TREC-6
– Singhal
- 1998
|
|
16
|
The Importance of Proper Weighting Methods
– Buckley
- 1993
|
|
8
|
Document Retrieval Systems--Optimization and Evaluation
– Rocchio
- 1966
|
|
3
|
Improving performance by relevance feedback
– Salton
- 1990
|
|
2
|
An evaluation of statistical approachesto text categorization
– Yang
- 1999
|