Abstract:
This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had rela...
Citations
|
2526
|
Induction of decision trees
– Quinlan
- 1986
|
|
988
|
Automatic Text Processing -- The Transformation, Analysis, and Retrieval of Information by Computer Addison-Wesley
– Salton
- 1989
|
|
213
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
209
|
Training algorithms for linear text classifiers
– Lewis, Schapire, et al.
- 1996
|
|
194
|
Context-sensitive learning methods for text categorization
– Cohen, Singer
- 1996
|
|
157
|
OHSUMED: An interactive retrieval evaluation and new large test collection for research
– Hersh, Buckley, et al.
- 1994
|
|
128
|
Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series
– Salton, McGill
- 1983
|
|
127
|
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
116
|
A neural network approach to topic spotting
– Wiener, Pedersen, et al.
- 1995
|
|
89
|
Feature selection, perceptron learning, and a usability case study for text categorization
– Ng, Goh, et al.
- 1997
|
|
83
|
An example-based mapping method for text categorization and retrieval
– Yang, Chute
- 1994
|
|
74
|
Towards language independent automated learning of text categorization models
– Apté, Damerau, et al.
- 1994
|
|
57
|
Construe/tis: a system for content-based indexing of a database of news stories
– Hayes, Weinstein
- 1990
|
|
55
|
Feature selection in statistical learning of text categorization
– Yang, Pedersen
- 1997
|
|
44
|
Air/x - a rulebased multistage indexing systems for large subject fields
– Fuhr, Hartmanna, et al.
- 1991
|
|
44
|
Cluster-Based Text Categorization: A Comparison of Category Search Strategies
– Iwayama, Tokunaga
- 1995
|
|
44
|
Noise reduction in a statistical approach to text categorization
– Yang
- 1995
|
|
43
|
Automatic indexing based on bayesian inference networks
– Tzeras, Hartman
- 1993
|
|
37
|
Text categorization: a symbolic approach
– Moulinier, Raˇskinis, et al.
- 1996
|
|
31
|
Document filtering for fast ranking
– Persin
- 1994
|
|
18
|
A linear least squares fit mapping method for information retrieval from natural language texts
– Yang, Chute
- 1992
|
|
14
|
Trading mips and memory for knowledge engineering: classifying census returns on the connection machine
– Creecy, Masand, et al.
- 1992
|
|
11
|
Is learning bias an issue on the text categorization problem
– Moulinier
- 1997
|
|
10
|
The design of a high performance information filtering system
– Bell, Moffat
- 1996
|