MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

An Evaluation of Statistical Approaches to Text Categorization (1997) [346 citations — 14 self]

Abstract:

This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had rela...

Citations

2526 Induction of decision trees – Quinlan - 1986
988 Automatic Text Processing -- The Transformation, Analysis, and Retrieval of Information by Computer Addison-Wesley – Salton - 1989
213 A comparison of two learning algorithms for text categorization – Lewis, Ringuette - 1994
209 Training algorithms for linear text classifiers – Lewis, Schapire, et al. - 1996
194 Context-sensitive learning methods for text categorization – Cohen, Singer - 1996
157 OHSUMED: An interactive retrieval evaluation and new large test collection for research – Hersh, Buckley, et al. - 1994
128 Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series – Salton, McGill - 1983
127 Expert network: Effective and efficient learning from human decisions in text categorization and retrieval – Yang - 1994
116 A neural network approach to topic spotting – Wiener, Pedersen, et al. - 1995
89 Feature selection, perceptron learning, and a usability case study for text categorization – Ng, Goh, et al. - 1997
83 An example-based mapping method for text categorization and retrieval – Yang, Chute - 1994
74 Towards language independent automated learning of text categorization models – Apté, Damerau, et al. - 1994
57 Construe/tis: a system for content-based indexing of a database of news stories – Hayes, Weinstein - 1990
55 Feature selection in statistical learning of text categorization – Yang, Pedersen - 1997
44 Air/x - a rulebased multistage indexing systems for large subject fields – Fuhr, Hartmanna, et al. - 1991
44 Cluster-Based Text Categorization: A Comparison of Category Search Strategies – Iwayama, Tokunaga - 1995
44 Noise reduction in a statistical approach to text categorization – Yang - 1995
43 Automatic indexing based on bayesian inference networks – Tzeras, Hartman - 1993
37 Text categorization: a symbolic approach – Moulinier, Raˇskinis, et al. - 1996
31 Document filtering for fast ranking – Persin - 1994
18 A linear least squares fit mapping method for information retrieval from natural language texts – Yang, Chute - 1992
14 Trading mips and memory for knowledge engineering: classifying census returns on the connection machine – Creecy, Masand, et al. - 1992
11 Is learning bias an issue on the text categorization problem – Moulinier - 1997
10 The design of a high performance information filtering system – Bell, Moffat - 1996