MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

An evaluation of statistical approaches to text categorization (1999) [358 citations — 14 self]

Abstract:

Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.

Citations

2566 Induction of decision trees – Quinlan - 1996
992 Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer – Salton - 1989
213 A comparison of two learning algorithms for text categorization – Lewis, Ringuette - 1994
211 Training Algorithms for Linear Text Classifiers – Lewis - 1996
194 Context sensitive learning methods for text categorization – Cohen - 1999
158 Ohsumed: an interactive retrieval evaluation and new large test collection for research – Hersh, Buckley, et al. - 1994
130 Introduction to Modern Information Retrieval, McGill-Hill – Salton, McGill - 1983
129 Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval – Yang - 1994
122 A neural network approach to topic spotting – Wiener, Pedersen, et al. - 1995
89 Feature selection, perceptron learning, and a usability case study for text categorization – Ng, Goh, et al. - 1997
85 An example-based mapping method for text categorization and retrieval – Yang, Chute - 1994
74 Towards language independent automated learning of text categorization models – Apte, Dameru, et al. - 1994
57 CONSTRUE/TIS: a system for content-based indexing of a database of news stories – HAYES, WEINSTEIN - 1990
55 Feature selection in statistical learning of text categorization – Yang, Pedersen - 1997
45 Automatic indexing based on bayesian inference networks – Tzeras, Hartman - 1993
44 Air/x - a rulebased multistage indexing systems for large subject fields – Fuhr, Hartmanna, et al. - 1991
44 Cluster-Based Text Categorization: A Comparison of Category Search Strategies – Iwayama, Tokunaga - 1995
44 Noise reduction in a statistical approach to text categorization – Yang - 1995
39 Text categorization: a symbolic approach – Moulinier, Raskinis, et al. - 1996
33 Document filtering for fast ranking – Persin - 1994
19 A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts – Yang, Chute - 1992
14 Trading mips and memory for knowledge engineering: classifying census returns on the connection machine – Creecy, Masand, et al. - 1992
11 Is learning bias an issue on the text categorization problem – Moulinier - 1997
10 The design of a high performance information filtering system – Bell, Moffat - 1996