MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

A Comparative Study on Feature Selection in Text Categorization (1997) [565 citations — 8 self]

by Yiming Yang ,  Jan O. Pedersen
Add To MetaCart

Abstract:

This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...

Citations

2526 Induction of decision trees – Quinlan - 1986
1636 Indexing by latent semantic analysis – Deerwester, Dumais, et al. - 1990
988 Automatic Text Processing -- The Transformation, Analysis, and Retrieval of Information by Computer Addison-Wesley – Salton - 1989
496 Accurate methods for the statistics of surprise and coincidence – Dunning - 1993
464 Word association norms, mutual information, and lexicography – CHURCH, HANKS - 1989
346 An evaluation of statistical approaches to text categorization – Yang - 1999
299 Learning to filter netnews – Lang - 1995
259 Toward optimal feature selection – Koller, Sahami - 1996
213 A comparison of two learning algorithms for text categorization – Lewis, Ringuette - 1994
209 Training algorithms for linear text classifiers – Lewis, Schapire, et al. - 1996
157 OHSUMED: An interactive retrieval evaluation and new large test collection for research – Hersh, Buckley, et al. - 1994
134 A comparison of classifiers and document representations for the routing problem – Schütze, Hull, et al. - 1995
127 Expert network: Effective and efficient learning from human decisions in text categorization and retrieval – Yang - 1994
116 A neural network approach to topic spotting – Wiener, Pedersen, et al. - 1995
83 An example-based mapping method for text categorization and retrieval – Yang, Chute - 1994
77 The Transmission of Information – Fano - 1961
74 Towards language independent automated learning of text categorization models – Apté, Damerau, et al. - 1994
44 Air/x - a rulebased multistage indexing systems for large subject fields – Fuhr, Hartmanna, et al. - 1991
44 Noise reduction in a statistical approach to text categorization – Yang - 1995
43 Automatic indexing based on bayesian inference networks – Tzeras, Hartman - 1993
37 Text categorization: a symbolic approach – Moulinier, Raˇskinis, et al. - 1996
18 Using corpus statistics to remove redundant words in text categorization – Yang, Wilbur - 1996
14 Trading mips and memory for knowledge engineering: classifying census returns on the connection machine – Creecy, Masand, et al. - 1992
14 The automatic identification of stop words – Wilbur, Sirotkin - 1992
11 Is learning bias an issue on the text categorization problem – Moulinier - 1997
8 Sampling strategies and learning efficiency in text categorization – Yang - 1996
3 Context-sensitive learning metods for text categorization – Cohen, Singer - 1996