Abstract:
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...
Citations
|
2526
|
Induction of decision trees
– Quinlan
- 1986
|
|
1636
|
Indexing by latent semantic analysis
– Deerwester, Dumais, et al.
- 1990
|
|
988
|
Automatic Text Processing -- The Transformation, Analysis, and Retrieval of Information by Computer Addison-Wesley
– Salton
- 1989
|
|
496
|
Accurate methods for the statistics of surprise and coincidence
– Dunning
- 1993
|
|
464
|
Word association norms, mutual information, and lexicography
– CHURCH, HANKS
- 1989
|
|
346
|
An evaluation of statistical approaches to text categorization
– Yang
- 1999
|
|
299
|
Learning to filter netnews
– Lang
- 1995
|
|
259
|
Toward optimal feature selection
– Koller, Sahami
- 1996
|
|
213
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
209
|
Training algorithms for linear text classifiers
– Lewis, Schapire, et al.
- 1996
|
|
157
|
OHSUMED: An interactive retrieval evaluation and new large test collection for research
– Hersh, Buckley, et al.
- 1994
|
|
134
|
A comparison of classifiers and document representations for the routing problem
– Schütze, Hull, et al.
- 1995
|
|
127
|
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
116
|
A neural network approach to topic spotting
– Wiener, Pedersen, et al.
- 1995
|
|
83
|
An example-based mapping method for text categorization and retrieval
– Yang, Chute
- 1994
|
|
77
|
The Transmission of Information
– Fano
- 1961
|
|
74
|
Towards language independent automated learning of text categorization models
– Apté, Damerau, et al.
- 1994
|
|
44
|
Air/x - a rulebased multistage indexing systems for large subject fields
– Fuhr, Hartmanna, et al.
- 1991
|
|
44
|
Noise reduction in a statistical approach to text categorization
– Yang
- 1995
|
|
43
|
Automatic indexing based on bayesian inference networks
– Tzeras, Hartman
- 1993
|
|
37
|
Text categorization: a symbolic approach
– Moulinier, Raˇskinis, et al.
- 1996
|
|
18
|
Using corpus statistics to remove redundant words in text categorization
– Yang, Wilbur
- 1996
|
|
14
|
Trading mips and memory for knowledge engineering: classifying census returns on the connection machine
– Creecy, Masand, et al.
- 1992
|
|
14
|
The automatic identification of stop words
– Wilbur, Sirotkin
- 1992
|
|
11
|
Is learning bias an issue on the text categorization problem
– Moulinier
- 1997
|
|
8
|
Sampling strategies and learning efficiency in text categorization
– Yang
- 1996
|
|
3
|
Context-sensitive learning metods for text categorization
– Cohen, Singer
- 1996
|