This paper reports a controlled study with statistical significance tests on five text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a NaiveBayes (NB) classifier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF significantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are sufficiently common (over 300 instances).
|
1110
|
Support-vector networks
– Cortes, Vapnik
- 1995
|
|
1064
|
Text categorization with support vector machines: Learning with many relevant features
– Joachims
- 1999
|
|
574
|
A comparative study on feature selection in text categorization
– Yang, Pedersen
- 1997
|
|
519
|
A comparison of event model for naive Bayes text classification
– McCallum, Nigam
- 1998
|
|
353
|
An Evaluation of Statistical Approaches to Text Categorization
– Yang
- 1999
|
|
304
|
Hierachically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
213
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
211
|
Training Algorithms for Linear Text Classifiers
– Lewis
- 1996
|
|
199
|
Nearest Neighbor (NN) Norms: NN Pattern Recognition Classification Techniques
– Dasarathy
- 1991
|
|
194
|
Context sensitive learning methods for text categorization
– Cohen
- 1999
|
|
151
|
Sequential minimal optimization: A fast algorithm for training support vector machines
– Platt
|
|
128
|
Support Vector Machines: Training and Applications
– Osuna
- 1998
|
|
128
|
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
121
|
A neural network approach to topic spotting
– Wiener, Pedersen, et al.
- 1995
|
|
89
|
Feature selection, perceptron learning, and a usability case study for text categorization
– Ng, Goh, et al.
- 1997
|
|
85
|
An example-based mapping method for text categorization and retrieval
– Yang, Chute
- 1994
|
|
74
|
Towards language independent automated learning of text categorization models
– Apte, Dameru, et al.
- 1994
|
|
72
|
Classifying News Stories using Memory based Reasoning”, The
– Massand, Linoff, et al.
- 1992
|
|
57
|
CONSTRUE/TIS: a system for content-based indexing of a database of news stories
– HAYES, WEINSTEIN
- 1990
|
|
56
|
Text categorization and relational learning
– Cohen
- 1995
|
|
55
|
Feature selection in statistical learning of text categorization
– Yang, Pedersen
- 1997
|
|
47
|
Using Generalized Instance Set for Automatic Texts Categorization
– Lam, Ho
- 1998
|
|
45
|
Automatic indexing based on bayesian inference networks
– Tzeras, Hartman
- 1993
|
|
44
|
Air/x - a rulebased multistage indexing systems for large subject fields
– Fuhr, Hartmanna, et al.
- 1991
|
|
44
|
Cluster-Based Text Categorization: A Comparison of Category Search Strategies
– Iwayama, Tokunaga
- 1995
|
|
39
|
Text categorization: a symbolic approach
– Moulinier, Raskinis, et al.
- 1996
|
|
20
|
Text mining with decision rules and decision trees
– Apte, Damerau
- 1998
|
|
14
|
The Nature of Statistical Learning Theory
– Vapnic
- 1995
|
|
12
|
Statistical Theory and Methods
– Berry, Lindgren
- 1998
|
|
11
|
Is learning bias an issue on the text categorization problem
– Moulinier
- 1997
|
|
10
|
Distributional clustering of words for text categorization
– Baker, Mccallum
- 1998
|
|
8
|
Sampling strategies and learning efficiency in text categorization
– Yang
- 1996
|
|
3
|
A comparison of event models for naivebayes text classi
– McCallum, Nigam
- 1998
|
|
3
|
Expert network: E ective and e cient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
1
|
Statistics: Theory and Methods. Brooks/Cole, Paci c
– Berry, Lindgren
- 1990
|
|
1
|
Sampling strategies and learning e ciency in text categorization
– Yang
- 1996
|