## An evaluation of statistical approaches to text categorization (1999)

### Cached

### Download Links

Venue: | Journal of Information Retrieval |

Citations: | 495 - 21 self |

### BibTeX

@ARTICLE{Yang99anevaluation,

author = {Yiming Yang},

title = {An evaluation of statistical approaches to text categorization},

journal = {Journal of Information Retrieval},

year = {1999},

volume = {1},

pages = {67--88}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.

### Citations

3376 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...E to other application domains would be costly and labor-intensive. Decision Tree (DTree) is a well-known machine learning approach to automatic induction of classi cation trees based on training data=-=[16, 11]-=-. Applied to text categorization, DTree algorithms are used to select informativewords based on an information gain criterion, and predict categories of eachdocumentaccording to the occurrence of word... |

1213 |
Automatic text processing: the transformation, analysis, and retrieval of information by computer
- Salton
- 1989
(Show Context)
Citation Context ... learning compared to a non-learning approach. The conventional vector space model is used for representing documents and category names (each name is treated as a bag of words), and the SMART system =-=[17]-=- is used as the search engine. These classi ers can be divided into two types: independent binary classi ers or m-ary (m >2) classi ers. Given a document, an independent binary classi er makes a YES/N... |

267 | A comparison of two learning algorithms for text categorization
- Lewis, Ringuette
- 1994
(Show Context)
Citation Context ...ing number of statistical learning methods have been applied to this problem in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers =-=[19, 10, 12]-=-, decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation ... |

255 | OHSUMED: An interactive retrieval evaluation and new large test collection for research
- Hersh, Buckley, et al.
- 1994
(Show Context)
Citation Context ...s of a classi er in text categorization when the number of categories is large and the average number of categories per document is small. This problem is further illustrated by the OHSUMED collection=-=[7]-=-, which is another corpus commonly used in text categorization research. OHSUMED contains 233,445 documents indexed using 14,321 unique categories� there are about 13 categories per document onaverage... |

247 | Context-sensitive learning methods for text categorization
- Cohen, Singer
- 1999
(Show Context)
Citation Context ...m in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms=-=[1, 3, 13]-=-, neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text c... |

244 | Training algorithms for linear text classifiers
- Lewis, Schapire, et al.
- 1996
(Show Context)
Citation Context ...hbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learning approaches=-=[3, 9]-=-. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text categorization. However, without a uni ed methodology in empiric... |

194 |
Overview of the third Text REtrieval Conference (TREC-3
- Harman
- 1995
(Show Context)
Citation Context ...ared by all the text categorization researchers, or if a controlled evaluation of a wide range of categorization methods were conducted, similar to the Text Retrieval Conference for document retrieval=-=[4]-=-. The reality, however, is still far from the ideal. Cross-method comparisons have often been attempted but only for two or three methods. The small scale of these experiments could lead to overly gen... |

185 |
McGill.Introduction to Modern Information Retrieval
- Salton, Michael
- 1986
(Show Context)
Citation Context ...categories above the decision threshold. For the global evaluation of a classi er on a collection of test documents, we adapt the procedure for the conventional interpolated 11-point average precision=-=[18]-=-, as described below: ; For each document, compute the recall and precision at each position in the ranked list where a correct category is found. ; For each interval between recall thresholds of 0%, ... |

153 | Expert network: effective and efficient learning from human decisions in text categorisation and retrieval - Yang - 1994 |

152 | A neural network approach to topic spotting
- Wiener, Pedersen, et al.
- 1995
(Show Context)
Citation Context ...g regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks=-=[21, 14]-=- and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text categorization. However, w... |

116 |
Feature selection, perceptron learning, and a usability case study for text categorisation
- Ng, Goh, et al.
- 1997
(Show Context)
Citation Context ...g regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks=-=[21, 14]-=- and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text categorization. However, w... |

114 |
An example-based mapping method for text categorization and retrieval
- Yang, Chute
- 1994
(Show Context)
Citation Context ...is the problem of assigning prede ned categories to free text documents. A growing number of statistical learning methods have been applied to this problem in recent years, including regression models=-=[5, 26]-=-, nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learn... |

84 | Towards language independent automated learning of text categorization models
- Apte, Damerau, et al.
- 1994
(Show Context)
Citation Context ...m in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms=-=[1, 3, 13]-=-, neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text c... |

75 |
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories
- Hayes, Weinstein
- 1991
(Show Context)
Citation Context ...1998 3 2. Classi ers Evaluated on Reuters 2.1. Classifiers We consider the text categorization systems whose results on the various versions of the Reuters corpus have been published in the literature=-=[6, 10, 1, 21, 13, 3, 27, 14]-=- 1 . In addition to these results, we present new results of three systems. These systems are brie y described below, grouped roughly according to their theoretical foundations or technical characteri... |

60 |
Feature selection in statistical learning of text categorization
- Yang, Pederson
- 1997
(Show Context)
Citation Context ...1998 3 2. Classi ers Evaluated on Reuters 2.1. Classifiers We consider the text categorization systems whose results on the various versions of the Reuters corpus have been published in the literature=-=[6, 10, 1, 21, 13, 3, 27, 14]-=- 1 . In addition to these results, we present new results of three systems. These systems are brie y described below, grouped roughly according to their theoretical foundations or technical characteri... |

58 | Noise reduction in a statistical approach to text categorization
- Yang
- 1995
(Show Context)
Citation Context ... where the main observations were that the performance of kNN is relatively stable for a large range of k values[22], and that satisfactory performance of LLSF depends on whether p is su ciently large=-=[23]-=-. Given the large number of possible combinations of parameter values, exhaustive testing of all the combinations is neither practical nor necessary. Wetakeagreedy-search strategy for parameter tuning... |

54 | Automatic indexing based on Bayesian inference networks
- Tzeras, Hartmann
- 1993
(Show Context)
Citation Context ...ing number of statistical learning methods have been applied to this problem in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers =-=[19, 10, 12]-=-, decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation ... |

51 |
Cluster-based text categorization: a comparison of category search strategies
- Iwayama, Tokunaga
- 1995
(Show Context)
Citation Context ...n be reduced to the scaling problem in on-line document ranking, for which anumber of techniques have been studied in the literature, including partial indexing and ranking[15, 2], document clustering=-=[8]-=-, dimensionality reduction[27] and parallel computing[4]. 6. Conclusions The following conclusions are reached from this study: Comparative evaluation across methods and experiments is important for u... |

50 | Air/x - a rule-based multistage indexing systems for large subject elds
- Fuhr, Hartmanna, et al.
- 1991
(Show Context)
Citation Context ...is the problem of assigning prede ned categories to free text documents. A growing number of statistical learning methods have been applied to this problem in recent years, including regression models=-=[5, 26]-=-, nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learn... |

44 |
Text categorization: a symbolic approach
- Moulinier, Ra˘skinis, et al.
- 1996
(Show Context)
Citation Context ...m in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers [19, 10, 12], decision trees[5, 10, 12], inductive rule learning algorithms=-=[1, 3, 13]-=-, neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation becomes increasingly important to identify the state-of-the-art in text c... |

44 | Document filtering for fast ranking - Persin - 1994 |

24 | A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts - Yang, Chute - 1992 |

20 | An evaluation of statistical approaches to medline indexing
- Yang
- 1996
(Show Context)
Citation Context ...out 18,000 categories de ned) in the National Library of Medicine. The OHSUMED collection has been used with the full range of categories (14,321 MeSH categories actually occurred) in some experiments=-=[17]-=-, or with a subset of categories in the heart disease sub-domain (HD, 119 categories) in other experiments[8]. 3.2 Di erent versions Table 1 lists the di erent versions or subsets of Reuters and OHSUM... |

17 |
Trading MIPS and memory for knowledge engineering: classifying census returns on the Connection Machine
- Creecy, Masand, et al.
- 1992
(Show Context)
Citation Context ...ranking, for which anumber of techniques have been studied in the literature, including partial indexing and ranking[15, 2], document clustering[8], dimensionality reduction[27] and parallel computing=-=[4]-=-. 6. Conclusions The following conclusions are reached from this study: Comparative evaluation across methods and experiments is important for understanding the state-ofthe-art in text categorization.... |

14 | A.: The design of a high performance information filtering system - Bell, Moffat - 1996 |

11 |
Is learning bias an issue on the text categorization problem
- Moulinier
- 1997
(Show Context)
Citation Context ...ing number of statistical learning methods have been applied to this problem in recent years, including regression models[5, 26], nearest neighbor classi ers[4, 22], Bayesian probabilistic classi ers =-=[19, 10, 12]-=-, decision trees[5, 10, 12], inductive rule learning algorithms[1, 3, 13], neural networks[21, 14] and on-line learning approaches[3, 9]. With more and more methods available, cross-method evaluation ... |

7 |
Expert network: E ective and e cient learning from human decisions in text categorization and retrieval
- Yang
- 1994
(Show Context)
Citation Context ...tigations on suitable choices of these parameter values were reported in previous papers where the main observations were that the performance of kNN is relatively stable for a large range of k values=-=[22]-=-, and that satisfactory performance of LLSF depends on whether p is su ciently large[23]. Given the large number of possible combinations of parameter values, exhaustive testing of all the combination... |

4 |
Document ltering for fast ranking
- Persin
- 1994
(Show Context)
Citation Context ...he scaling problem in kNN can be reduced to the scaling problem in on-line document ranking, for which anumber of techniques have been studied in the literature, including partial indexing and ranking=-=[15, 2]-=-, document clustering[8], dimensionality reduction[27] and parallel computing[4]. 6. Conclusions The following conclusions are reached from this study: Comparative evaluation across methods and experi... |

4 |
Context-sensitive learning metods for text categorization
- Cohen, Singer
- 1996
(Show Context)
Citation Context ...ledge or training. The Reuters Apte set has the densest column where the results of eight systems are available. Although the document counts reported by di erent researchers are somewhat inconsistent=-=[1, 2]-=- 4 , the di erences are relatively small compared to the size of the corpus (i.e., at most 21 miscounted out of over ten thousands training documents, and at most 7 miscounted out of over three thousa... |

2 |
A linear least squares t mapping method for information retrieval from natural language texts
- Yang, Chute
- 1992
(Show Context)
Citation Context ... consequently, Rocchio does not perform well when the documents belonging to a category naturally form separate clusters. LLSF stands for Linear Least Squares Fit, a mapping approach developed by Yang=-=[25]-=-. A multivariate regression model is automatically learned from a training set of documents and their categories. The training data are represented in the form of input/output vector pairs where the i... |

2 |
Une approche de la cat'egorisation de textes par l'apprentissage symbolique
- Moulinier
- 1996
(Show Context)
Citation Context ...or Reuters news stories[6]. 2. Decision tree (DTree) algorithms for classi cation[9, 11]. 3. A naiveBayes model (NaiveBayes) for classi cation where word independence is assumed in category prediction=-=[9, 10]-=-. 1Table 1. Data collections examination using WORD, kNN and LLSF in category ranking Corpus Set UniqCate TrainDoc TestDos (labelled) WORD kNN LLSF CONSTRUE* 182 21,450 723 (80%) .28 .80 - CONSTRUE.2... |

1 |
Alisair Mo at. The design of a high performance information ltering system
- Bell
- 1996
(Show Context)
Citation Context ...he scaling problem in kNN can be reduced to the scaling problem in on-line document ranking, for which anumber of techniques have been studied in the literature, including partial indexing and ranking=-=[15, 2]-=-, document clustering[8], dimensionality reduction[27] and parallel computing[4]. 6. Conclusions The following conclusions are reached from this study: Comparative evaluation across methods and experi... |