The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
|
1055
|
Text categorization with support vector machines: Learning with many relevant features
– Joachims
- 1998
|
|
917
|
Term-weighting approaches in automatic text retrieval
– Salton, Buckley
- 1988
|
|
567
|
A comparative study on feature selection in text categorization
– YANG, PEDERSEN
- 1997
|
|
496
|
Text Classification from Labeled and Unlabelled Documents using EM
– Nigam, McCallum, et al.
- 2000
|
|
477
|
Irrelevant features and the subset selection problem
– John, Kohavi, et al.
- 1994
|
|
450
|
A re-examination of text categorization methods
– YANG, LIU
- 1999
|
|
413
|
Relevance weighting of search terms
– Robertson, Jones
- 1976
|
|
387
|
A vector space model for automatic indexing
– Salton, Wong, et al.
- 1975
|
|
366
|
Transductive inference for text classification using support vector machines
– Joachims
- 1999
|
|
364
|
On the optimality of the simple Bayesian classifier under zero-one loss
– Domingos, Pazzani
- 1997
|
|
350
|
Inductive learning algorithms and representations for text categorization
– DUMAIS, PLATT, et al.
- 1998
|
|
348
|
An evaluation of statistical approaches to text categorization
– Yang
- 1999
|
|
303
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
299
|
Learning to filter netnews
– Lang
- 1995
|
|
284
|
A sequential algorithm for training text classifiers
– Lewis, Gale
- 1994
|
|
277
|
Information filtering and information retrieval: Two sides of the same coin
– Belkin, Croft
- 1992
|
|
271
|
Boostexter: A boostingbased system for text categorization
– Schapire, Singer
- 2000
|
|
266
|
The discipline of machine learning
– Mitchell
- 2006
|
|
263
|
Document length normalization
– Singhal, Salton, et al.
- 1996
|
|
255
|
Enhanced hypertext categorization using hyperlinks
– CHAKRABARTI, DOM, et al.
- 1998
|
|
250
|
A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization
– Joachims
- 1997
|
|
223
|
Naive (bayes) at forty: The independence assumption in information retrieval
– LEWIS
- 1998
|
|
213
|
A comparison of two learning algorithms for text categorization
– LEWIS, RINGUETTE
- 1994
|
|
209
|
Training algorithms for linear text classifiers
– Lewis, Schapire, et al.
- 1996
|
|
194
|
Context sensitive learning methods for text categorization
– Cohen
- 1999
|
|
186
|
Automated Learning of Decision Rules for Text Categorization
– Apte, Damerau
- 1994
|
|
179
|
Automatic word sense discrimination
– Schütze
- 1998
|
|
175
|
A method for disambiguating word senses in a large corpus
– Gale, Church, et al.
- 1993
|
|
172
|
An evaluation of phrasal and clustered representations on a text categorization task
– Lewis
- 1992
|
|
169
|
Improving text classification by shrinkage in a hierarchy of classes
– McCallum, Rosenfeld, et al.
- 1998
|
|
163
|
Hierarchical classification of web content
– Dumais, Chen
- 2000
|
|
162
|
Distributional clustering of words for text classification
– Baker, McCallum
- 1998
|
|
157
|
OHSUMED: An interactive retrieval evaluation and new large test collection for research
– Hersh, Buckley, et al.
- 1994
|
|
149
|
Heterogeneous uncertainty sampling for supervised learning.” ICML
– Lewis, Catlett
- 1994
|
|
147
|
Information storage and retrieval
– Korfhage
- 1997
|
|
147
|
Employing EM in pool-based active learning for text classification
– McCallum, Nigam
- 1998
|
|
143
|
N-gram-based text categorization
– Cavnar, Trenkle
- 1994
|
|
135
|
Representation and learning in information retrieval
– Lewis
- 1992
|
|
135
|
A comparison of classifiers and document representations for the routing problem
– SCHÜTZE, HULL, et al.
- 1995
|
|
130
|
Learning to resolve natural language ambiguities: a unified approach
– Roth
- 1998
|
|
127
|
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
123
|
Learning to classify text from labeled and unlabeled documents
– Nigam, McCallum, et al.
- 1998
|
|
116
|
A neural network approach to topic spotting
– Wiener, Pedersen, et al.
- 1995
|
|
113
|
Error correlation and error reduction in ensemble classifiers
– Tumer, Ghosh
- 1996
|
|
110
|
Relevance: A Review of and a framework for the thinking on the notion in information science
– Saracevic
- 1975
|
|
100
|
Information extraction as a basis for highprecision text classification
– Riloff, Lehnert
- 1994
|
|
96
|
Combining Classifiers in Text Categorization
– Croft
- 1996
|
|
90
|
A theoretical basis for the use of co-occurrence data in information Retrieval
– Rijsbergen
- 1977
|
|
89
|
Feature selection, perceptron learning, and a usability case study for text categorization
– Ng, Goh, et al.
- 1997
|
|
85
|
Automatic detection of text genre
– Kessler, Nunberg, et al.
- 1997
|