MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Machine Learning in Automated Text Categorization (2002) [592 citations — 13 self]

Abstract:

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Citations

1055 Text categorization with support vector machines: Learning with many relevant features – Joachims - 1998
917 Term-weighting approaches in automatic text retrieval – Salton, Buckley - 1988
567 A comparative study on feature selection in text categorization – YANG, PEDERSEN - 1997
496 Text Classification from Labeled and Unlabelled Documents using EM – Nigam, McCallum, et al. - 2000
477 Irrelevant features and the subset selection problem – John, Kohavi, et al. - 1994
450 A re-examination of text categorization methods – YANG, LIU - 1999
413 Relevance weighting of search terms – Robertson, Jones - 1976
387 A vector space model for automatic indexing – Salton, Wong, et al. - 1975
366 Transductive inference for text classification using support vector machines – Joachims - 1999
364 On the optimality of the simple Bayesian classifier under zero-one loss – Domingos, Pazzani - 1997
350 Inductive learning algorithms and representations for text categorization – DUMAIS, PLATT, et al. - 1998
348 An evaluation of statistical approaches to text categorization – Yang - 1999
303 Hierarchically classifying documents using very few words – Koller, Sahami - 1997
299 Learning to filter netnews – Lang - 1995
284 A sequential algorithm for training text classifiers – Lewis, Gale - 1994
277 Information filtering and information retrieval: Two sides of the same coin – Belkin, Croft - 1992
271 Boostexter: A boostingbased system for text categorization – Schapire, Singer - 2000
266 The discipline of machine learning – Mitchell - 2006
263 Document length normalization – Singhal, Salton, et al. - 1996
255 Enhanced hypertext categorization using hyperlinks – CHAKRABARTI, DOM, et al. - 1998
250 A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization – Joachims - 1997
223 Naive (bayes) at forty: The independence assumption in information retrieval – LEWIS - 1998
213 A comparison of two learning algorithms for text categorization – LEWIS, RINGUETTE - 1994
209 Training algorithms for linear text classifiers – Lewis, Schapire, et al. - 1996
194 Context sensitive learning methods for text categorization – Cohen - 1999
186 Automated Learning of Decision Rules for Text Categorization – Apte, Damerau - 1994
179 Automatic word sense discrimination – Schütze - 1998
175 A method for disambiguating word senses in a large corpus – Gale, Church, et al. - 1993
172 An evaluation of phrasal and clustered representations on a text categorization task – Lewis - 1992
169 Improving text classification by shrinkage in a hierarchy of classes – McCallum, Rosenfeld, et al. - 1998
163 Hierarchical classification of web content – Dumais, Chen - 2000
162 Distributional clustering of words for text classification – Baker, McCallum - 1998
157 OHSUMED: An interactive retrieval evaluation and new large test collection for research – Hersh, Buckley, et al. - 1994
149 Heterogeneous uncertainty sampling for supervised learning.” ICML – Lewis, Catlett - 1994
147 Information storage and retrieval – Korfhage - 1997
147 Employing EM in pool-based active learning for text classification – McCallum, Nigam - 1998
143 N-gram-based text categorization – Cavnar, Trenkle - 1994
135 Representation and learning in information retrieval – Lewis - 1992
135 A comparison of classifiers and document representations for the routing problem – SCHÜTZE, HULL, et al. - 1995
130 Learning to resolve natural language ambiguities: a unified approach – Roth - 1998
127 Expert network: Effective and efficient learning from human decisions in text categorization and retrieval – Yang - 1994
123 Learning to classify text from labeled and unlabeled documents – Nigam, McCallum, et al. - 1998
116 A neural network approach to topic spotting – Wiener, Pedersen, et al. - 1995
113 Error correlation and error reduction in ensemble classifiers – Tumer, Ghosh - 1996
110 Relevance: A Review of and a framework for the thinking on the notion in information science – Saracevic - 1975
100 Information extraction as a basis for highprecision text classification – Riloff, Lehnert - 1994
96 Combining Classifiers in Text Categorization – Croft - 1996
90 A theoretical basis for the use of co-occurrence data in information Retrieval – Rijsbergen - 1977
89 Feature selection, perceptron learning, and a usability case study for text categorization – Ng, Goh, et al. - 1997
85 Automatic detection of text genre – Kessler, Nunberg, et al. - 1997