Results 1 - 10
of
11
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 839 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
A Comparative Study on Feature Selection in Text Categorization
, 1997
"... This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), ..."
Abstract
-
Cited by 739 (11 self)
- Add to MetaCart
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Ø 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary redu...
An evaluation of statistical approaches to text categorization
- Journal of Information Retrieval
, 1999
"... Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine th ..."
Abstract
-
Cited by 413 (16 self)
- Add to MetaCart
Abstract. This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.
Learning approaches for Detecting and Tracking News Events
- IEEE Intelligent Systems
, 1999
"... This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended e ..."
Abstract
-
Cited by 58 (6 self)
- Add to MetaCart
This paper studies the effective use of information retrieval and machine learning techniques in a new task, event detection and tracking. The objective is to automatically detect novel events from chronologically-ordered streams of news stories, and track events of interest over time. We extended existing supervised learning and unsupervised clustering algorithms to allow document classification based on both information content and temporal aspects of events. A task-oriented evaluation was conducted using Reuters and CNN news stories. We found agglomerative document clustering highly effective (82% in the F 1 measure) for retrospective event detection, and single-pass clustering with time windowing a better choice for on-line alerting of novel events. We also observed robust learning behavior for k-nearest neighbor (kNN) classification and a decision-tree approach in event tracking, under the difficult condition when the number of positive training examples is extremely small. 1 Intr...
A tutorial on automated text categorisation
- Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, pages 7--35, Buenos Aires, AR
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert k ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to 1960. Until the late ’80s, the dominant approach to the problem involved knowledge-engineering automatic categorisers, i.e. manually building a set of rules encoding expert knowledge on how to classify documents. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest. A newer paradigm based on machine learning has superseded the previous approach. Within this paradigm, a general inductive process automatically builds a classifier by “learning”, from a set of previously classified documents, the characteristics of one or more categories; the advantages are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues of document indexing, classifier construction, and classifier evaluation, will be touched upon. 1 A definition of the text categorisation task
Machine Learning in Automated Text Categorisation
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the learner) automatically builds a classifier (also called the rule, or the hypothesis) by “learning”, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into “Yahoo!-like ” hierarchically structured sets of categories.
Sampling Strategies and Learning Efficiency in Text Categorization
- Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access
, 1996
"... This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbor classifier on training samples drawn from a pool of 235,401 training texts with 29,741 distinct categories. The learning curves of the classifier are analyzed with respect to the choice of training resources, the sampling methods, the size, vocabulary and category coverage of a sample, and the category distribution over the texts in the sample. A nearly-optimal categorization performance of the classifier is achieved using a relatively small training sample, showing that statistical learning can be successfully applied to very large text categorization problems with affordable computation.
Combining Machine Learning and Hierarchical Structures for Text Categorization
, 2001
"... Of a thesis submitted in partial fulfillment of the requirements for the Interdisciplinary Studies-Ph.D. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Of a thesis submitted in partial fulfillment of the requirements for the Interdisciplinary Studies-Ph.D.
An Evaluation of Statistical Approaches to Text Categorization
- Journal of Information Retrieval
, 1999
"... This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifie ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper is a comparative study of text categorization methods. Fourteen methods are investigated, based on previously published results and newly obtained results from additional experiments. Corpus biases in commonly used document collections are examined using the performance of three classifiers. Problems in previously published experiments are analyzed, and the results of flawed experiments are excluded from the cross-method evaluation. As a result, eleven out of the fourteen methods are remained. A k-nearest neighbor (kNN) classifier was chosen for the performance baseline on several collections; on each collection, the performance scores of other methods were normalized using the score of kNN. This provides a common basis for a global observation on methods whose results are only available on individual collections. Widrow-Hoff, k-nearest neighbor, neural networks and the Linear Least Squares Fit mapping are the top-performing classifiers, while the Rocchio approaches had rela...
Virtual Attributes from Imperfect Domain Theories
"... This paper proposes to enhance similarity-based classification by virtual attributes from imperfect domain theories. We analyze how properties of the domain theory, such as partialness and vagueness, influence classification accuracy. Experiments in a simple domain suggest that partial knowledge ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper proposes to enhance similarity-based classification by virtual attributes from imperfect domain theories. We analyze how properties of the domain theory, such as partialness and vagueness, influence classification accuracy. Experiments in a simple domain suggest that partial knowledge is more useful than vague knowledge. However, for data sets from the UCI Machine Learning repository, we show that vague domain knowledge which in isolation performs at chance level can substantially increase classification accuracy when being incorporated into similarity-based classification.

