Results 1 - 10
of
45
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 838 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Dependency tree kernels for relation extraction
- In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04
, 2004
"... We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility ..."
Abstract
-
Cited by 132 (2 self)
- Add to MetaCart
We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility of different features such as Wordnet hypernyms, parts of speech, and entity types, and find that the dependency tree kernel achieves a 20 % F1 improvement over a “bag-of-words ” kernel. 1
Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5
- In ICML’04
, 2004
"... Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed.
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
Efficient phrase-based document indexing for Web document clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2004
"... Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.
Combining Naive Bayes and n-Gram Language Models for Text Classification
- In 25th European Conference on Information Retrieval Research (ECIR
, 2003
"... We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers. ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers.
Complex linguistic features for text classification: a comprehensive study
- Proceedings of the 26th European Conference on Information Retrieval (ECIR
, 2004
"... Abstract. Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techni ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Abstract. Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations derived by Natural Language Processing (NLP) techniques were observed ineffective to increase retrieval accuracy. For Text Categorization (TC) are available fewer and less definitive studies on the use of advanced document representations as it is a relatively new research area (compared to document retrieval). In this paper, advanced document representations have been investigated. Extensive experimentation on representative classifiers, Rocchio and SVM, as well as a careful analysis of the literature have been carried out to study how some NLP techniques used for indexing impact TC. Cross validation over 4 different corpora in two languages allowed us to gather an overwhelming evidence that complex nominals, proper nouns and word senses are not adequate to improve TC accuracy. 1
Using Bigrams in Text Categorization
, 2003
"... In the past decade a sufficient effort has been expended on attempting to come up with a document representation which is richer than the simple Bag-Of-Words (BOW). One of the widely explored approaches to enrich the BOW representation is in using n-grams (usually bigrams) of words in addition to (o ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
In the past decade a sufficient effort has been expended on attempting to come up with a document representation which is richer than the simple Bag-Of-Words (BOW). One of the widely explored approaches to enrich the BOW representation is in using n-grams (usually bigrams) of words in addition to (or in place of) single words (unigrams). After more than ten years of unsuccessful attempts to improve the text categorization results by applying bigrams, many researchers agree that there might be a certain limitation in usability of bigrams for text categorization. We analyze the related works and discuss possible reasons for this limitation. In addition, we demonstrate our own attempt to incorporate bigrams in a document representation based on distributional clusters of unigrams, and report (statistically insignificant) improvement to our baseline results on the 20 Newsgroups (20NG) dataset. Nevertheless, the reported result is (to our knowledge) the best categorization result ever achieved on this highly popular dataset. 1
Language and Task Independent Text Categorization with Simple Language Models
- In Proc. of HLT-NAACL ’03
, 2003
"... We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiri ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
We present a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models. Our approach uses simple information theoretic principles and achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing. To demonstrate the language and task independence of the proposed technique, we present experimental results on several languages - Greek, English, Chinese and Japanese - in several text categorization problems - language identification, authorship attribution, text genre classification, and topic detection. Our experimental results show that the simple approach achieves state of the art performance in each case.

