Results 1 - 10
of
13
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
We augment naive Bayes models with statistical n-gram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
N-gram-based author profiles for authorship attribution
, 2003
"... We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of fea ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
We present a novel method for computer-assisted authorship attribution based on character-level n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of feature weights, as language models, or similar. Our approach is based on byte-level n-grams, it is language independent, and the generated author profiles are limited in size. The effectiveness of the approach and language independence are demonstrated in experiments performed on English, Greek, and Chinese data. The accuracy of the results is at the level of the current state of the art approaches or higher in some cases.
Key words: Authorship attribution, character n-grams, text categorization
Language Independent Authorship Attribution using Character Level Language Models
, 2003
"... We present a method for computerassisted authorship attribution based on character-level n-gram language models. ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We present a method for computerassisted authorship attribution based on character-level n-gram language models.
Extracting key-substring-group features for text classification
- In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06
, 2006
"... In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named keysubstring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
Title Similarity-Based Feature Weighting for Text Categorization
- Department of Computer Science, University of Alberta
"... In automated text categorization, a system analyzes a natural-language document to decide whether it belongs in one or more of a group of pre-defined categories. The typical approach is to represent the documents using feature vectors, and inductively generate a classifier based on a training set of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In automated text categorization, a system analyzes a natural-language document to decide whether it belongs in one or more of a group of pre-defined categories. The typical approach is to represent the documents using feature vectors, and inductively generate a classifier based on a training set of documents and their manually-assigned categories. Such a process ignores information on word order, syntax, and other heuristics that might aid in identifying good features for categorization. Recently, more attention has been paid to using deeper natural language processing techniques to improve the performance of the standard classifiers. One such approach, which takes advantage of a previously-generated thesaurus of lexical similarities, is studied in this project. This system identifies key-words in the text by looking for terms with high similarity to the terms in the title field. A database of automatically-clustered dependency-based word similarities is used to identify the similar words. Experiments show increased weighting of key terms aids the effectiveness of text categorization for a number of topics in the standard Reuters newswire corpus.
Text Classification in Asian Languages without Word Segmentation
, 2003
"... We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a simple approach for Asian language text classification without word segmentation, based on statistical n-gram language modeling. In particular, we examine Chinese and Japanese text classification.
Part-Of-Speech Enhanced Context
- In The 17th international conference on pattern recognition
, 2004
"... Language independent `bag-of-words' representations are surprisingly e#ective for text classification. In this communication our aim is to elucidate the synergy between language independent features and simple language model features. We consider term tag features estimated by a so-called part-of-sp ..."
Abstract
- Add to MetaCart
Language independent `bag-of-words' representations are surprisingly e#ective for text classification. In this communication our aim is to elucidate the synergy between language independent features and simple language model features. We consider term tag features estimated by a so-called part-of-speech tagger. The feature sets are combined in an early binding design with an optimized binding coe#cient that allows weighting of the relative variance contributions of the participating feature sets. With the combined features documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Three medium size data-sets are analyzed and we find consistent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant enhancement is found for small text databases where high recognition rates are possible.
Ontology Driven Text Mining for Cost Management
"... In this article a semantic based software system for the management and monitoring of enterprise purchase processes is described and a paradigmatic case study (the Creactive Consulting S.p.A company that have developed the system) is presented. The system enables purchaser officers to search pro ..."
Abstract
- Add to MetaCart
In this article a semantic based software system for the management and monitoring of enterprise purchase processes is described and a paradigmatic case study (the Creactive Consulting S.p.A company that have developed the system) is presented. The system enables purchaser officers to search products through a semantic based engine, and navigate a semantic based catalogue in order to electronically buy the more suitable (less expensive) products. This system is based on a domain-specific ontological model, developed according to a structured representation of purchasable items. In the following paragraphs some of the difficulties that has been overcame will be described. In particular the pre-analysis -- through text-mining techniques -- of a system of documents written in natural language (that it is used to unveil concepts), and the definition of the notion of "functional equivalence" between items (that it is used to effectively compare products) will be deeply analyzed.
Danish Natural Language Processing in Automated Categorization
"... In this work, we investigate possible benefits of natural language processing tools, as means to support automated text categorization. Our corpus consists of a small collection of categorized Danish web pages in the fields of art, architecture, and design. The natural language processing techniques ..."
Abstract
- Add to MetaCart
In this work, we investigate possible benefits of natural language processing tools, as means to support automated text categorization. Our corpus consists of a small collection of categorized Danish web pages in the fields of art, architecture, and design. The natural language processing techniques we examine are stop word removal, removal of functional words, and lemmatization. The tools are based on a stop word list, a part-of-speech tagger and a dictionary. We evaluate effects on a string matching classifier and a support vector machine. The classification accuracy increases when using the lemmata, either in addition to or replacing the original inflected words in the documents. Positive effects are seen on both precision and recall. In absence lemmatization, the removal of stop words increases classifier performance, although not as much. Results are valid both for support vector machine, and string matching categorization. Acknowledgments First of all, I would like to thank my main supervisor Pierre Nugues, for
A New Text Mining Approach Based on HMM-SVM for Web News Classification
"... Since the emergence of WWW, it is essential to handle a very large amount of electronic data of which the majority is in the form of text. This scenario can be effectively handled by various Data Mining techniques. This paper proposes an intelligent system for online news classification based on Hid ..."
Abstract
- Add to MetaCart
Since the emergence of WWW, it is essential to handle a very large amount of electronic data of which the majority is in the form of text. This scenario can be effectively handled by various Data Mining techniques. This paper proposes an intelligent system for online news classification based on Hidden Markov Model (HMM) and Support Vector Machine (SVM). An intelligent system is designed to extract the keywords from the online news paper content and classify it according to the pre defined categories. Three different stages are designed to classify the content of online newspapers such as (1) Text pre-processing (2) HMM based Feature Extraction and (3) Classification using SVM. Data have been collected for experimentation from The Hindu, The New Indian Express, Times of India, Business Line, and The Economic Times. The experimental results are based on the news categories such as sports, finance and politics and their accuracies in percentage are 92.45, 96.34 and 90.76 respectively. These results are very good compared to that of other text classification methods.

