• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Augmenting Naive Bayes Classifiers with Statistical Language Models (2003)

by Fuchun Peng, Dale Schuurmans, Shaojun Wang
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 28
Next 10 →

Building Bridges for Web Query Classification

by Dou Shen, Jian-Tao Sun, Qiang Yang, Zheng Chen , 2006
"... Web query classification (QC) aims to classify Web users’ queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and personalization. In this paper, we present a nove ..."
Abstract - Cited by 39 (8 self) - Add to MetaCart
Web query classification (QC) aims to classify Web users’ queries, which are often short and ambiguous, into a set of target categories. QC has many applications including page ranking in Web search, targeted advertisement in response to queries, and personalization. In this paper, we present a novel approach for QC that outperforms the winning solution of the ACM KDDCUP 2005 competition, whose objective is to classify 800,000 real user queries. In our approach, we first build a bridging classifier on an intermediate taxonomy in an offline mode. This classifier is then used in an online mode to map user queries to the target categories via the above intermediate taxonomy. A major innovation is that by leveraging the similarity distribution over the intermediate taxonomy, we do not need to retrain a new classifier for each new set of target categories, and therefore the bridging classifier needs to be trained only once. In addition, we introduce category selection as a new method for narrowing down the scope of the intermediate taxonomy based on which we classify the queries. Category selection can improve both efficiency and effectiveness of the online classification. By combining our algorithm with the winning solution of KDDCUP 2005, we made an improvement by 9.7 % and 3.8 % in terms of precision and F1 respectively compared with the best results of KDDCUP 2005.

Spam filtering using statistical data compression models

by Andrej Bratko, Gordon V. Cormack, David R, Bogdan Filipič, Philip Chan, Thomas R. Lynam, Thomas R. Lynam - Journal of Machine Learning Research , 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract - Cited by 33 (12 self) - Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

A Survey of Modern Authorship Attribution Methods

by Efstathios Stamatatos - JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
"... Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed ..."
Abstract - Cited by 18 (0 self) - Add to MetaCart
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

Spam filtering using compression models

by Andrej Bratko, Bogdan Filipič , 2005
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. Since messages are modeled as sequences of characters, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We present experimental results indicating that compression models perform well in comparison to established spam filters. We also show that the method is extremely robust to noise, which should make such filters difficult to defeat. 1

Wikipedia-based semantic interpretation for natural language processing

by Evgeniy Gabrilovich, Shaul Markovitch - J. Artif. Int. Res
"... Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such a ..."
Abstract - Cited by 13 (3 self) - Add to MetaCart
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1.

Exploiting structural information for semi-structured document categorization

by Andrej Bratko, Bogdan Filipič - Information Processing & Management , 2004
"... This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adeq ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction. 1

Using Query Contexts in Information Retrieval

by Jing Bai, Jian-yun Nie, Hugues Bouchard, Guihong Cao
"... User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the user’s interests by creating a user profile. However, a s ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the user’s interests by creating a user profile. However, a single profile for a user may not be sufficient for a variety of queries of the user. In this study, we propose to use query-specific contexts instead of user-centric ones, including context around query and context within query. The former specifies the environment of a query such as the domain of interest, while the latter refers to context words within the query, which is particularly useful for the selection of relevant term relations. In this paper, both types of context are integrated in an IR model based on language modeling. Our experiments on several TREC collections show that each of the context factors brings significant improvements in retrieval effectiveness.

Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

by Evgeniy Gabrilovich, Shaul Markovitch
"... Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is ..."
Abstract - Cited by 9 (2 self) - Add to MetaCart
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.

On compression-based text classification

by Yuval Marton, Ning Wu, Lisa Hellerstein - In Proc. ECIR-05, 300–314 , 2005
"... Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spann ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spanning more than one word. However, compressionbased classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification. 1

Ensemble-based Author Identification Using Character N-grams

by Efstathios Stamatatos - In Proc. of the 3rd Int. Workshop on Textbased Information Retrieval
"... Abstract. This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclas ..."
Abstract - Cited by 5 (3 self) - Add to MetaCart
Abstract. This paper deals with the problem of identifying the most likely author of a text. Several thousands of character n-grams, rather than lexical or syntactic information, are used to represent the style of a text. Thus, the author identification task can be viewed as a single-label multiclass classification problem of high dimensional feature space and sparse data. In order to cope with such properties, we propose a suitable learning ensemble based on feature set subspacing. Performance results on two well-tested benchmark text corpora for author identification show that this classification scheme is quite effective, significantly improving the best reported results so far. Additionally, this approach is proved to be quite stable in comparison with support vector machines when using limited number of training texts, a condition usually met in this kind of problem. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University