Results 1 - 10
of
19
Wikipedia-based semantic interpretation for natural language processing
- J. Artif. Int. Res
"... Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such a ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text categorization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users. 1.
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
"... Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.
Finding Deceptive Opinion Spam by Any Stretch of the Imagination
"... Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we s ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we study deceptive opinion spam—fictitious opinions that have been deliberately written to sound authentic. Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90 % accurate on our gold-standard opinion spam dataset. Based on feature analysis of our learned models, we additionally make several theoretical contributions, including revealing a relationship between deceptive opinions and imaginative writing. 1
Combining email models for false positive reduction
- In KDD ’05: Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining
, 2005
"... Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis. EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT’s user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models,
On compression-based text classification
- In Proc. ECIR-05, 300–314
, 2005
"... Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spann ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spanning more than one word. However, compressionbased classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification. 1
Noisy sequence classification with smoothed markov chains
- In Conférence francophone sur l’apprentissage automatique 2006, (CAp 2006
, 2006
"... This paper is concerned with sequence classification using Markov chains when classification noise is included in the learning data. These models offer a direct generalization of a Multinomial Naive Bayes classifier by taking into account dependences between successive events up to a certain history ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper is concerned with sequence classification using Markov chains when classification noise is included in the learning data. These models offer a direct generalization of a Multinomial Naive Bayes classifier by taking into account dependences between successive events up to a certain history length. Our study shows that smoothed Markov chains are very robust to classification noise. The relation between classification accuracy and test set perplexity, often used to measure prediction quality, is discussed. The influence of varying the model order is also studied from an experimental viewpoint. Experiments are conducted both on a gender classification task from spelling of first names and splicing region classification in DNA sequences. The first set of experiments also illustrate the superiority of smoothed Markov chains to classify noisy sequence over an automaton learning technique using boosting.
An Adaptive Approach to Spam Filtering on a New Corpus
"... Motivated by the absence of rigorous experimentation in the area of spam filtering using realistic email data, we present a newly-assembled corpus of genuine and unsolicited (spam) email, dubbed GenSpam, to be made publicly available. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Motivated by the absence of rigorous experimentation in the area of spam filtering using realistic email data, we present a newly-assembled corpus of genuine and unsolicited (spam) email, dubbed GenSpam, to be made publicly available.
Behavior-based Email Analysis with Application to Spam Detection
, 2006
"... Email is the “killer network application”. Email is ubiquitous and pervasive. In a relatively short timeframe, the Internet has become irrevocably and deeply entrenched in our modern society primarily due to the power of its communication substrate linking people and organizations around the globe. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Email is the “killer network application”. Email is ubiquitous and pervasive. In a relatively short timeframe, the Internet has become irrevocably and deeply entrenched in our modern society primarily due to the power of its communication substrate linking people and organizations around the globe. Much work on email technology has focused on making email easy to use, permitting a wide variety of information and information types to be conveniently, reliably, and efficiently sent throughout the Internet. However, the analysis of the vast storehouse of email content accumulated or produced by individual users has received relatively little attention other than for specific tasks such as spam and virus filtering. As one paper in the literature puts it, ”the state of the art is still a messy desktop” (Denning,
Session Boundary Detection for Association Rule Learning Using n-Gram Language Models
"... We present a statistical method using n-gram language models to identify session boundaries in a large collection of Livelink log data. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present a statistical method using n-gram language models to identify session boundaries in a large collection of Livelink log data.
Incremental Mining from News Streams
- Encyclopedia of Data Warehousing and Mining, Idea Group Inc
, 2004
"... With the rapid growth of the World Wide Web, Internet users are now ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
With the rapid growth of the World Wide Web, Internet users are now

