Results 1 - 10
of
12
Spam filtering using statistical data compression models
- Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
Spam filtering using compression models
, 2005
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. Since messages are modeled as sequences of characters, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We present experimental results indicating that compression models perform well in comparison to established spam filters. We also show that the method is extremely robust to noise, which should make such filters difficult to defeat. 1
Voting Experts: An Unsupervised Algorithm for Segmenting Sequences
, 2006
"... We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We describe a statistical signature of chunks and an algorithm for finding chunks. While there is no formal definition of chunks, they may be reliably identified as configurations with low internal entropy or unpredictability and high entropy at their boundaries. We show that the log frequency of a chunk is a measure of its internal entropy. The Voting-Experts exploits the signature of chunks to find word boundaries in text from four languages and episode boundaries in the activities of a mobile robot. 1
Spam Filtering using Character-level Markov Models: Experiments for the TREC 2005 Spam Track
- In Proc. 14th Text REtrieval Conference (TREC 2005
, 2005
"... This paper summarizes our participation in the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. We experimen ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
This paper summarizes our participation in the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. We experimented with two different compression algorithms under varying model parameters. All four filters that we submitted exhibited strong performance in the official evaluation, indicating that data compression models are well suited to the spam filtering problem. 1
On compression-based text classification
- In Proc. ECIR-05, 300–314
, 2005
"... Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spann ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, wordstems, and features spanning more than one word. However, compressionbased classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification. 1
M.: Comparing Natural Language Identification Methods based on Markov Processes
- In: Slovko, International Seminar on Computer Treatment of Slavic and East European Languages
, 2007
"... Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. We discover and experiment with categorization-based methods to natural language identification. Two approaches to language identification based on Markov processes are compared, both methods treat the incoming text on the character level. We performed series of experiments with the aim to make certain of high precision in language identification task of selected methods and also with the objective to compare them against themselves. Experimental evaluation was based on largescaled Multilingual Reuters Corpus with various European and Slavic languages. Our research results showed that both methods are comparable in the task of natural language identification achieving recall as high as 99,75%. 1
Catching the drift: Using feature-free case-based reasoning for spam filtering
- In Procs. of the 7th International Conference on Case Based Reasoning
, 2007
"... Abstract. In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a featurefree distance measure based on text compression. In our experiments, we compare two ways ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a featurefree distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents. 1
Measuring Historical Word Sense Variation
"... We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are ab ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.
A Comparison of Language Identification Approaches on Short, Query-Style Texts
"... Abstract In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for si ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80 % can be achieved, for slightly longer texts we even observed accuracy values close to 100%. 1
LINKING GENES IN LOCUSLINK/ENTREZ GENE TO MEDLINE CITATIONS WITH LINGPIPE ∗
"... This paper demonstrates that the human curated citations associated with genes in Entrez Gene (formerly LocusLink) provide an accurate method for tracking gene references through the biomedical literature as represented by MEDLINE. We show how the Entrez Gene citations for a gene can be used to buil ..."
Abstract
- Add to MetaCart
This paper demonstrates that the human curated citations associated with genes in Entrez Gene (formerly LocusLink) provide an accurate method for tracking gene references through the biomedical literature as represented by MEDLINE. We show how the Entrez Gene citations for a gene can be used to build character language-model-based classifiers that picks out MEDLINE citations about that gene. This problem is harder than it may appear given the range of overlapping aliases and contexts. We use the language modeling and classification modules of LingPipe, a natural language processing toolkit distributed with source code. 1.

