## Spam filtering using statistical data compression models (2006)

### Cached

### Download Links

- [www.jmlr.org]
- [jmlr.csail.mit.edu]
- [www2.in.tu-clausthal.de]
- [jmlr.org]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 52 - 12 self |

### BibTeX

@ARTICLE{Bratko06spamfiltering,

author = {Andrej Bratko and Gordon V. Cormack and David R and Bogdan Filipič and Philip Chan and Thomas R. Lynam and Thomas R. Lynam},

title = {Spam filtering using statistical data compression models},

journal = {Journal of Machine Learning Research},

year = {2006},

volume = {7},

pages = {2673--2698}

}

### Years of Citing Articles

### OpenURL

### Abstract

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

### Citations

1170 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...y of each observed character by 1/2 occurrence and uses the gained probability mass for the escape probability. 4. Minimum Description Length Principle The minimum description length (MDL) principle (=-=Rissanen, 1978-=-; Barron et al., 1998; Grünwald, 2005) favors models that yield compact representations of the data. The traditional two-part MDL principle states that the preferred model results in the shortest desc... |

765 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
(Show Context)
Citation Context ...ption length principle. Although we are aware of no parallel to this in existing text classification research, the same approach could easily be adopted for the popular multinomial naive Bayes model (=-=McCallum and Nigam, 1998-=-) and possibly also for other incremental models. We believe this to be an interesting avenue for future research. The large memory requirements of compression models are a major disadvantage of this ... |

471 | Making large-scale support vector machine learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...ng of naive Bayes and k-nearest neighbors (Sakkis et al., 2001). s-kNN ⋄ k-nearest neighbors with attribute and distance weighting (Sakkis et al., 2003). SVM ⋆ An adaptation of the SVM light package (=-=Joachims, 1998-=-) for the PU1 data set due to Tretyakov (2004), linear kernel with C = 1. Perceptron ⋆ Implementation of the perceptron algorithm due to Tretyakov (2004). Table 2: Reference systems and results of pre... |

307 | The minimum description length principle in coding and modeling
- Barron, Rissanen, et al.
- 1998
(Show Context)
Citation Context ...ed character by 1/2 occurrence and uses the gained probability mass for the escape probability. 4. Minimum Description Length Principle The minimum description length (MDL) principle (Rissanen, 1978; =-=Barron et al., 1998-=-; Grünwald, 2005) favors models that yield compact representations of the data. The traditional two-part MDL principle states that the preferred model results in the shortest description of the model ... |

285 |
Universal coding, information, prediction, and estimation
- Rissanen
- 1984
(Show Context)
Citation Context ...ws that codes exist that achieve this bound. Two-part codes are universal, since only a finite code length is required to specify the model. It turns out that adaptive codes are also universal codes (=-=Rissanen, 1984-=-). In fact, adaptive compression algorithms exist that are proven to achieve Rissanen’s lower bound relative to the class of all finite-memory tree sources (e.g., Willems et al., 1995). The redundancy... |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ... the cost of separately encoding the model in two-part codes. 4.2 Predictive MDL The limitations of the original two-part MDL principle were largely overcome with the modern version of the principle (=-=Rissanen, 1996-=-), which advocates the use of one-part universal codes for measuring description length relative to a chosen model class. The use of adaptive codes for this task is sometimes denoted predictive MDL an... |

189 | The similarity metric
- Li, Chen, et al.
(Show Context)
Citation Context ...ession algorithms in machine learning and data mining problems, in which data compression algorithms are most often used to produce a distance or (dis)similarity measure between pairs of data points (=-=Li et al., 2003-=-; Keogh et al., 2004; Sculley and Brodley, 2006). Most relevant to our study is the work of Frank et al. (2000), who first proposed the use of compression models for automated text categorization. The... |

118 | Towards parameter-free data mining
- Keogh, Lonardi, et al.
- 2004
(Show Context)
Citation Context ... in machine learning and data mining problems, in which data compression algorithms are most often used to produce a distance or (dis)similarity measure between pairs of data points (Li et al., 2003; =-=Keogh et al., 2004-=-; Sculley and Brodley, 2006). Most relevant to our study is the work of Frank et al. (2000), who first proposed the use of compression models for automated text categorization. They investigate using ... |

111 | Unbounded length contexts for PPM
- Cleary, Teahan
- 1997
(Show Context)
Citation Context ...ICTION BY PARTIAL MATCHING The prediction by partial matching (PPM) algorithm (Cleary and Witten, 1984) has set the standard for lossless text compression since its introduction over two decades ago (=-=Cleary and Teahan, 1997-=-). Essentially, the PPM algorithm is a back-off smoothing technique for finite-order Markov models, similar to back-off models used in natural language processing. It is convenient to assume that an o... |

94 | Boosting trees for anti-spam e-mail filtering
- Carreras, Márquez
(Show Context)
Citation Context .... b-Stack ⋄ Stacking of linear support vector machine classifiers built from different message fields (Bratko and Filipič, 2006). c-AdaBoost ⋄ Boosting of decision trees with real-valued predictions (=-=Carreras and Márquez, 2001-=-). gh-Bayes ⋄ Naive Bayes (exact model unknown) with weighting of training instances according to misclassification cost ratio (Hidalgo, 2002). gh-SVM ⋄ Linear support vector machine with weighting of... |

82 | 2002a): Language Trees and Zipping - Benedetto, Caglioti, et al. |

79 | Data compression using dynamic Markov modelling
- Cormac, Horspool
- 1987
(Show Context)
Citation Context ...or fast, incremental and robust learning algorithms. In this paper, we consider the use of adaptive data compression models for spam filtering. Specifically, we employ the dynamic Markov compression (=-=Cormack and Horspool, 1987-=-) and prediction by partial matching (Cleary and Witten, 1984) algorithms. Classification is done by first building two compression models from the training corpus, one from examples of spam and one f... |

65 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...approximated by applying the model M to sufficiently long sequences of symbols, with the expectation that these sequences are representative samples of all possible sequences generated by the source (=-=Brown et al., 1992-=-; Teahan, 2000): H(X,M) ≈ 1 L(x|M). (2) |x| As |x| becomes large, this estimate will approach the actual cross-entropy in the limit almost surely if the source is ergodic (Algoet and Cover, 1988). Rec... |

59 | A tutorial introduction to the minimum description length principle,” In: Advances in Minimum Description Length: Theory and Applications, (edited by
- Grünwald
- 2005
(Show Context)
Citation Context ...ccurrence and uses the gained probability mass for the escape probability. 4. Minimum Description Length Principle The minimum description length (MDL) principle (Rissanen, 1978; Barron et al., 1998; =-=Grünwald, 2005-=-) favors models that yield compact representations of the data. The traditional two-part MDL principle states that the preferred model results in the shortest description of the model and the data, gi... |

53 |
Complexity of strings in the class of Markov sources
- Rissanen
- 1986
(Show Context)
Citation Context ...a universal code and the best model in the model class increases sublinearly with the length of the sequence. Rissanen gives a precise non-asymptotic lower bound on this difference in the worst case (=-=Rissanen, 1986-=-), which turns out to be linearly related to the complexity of the data generating process (in terms of the number of parameters). He also shows that codes exist that achieve this bound. Two-part code... |

49 | The design and analysis of efficient lossless data compression systems
- Howard
- 1993
(Show Context)
Citation Context ...texts (up to order k) of the current symbol. Many versions of the PPM algorithm exist, differing mainly in the way the escape probability is estimated. In our implementation, we used escape method D (=-=Howard, 1993-=-), which simply discounts the frequency of each observed character by 1/2 occurrence and uses the gained probability mass for the escape probability. 4. Minimum Description Length Principle The minimu... |

47 | Augmenting Naive Bayes classifiers with statistical language models
- Peng, Schuurmans, et al.
- 2004
(Show Context)
Citation Context ...t tokens are not considered independently. Their probability is always evaluated with respect to the local context, which was already found to be beneficial for word-level models in previous studies (=-=Peng et al., 2004-=-). Compression models offer the additional advantage over language models considered by Peng et al. (2004) in their effective strategy for adapting the model structure incrementally. By design, the co... |

40 | A memory-based approach to anti-spam filtering for mailing lists
- Sakkis, Androutsopoulos, et al.
- 2003
(Show Context)
Citation Context ...ines with linear kernels (Michelakis et al., 2004). s-Stack ⋄ Stacking of naive Bayes and k-nearest neighbors (Sakkis et al., 2001). s-kNN ⋄ k-nearest neighbors with attribute and distance weighting (=-=Sakkis et al., 2003-=-). SVM ⋆ An adaptation of the SVM light package (Joachims, 1998) for the PU1 data set due to Tretyakov (2004), linear kernel with C = 1. Perceptron ⋆ Implementation of the perceptron algorithm due to ... |

37 | Stacking Classifiers for AntiSpam Filtering of E-mail
- Sakkis, Androutsopoulos, et al.
- 2001
(Show Context)
Citation Context ...us heuristic scoring functions (Pampapathi et al., 2006). m-Filtron ⋄ Support vector machines with linear kernels (Michelakis et al., 2004). s-Stack ⋄ Stacking of naive Bayes and k-nearest neighbors (=-=Sakkis et al., 2001-=-). s-kNN ⋄ k-nearest neighbors with attribute and distance weighting (Sakkis et al., 2003). SVM ⋆ An adaptation of the SVM light package (Joachims, 1998) for the PU1 data set due to Tretyakov (2004), ... |

36 | A statistical approach to the spam problem - Robinson - 2003 |

34 | TREC 2005 spam track overview - Cormack, Lynam - 2005 |

33 | An MDL framework for data clustering - Kontkanen, Myllymäki, et al. - 2006 |

28 |
Learning to filter unsolicited commercial e-mail
- Androutsopoulos, Paliouras, et al.
- 2004
(Show Context)
Citation Context ...l with binary features (Androutsopoulos et al., 2000). a-FlexBayes ⋄ Flexible naive Bayes—uses kernel density estimation for estimating classconditional probabilities of continuous valued attributes (=-=Androutsopoulos et al., 2004-=-). a-LogitBoost ⋄ LogitBoost (variant of boosting) with decision stumps as base classifiers (Androutsopoulos et al., 2004). a-SVM ⋄ Linear kernel support vector machines (Androutsopoulos et al., 2004)... |

26 | I.H.: Text categorization using compression models
- Frank, Chui, et al.
- 2000
(Show Context)
Citation Context ...rate achieved using these two models on the target message determines the classification outcome. Two variants of the method with different theoretical underpinnings are evaluated. The first variant (=-=Frank et al., 2000-=-; Teahan, 2000) estimates the probability of a document using compression models derived from the training data, and assigns the class label based on the model that deems the target document most prob... |

25 | Text classification and segmentation using minimum cross-entropy
- Teahan
(Show Context)
Citation Context ...these two models on the target message determines the classification outcome. Two variants of the method with different theoretical underpinnings are evaluated. The first variant (Frank et al., 2000; =-=Teahan, 2000-=-) estimates the probability of a document using compression models derived from the training data, and assigns the class label based on the model that deems the target document most probable. The seco... |

19 | Evaluating cost-sensitive unsolicited bulk email categorization
- Hidalgo
- 2002
(Show Context)
Citation Context ... decision trees with real-valued predictions (Carreras and Márquez, 2001). gh-Bayes ⋄ Naive Bayes (exact model unknown) with weighting of training instances according to misclassification cost ratio (=-=Hidalgo, 2002-=-). gh-SVM ⋄ Linear support vector machine with weighting of training instances according to misclassification cost ratio (Hidalgo, 2002). h-Bayes ⋄ Multinomial naive Bayes (Hovold, 2005). ks-Bayes ⋄ M... |

18 | 30-31, A Learning-Based Anti-Spam Filter
- Michelakis
- 2004
(Show Context)
Citation Context ...ern matching of character sequences based on the suffix tree data structure and various heuristic scoring functions (Pampapathi et al., 2006). m-Filtron ⋄ Support vector machines with linear kernels (=-=Michelakis et al., 2004-=-). s-Stack ⋄ Stacking of naive Bayes and k-nearest neighbors (Sakkis et al., 2001). s-kNN ⋄ k-nearest neighbors with attribute and distance weighting (Sakkis et al., 2003). SVM ⋆ An adaptation of the ... |

11 | Hackers & Painters: Big Ideas from the Computer Age. O’Reilly - Graham - 2004 |

11 | Compression and machine learning: A new perspective on feature space vectors
- Sculley, Brodley
- 2006
(Show Context)
Citation Context ... and data mining problems, in which data compression algorithms are most often used to produce a distance or (dis)similarity measure between pairs of data points (Li et al., 2003; Keogh et al., 2004; =-=Sculley and Brodley, 2006-=-). Most relevant to our study is the work of Frank et al. (2000), who first proposed the use of compression models for automated text categorization. They investigate using the prediction by partial m... |

9 | Spam filtering using character-level markov models: Experiments for the TREC 2005 spam track,” Text Retrieval Conference
- Bratko, Filipic
- 2005
(Show Context)
Citation Context ...ld email collections in the framework of the 2005 Text REtrieval Conference (TREC). The results of this evaluation showed promise in the use of statistical data compression models for spam filtering (=-=Bratko and Filipič, 2005-=-). This article describes the methods used at TREC in greater detail and extends the TREC paper in a number of ways: We compare the use of adaptive and static models for classification, extend our ana... |

9 |
Naive Bayes Spam Filtering Using Word-Position-Based Attributes,” Proc. Second Conf. Email and Anti-Spam (CEAS),2005. AUTHORS First Author – Mrs.Latha.K Assistant Professor,Department of IT,Adhiyamaan College Of Engineering,Affiliated to Anna University,
- Hovold
(Show Context)
Citation Context ...n cost ratio (Hidalgo, 2002). gh-SVM ⋄ Linear support vector machine with weighting of training instances according to misclassification cost ratio (Hidalgo, 2002). h-Bayes ⋄ Multinomial naive Bayes (=-=Hovold, 2005-=-). ks-Bayes ⋄ Multinomial naive Bayes (Schneider, 2003). p-Suffix ⋄ Pattern matching of character sequences based on the suffix tree data structure and various heuristic scoring functions (Pampapathi ... |

8 |
Stopping spam
- Goodman, Heckerman, et al.
- 2005
(Show Context)
Citation Context ...ftware. Many different approaches for fighting spam have been proposed, ranging from various sender authentication protocols to charging senders indiscriminately, in money or computational resources (=-=Goodman et al., 2005-=-). A promising approach is the use of content-based filters, capable of discerning spam and legitimate email messages automatically. Machine learning methods are particularly attractive for this task,... |

6 | Machine learning techniques in spam filtering - Tretyakov - 2004 |

5 | 2005 CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track - Assis |

3 | A TREC along the spam track with SpamBayes - Meyer - 2005 |

2 |
DBACL at the TREC 2005
- Breyer
- 2005
(Show Context)
Citation Context ...or TREC. Labeled “CRMSPAM2” at TREC 2005. dbacl a ⋆ Version 1.91, default parameters (http://dbacl.sourceforge.net). dbacl b • A custom version of dbacl prepared by the author for evaluation at TREC (=-=Breyer, 2005-=-). Labeled “lbSPAM2” at TREC 2005. SpamAssassin a ⋆ Version 3.0.2, combination of rule-based and learning components (http:// spamassassin.apache.org). SpamAssassin b • Version 3.0.2, learning compone... |

1 | An evaluation of naive bayesian anti-spam filtering - BRATKO, FILIPIČ, et al. - 2000 |

1 | Chung-kwei: A pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam - BRATKO, FILIPIČ, et al. - 2004 |