## An evaluation of statistical spam filtering techniques (2004)

### Cached

### Download Links

- [homepages.inf.ed.ac.uk]
- [www.mts.jhu.edu]
- [www.cis.uab.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | ACM Transactions on Asian Language Information Processing (TALIP |

Citations: | 8 - 0 self |

### BibTeX

@ARTICLE{Zhang04anevaluation,

author = {Le Zhang and Jingbo Zhu and Tianshun Yao},

title = {An evaluation of statistical spam filtering techniques},

journal = {ACM Transactions on Asian Language Information Processing (TALIP},

year = {2004},

volume = {3},

pages = {2004}

}

### OpenURL

### Abstract

This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner’s performance using cost-sensitive measures. It is observed that the significance of feature selection varies greatly from classifier to classifier. In particular, we found Support Vector Machine, AdaBoost and Maximum Entropy Model are top performers in this evaluation, sharing similar characteristics: not sensitive to feature selection strategy, easily scalable to very high feature dimension and good performances across different datasets. In contrast, Naive Bayes, a commonly used classifier in spam filtering, is found to be sensitive to feature selection methods on small feature set, and fail to function well in scenarios where false positives are penalized heavily. The experiments also suggest that aggressive feature pruning should be avoided when building filters to be used in applications where legitimate mails are assigned a cost much higher than spams (such as λ = 999), so as to maintain a better-than-baseline performance. An interesting finding is the effect of mail headers on spam filtering, which is often ignored in previous studies. Experiments show that classifiers using features from message header alone can achieve comparable or better performance than filters utilizing body features only. This suggests that message headers can be reliable and powerfully discriminative feature sources for spam filtering. 1

### Citations

9092 | Statistical Learning Theory - Vapnik - 1998 |

2206 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...e according to its similarity between stored instance base. After that, we introduce Support Vector Machine, a learning paradigm that is based on Structured Risk Minimization principle [Vapnik, 1995, =-=Cortes and Vapnik, 1995-=-]. Finally, we discuss AdaBoost [Freund and Schapire, 1996], a relatively new framework for boosting a weak learner into a strong one. These particular techniques are chosen because of their excellent... |

1717 | Text categorization with support vector machines: Learning with many relevant features - Joachims - 1998 |

1653 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...se. After that, we introduce Support Vector Machine, a learning paradigm that is based on Structured Risk Minimization principle [Vapnik, 1995, Cortes and Vapnik, 1995]. Finally, we discuss AdaBoost [=-=Freund and Schapire, 1996-=-], a relatively new framework for boosting a weak learner into a strong one. These particular techniques are chosen because of their excellent performance reported in previous studies. 3.1 Naive Bayes... |

1126 | Machine learning in automated text categorization - Sebastiani |

1084 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...ier is expected to achieve an optimal result by setting t = λ/(1 + λ), as long as the probability estimates are accurate [Lewis, 1995]. 3.2 Maximum Entropy Model Maximum Entropy (ME or MaxEnt) Model [=-=Berger et al., 1996-=-] is a general-purpose machine-learning framework that has been successfully applied to a wide range of text processing tasks such as Statistical Language Modeling [Rosenfeld, 1996], Language Ambiguit... |

965 | Pederson: A Comparative Study on Feature Selection in Text Categorization
- Yang, Jan
- 1997
(Show Context)
Citation Context ...e of tens of thousands of features. Several attempts have been made to evaluate the performance of machine-learning methods on general text categorization task where there are ten or more categories [=-=Yang and Pedersen, 1997-=-, Yang and Liu, 1999]. However, 1 More information can be found at: http://www.caube.org.au/spamstats.html. 1swhether the same conclusion still holds on two-class spam filtering task remains to be an ... |

769 | A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on Learning for Text Categorization
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...cular techniques are chosen because of their excellent performance reported in previous studies. 3.1 Naive Bayes Naive Bayes (NB) is a widely used classifier in Text Categorization task [Lewis, 1998, =-=McCallum and Nigam, 1998-=-]. It also enjoys a blaze of popularity in anti-spam researches [Sahami et al., 1998, Pantel and Lin, 1998, Androutsopoulos et al., 2000a, Schneider, 2003], and often serves as baseline method for com... |

714 | Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods - Platt - 1999 |

652 | A re-examination of text categorization methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ... features. Several attempts have been made to evaluate the performance of machine-learning methods on general text categorization task where there are ten or more categories [Yang and Pedersen, 1997, =-=Yang and Liu, 1999-=-]. However, 1 More information can be found at: http://www.caube.org.au/spamstats.html. 1swhether the same conclusion still holds on two-class spam filtering task remains to be an open question, espec... |

581 | A short introduction to boosting
- Freund, Schapire
- 1999
(Show Context)
Citation Context ... SVM and AdaBoost can be clarified by the fact that the two algorithms seek to maximize the minimum margins of training examples that are only different in norms of instance vector and weight vector [=-=Freund and Schapire, 1999-=-]. Viewed in this way, SVM and AdaBoost are very similar. The difference between large margin classifiers (SVM, AdaBoost) and Maximum Likelihood Estimation of exponential model (Maximum Entropy Model)... |

551 | Inducing features of random fields - Pietra, Pietra, et al. - 1997 |

533 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...ure 1 to 5). The X-Axis indicates different feature set sizes 8 ; the Y-axis is the TCR measure 9 . All the p-values reported in the rest of this paper were obtained with paired single-tailed t test [=-=Dietterich, 1998-=-]. We can make several useful observations when putting the results together (Figures 1 to 5). The first thing we notice on the figures is that the performance of most classifiers (except Timbl) incre... |

498 | On the Limited Memory BFGS Method for Large Scale Optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...(x, y) are empirical probability distribution. In practice, the parameter Λ can be computed through numerical optimization method. In our experiment, we use the Limited-Memory Variable Metric method [=-=Liu and Nocedal, 1989-=-], a Limited-memory version of quasi-newton method (also called L-BFGS) to find Λ. Applying L-BFGS requires evaluating the gradient of object function LΛ in each iteration, which can be computed as: ∂... |

490 | BoosTexter: A Boosting-based System for Text Categorization - Schapire, Singer |

472 | Making large-scale support vector machine learning practical
- Joachims
- 1999
(Show Context)
Citation Context ...iables and are used together with constant C ≥ 0 to find solution of (15) in non-separable cases. SVM has been reported remarkable performance on Text Categorization task with many relevant features [=-=Joachims, 1998-=-b]. It has also been applied to spam filtering task with excellent filtering accuracy [Kolcz and Alspector, 2001, Drucker et al., 1999]. In our evaluation, we used an off-the-shelf SVM implementation:... |

440 | An experimental comparison of three methods for constructing ensembles of decision trees - Dietterich - 2000 |

398 | A Bayesian approach to filtering junk e-mail, Learning for Text Categorization
- Sahami, Dumais, et al.
- 1998
(Show Context)
Citation Context ...e predicted are spam and legitimate. A variety of supervised machine-learning algorithms have been successfully applied to mail filtering task. A non-exhaustive list includes: Naive Bayes classifier [=-=Sahami et al., 1998-=-, Androutsopoulos et al., 2000a, Schneider, 2003], RIPPER rule induction algorithm [Cohen, 1996], Support Vector Machines [Drucker et al., 1999, Kolcz and Alspector, 2001], Memory Based Learning [Andr... |

349 | Naive Bayes at forty: The independence assumption in information retrieval
- Lewis
- 1998
(Show Context)
Citation Context .... These particular techniques are chosen because of their excellent performance reported in previous studies. 3.1 Naive Bayes Naive Bayes (NB) is a widely used classifier in Text Categorization task [=-=Lewis, 1998-=-, McCallum and Nigam, 1998]. It also enjoys a blaze of popularity in anti-spam researches [Sahami et al., 1998, Pantel and Lin, 1998, Androutsopoulos et al., 2000a, Schneider, 2003], and often serves ... |

259 | Using maximum entropy for text classification
- Nigam
- 1999
(Show Context)
Citation Context ... been successfully applied to a wide range of text processing tasks such as Statistical Language Modeling [Rosenfeld, 1996], Language Ambiguity Resolution [Ratnaparkhi, 1998] and Text Categorization [=-=Nigam et al., 1999-=-]. Given a set of training samples T = {(x1, y1), (x2, y2), . . . , (xN , yN)} where xi is a real value feature vector and yi is the target class, the maximum entropy principle [Jaynes, 1983] states t... |

246 | Support vector machines for spam categorization
- Drucker, Wu, et al.
- 1999
(Show Context)
Citation Context .... A non-exhaustive list includes: Naive Bayes classifier [Sahami et al., 1998, Androutsopoulos et al., 2000a, Schneider, 2003], RIPPER rule induction algorithm [Cohen, 1996], Support Vector Machines [=-=Drucker et al., 1999-=-, Kolcz and Alspector, 2001], Memory Based Learning [Androutsopoulos et al., 2000b], AdaBoost [Carreras and Márquez, 2001] and Maximum Entropy Model [Zhang and Yao, 2003]. While all these approaches s... |

245 | A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language
- Rosenfeld
- 1996
(Show Context)
Citation Context ... MaxEnt) Model [Berger et al., 1996] is a general-purpose machine-learning framework that has been successfully applied to a wide range of text processing tasks such as Statistical Language Modeling [=-=Rosenfeld, 1996-=-], Language Ambiguity Resolution [Ratnaparkhi, 1998] and Text Categorization [Nigam et al., 1999]. Given a set of training samples T = {(x1, y1), (x2, y2), . . . , (xN , yN)} where xi is a real value ... |

203 | Maximum Entropy Models for Natural Language Ambiguity Resolution
- Ratnaparkhi
- 1998
(Show Context)
Citation Context ...-purpose machine-learning framework that has been successfully applied to a wide range of text processing tasks such as Statistical Language Modeling [Rosenfeld, 1996], Language Ambiguity Resolution [=-=Ratnaparkhi, 1998-=-] and Text Categorization [Nigam et al., 1999]. Given a set of training samples T = {(x1, y1), (x2, y2), . . . , (xN , yN)} where xi is a real value feature vector and yi is the target class, the maxi... |

170 | Prior probabilities
- Jaynes
- 1968
(Show Context)
Citation Context ...on [Nigam et al., 1999]. Given a set of training samples T = {(x1, y1), (x2, y2), . . . , (xN , yN)} where xi is a real value feature vector and yi is the target class, the maximum entropy principle [=-=Jaynes, 1983-=-] states that we should summarize data T with a model that is maximally noncommittal with respect to missing information. Among distributions consistent with the constraints imposed by T , there exist... |

167 | Learning rules that classify e-mail
- Cohen
- 1996
(Show Context)
Citation Context ...essfully applied to mail filtering task. A non-exhaustive list includes: Naive Bayes classifier [Sahami et al., 1998, Androutsopoulos et al., 2000a, Schneider, 2003], RIPPER rule induction algorithm [=-=Cohen, 1996-=-], Support Vector Machines [Drucker et al., 1999, Kolcz and Alspector, 2001], Memory Based Learning [Androutsopoulos et al., 2000b], AdaBoost [Carreras and Márquez, 2001] and Maximum Entropy Model [Zh... |

123 | C.D.: An Evaluation of Naive Bayesian Anti-Spam Filtering - Androutsopoulos, Koutsias, et al. - 2000 |

116 | An example-based mapping method for text categorization and retrieval - Yang, Chute - 1994 |

105 |
Transmission of information
- Fano
- 1949
(Show Context)
Citation Context ...mmarized in Table 1. It is worth mentioning that Information Gain is just the weighted average of the Mutual Information between (t, c) and (¯t, c), and is also called the average mutual information [=-=Fano, 1961-=-]. IG has been used in several anti-spam experiments under the name of “Mutual Information” [Sahami et al., 1998, Androutsopoulos et al., 2000c, Schneider, 2003]. We shall use the name IG in the rest ... |

102 | An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages - Androutsopoulos, Koutsias, et al. |

96 | Boosting trees for anti-spam email filtering - Carreras, Marquez - 2001 |

81 | Boosting and maximum likelihood for exponential models
- Lebanon, La®erty
- 2002
(Show Context)
Citation Context ...ikelihood Estimation of exponential model (Maximum Entropy Model) seems to be obvious at first glance. However, there exists strong connection between AdaBoost and Maximum Entropy Model. The work of [=-=Lebanon and Lafferty, 2001-=-] showed that AdaBoost and Maximum Likelihood of exponential models actually minimize the same Kullback-Leibler divergence objective function subject to identical feature constraints, and they typical... |

76 | Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach - Androutsopoulos, Paliouras, et al. - 2000 |

43 | Spamcop: A spam classification and organization program,” in Learning for Text Categorization
- Pantel, Lin
- 1998
(Show Context)
Citation Context ...es Naive Bayes (NB) is a widely used classifier in Text Categorization task [Lewis, 1998, McCallum and Nigam, 1998]. It also enjoys a blaze of popularity in anti-spam researches [Sahami et al., 1998, =-=Pantel and Lin, 1998-=-, Androutsopoulos et al., 2000a, Schneider, 2003], and often serves as baseline method for comparison with other approaches. Given a feature vector � dj = {wj1, wj2, . . . , w j| � dj | } of a message... |

35 | A comparison of event models for naive Bayes anti-spam email filtering - Schneider - 2003 |

32 |
SVM-based filtering of e-mail spam with content-specific misclassification costs
- Kolcz, Alspector
- 2001
(Show Context)
Citation Context ...t includes: Naive Bayes classifier [Sahami et al., 1998, Androutsopoulos et al., 2000a, Schneider, 2003], RIPPER rule induction algorithm [Cohen, 1996], Support Vector Machines [Drucker et al., 1999, =-=Kolcz and Alspector, 2001-=-], Memory Based Learning [Androutsopoulos et al., 2000b], AdaBoost [Carreras and Márquez, 2001] and Maximum Entropy Model [Zhang and Yao, 2003]. While all these approaches seem appealing, it is diffic... |

28 | Jakub Zavrel, Ko van der Sloot and Antal van den Bosch. Timbl: Tilburg memory based learner, version 5.1, reference guide - Daelemans - 2004 |

24 | TIMBL: Tilburg memorybased learner—version 4.3 reference guide - DAELEMANS, ZAVREL, et al. - 1999 |

18 |
Evaluating and optmizing autonomous text classification systems
- Lewis
- 1995
(Show Context)
Citation Context ...es more costly than wrongly labeling a spam as legitimate, Naive Bayes classifier is expected to achieve an optimal result by setting t = λ/(1 + λ), as long as the probability estimates are accurate [=-=Lewis, 1995-=-]. 3.2 Maximum Entropy Model Maximum Entropy (ME or MaxEnt) Model [Berger et al., 1996] is a general-purpose machine-learning framework that has been successfully applied to a wide range of text proce... |

18 | Filtering junk mail with a maximum entropy model - ZHANG, T |

4 |
Lafferty: “Inducing Features of Random Fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...r, it has been shown that the Maximum Entropy model is also the Maximum Likelihood solution on the training data that minimizes the Kullback-Leibler divergence between PΛ and the uniform model [Della =-=Pietra et al., 1997-=-]. Since the log-likelihood of PΛ(y | x) on training data is convex in the model’s parameter space Λ, a unique Maximum Entropy solution is guaranteed and can be found by maximizing the log-likelihood ... |

2 | A corpus-based investigation of junk emails
- ORASAN, R
- 2002
(Show Context)
Citation Context ...ransform data into the input of a machine learning algorithm and evaluate the output produced by that algorithm. Relatively few attempts have been made to exploit the linguistic characters of spams. [=-=Orasan and Krishnamurthy, 2002-=-] compared various linguistic features such as average sentence length, POS distribution, lexical frequencies (N-grams, lemma) in one junk corpus to BNC corpus, showing quite different characters exis... |

1 | A maximum entropy approach to adaptive statistical language modeling - Thesis - 1996 |