Results 1 - 10
of
15
Spam filtering using statistical data compression models
- Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
On-line supervised spam filter evaluation
- ACM Transactions on Information Systems
, 2007
"... Eleven variants of six widely used open-source spam filters are tested on a chronological sequence of 49086 email messages received by an individual from August 2003 through March 2004. Our approach differs from those previously reported in that the test set is large, comprises uncensored raw messag ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Eleven variants of six widely used open-source spam filters are tested on a chronological sequence of 49086 email messages received by an individual from August 2003 through March 2004. Our approach differs from those previously reported in that the test set is large, comprises uncensored raw messages, and is presented to each filter sequentially with incremental feedback. Misclassification rates and Receiver Operating Characteristic Curve measurements are reported, with statistical confidence intervals. Quantitative results indicate that content-based filters can eliminate 98 % of spam while incurring 0.1 % legitimate email loss. Qualitative results indicate that the risk of loss depends on the nature of the message, and that messages likely to be lost may be those that are less critical. More generally, our methodology has been encapsulated in a free software toolkit, which may used to conduct similar experiments.
Filtron: A Learning-Based Anti-Spam Filter
- PROCEEDINGS OF THE 1ST CONFERENCE ON EMAIL AND ANTI-SPAM. MOUNTAIN
, 2004
"... We present Filtron, a prototype anti-spam filter that integrates the main empirical conclusions of our comprehensive analysis on using machine learning to construct e#ective personalized anti-spam filters. Filtron is based on the experimental results over several design parameters on four publicl ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We present Filtron, a prototype anti-spam filter that integrates the main empirical conclusions of our comprehensive analysis on using machine learning to construct e#ective personalized anti-spam filters. Filtron is based on the experimental results over several design parameters on four publicly available benchmark corpora. After describing Filtron's architecture, we assess its behavior in real use over a period of seven months. The results are deemed satisfactory, though they can be improved with more elaborate preprocessing and regular re-training.
An assessment of casebased reasoning for spam filtering
- Artif. Intell. Rev
, 2005
"... Abstract. Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run tim ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Abstract. Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run time. This means that the case base can be updated continuously and new training data is immediately available to the induction process. In this paper we present a detailed description of such a system called ECUE and evaluate design decisions concerning the case representation. We compare its performance with an alternative system that uses Naïve Bayes (NB). We find that there is little to choose between the two alternatives in cross-validation tests on data sets. However, ECUE does appear to have some advantages in tracking concept drift over time. 1
Introducing the Webb spam corpus: Using email spam to identify Web spam automatically
- In Proceedings of the 3rd Conference on Email and AntiSpam (CEAS) (Mountain View
, 2006
"... Just as email spam has negatively impacted the user messaging experience, the rise of Web spam is threatening to severely degrade the quality of information on the World Wide Web. Fundamentally, Web spam is designed to pollute search engines and corrupt the user experience by driving traffic to part ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Just as email spam has negatively impacted the user messaging experience, the rise of Web spam is threatening to severely degrade the quality of information on the World Wide Web. Fundamentally, Web spam is designed to pollute search engines and corrupt the user experience by driving traffic to particular spammed Web pages, regardless of the merits of those pages. In this paper, we identify an interesting link between email spam and Web spam, and we use this link to propose a novel technique for extracting large Web spam samples from the Web. Then, we present the Webb Spam Corpus – a first-of-its-kind, large-scale, and publicly available Web spam data set that was created using our automated Web spam collection method. The corpus consists of nearly 350,000 Web spam pages, making it more than two orders of magnitude larger than any other previously cited Web spam data set. Finally, we identify several application areas where the Webb Spam Corpus may be especially helpful. Interestingly, since the Webb Spam Corpus bridges the worlds of email spam and Web spam, we note that it can be used to aid traditional email spam classification algorithms through an analysis of the characteristics of the Web pages referenced by email messages. 1.
A case-based technique for tracking concept drift in spam filtering
, 2005
"... Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy le ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.
Web mining
- In Oded Maimon and Lior Rokach, editors, The Data Mining and Knowledge Discovery Handbook
, 2005
"... The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.
Asymmetric Gradient Boosting with Application to Spam Filtering
"... In this paper, we propose a new asymmetric boosting method, Boosting with Different Costs. Traditional boosting methods assume the same cost for misclassified instances from different classes, and in this way focus on good performance with respect to overall accuracy. Our method is more generic, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we propose a new asymmetric boosting method, Boosting with Different Costs. Traditional boosting methods assume the same cost for misclassified instances from different classes, and in this way focus on good performance with respect to overall accuracy. Our method is more generic, and is designed to be more suitable for problems where the major concern is a low false positive (or negative) rate, such as spam filtering. Experimental results on a large scale email spam data set demonstrate the superiority of our method over state-of-the-art techniques. 1.
A Content Vector Model for Text Classification
, 2006
"... As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifie ..."
Abstract
- Add to MetaCart
As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the training data and is equipped with several pertinent feature selection and text classification algorithms. The proposed classifier has been applied to email classification and its experiments on a benchmark spam testing corpus (PU1) have shown that the approach represents a competitive alternative to other email classifiers based on the well-known SVM and naïve Bayes algorithms.
A Mail Client Plugin for Privacy-Preserving Spam Filter Evaluation
"... We describe a plugin extension to the Thunderbird Mail Client to support standardized evaluation of multiple spam filters on private mail streams. Researchers need not view or handle the subject users ’ messages and subject users need not be familiar with spam filter evaluation methodology. All that ..."
Abstract
- Add to MetaCart
We describe a plugin extension to the Thunderbird Mail Client to support standardized evaluation of multiple spam filters on private mail streams. Researchers need not view or handle the subject users ’ messages and subject users need not be familiar with spam filter evaluation methodology. All that is required of the user is to install the plugin as a standard extension and to run it on his or her mailbox. The plugin evaluates a spam filter, assuming the user’s existing classification to be accurate, and sends summary results only to the researcher, after allowing the user to verify exactly what is sent. This plugin addresses an outstanding challenge in spam filter evaluation: that of using a broad base of realistic data while satisfying personal and legislative privacy requirements. Previous efforts have used public data which may not be representative, captured data which may be insufficiently private, and obfuscation techniques which compromise the integrity of the data and may also be insufficiently private. We show preliminary results using the tool to evaluate some filters previously evaluated at TREC. 1

