Results 1 - 10
of
28
A Survey of Learning-Based Techniques of Email Spam Filtering
, 2007
"... Email spam is one of the major problems of the today’s Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
(Show Context)
Email spam is one of the major problems of the today’s Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and non-commercial anti-spam software solutions. 1
Email Spam Filtering: A Systematic Review
- Foundations and Trends in Information Retrieval
"... Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than “I know it when I see it.” Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be
Visual-Similarity-Based Phishing Detection
"... Phishing is a form of online fraud that aims to steal a user’s sensitive information, such as online banking passwords or credit card numbers. The victim is tricked into entering such information on a web page that is crafted by the attacker so that it mimics a legitimate page. Recent statistics abo ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
(Show Context)
Phishing is a form of online fraud that aims to steal a user’s sensitive information, such as online banking passwords or credit card numbers. The victim is tricked into entering such information on a web page that is crafted by the attacker so that it mimics a legitimate page. Recent statistics about the increasing number of phishing attacks suggest that this security problem still deserves significant attention. In this paper, we present a novel technique to visually compare a suspected phishing page with the legitimate one. The goal is to determine whether the two pages are suspiciously similar. We identify and consider three page features that play a key role in making a phishing page look similar to a legitimate one. These features are text pieces and their style, images embedded in the page, and the overall visual appearance of the page as rendered by the browser. To verify the feasibility of our approach, we performed an experimental evaluation using a dataset composed of 41 realworld phishing pages, along with their corresponding legitimate targets. Our experimental results are satisfactory in terms of false positives and false negatives. 1
Detecting image spam using visual features and near duplicate detection
- In Proceeding of the 17th international conference on World Wide Web, WWW ’08
, 2008
"... Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam generation have posed a new challenge: image-based spam. Image based spam is email which includes embedded images containi ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam generation have posed a new challenge: image-based spam. Image based spam is email which includes embedded images containing the spam messages, but in binary format. In this paper, we study the characteristics of image spam to propose two solutions for detecting image-based spam, while drawing a comparison with the existing techniques. The first solution, which uses the visual features for classification, offers an accuracy of about 98%, i.e. an improvement of at least 6 % compared to existing solutions. SVMs (Support Vector Machines) are used to train classifiers using judiciously decided color, textureandshapefeatures. Thesecondsolutionoffersa novel approach for near duplication detection in images. It involves clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as the distance measure.
A campaign-based characterization of spamming strategies
- In CEAS
, 2008
"... This paper presents a methodology for the characterization of spamming strategies based on the identification of spam campaigns. To deeply understand how spammers abuse network resources and obfuscate their messages, an aggregated analysis of spam messages is not enough. Grouping spam messages into ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
This paper presents a methodology for the characterization of spamming strategies based on the identification of spam campaigns. To deeply understand how spammers abuse network resources and obfuscate their messages, an aggregated analysis of spam messages is not enough. Grouping spam messages into campaigns is important to unveil behaviors that cannot be noticed when looking at the whole set of spams collected. We propose a spam identification technique based on a frequent pattern tree, which naturally captures the invariants on message content and detect campaigns that differ only due to obfuscated fragments. After that, we characterize these campaigns both in terms of content obfuscation and exploitation of network resources. Our methodology includes the use of attribute association analysis: by applying an association rule mining algorithm, we were able to determine cooccurrence of campaign attributes that unveil different spamming strategies. In particular, we found strong relations between the origin of the spam and how it abused the network, and also between operating systems and types of abuse. 1
A Comprehensive Approach to Image Spam Detection: From Server to Client Solution, Information Forensics and Security
- IEEE Transactions
, 2010
"... Abstract—Image spam is a type of e-mail spam that embeds spam text content into graphical images to bypass traditional text-based e-mail spam filters. To effectively detect image spam, it is desirable to leverage image content analysis technologies. However, most previous works of image spam detecti ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Image spam is a type of e-mail spam that embeds spam text content into graphical images to bypass traditional text-based e-mail spam filters. To effectively detect image spam, it is desirable to leverage image content analysis technologies. However, most previous works of image spam detection focus on filtering the image spam on the client side. We propose a more desirable comprehensive solution which embraces both server-side filtering and client-side detection to effectively mitigate image spam. On the server side, we present a nonnegative sparsity induced similarity measure for cluster analysis of spam images to filter the attack activities of spammers and fast trace back the spam sources. On the client side, we employ the principle of active learning where the learner guides the users to label as few images as possible while maximizing the classification accuracy. The server-side filtering identifies large image clusters as suspicious spam sources and further analysis can be performed to identify the real sources and block them from the beginning. For those spam images which survived the server-side filter, our active learner on the client side will further guide the users to interactively and efficiently filter them out. Our experiments on an image spam data-set collected from the e-mail server of our department demonstrate the efficacy of the proposed comprehensive solution. Index Terms—Active learning, clustering, image recognition, image spam, spam filtering, sparse representation. I.
Evaluating the Security of Machine Learning Algorithms
, 2008
"... All rights reserved. ..."
(Show Context)
Graph-based Rare Category Detection
"... Rare category detection is the task of identifying examples from rare classes in an unlabeled data set. It is an open challenge in machine learning and plays key roles in real applications such as financial fraud detection, network intrusion detection, astronomy, spam image detection, etc. In this p ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Rare category detection is the task of identifying examples from rare classes in an unlabeled data set. It is an open challenge in machine learning and plays key roles in real applications such as financial fraud detection, network intrusion detection, astronomy, spam image detection, etc. In this paper, we develop a new graph-based method for rare category detection named GRADE. It makes use of the global similarity matrix motivated by the manifold ranking algorithm, which results in more compact clusters for the minority classes; by selecting examples from the regions where the density changes the most, it eliminates the assumption that the majority classes and the minority classes are separable. Furthermore, when detailed information about the data set is not available, we develop a modified version of GRADE named GRADE-LI, which only needs an upper bound on the proportion of all the minority classes as input. Besides working with data with features, both GRADE and GRADE-LI can also work with graph data, which can not be processed by existing rare category detection methods. Experimental results on both synthetic and real data sets demonstrate the effectiveness of the GRADE and GRADE-LI algorithms. 1.
Rare Category Analysis
, 2010
"... In many real world problems, rare categories (minority classes) play an essential role despite of their extreme scarcity. For example, in financial fraud detection, the vast majority of the financial transactions are legitimate, and only a small number may be fraudulent; in Medicare fraud detection, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In many real world problems, rare categories (minority classes) play an essential role despite of their extreme scarcity. For example, in financial fraud detection, the vast majority of the financial transactions are legitimate, and only a small number may be fraudulent; in Medicare fraud detection, the percentage of bogus claims is small, but the total loss is significant; in network intrusion detection, malicious network activities are hidden among huge volumes of routine network traffic; in astronomy, only 0.001 % of the objects in sky survey images are truly beyond the scope of current science and may lead to new discoveries; in spam image detection, the near-duplicate spam images are difficult to discover from the large number of non-spam image; in rare disease diagnosis, the rare diseases affect less than 1 out of 2000 people, but the consequences can be very severe. Therefore, the discovery, characterization and prediction of rare categories or rare examples may protect us from fraudulent or malicious behaviors, provide the aid for scientific discoveries, and even save lives. This thesis focuses on rare category analysis, where the majority classes have a smooth distribution, and the minority classes exhibit a compactness property. Furthermore, we focus on the challenging cases where the support regions of the majority and minority classes overlap