Results 1 -
4 of
4
Learning Spam: Simple Techniques for Freely-Available Software
- In USENIX Annual Technical Conference, FREENIX Track
, 2003
"... Permission is granted for noncommercial reproduction of the work for educational or research purposes. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Permission is granted for noncommercial reproduction of the work for educational or research purposes.
Combining email models for false positive reduction
- In KDD ’05: Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining
, 2005
"... Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis. EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT’s user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models,
Hierarchical Text Categorization and Its Application to Bioinformatics
, 2005
"... In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that ma ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In a hierarchical categorization problem, categories are partially ordered to form a hier-archy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algo-rithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of convert-ing a conventional “flat ” learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding “flat ” as well as the local top-down method. For eval-uation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number
Coarse-to-Fine, Cost-Sensitive Classification of E-Mail
"... In many real-world scenarios, it is necessary to make judgments at differing levels of granularity due to computational constraints. Particularly when there are a large number of classifications that must be done in a real-time streaming setting and there is a significant difference in the time requ ..."
Abstract
- Add to MetaCart
In many real-world scenarios, it is necessary to make judgments at differing levels of granularity due to computational constraints. Particularly when there are a large number of classifications that must be done in a real-time streaming setting and there is a significant difference in the time required to acquire different subsets of features, it is important to have an intelligent strategy for optimizing classification accuracy versus computational costs. Accurate and timely email classification requires trading off the classification granularity with the feature acquisition costs. To solve this problem, we introduce a Granular Cost-Sensitive Classifier (GCSC) which modulates the cost of feature acquisition with the granularity of the classification, allowing inexpensive classification at a coarse level and more costly classification at finer levels of granularity. Our approach can classify messages with greater accuracy while incurring a lower feature acquisition cost relative to baseline classifiers that do not make use of cost information. 1

