Results 21 - 30
of
414
An Immunity-Based Technique to Characterize Intrusions in Computer Networks
, 2002
"... This paper presents a technique inspired by the negative selection mechanism of the immune system that can detect foreign patterns in the complement (non-self) space. In particular, the novel pattern detectors (in the complement space) are evolved using a genetic search, which could di#erentiate var ..."
Abstract
-
Cited by 103 (19 self)
- Add to MetaCart
(Show Context)
This paper presents a technique inspired by the negative selection mechanism of the immune system that can detect foreign patterns in the complement (non-self) space. In particular, the novel pattern detectors (in the complement space) are evolved using a genetic search, which could di#erentiate varying degrees of abnormality in network tra#c. The paper demonstrates the usefulness of such a technique to detect a wide variety of intrusive activities on networked computers. We also used a positive characterization method based on a nearest-neighbor classification.
Explicitly representing expected cost: an alternative to ROC representation
- KDD
, 2000
"... This paper proposes an alternative to ROC representation, in which the expected cost of a classifier is represented explicitly. This expected cost representation maintains many of the advantages of ROC representation, but is easier to understand. It allows the experimenter to immediately see the ran ..."
Abstract
-
Cited by 93 (10 self)
- Add to MetaCart
(Show Context)
This paper proposes an alternative to ROC representation, in which the expected cost of a classifier is represented explicitly. This expected cost representation maintains many of the advantages of ROC representation, but is easier to understand. It allows the experimenter to immediately see the range of costs and class frequencies where a particular classifier is the best and quantitatively how much better it is than other classiers. This paper demonstrates there is a point/line duality between the two representations. A point in ROC space representing a classier becomes a line segment spanning the full range of costs and class frequen-cies. This duality produces equivalent operations in the two spaces, allowing most techniques used in ROC analysis to be readily reproduced in the cost space.
Using AUC and accuracy in evaluating learning algorithms
- IEEE Transactions on Knowledge and Data Engineering
, 2005
"... The area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been recently proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. ..."
Abstract
-
Cited by 89 (1 self)
- Add to MetaCart
(Show Context)
The area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been recently proposed as an alternative single-number measure for evaluating the predictive ability of learning algorithms. However, no formal arguments were given as to why AUC should be preferred over accuracy. In this paper, we establish formal criteria for comparing two different measures for learning algorithms, and we show theoretically and empirically that AUC is, in general, a better measure (defined precisely) than accuracy. We then reevaluate well-established claims in machine learning based on accuracy using AUC, and obtain interesting and surprising new results. We also show that AUC is more directly associated with the net profit than accuracy in direct marketing, suggesting that learning algorithms should optimize AUC instead of accuracy in real-world applications. The conclusions drawn in this paper may make a significant impact to machine learning and data mining applications. Note: This paper integrates results in our papers published in IJCAI 2003 [22] and ICDM 2003 [15]. It also includes many new results. For example, the concept of indifferency in Section II-B is new, and Sections III-B, III-C, IV-A, IV-D, and V are all new and unpublished. Index Terms Evaluation of learning algorithms, AUC vs accuracy, ROC
Learning when data sets are imbalanced and when costs are unequal and unknown
- ICML-2003 Workshop on Learning from Imbalanced Data Sets II
, 2003
"... The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results fr ..."
Abstract
-
Cited by 87 (0 self)
- Add to MetaCart
(Show Context)
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same roc curve. 1.
Tree induction vs. logistic regression: A learning-curve analysis
- CEDER WORKING PAPER #IS-01-02, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership pr ..."
Abstract
-
Cited by 86 (16 self)
- Add to MetaCart
(Show Context)
Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classi cation. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probability-based rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signal-to-noise ratio.
Active Sampling for Class Probability Estimation and Ranking
- Machine Learning
, 2004
"... In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels ..."
Abstract
-
Cited by 78 (9 self)
- Add to MetaCart
(Show Context)
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels. Active learning acquires data incrementally, at each phase identifying especially useful additional data for labeling, and can be used to economize on examples needed for learning. We outline the critical features of an active learner and present a sampling-based active learning method for estimating class probabilities and class-based rankings. BOOT- STRAP-LV identifies particularly informative new data for learning based on the variance in probability estimates, and uses weighted sampling to account for a potential example's informative value for the rest of the input space. We show empirically that the method reduces the number of data items that must be obtained and labeled, across a wide variety of domains. We investigate the contribution of the components of the algorithm and show that each provides valuable information to help identify informative examples. We also compare BOOTSTRAP- LV with UNCERTAINTY SAMPLING, an existing active learning method designed to maximize classification accuracy. The results show that BOOTSTRAP-LV uses fewer examples to exhibit a certain estimation accuracy and provide insights to the behavior of the algorithms. Finally, we experiment with another new active sampling algorithm drawing from both UNCERTAINTY SAMPLING and BOOTSTRAP-LV and show that it is significantly more competitive with BOOTSTRAP-LV compared to UNCERTAINTY SAMPLING. The analysis suggests more general implications for improving existing active sampling ...
Simple Estimators for Relational Bayesian Classifiers
- In Proceedings of the 3rd IEEE International Conference on Data Mining
, 2003
"... This paper evaluates several modifications of the Simple Bayesian Classifier to enable estimation and inference over relational data. The resulting Relational Bayesian Classifiers are evaluated on three real-world datasets and compared to a baseline SBC using no relational information ..."
Abstract
-
Cited by 76 (20 self)
- Add to MetaCart
This paper evaluates several modifications of the Simple Bayesian Classifier to enable estimation and inference over relational data. The resulting Relational Bayesian Classifiers are evaluated on three real-world datasets and compared to a baseline SBC using no relational information
A Framework for Detection and Measurement of Phishing Attacks
, 2006
"... Phishing is form of identity theft that combines social engineering techniques and sophisticated attack vectors to harvest financial information from unsuspecting consumers. Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page. In this paper, we focus on studying the ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
(Show Context)
Phishing is form of identity theft that combines social engineering techniques and sophisticated attack vectors to harvest financial information from unsuspecting consumers. Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page. In this paper, we focus on studying the structure of URLs employed in various phishing attacks. We find that it is often possible to tell whether or not a URL belongs to a phishing attack without requiring any knowledge of the corresponding page data. We describe several features that can be used to distinguish a phishing URL from a benign one. These features are used to model a logistic regression filter that is efficient and has a high accuracy. We use this filter to perform thorough measurements on several million URLs and quantify the prevalence of phishing on the Internet today.
Cyclic pattern kernels for Predictive graph mining
, 2004
"... With applications in biology, the world-wide web, and several other areas, mining of graph-structured objects has received significant interest recently. One of the major research directions in this field is concerned with predictive data mining in graph databases where each instance is represented ..."
Abstract
-
Cited by 73 (2 self)
- Add to MetaCart
With applications in biology, the world-wide web, and several other areas, mining of graph-structured objects has received significant interest recently. One of the major research directions in this field is concerned with predictive data mining in graph databases where each instance is represented by a graph. Some of the proposed approaches for this task rely on the excellent classification performance of support vector machines. To control the computational cost of these approaches, the underlying kernel functions are based on frequent patterns. In contrast to these approaches, we propose a kernel function based on a natural set of cyclic and tree patterns independent of their frequency, and discuss its computational aspects. To practically demonstrate the effectiveness of our approach, we use the popular NCI-HIV molecule dataset. Our experimental results show that cyclic pattern kernels can be computed quickly and offer predictive performance superior to recent graph kernels based on frequent patterns.
AUC: a statistically consistent and more discriminating measure than accuracy
- IN: PROCEEDINGS OF 18TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-2003
, 2003
"... Predictive accuracy has been used as the main and often only evaluation criterion for the predictive performance of classification learning algorithms. In recent years, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been proposed as an alternative single-number ..."
Abstract
-
Cited by 73 (5 self)
- Add to MetaCart
Predictive accuracy has been used as the main and often only evaluation criterion for the predictive performance of classification learning algorithms. In recent years, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, has been proposed as an alternative single-number measure for evaluating learning algorithms. In this paper, we prove that AUC is a better measure than accuracy. More specifically, we present rigourous definitions on consistency and discriminancy in comparing two evaluation measures for learning algorithms. We then present empirical evaluations and a formal proof to establish that AUC is indeed statistically consistent and more discriminating than accuracy. Our result is quite significant since we formally prove that, for the first time, AUC is a better measure than accuracy in the evaluation of learning algorithms.