Results 1  10
of
74
An Empirical Comparison of Supervised Learning Algorithms
 In Proc. 23 rd Intl. Conf. Machine learning (ICML’06
, 2006
"... A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, n ..."
Abstract

Cited by 114 (6 self)
 Add to MetaCart
(Show Context)
A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90’s. We present a largescale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memorybased learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods. 1.
AUC optimization vs. error rate minimization
 Advances in Neural Information Processing Systems
, 2004
"... The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship ..."
Abstract

Cited by 110 (2 self)
 Add to MetaCart
(Show Context)
The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates can be large. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that under certain conditions the global function optimized by the RankBoost algorithm is exactly the AUC. We report results of our experiments with RankBoost in several datasets that demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1
Active Sampling for Class Probability Estimation and Ranking
 Machine Learning
, 2004
"... In many costsensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels ..."
Abstract

Cited by 64 (9 self)
 Add to MetaCart
(Show Context)
In many costsensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels. Active learning acquires data incrementally, at each phase identifying especially useful additional data for labeling, and can be used to economize on examples needed for learning. We outline the critical features of an active learner and present a samplingbased active learning method for estimating class probabilities and classbased rankings. BOOT STRAPLV identifies particularly informative new data for learning based on the variance in probability estimates, and uses weighted sampling to account for a potential example's informative value for the rest of the input space. We show empirically that the method reduces the number of data items that must be obtained and labeled, across a wide variety of domains. We investigate the contribution of the components of the algorithm and show that each provides valuable information to help identify informative examples. We also compare BOOTSTRAP LV with UNCERTAINTY SAMPLING, an existing active learning method designed to maximize classification accuracy. The results show that BOOTSTRAPLV uses fewer examples to exhibit a certain estimation accuracy and provide insights to the behavior of the algorithms. Finally, we experiment with another new active sampling algorithm drawing from both UNCERTAINTY SAMPLING and BOOTSTRAPLV and show that it is significantly more competitive with BOOTSTRAPLV compared to UNCERTAINTY SAMPLING. The analysis suggests more general implications for improving existing active sampling ...
Distributionbased aggregation for relational learning with identifier attributes
 Machine Learning
, 2004
"... Feature construction through aggregation plays an essential role in modeling relational domains with onetomany relationships between tables. Onetomany relationships lead to bags (multisets) of related entities, from which predictive information must be captured. This paper focuses on aggregation ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
Feature construction through aggregation plays an essential role in modeling relational domains with onetomany relationships between tables. Onetomany relationships lead to bags (multisets) of related entities, from which predictive information must be captured. This paper focuses on aggregation from categorical attributes that can take many values (e.g., object identifiers). We present a novel aggregation method as part of a relational learning system ACORA, that combines the use of vector distance and metadata about the classconditional distributions of attribute values. We provide a theoretical foundation for this approach deriving a “relational fixedeffect ” model within a Bayesian framework, and discuss the implications of identifier aggregation on the expressive power of the induced model. One advantage of using identifier attributes is the circumvention of limitations caused either by missing/unobserved object properties or by independence assumptions. Finally, we show empirically that the novel aggregators can generalize in the presence of identifier (and other highdimensional) attributes, and also explore the limitations of the applicability of the methods. 1
Learning ensembles from bites: A scalable and accurate approach
"... Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small vo ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive datasets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.
Handling missing values when applying classification models. Journal of machine learning research
, 2007
"... Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased impu ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reducedmodels approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.
ROC Confidence Bands: An Empirical Evaluation
 In: Proceedings of the TwentySecond International Conference on Machine Learning
, 2005
"... This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three bandgenerating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region where the "true" ROC curve is expec ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three bandgenerating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region where the "true" ROC curve is expected to reside, with the designated confidence level. To assess the containment of the bands we begin with a synthetic world where we know the true ROC curvespecifically, where the classconditional model scores are normally distributed. The only method that attains reasonable containment outofthebox produces nonparametric, "fixedwidth" bands (FWBs). Next we move to a context more appropriate for machine learning evaluations: bands that with a certain confidence level will bound the performance of the model on future data. We introduce a correction to account for the larger uncertainty, and the widened FWBs continue to have reasonable containment. Finally, we assess the bands on 10 relatively large benchmark data sets. We conclude by recommending these FWBs, noting that being nonparametric they are especially attractive for machine learning studies, where the score distributions (1) clearly are not normal, and (2) even for the same data set vary substantially from learning method to learning method.
Model Selection via the AUC
 IN PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2004
"... We present a statistical analysis of the AUC as an evaluation criterion for classification scoring models. First, we consider significance tests for the dierence between AUC scores of two algorithms on the same test set. We derive exact moments under simplifying assumptions and use them to exam ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We present a statistical analysis of the AUC as an evaluation criterion for classification scoring models. First, we consider significance tests for the dierence between AUC scores of two algorithms on the same test set. We derive exact moments under simplifying assumptions and use them to examine approximate practical methods from the literature. We then compare AUC to empirical misclassification error when the prediction goal is to minimize future error rate. We show that the AUC may be preferable to empirical error even in this case and discuss the tradeoff between approximation error and estimation error underlying this phenomenon.
Distributed Learning with BaggingLike Performance
 PATTERN RECOGNITION LETTERS
, 2003
"... Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagginglike performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.