Results 1  10
of
31
A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection
 INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1995
"... We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), te ..."
Abstract

Cited by 749 (12 self)
 Add to MetaCart
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), tenfold crossvalidation may be better than the more expensive leaveoneout crossvalidation. We report on a largescale experiment  over half a million runs of C4.5 and a NaiveBayes algorithm  to estimate the effects of different parameters on these algorithms on realworld datasets. For crossvalidation, we vary the number of folds and whether the folds are stratified or not; for bootstrap, we vary the number of bootstrap samples. Our results indicate that for realword datasets similar to ours, the best method to use for model selection is tenfold stratified cross validation, even if computation power allows using more folds.
The Lack of A Priori Distinctions Between Learning Algorithms
, 1996
"... This is the first of two papers that use offtraining set (OTS) error to investigate the assumption free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in ..."
Abstract

Cited by 123 (5 self)
 Add to MetaCart
This is the first of two papers that use offtraining set (OTS) error to investigate the assumption free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in which there are such distinctions.) In this first paper it is shown, loosely speaking, that for any two algorithms A and B, there are "as many" targets (or priors over targets) for which A has lower expected OTS error than B as viceversa, for loss functions like zeroone loss. In particular, this is true if A is crossvalidation and B is "anticrossvalidation" (choose the learning algorithm with largest crossvalidation error). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can not say: if empirical misclassification rate is low; the VapnikChervonenkis dimension of your generalizer is small; and the trainin...
The Power of Decision Tables
 Proceedings of the European Conference on Machine Learning
, 1995
"... . We evaluate the power of decision tables as a hypothesis space for supervised learning algorithms. Decision tables are one of the simplest hypothesis spaces possible, and usually they are easy to understand. Experimental results show that on artificial and realworld domains containing only discre ..."
Abstract

Cited by 100 (5 self)
 Add to MetaCart
. We evaluate the power of decision tables as a hypothesis space for supervised learning algorithms. Decision tables are one of the simplest hypothesis spaces possible, and usually they are easy to understand. Experimental results show that on artificial and realworld domains containing only discrete features, IDTM, an algorithm inducing decision tables, can sometimes outperform stateoftheart algorithms such as C4.5. Surprisingly, performance is quite good on some datasets with continuous features, indicating that many datasets used in machine learning either do not require these features, or that these features have few values. We also describe an incremental method for performing crossvalidation that is applicable to incremental learning algorithms including IDTM. Using incremental crossvalidation, it is possible to crossvalidate a given dataset and IDTM in time that is linear in the number of instances, the number of features, and the number of label values. The time for incre...
Extracting Comprehensible Models from Trained Neural Networks
, 1996
"... To Mom, Dad, and Susan, for their support and encouragement. ..."
Abstract

Cited by 69 (4 self)
 Add to MetaCart
To Mom, Dad, and Susan, for their support and encouragement.
Automatic Parameter Selection by Minimizing Estimated Error
 In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... We address the problem of finding the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data. We describe a "wrapper" method, considering determination of the best parameters as a discrete function optimization problem. Th ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
We address the problem of finding the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data. We describe a "wrapper" method, considering determination of the best parameters as a discrete function optimization problem. The method uses bestfirst search and crossvalidation to wrap around the basic induction algorithm: the search explores the space of parameter values, running the basic algorithm many times on training and holdout sets produced by crossvalidation to get an estimate of the expected error of each parameter setting. Thus, the final selected parameter settings are tuned for the specific induction algorithm and dataset being studied. We report experiments with this method on 33 datasets selected from the UCI and StatLog collections using C4.5 as the basic induction algorithm. At a 90% confidence level, our method improves the performance of C4.5 on nine domains, degrades performance on one, and is...
The supervised learning nofreelunch Theorems
 In Proc. 6th Online World Conference on Soft Computing in Industrial Applications
, 2001
"... Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning. ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.
Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling
 Journal of Machine Learning Research
, 2001
"... Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as nbest hypotheses problems where the goal is to nd the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on con dence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function  which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worstcase sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worstcase bounds.
On Bias Plus Variance
 Neural Computation
, 1996
"... : This paper presents a Bayesian additive "correction" to the familiar quadratic loss biasplus variance formula. It then discusses some other lossfunctionspecific aspects of supervised learning. It ends by presenting a version of the biasplusvariance formula appropriate for log loss, and then t ..."
Abstract

Cited by 22 (10 self)
 Add to MetaCart
: This paper presents a Bayesian additive "correction" to the familiar quadratic loss biasplus variance formula. It then discusses some other lossfunctionspecific aspects of supervised learning. It ends by presenting a version of the biasplusvariance formula appropriate for log loss, and then the Bayesian additive correction to that formula. Both the quadratic loss and log loss correction terms are a covariance, between the learning algorithm and the posterior distribution over targets. Accordingly, in the context in which those terms apply, there is not a "biasvariance tradeoff ", or a "biasvariance dilemma", as one often hears. Rather there is a biasvariancecovariance tradeoff. 2 I INTRODUCTION The biasplusvariance formula [Geman et al. 1992] is a powerful tool for analyzing supervised learning scenarios that have quadratic loss functions. In this paper an additive "Bayesian" correction to the formula is presented, appropriate when the target is not fixed. Next is a bri...
OffTraining Set Error And a Priori Distinctions Between . . .
, 1995
"... This paper uses offtraining set (OTS) error to investigate the assumptionfree relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as vicever ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
This paper uses offtraining set (OTS) error to investigate the assumptionfree relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as viceversa, for loss functions like zeroone loss. In particular, this is true if A is crossvalidation and B is "anticrossvalidation" (choose the generalizer with largest crossvalidation error). On the other hand, for loss functions other than zeroone (e.g., quadratic loss), there are a priori distinctions between algorithms. However even for such loss functions, any algorithm is equivalent on average to its "randomized" version, and in this still has no first principles justification in terms of average error. Nonetheless, it may be that (for example) crossvalidation has better minimax properties than anticrossvalidation, even for zeroone loss. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors. Accordingly they prove, as a particular example, that crossvalidation can not be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anticrossvalidation rather than crossvalidation (!). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can not say: if empirical misclassification rate is low; the VC dimension of your generalizer is small; and the training set is large, then with high probability your OTS error is small. Other implications for "membership queries " algorithms and "punting" algorithms are also discussed.