Results 1  10
of
41
Bagging Predictors
 Machine Learning
, 1996
"... Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making ..."
Abstract

Cited by 2479 (1 self)
 Add to MetaCart
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. 1. Introduction A learning set of L consists of data f(y n ; x n ), n = 1; : : : ; Ng where the y's are either class labels or a numerical response. We have a procedure for using this learning set to form a predictor '(x; L)  if the input is x we ...
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
 MACHINE LEARNING
, 1999
"... Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in co ..."
Abstract

Cited by 539 (2 self)
 Add to MetaCart
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a NaiveBayes inducer.
The purpose of the study is to improve our understanding of why and
when these algorithms, which use perturbation, reweighting, and
combination techniques, affect classification error. We provide a
bias and variance decomposition of the error to show how different
methods and variants influence these two terms. This allowed us to
determine that Bagging reduced variance of unstable methods, while
boosting methods (AdaBoost and Arcx4) reduced both the bias and
variance of unstable methods but increased the variance for NaiveBayes,
which was very stable. We observed that Arcx4 behaves differently
than AdaBoost if reweighting is used instead of resampling,
indicating a fundamental difference. Voting variants, some of which
are introduced in this paper, include: pruning versus no pruning,
use of probabilistic estimates, weight perturbations (Wagging), and
backfitting of data. We found that Bagging improves when
probabilistic estimates in conjunction with nopruning are used, as
well as when the data was backfit. We measure tree sizes and show
an interesting positive correlation between the increase in the
average tree size in AdaBoost trials and its success in reducing the
error. We compare the meansquared error of voting methods to
nonvoting methods and show that the voting methods lead to large
and significant reductions in the meansquared errors. Practical
problems that arise in implementing boosting algorithms are
explored, including numerical instabilities and underflows. We use
scatterplots that graphically show how AdaBoost reweights instances,
emphasizing not only "hard" areas but also outliers and noise.
Bias plus variance decomposition for zeroone loss functions
 In Machine Learning: Proceedings of the Thirteenth International Conference
, 1996
"... We present a biasvariance decomposition of expected misclassi cation rate, the most commonly used loss function in supervised classi cation learning. The biasvariance decomposition for quadratic loss functions is well known and serves as an important tool for analyzing learning algorithms, yet no ..."
Abstract

Cited by 172 (4 self)
 Add to MetaCart
We present a biasvariance decomposition of expected misclassi cation rate, the most commonly used loss function in supervised classi cation learning. The biasvariance decomposition for quadratic loss functions is well known and serves as an important tool for analyzing learning algorithms, yet no decomposition was o ered for the more commonly used zeroone (misclassi cation) loss functions until the recent work of Kong & Dietterich (1995) and Breiman (1996). Their decomposition su ers from some major shortcomings though (e.g., potentially negative variance), which our decomposition avoids. We show that, in practice, the naive frequencybased estimation of the decomposition terms is by itself biased and show how to correct for this bias. We illustrate the decomposition on various algorithms and datasets from the UCI repository. 1
Error Reduction through Learning Multiple Descriptions
, 1996
"... . Learning multiple descriptions for each class in the data has been shown to reduce generalization error but the amount of error reduction varies greatly from domain to domain. This paper presents a novel empirical analysis that helps to understand this variation. Our hypothesis is that the amount ..."
Abstract

Cited by 126 (3 self)
 Add to MetaCart
. Learning multiple descriptions for each class in the data has been shown to reduce generalization error but the amount of error reduction varies greatly from domain to domain. This paper presents a novel empirical analysis that helps to understand this variation. Our hypothesis is that the amount of error reduction is linked to the "degree to which the descriptions for a class make errors in a correlated manner." We present a precise and novel definition for this notion and use twentynine data sets to show that the amount of observed error reduction is negatively correlated with the degree to which the descriptions make errors in a correlated manner. We empirically show that it is possible to learn descriptions that make less correlated errors in domains in which many ties in the search evaluation measure (e.g. information gain) are experienced during learning. The paper also presents results that help to understand when and why multiple descriptions are a help (irrelevant attribute...
Wrappers For Performance Enhancement And Oblivious Decision Graphs
, 1995
"... In this doctoral dissertation, we study three basic problems in machine learning and two new hypothesis spaces with corresponding learning algorithms. The problems we investigate are: accuracy estimation, feature subset selection, and parameter tuning. The latter two problems are related and are stu ..."
Abstract

Cited by 107 (8 self)
 Add to MetaCart
In this doctoral dissertation, we study three basic problems in machine learning and two new hypothesis spaces with corresponding learning algorithms. The problems we investigate are: accuracy estimation, feature subset selection, and parameter tuning. The latter two problems are related and are studied under the wrapper approach. The hypothesis spaces we investigate are: decision tables with a default majority rule (DTMs) and oblivious readonce decision graphs (OODGs).
MetaLearning in Distributed Data Mining Systems: Issues and Approaches
 Advances of Distributed Data Mining
, 2000
"... Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach to this objective is to apply various machine learning algorithms to compute descriptive models of the available data. Here, we explore one of the main challeng ..."
Abstract

Cited by 80 (0 self)
 Add to MetaCart
Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach to this objective is to apply various machine learning algorithms to compute descriptive models of the available data. Here, we explore one of the main challenges in this research area, the development of techniques that scale up to large and possibly physically distributed databases. Metalearning is a technique that seeks to compute higherlevel classifiers (or classification models), called metaclassifiers, that integrate in some principled fashion multiple classifiers computed separately over different databases. This study, describes metalearning and presents the JAM system (Java Agents for Metalearning), an agentbased metalearning system for largescale data mining applications. Specifically, it identifies and addresses several important desiderata for distributed data mining systems that stem from their additional complexity co...
Improving Simple Bayes
, 1997
"... The simple Bayesian classifier (SBC), sometimes called NaiveBayes, is built based on a conditional independence model of each attribute given the class. The model was previously shown to be surprisingly robust to obvious violations of this independence assumption, yielding accurate classificat ..."
Abstract

Cited by 59 (1 self)
 Add to MetaCart
The simple Bayesian classifier (SBC), sometimes called NaiveBayes, is built based on a conditional independence model of each attribute given the class. The model was previously shown to be surprisingly robust to obvious violations of this independence assumption, yielding accurate classification models even when there are clear conditional dependencies. We examine different approaches for handling unknowns and zero counts when estimating probabilities. Large scale experiments on 37 datasets were conducted to determine the effects of these approaches and several interesting insights are given, including a new variant of the Laplace estimator that outperforms other methods for dealing with zero counts. Using the biasvariance decomposition [15, 10], we show that while the SBC has performed well on common benchmark datasets, its accuracy will not scale up as the dataset sizes grow. Even with these limitations in mind, the SBC can serve as an excellenttool for initial exp...
Using Correspondence Analysis to Combine Classifiers
 Machine Learning
, 1998
"... . Several effective methods have been developed recently for improving predictive performance by generating and combining multiple learned models. The general approach is to create a set of learned models either by applying an algorithm repeatedly to different versions of the training data, or by ap ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
. Several effective methods have been developed recently for improving predictive performance by generating and combining multiple learned models. The general approach is to create a set of learned models either by applying an algorithm repeatedly to different versions of the training data, or by applying different learning algorithms to the same data. The predictions of the models are then combined according to a voting scheme. This paper focuses on the task of combining the predictions of a set of learned models. The method described uses the strategies of stacking and Correspondence Analysis to model the relationship between the learning examples and their classification by a collection of learned models. A nearest neighbor method is then applied within the resulting representation to classify previously unseen examples. The new algorithm does not perform worse than, and frequently performs significantly better than other combining techniques on a suite of data sets. Keywords: Clas...
Option Decision Trees with Majority Votes
 ICML97
, 1997
"... We describe an experimental study of Option Decision Trees with majority votes. Option Decision Trees generalize regular decision trees by allowing option nodes in addition to decision nodes; such nodes allow for several possible tests to be conducted instead of the commonly used single test. ..."
Abstract

Cited by 41 (8 self)
 Add to MetaCart
We describe an experimental study of Option Decision Trees with majority votes. Option Decision Trees generalize regular decision trees by allowing option nodes in addition to decision nodes; such nodes allow for several possible tests to be conducted instead of the commonly used single test. Our goal was to explore when option nodes are most useful and to control the growth of the trees so that additional complexity of little utility is limited. Option Decision Trees can reduce the error of decision trees on realworld problems by combining multiple options, with the motivation similar to that of voting algorithms that learn multiple models and combine the predictions. However, unlikevoting algorithms, an Option Decision Tree provides a single structured classifier (one decision tree), which can be interpreted more easily by humans. Our results show that for the tested problems, we can achieve significant reduction in error rates for trees restricted to two levels of o...