Results 1  10
of
34
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
 MACHINE LEARNING
, 1999
"... Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in co ..."
Abstract

Cited by 539 (2 self)
 Add to MetaCart
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a NaiveBayes inducer.
The purpose of the study is to improve our understanding of why and
when these algorithms, which use perturbation, reweighting, and
combination techniques, affect classification error. We provide a
bias and variance decomposition of the error to show how different
methods and variants influence these two terms. This allowed us to
determine that Bagging reduced variance of unstable methods, while
boosting methods (AdaBoost and Arcx4) reduced both the bias and
variance of unstable methods but increased the variance for NaiveBayes,
which was very stable. We observed that Arcx4 behaves differently
than AdaBoost if reweighting is used instead of resampling,
indicating a fundamental difference. Voting variants, some of which
are introduced in this paper, include: pruning versus no pruning,
use of probabilistic estimates, weight perturbations (Wagging), and
backfitting of data. We found that Bagging improves when
probabilistic estimates in conjunction with nopruning are used, as
well as when the data was backfit. We measure tree sizes and show
an interesting positive correlation between the increase in the
average tree size in AdaBoost trials and its success in reducing the
error. We compare the meansquared error of voting methods to
nonvoting methods and show that the voting methods lead to large
and significant reductions in the meansquared errors. Practical
problems that arise in implementing boosting algorithms are
explored, including numerical instabilities and underflows. We use
scatterplots that graphically show how AdaBoost reweights instances,
emphasizing not only "hard" areas but also outliers and noise.
Supervised and unsupervised discretization of continuous features
 in A. Prieditis & S. Russell, eds, Machine Learning: Proceedings of the Twelfth International Conference
, 1995
"... Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify de ning characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised dis ..."
Abstract

Cited by 408 (10 self)
 Add to MetaCart
Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify de ning characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropybased and puritybased methods, which are supervised algorithms. We found that the performance of the NaiveBayes algorithm signi cantly improved when features were discretized using an entropybased method. In fact, over the 16 tested datasets, the discretized version of NaiveBayes slightly outperformed C4.5 on average. We also show that in some cases, the performance of the C4.5 induction algorithm signi cantly improved if features were discretized in advance � in our experiments, the performance never signi cantly degraded, an interesting phenomenon considering the fact that C4.5 is capable of locally discretizing features. 1
An analysis of Bayesian classifiers
 IN PROCEEDINGS OF THE TENTH NATIONAL CONFERENCE ON ARTI CIAL INTELLIGENCE
, 1992
"... In this paper we present anaveragecase analysis of the Bayesian classifier, a simple induction algorithm that fares remarkably well on many learning tasks. Our analysis assumes a monotone conjunctive target concept, and independent, noisefree Boolean attributes. We calculate the probability that t ..."
Abstract

Cited by 333 (17 self)
 Add to MetaCart
In this paper we present anaveragecase analysis of the Bayesian classifier, a simple induction algorithm that fares remarkably well on many learning tasks. Our analysis assumes a monotone conjunctive target concept, and independent, noisefree Boolean attributes. We calculate the probability that the algorithm will induce an arbitrary pair of concept descriptions and then use this to compute the probability of correct classification over the instance space. The analysis takes into account the number of training instances, the number of attributes, the distribution of these attributes, and the level of class noise. We also explore the behavioral implications of the analysis by presenting
General and Efficient Multisplitting of Numerical Attributes
, 1999
"... . Often in supervised learning numerical attributes require special treatment and do not fit the learning scheme as well as one could hope. Nevertheless, they are common in practical tasks and, therefore, need to be taken into account. We characterize the wellbehavedness of an evaluation function, ..."
Abstract

Cited by 40 (7 self)
 Add to MetaCart
. Often in supervised learning numerical attributes require special treatment and do not fit the learning scheme as well as one could hope. Nevertheless, they are common in practical tasks and, therefore, need to be taken into account. We characterize the wellbehavedness of an evaluation function, a property that guarantees the optimal multipartition of an arbitrary numerical domain to be defined on boundary points. Wellbehavedness reduces the number of candidate cut points that need to be examined in multisplitting numerical attributes. Many commonly used attribute evaluation functions possess this property; we demonstrate that the cumulative functions Information Gain and Training Set Error as well as the noncumulative functions Gain Ratio and Normalized Distance Measure are all wellbehaved. We also devise a method of finding optimal multisplits efficiently by examining the minimum number of boundary point combinations that is required to produce partitions which are optimal wit...
AverageCase Analysis of a Nearest Neighbor Algorithm
 PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (PP. 889894). CHAMBERY
, 1993
"... In this paper we present an averagecase analysis of the nearest neighbor algorithm, a simple induction method that has been studied by many researchers. Our analysis assumes a conjunctive target concept, noisefree Boolean attributes, and a uniform distribution over the instance space. We calculate ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In this paper we present an averagecase analysis of the nearest neighbor algorithm, a simple induction method that has been studied by many researchers. Our analysis assumes a conjunctive target concept, noisefree Boolean attributes, and a uniform distribution over the instance space. We calculate the probability that the algorithm will encounter a test instance that is distance d from the prototype of the concept, along with the probability that the nearest stored training case is distance e from this test instance. From this we compute the probability of correct classification as a function of the number of observed training cases, the number of relevant attributes, and the number of irrelevant attributes. We also explore the behavioral implications of the analysis by presenting predicted learning curves for artificial domains, and give experimental results on these domains as a check on our reasoning.
A Survey of Methods for Scaling Up Inductive Learning Algorithms
, 1997
"... Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper serv ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
Each year, one of the explicit challenges for the KDD research community is to develop methods that facilitate the use of inductive learning algorithms for mining very large databases. By collecting, categorizing, and summarizing past work on scaling up inductive learning algorithms, this paper serves to establish a common ground for researchers addressing the challenge. We begin with a discussion of important, but often tacit, issues related to scaling up learning algorithms. We highlight similarities among methods by categorizing them into three main approaches. For each approach, we then describe, compare, and contrast the different constituent methods, drawing on specific examples from the published literature. Finally, we use the preceding analysis to suggest how one should proceed when dealing with a large problem, and where future research efforts should be focused.
Boosting and Microarray Data
 MACHINE LEARNING
, 2003
"... We have found one reason why AdaBoost tends not to perform well on gene expression data, and identified simple modifications that improve its ability to find accurate class prediction rules. These modifications appear especially to be needed when there is a strong association between expression prof ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
We have found one reason why AdaBoost tends not to perform well on gene expression data, and identified simple modifications that improve its ability to find accurate class prediction rules. These modifications appear especially to be needed when there is a strong association between expression profiles and class designations. Crossvalidation analysis of six microarray datasets with different characteristics suggests that, suitably modified, boosting provides competitive classification accuracy in general. Sometimes the goal
Feature Selection and Generalisation for Retrieval of Textual Cases
 IN PROCEEDINGS OF 7TH EUROPEAN CONFERENCE ON CASEBASED REASONING (ECCBR’04), VOLUME 3155
, 2004
"... Textual CBR systems solve problems by reusing experiences that are in textual form. Knowledgerich comparison of textual cases remains an important challenge for these systems. However mapping text data into a structured case representation requires a significant knowledge engineering effort. In thi ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
Textual CBR systems solve problems by reusing experiences that are in textual form. Knowledgerich comparison of textual cases remains an important challenge for these systems. However mapping text data into a structured case representation requires a significant knowledge engineering effort. In this paper we look at automated acquisition of the case indexing vocabulary as a two step process involving feature selection followed by feature generalisation. Boosted decision stumps are employed as a means to select features that are predictive and relatively orthogonal. Association rule induction is employed to capture feature cooccurrence patterns. Generalised features are constructed by applying these rules. Essentially, rules preserve implicit semantic relationships between features and applying them has the desired effect of bringing together cases that would have otherwise been overlooked during case retrieval. Experiments with four textual data sets show significant improvement in retrieval accuracy whenever generalised features are used. The results further suggest that boosted decision stumps with generalised features to be a promising combination.
Seer: Maximum Likelihood Regression for LearningSpeed Curves
 University of Illinois at
, 1995
"... The research presented here focuses on modeling machinelearning performance. The thesis introduces Seer, a system that generates empirical observations of classificationlearning performance and then uses those observations to create statistical models. The models can be used to predict the number ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
The research presented here focuses on modeling machinelearning performance. The thesis introduces Seer, a system that generates empirical observations of classificationlearning performance and then uses those observations to create statistical models. The models can be used to predict the number of training examples needed to achieve a desired level and the maximum accuracy possible given an unlimited number of training examples. Seer advances the state of the art with 1) models that embody the best constraints for classification learning and most useful parameters, 2) algorithms that efficiently find maximumlikelihood models, and 3) a demonstration on realworld data from three domains of a practicable application of such modeling. The first part of the thesis gives an overview of the requirements for a good maximumlikelihood model of classificationlearning performance. Next, reasonable design choices for such models are explored. Selection among such models is a task of nonlinear programming, but by exploiting appropriate problem constraints, the task is reduced to a nonlinear regression task that can be solved with an efficient iterative algorithm. The latter part of the thesis describes almost 100 experiments in the domains of soybean disease, heart disease, and audiological problems. The tests show that Seer is excellent at characterizing learningperformance and that it seems to be as good as possible at predicting learning
In Defense of C4.5: Notes on Learning OneLevel Decision Trees
 Proc. of the 11th Int. Conf. on Machine Learning
, 1994
"... We discuss the implications of Holte's recentlypublished article, which demonstrated that on the most commonly used data very simple classification rules are almost as accurate as decision trees produced by Quinlan's C4.5. We consider, in particular, what is the significance of Holte's results for t ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We discuss the implications of Holte's recentlypublished article, which demonstrated that on the most commonly used data very simple classification rules are almost as accurate as decision trees produced by Quinlan's C4.5. We consider, in particular, what is the significance of Holte's results for the future of topdown induction of decision trees. To an extent, Holte questioned the sense of further research on multilevel decision tree learning. We go in detail through all the parts of Holte's study. We try to put the results into perspective. We argue that the (in absolute terms) small difference in accuracy between 1R and C4.5 that was witnessed by Holte is still significant. We claim that C4.5 possesses additional accuracyrelated advantages over 1R. In addition we discuss the representativeness of the databases used by Holte. We compare empirically the optimal accuracies of multilevel and onelevel decision trees and observe some significant differences. We point out several defici...