Results 1  10
of
515
A DecisionTheoretic Generalization of onLine Learning and an Application to Boosting
, 1997
"... In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worstcase online framework. The model we study can be interpreted as a broad, abstract extension of the wellstudied online prediction model to a general decisiontheoretic set ..."
Abstract

Cited by 2307 (59 self)
 Add to MetaCart
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worstcase online framework. The model we study can be interpreted as a broad, abstract extension of the wellstudied online prediction model to a general decisiontheoretic setting. We show that the multiplicative weightupdate rule of Littlestone and Warmuth [20] can be adapted to this model yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multipleoutcome prediction, repeated games and prediction of points in R n . In the second part of the paper we apply the multiplicative weightupdate technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of...
Experiments with a New Boosting Algorithm
, 1996
"... In an earlier paper, we introduced a new “boosting” algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. We also introduced the relate ..."
Abstract

Cited by 1625 (21 self)
 Add to MetaCart
In an earlier paper, we introduced a new “boosting” algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a “pseudoloss ” which is a method for forcing a learning algorithm of multilabel conceptsto concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudoloss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman’s “bagging ” method when used to aggregate various classifiers (including decision trees and single attributevalue tests). We compared the performance of the two methods on a collection of machinelearning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearestneighbor classifier on an OCR problem.
Additive Logistic Regression: a Statistical View of Boosting
 Annals of Statistics
, 1998
"... Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and t ..."
Abstract

Cited by 1217 (21 self)
 Add to MetaCart
Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted majority vote of the sequence of classifiers thereby produced. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the twoclass problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most...
Wrappers for feature subset selection
 ARTIFICIAL INTELLIGENCE
, 1997
"... In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a ..."
Abstract

Cited by 1023 (3 self)
 Add to MetaCart
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 731 (58 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Boosting the margin: A new explanation for the effectiveness of voting methods
 In Proceedings International Conference on Machine Learning
, 1997
"... Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show ..."
Abstract

Cited by 721 (52 self)
 Add to MetaCart
Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the biasvariance decomposition. 1
Solving multiclass learning problems via errorcorrecting output codes
 Journal of Artificial Intelligence Research
, 1995
"... Multiclass learning problems involve nding a de nition for an unknown function f(x) whose range is a discrete set containing k>2values (i.e., k \classes"). The de nition is acquired by studying collections of training examples of the form hx i;f(x i)i. Existing approaches to multiclass learning ..."
Abstract

Cited by 564 (9 self)
 Add to MetaCart
Multiclass learning problems involve nding a de nition for an unknown function f(x) whose range is a discrete set containing k>2values (i.e., k \classes"). The de nition is acquired by studying collections of training examples of the form hx i;f(x i)i. Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decisiontree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which errorcorrecting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of over tting avoidance techniques such as decisiontree pruning. Finally,we show thatlike the other methodsthe errorcorrecting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that errorcorrecting output codes provide a generalpurpose method for improving the performance of inductive learning programs on multiclass problems. 1.
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
 MACHINE LEARNING
, 1999
"... Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in co ..."
Abstract

Cited by 539 (2 self)
 Add to MetaCart
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a NaiveBayes inducer.
The purpose of the study is to improve our understanding of why and
when these algorithms, which use perturbation, reweighting, and
combination techniques, affect classification error. We provide a
bias and variance decomposition of the error to show how different
methods and variants influence these two terms. This allowed us to
determine that Bagging reduced variance of unstable methods, while
boosting methods (AdaBoost and Arcx4) reduced both the bias and
variance of unstable methods but increased the variance for NaiveBayes,
which was very stable. We observed that Arcx4 behaves differently
than AdaBoost if reweighting is used instead of resampling,
indicating a fundamental difference. Voting variants, some of which
are introduced in this paper, include: pruning versus no pruning,
use of probabilistic estimates, weight perturbations (Wagging), and
backfitting of data. We found that Bagging improves when
probabilistic estimates in conjunction with nopruning are used, as
well as when the data was backfit. We measure tree sizes and show
an interesting positive correlation between the increase in the
average tree size in AdaBoost trials and its success in reducing the
error. We compare the meansquared error of voting methods to
nonvoting methods and show that the voting methods lead to large
and significant reductions in the meansquared errors. Practical
problems that arise in implementing boosting algorithms are
explored, including numerical instabilities and underflows. We use
scatterplots that graphically show how AdaBoost reweights instances,
emphasizing not only "hard" areas but also outliers and noise.
An Efficient Boosting Algorithm for Combining Preferences
, 1999
"... The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting ..."
Abstract

Cited by 515 (18 self)
 Add to MetaCart
The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting algorithm for combining preferences called RankBoost. We also describe an efficient implementation of the algorithm for certain natural cases. We discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different WWW search strategies, each of which is a query expansion for a given domain. For this task, we compare the performance of RankBoost to the individual search strategies. The second experiment is a collaborativefiltering task for making movie recommendations. Here, we present results comparing RankBoost to nearestneighbor and regression algorithms.