#### DMCA

## An introduction to boosting and leveraging (2003)

Venue: | Advanced Lectures on Machine Learning, LNCS |

Citations: | 136 - 9 self |

### Citations

6592 |
Neural Networks for Pattern Recognition
- Bishop
- 1996
(Show Context)
Citation Context ...blem, yn is a K-dimensional vector of the form (0, 0,... ,0, 1, 0,... ,0)witha1inthek-th position if, and only if, xn belongs to the k-th class. The log-likelihood function can then be constructed as =-=[20]-=- (see also Page 84) G = N� n=1 k=1 K� {yn,k log p(k|xn)+(1−yn,k) log(1 − p(k|xn))} , (50) where yn,k is the k’th component of the vector yn, and p(k|xn) is the model probability that xn belongs to the... |

5955 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...riefly mention some weak learners which have been used successfully in applications. Decision trees and stumps. Decision trees have been widely used for many years in the statistical literature (e.g. =-=[29, 144, 85]-=-) as powerful, effective and easily interpretable classification algorithms that are able to automatically select relevant features. Hence, it is not surprising that some of the most successful initia... |

5955 |
Neural networks: a comprehensive foundation
- Haykin
- 1994
(Show Context)
Citation Context ...gistic regression with decision trees are described in [76]. This work showed that Boosting significantly enhances the performance of decision trees and stumps. Neural networks. Neural networks (e.g. =-=[88, 20]-=-) were extensively used during the 1990’s in many applications. The feed-forward neural network, by far the most widely used in practice, is essentially a highly non-linear function representation for... |

4624 |
A New Look at the Statistical Model Identification
- Akaike
- 1974
(Show Context)
Citation Context ...n statistical learning procedures when using complex non-linear models, for instance for neural networks, where a regularization term is used to appropriately limit the complexity of the models (e.g. =-=[2, 159, 143, 133, 135, 187, 191, 36]-=-). Before proceeding to discuss ensemble methods we briefly review the strong and weak PAC models for learning binary concepts [188]. Let S be a sample consisting of N data points {(xn,yn)} N n=1, whe... |

3688 | Support–vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...f the α’s are positive (as in Algorithm 2.1), then �w�1 = �α�1 holds. 4.2 Geometric Interpretation of p-Norm Margins Margins have been used frequently in the context of Support Vector Machines (SVMs) =-=[22, 41, 190]-=- and Boosting. These so-called large margin algorithms focus on generating hyperplanes/functions with large margins on most training examples. Let us therefore study some properties of the maximum mar... |

3642 | Bagging predictors.
- Breiman
- 1996
(Show Context)
Citation Context ...eraging Consider a combination of hypotheses in the form of (1). Clearly there are many approaches for selecting both the coefficients αt and the base hypotheses ht. In the so-called Bagging approach =-=[25]-=-, the hypotheses {ht} T t=1 are chosen based on a set of T bootstrap samples, and the coefficients αt are set to αt =1/T (see e.g. [142] for more refined choices). Although this algorithm seems somewh... |

3488 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...mial time Boosting algorithm, while [55] were the first to apply the Boosting idea to a real-world OCR task, relying on neural networks as base learners. The AdaBoost (Adaptive Boosting) algorithm by =-=[67, 68, 70]-=- (cf. Algorithm 2.1) is generally considered as a first step towards more practical Boosting algorithms. Very similar to AdaBoost is the Arcing algorithm, for which convergence to a linear programming... |

3072 |
An Introduction to Probability Theory and Its Applications
- Feller
- 1971
(Show Context)
Citation Context ...f approximating the minimum of the risk (2) by the minimum of the empirical risksAn Introduction to Boosting and Leveraging 121 ˆL(f) = 1 N N� λ(f(xn),yn). (3) n=1 From the law of large numbers (e.g. =-=[61]-=-) one expects that ˆ L(f) → L(f) asN → ∞. However, in order to guarantee that the function obtained by minimizing ˆ L(f) also attains asymptotically the minimum of L(f) a stronger condition is require... |

2701 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ... for arbitrary loss functions and any concave regularizer such as ℓ1-norm [147]. Note that N + 1 is an upper bound on the number of non-zero elements – in practice much sparser solutions are observed =-=[37, 48, 23, 8]-=-. In neural networks, SVMs, matching pursuit (e.g. [118]) and many other algorithms, one uses the ℓ2-norm for regularization. In this case the optimal solution w ∗ can be expressed as a linear combina... |

2209 | Experiments with a New Boosting Algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...mial time Boosting algorithm, while [55] were the first to apply the Boosting idea to a real-world OCR task, relying on neural networks as base learners. The AdaBoost (Adaptive Boosting) algorithm by =-=[67, 68, 70]-=- (cf. Algorithm 2.1) is generally considered as a first step towards more practical Boosting algorithms. Very similar to AdaBoost is the Arcing algorithm, for which convergence to a linear programming... |

1859 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...f the α’s are positive (as in Algorithm 2.1), then �w�1 = �α�1 holds. 4.2 Geometric Interpretation of p-Norm Margins Margins have been used frequently in the context of Support Vector Machines (SVMs) =-=[22, 41, 190]-=- and Boosting. These so-called large margin algorithms focus on generating hyperplanes/functions with large margins on most training examples. Let us therefore study some properties of the maximum mar... |

1745 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
- 1998
(Show Context)
Citation Context ...ntuitive in the context of algorithmic design, a step forward in transparency was taken by explaining Boosting in terms of a stage-wise gradient descent procedure in an exponential cost function (cf. =-=[27, 63, 74, 127, 153]-=-). A further interesting step towards practical applicability is worth mentioning: large parts of the early Boosting literature persistently contained the misconception that Boosting would not overfit... |

1416 | The elements of statistical learning: data mining, inference and prediction
- Hastie, Tibshirani, et al.
- 2008
(Show Context)
Citation Context ...riefly mention some weak learners which have been used successfully in applications. Decision trees and stumps. Decision trees have been widely used for many years in the statistical literature (e.g. =-=[29, 144, 85]-=-) as powerful, effective and easily interpretable classification algorithms that are able to automatically select relevant features. Hence, it is not surprising that some of the most successful initia... |

1366 |
Nearest neighbor pattern classification
- Cover, Hart
- 1967
(Show Context)
Citation Context ...g status of electric appliances (with and without inverter) for the purpose of constructing a non-intrusive monitoring system. In this study, RBF networks, K-nearest neighbor classifiers (KNNs) (e.g. =-=[42]-=-), SVMs and ν-Arc (cf. Section 6.2) were compared. The data set available for this task is rather small (36 examples), since the collection and labeling of data is manual and therefore expensive. As a... |

1318 |
A Probabilistic Theory of Pattern Recognition.
- Devroye, Gyorfi, et al.
- 1996
(Show Context)
Citation Context ..., followed by a discussion of different aspects of ensemble learning. 2.1 Learning from Data and the PAC Property We focus in this review (except in Section 7) on the problem of binary classification =-=[50]-=-. The task of binary classification is to find a rule (hypothesis), which, based on a set of observations, assigns an object to one of two classes. We represent objects as belonging to an input space ... |

1287 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...e data. Data-dependent bounds depending explicitly on the weights αt of the weak learners are given in [130]. In addition, bounds which take into account the full margin distribution are presented in =-=[180, 45, 181]-=-. Such results are particularly useful for the purpose of model selection, but are beyond the scope of this review. 3.5 Consistency The bounds presented in Theorems 2 and 3, depend explicitly on the d... |

1214 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...(xn) � j wj �� , (31) n=1 which is minimized with respect to ρ and w for some fixed β. We denote the optimum by wβ and ρβ. One can show that any limit point of (wβ,ρβ) for β → 0 is a solution of (30) =-=[136, 19, 40]-=-. This also holds when one only has a sequence of approximate minimizers and the approximation becomes better for decreasing β [40]. Additionally, the quantities ˜ � dn = exp(ρβ j wj − ynfwβ (xn))/Z (... |

879 | Hierarchical mixtures of experts and the em algorithm.
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...k of this approach as the representation of a general non-linear function by piece-wise linear functions. This type of representation forms the basis for the so-called mixture of experts models (e.g. =-=[99]-=-). An important observation concerns the complexity of the functions αt(x). Clearly, by allowing these functions to be arbitrarily complex (e.g. δfunctions), we can easily fit any finite data set. Thi... |

797 | A short introduction to boosting
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...literature, as will become clear from the bibliography, is so extensive, that a full treatment would require a book length treatise. The present review differs from other reviews, such as the ones of =-=[72, 165, 166]-=-, mainly in the choice of the presented material: we place more emphasis on robust algorithms for Boosting, on connections to other margin based approaches (such as support vector machines), and on co... |

724 | An efficient boosting algorithm for combining preferences
- Freund, Iyer, et al.
- 2003
(Show Context)
Citation Context ...learning. However, the general Boosting framework is much more widely applicable. In this section we present a brief survey of several extensions and generalizations, although many others exist, e.g. =-=[76, 158, 62, 18, 31, 3, 155, 15, 44, 52, 82, 164, 66, 38, 137, 147, 16, 80, 185, 100, 14]-=-. 7.1 Single Class A classical unsupervised learning task is density estimation. Assuming that the unlabeled observations x1,...,xN were generated independently at random according to some unknown dis... |

721 | Solving multiclass learning problems via error-correcting output codes
- Dietterich, Bakiri
- 1995
(Show Context)
Citation Context ...learning. However, the general Boosting framework is much more widely applicable. In this section we present a brief survey of several extensions and generalizations, although many others exist, e.g. =-=[76, 158, 62, 18, 31, 3, 155, 15, 44, 52, 82, 164, 66, 38, 137, 147, 16, 80, 185, 100, 14]-=-. 7.1 Single Class A classical unsupervised learning task is density estimation. Assuming that the unlabeled observations x1,...,xN were generated independently at random according to some unknown dis... |

705 | An empirical comparison of voting classification algorithms: Bagging, boosting, and variants.
- Bauer, Kohavi
- 1999
(Show Context)
Citation Context ...L θ (fT ) ≤ T� t=1 (1 − γt) 1−θ 2 (1 + γt) 1+θ 2 . (11) Proof. We present a proof from [167] for the case where ht ∈{−1, +1}. We begin by showing that for every {αt} ˜L θ � T� �� T� � (fT ) ≤ exp θ . =-=(12)-=- By definition Zt = = N� n=1 αt t=1 d (t) n e −ynαtht(xn) � n:yn=ht(xn) d (t) n e −αt + =(1− εt)e −αt + εte αt . t=1 t=1 � Zt n:yn�=ht(xn) From the definition of fT it follows that yfT (x) ≤ θ ⇒ � T� ... |

701 |
An introduction to computational learning theory
- Kearns, Vazirani
(Show Context)
Citation Context ...ed analogously, except that it is only required to satisfy the conditions for particular ε and δ, rather than all pairs. Various extensions and generalization of the basic PAC concept can be found in =-=[87, 102, 4]-=-.s122 R. Meir and G. Rätsch 2.2 Ensemble Learning, Boosting, and Leveraging Consider a combination of hypotheses in the form of (1). Clearly there are many approaches for selecting both the coefficien... |

669 | Inducing Features of Random Fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context .... Although parts of the analysis in [39] hold for any strictly convex cost function of Legendre-type (cf. [162], p. 258), one needs to demonstrate the existence of a so-called auxiliary function (cf. =-=[46, 39, 114, 106]-=-) for each cost function other than the exponential or the logistic loss. This has been done for the general case in [147] under very mild assumptions on the base learning algorithm and the loss funct... |

609 | An experimental comparison of three methods for constructing ensembles of decision trees: bagging
- Dietterich
- 2000
(Show Context)
Citation Context ...oost algorithm and many of its early variants were tested on standard data sets from the UCI repository, and often found to compare favorably with other state of the art algorithms (see, for example, =-=[68, 168, 51]-=-. However, it was clear from [146, 168, 51] that AdaBoost tends to overfit if the data is noisy and no regularization is enforced. More recent experiments, using the regularized forms of Boosting desc... |

559 | Reducing multiclass to binary: a unifying approach for margin classifiers
- Allwein, Schapire, et al.
- 2000
(Show Context)
Citation Context ...tionship opened the field to new types of Boosting algorithms. Among other options it now became possible to rigorously define Boosting algorithms for regression (cf. [58, 149]), multi-class problems =-=[3, 155]-=-, unsupervised learning (cf. [32, 150]) and to establish convergence proofs for Boosting algorithms by using results from the Theory of Optimization. Further extensions to Boosting algorithms can be f... |

510 | Boosting a Weak Learning Algorithm by Majority
- Freund
- 1995
(Show Context)
Citation Context ...istribution over X. The correlation between f and H, with respect to D, is given by CH,D(f) = sup h∈H ED{f(x)h(x)}. The distribution-free correlation between f and H is given by CH(f) = infD CH,D(f). =-=[64]-=- shows that if T>2log(2)dCH(f) −2 ��T then f can be represented exactly as f(x) = sign t=1 ht(x) � . In other words, if H is highly correlated with the target function f, then f can be exactly represe... |

488 |
The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming
- Bregman
- 1967
(Show Context)
Citation Context ... show in Sections 4 and 5 that this relates to barrier optimization techniques [156] and coordinate descent methods [197, 151]. Additionally, we will briefly discuss relations to information geometry =-=[24, 34, 110, 104, 39, 147, 111]-=- and column generation techniques (cf. Section 6.2). An important issue in the context of the algorithms discussed in this review pertains to the construction of the weak learner (e.g. step (3a) in Al... |

479 |
Estimation with quadratic loss
- James, Stein
- 1961
(Show Context)
Citation Context ...ly. However, it turns out that in many cases inconsistent procedures perform better for finite amounts of data, than consistent ones. A classic example of this is the so-called James-Stein estimator (=-=[96, 160]-=-, Section 2.4.5). In order to establish consistency one needs to assume (or prove in specific cases) that as T →∞the class of functions coT (H) is dense in T. The consistency of Boosting algorithms ha... |

468 | Choosing multiple parameters for support vector machines
- Chapelle, Vapnik, et al.
(Show Context)
Citation Context ...n statistical learning procedures when using complex non-linear models, for instance for neural networks, where a regularization term is used to appropriately limit the complexity of the models (e.g. =-=[2, 159, 143, 133, 135, 187, 191, 36]-=-). Before proceeding to discuss ensemble methods we briefly review the strong and weak PAC models for learning binary concepts [188]. Let S be a sample consisting of N data points {(xn,yn)} N n=1, whe... |

424 | Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...ed analogously, except that it is only required to satisfy the conditions for particular ε and δ, rather than all pairs. Various extensions and generalization of the basic PAC concept can be found in =-=[87, 102, 4]-=-.s122 R. Meir and G. Rätsch 2.2 Ensemble Learning, Boosting, and Leveraging Consider a combination of hypotheses in the form of (1). Clearly there are many approaches for selecting both the coefficien... |

416 |
Neural Network Learning – Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...he sample may be found, which, however, leads to very poor generalization. A sufficient condition for preventing this phenomenon is the requirement that ˆ L(f) converge uniformly (over F) toL(f); see =-=[50, 191, 4]-=- for further details. While it is possible to provide conditions for the learning machine which ensure that asymptotically (as N →∞) the empirical risk minimizer will perform optimally, for small samp... |

407 | Learning to order things
- Cohen, Schapire, et al.
- 1999
(Show Context)
Citation Context ...learning. However, the general Boosting framework is much more widely applicable. In this section we present a brief survey of several extensions and generalizations, although many others exist, e.g. =-=[76, 158, 62, 18, 31, 3, 155, 15, 44, 52, 82, 164, 66, 38, 137, 147, 16, 80, 185, 100, 14]-=-. 7.1 Single Class A classical unsupervised learning task is density estimation. Assuming that the unlabeled observations x1,...,xN were generated independently at random according to some unknown dis... |

393 | Rademacher and Gaussian complexities: Risk bounds and structural results
- Bartlett, Mendelson
(Show Context)
Citation Context ...rectangles the VC dimension is 2d. Many more results and bounds on the VC dimension of various classes can be found in [50] and [4]. We present an improved version of the classic VC bound, taken from =-=[11]-=-. Theorem 2 ([192]). Let F be a class of {−1, +1}-valued functions defined over a set X. LetP be a probability distribution on X ×{−1, +1}, and suppose that N-samples S = {(x1,y1),... ,(xN,yN)} are ge... |

372 |
Parallel Optimization: Theory, Algorithms, and Applications
- Censor, Zenios
- 1997
(Show Context)
Citation Context ... show in Sections 4 and 5 that this relates to barrier optimization techniques [156] and coordinate descent methods [197, 151]. Additionally, we will briefly discuss relations to information geometry =-=[24, 34, 110, 104, 39, 147, 111]-=- and column generation techniques (cf. Section 6.2). An important issue in the context of the algorithms discussed in this review pertains to the construction of the weak learner (e.g. step (3a) in Al... |

347 | Cryptographic limitations on learning Boolean formulae and finite automata
- Kearns, Valiant
- 1994
(Show Context)
Citation Context ...ble member ht is combined; both αt and the learner or hypothesis ht are to be learned within the Boosting procedure. The idea of Boosting has its roots in PAC learning (cf. [188]). Kearns and Valiant =-=[101]-=- proved the astonishing fact that learners, each performing only slightly better than random, can be combined to form an arbitrarily good ensemble hypothesis (when enough data is available). Schapire ... |

284 | Stochastic gradient boosting
- Friedman
- 2002
(Show Context)
Citation Context ...replacement based on the distribution d (t) . The latter approach is more general as it is applicable to any weak learner; however, the former approach has been more widely used in practice. Friedman =-=[73]-=- has also considered sampling based approaches within the general framework described in Section 5. He found that in certain situations (small samples and powerful weak learners) it is advantageous to... |

259 | Feature Selection via Concave Minimization and Support Vector
- Bradley, Mangasarian
- 1998
(Show Context)
Citation Context ... for arbitrary loss functions and any concave regularizer such as ℓ1-norm [147]. Note that N + 1 is an upper bound on the number of non-zero elements – in practice much sparser solutions are observed =-=[37, 48, 23, 8]-=-. In neural networks, SVMs, matching pursuit (e.g. [118]) and many other algorithms, one uses the ℓ2-norm for regularization. In this case the optimal solution w ∗ can be expressed as a linear combina... |

258 | Logistic regression, adaboost and bregman distances
- Collins, Schapire, et al.
- 2000
(Show Context)
Citation Context ... show in Sections 4 and 5 that this relates to barrier optimization techniques [156] and coordinate descent methods [197, 151]. Additionally, we will briefly discuss relations to information geometry =-=[24, 34, 110, 104, 39, 147, 111]-=- and column generation techniques (cf. Section 6.2). An important issue in the context of the algorithms discussed in this review pertains to the construction of the weak learner (e.g. step (3a) in Al... |

237 | Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods and Software 1
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ... describe the DOOM approach [125] that uses a non-convex, monotone upper bound to the training error motivated from the margin-bounds. Then we discuss a linear program (LP) implementing a soft-margin =-=[17, 153]-=- and outline algorithms to iteratively solve the linear programs [154, 48, 146]. The latter techniques are based on modifying the AdaBoost margin loss function to achieve better noise robustness. Howe... |

225 | On the learnability and design of output codes for multiclass problems
- Crammer, Singer
- 2000
(Show Context)
Citation Context |

207 | Boosting with the L2 loss: regression and classification
- Bühlmann, Yu
- 2003
(Show Context)
Citation Context |

193 |
Nonintrusive appliance load monitoring
- Hart
- 1992
(Show Context)
Citation Context ...m, in particular for appliances with inverter systems, 20 whereas non-intrusive measuring systems have already been developed for conventional on/off (non-inverter) operating electric equipments (cf. =-=[83, 33]-=-). The study in [139] presents a first evaluation of machine learning techniques to classify the operating status of electric appliances (with and without inverter) for the purpose of constructing a n... |

191 | Boosting for tumor classification with gene expression data
- Dettling, Bühlmann
(Show Context)
Citation Context ...attempts to develop effective classification procedures to solve it. Early work applied the AdaBoost algorithm to this data; however, the results seemed to be rather disappointing. The recent work in =-=[49]-=- applied the LogitBoost algorithm [74], using decision trees as base learners, together with several modifications, and achieved state of the art performance on this difficult task. It turned out that... |

168 | Prediction games and arcing algorithms
- Breiman
- 1999
(Show Context)
Citation Context ...rally considered as a first step towards more practical Boosting algorithms. Very similar to AdaBoost is the Arcing algorithm, for which convergence to a linear programming solution can be shown (cf. =-=[27]-=-). Although the Boosting scheme seemed intuitive in the context of algorithmic design, a step forward in transparency was taken by explaining Boosting in terms of a stage-wise gradient descent procedu... |

166 |
Semi-infinite programming: theory, methods, and applications
- Hettich, Kortanek
- 1993
(Show Context)
Citation Context ...table hypothesis classes. For uncountable classes, one can establish the same results under some regularity conditions on H, in particular that the real-valued hypotheses h are uniformly bounded (cf. =-=[93, 149, 157, 147]-=-).s136 R. Meir and G. Rätsch To avoid confusion, note that the hypotheses indexed as elements of the hypothesis set H are marked by a tilde, i.e. ˜ h1,... , ˜ hJ, whereas the hypotheses returned by th... |

163 | Adaptive game playing using multiplicative weights. - Freund, Schapire - 1999 |

161 | Linear programming boosting via column generation
- Demiriz
- 2002
(Show Context)
Citation Context ... of the function class (cf. Section 6). When trying to develop means for achieving robust Boosting it is important to elucidate the relations between Optimization Theory and Boosting procedures (e.g. =-=[71, 27, 48, 156, 149]-=-). Developing this interesting relationship opened the field to new types of Boosting algorithms. Among other options it now became possible to rigorously define Boosting algorithms for regression (cf... |

156 | Game theory, on-line prediction and boosting - Freund, Schapire - 1996 |

123 | Boosting in the limit: Maximizing the margin of learned ensembles
- Grove, Schuurmans
- 1998
(Show Context)
Citation Context ...th mentioning: large parts of the early Boosting literature persistently contained the misconception that Boosting would not overfit even when running for a large number of iterations. Simulations by =-=[81, 153]-=- on data sets with higher noise content could clearly show overfitting effects, which can only be avoided by regularizing Boosting so as to limit the complexity of the function class (cf. Section 6). ... |

105 | An adaptive version of the boost by majority algorithm
- Freund
- 2001
(Show Context)
Citation Context ... algorithms that implement the intuitive idea of limiting the influence of a single example. First we present AdaBoostReg [153] which trades off the influence with the margin, then discuss BrownBoost =-=[65]-=- which gives up on examples for which one cannot achieve large enough margins within a given number of iterations. We then discuss SmoothBoost [178] which prevents overfitting by disallowing overly sk... |

104 |
arcing classifiers
- Bias
- 1996
(Show Context)
Citation Context .... Although this algorithm seems somewhat simplistic, it has the nice property that it tends to reduce the variance of the overall estimate f(x) as discussed for regression [27, 79] and classification =-=[26, 75, 51]-=-. Thus, Bagging is quite often found to improve the performance of complex (unstable) classifiers, such as neural networks or decision trees [25, 51]. For Boosting the combination of the hypotheses is... |

92 |
Asymptotic Analysis of Penalized Likelihood and Related Estimators
- Cox, O’Sullivan
- 1990
(Show Context)
Citation Context ...e the complexity of Boosting is limited we might not encounter the effect of overfitting at all. However when using Boosting procedures on noisy realworld data, it turns out that regularization (e.g. =-=[103, 186, 143, 43]-=-) is mandatory if overfitting is to be avoided (cf. Section 6). This is in line with the general experience in statistical learning procedures when using complex non-linear models, for instance for ne... |

89 | A linear programming approach to novelty detection
- Campbell, Bennett
- 2000
(Show Context)
Citation Context ...s of Boosting algorithms. Among other options it now became possible to rigorously define Boosting algorithms for regression (cf. [58, 149]), multi-class problems [3, 155], unsupervised learning (cf. =-=[32, 150]-=-) and to establish convergence proofs for Boosting algorithms by using results from the Theory of Optimization. Further extensions to Boosting algorithms can be found in [32, 150, 74, 168, 169, 183, 3... |

86 |
Boosting and other ensemble methods.
- Drucker, Cortes, et al.
- 1994
(Show Context)
Citation Context ...may cause some numerical problems, we do not dwell upon it in this review. 6 Robustness, Regularization, and Soft-Margins It has been shown that Boosting rarely overfits in the low noise regime (e.g. =-=[54, 70, 167]-=-); however, it clearly does so for higher noise levels (e.g. [145, 27, 81,sAn Introduction to Boosting and Leveraging 151 Table 1. Summary of Results on the Convergence of Greedy Approximation Methods... |

86 |
Boosting performance in neural networks
- Drucker, Schapire, et al.
- 1993
(Show Context)
Citation Context ...achine Learning, LNAI 2600, pp. 118–183, 2003. c○ Springer-Verlag Berlin Heidelberg 2003sAn Introduction to Boosting and Leveraging 119 to provide a provably polynomial time Boosting algorithm, while =-=[55]-=- were the first to apply the Boosting idea to a real-world OCR task, relying on neural networks as base learners. The AdaBoost (Adaptive Boosting) algorithm by [67, 68, 70] (cf. Algorithm 2.1) is gene... |

73 | Boosting Applied to Tagging and PPattachment.
- Abney, Schapire, et al.
- 1999
(Show Context)
Citation Context ...m (cf. [108]) provides a bound on the probability of misclassification using a margin-based loss function. Theorem 3 ([108]). Let F be a class of real-valued functions from X to [−1, +1], and let θ ∈ =-=[0, 1]-=-. LetP be a probability distribution on X ×{−1, +1}, and suppose that N-samples S = {(x1,y1),... ,(xN,yN)} are generated independently at random according to P . Then, for any integer N, with probabil... |

67 | Legendre functions and the method of random Bregman projections
- Bauschke, Borwein
- 1997
(Show Context)
Citation Context ...ike the soft-margin or the ε-insensitive loss. ⋆⋆ Extended to τ-relaxed in [147]. ⋆⋆⋆ There are a few more technical assumptions. These functions are usually refereed to as functions of Legendre-type =-=[162, 13]-=-. 153, 127, 12, 51]). In this section, we summarize techniques that yield state-ofthe-art results and extend the applicability of boosting to the noisy case. The margin distribution is central to the ... |

66 | Boosting applied to word sense disambiguation
- Escudero, Marquez, et al.
- 2000
(Show Context)
Citation Context ...as studied in [161] and applications to the problem of modeling auction price uncertainty was introduced in [171]. Applications of boosting methods to natural language processing has been reported in =-=[1, 60, 84, 194]-=-, and approaches to Melanoma Diagnosis are presented in [132]. Some further applications to Pose Invariant Face Recognition [94], Lung Cancer Cell Identification [200] and Volatility Estimation for Fi... |

64 | Exploiting unlabeled data in ensemble methods
- Bennett, Demiriz, et al.
- 2002
(Show Context)
Citation Context |

61 | The logarithmic potential method of convex programming. - Frisch - 1955 |

52 |
The densest hemisphere problem
- Johnson, Preparata
- 1978
(Show Context)
Citation Context ...(xn))). n=1 However, since this loss is non-convex and not even differentiable, the problem of finding the best linear combination is a very hard problem. In fact, the problem is provably intractable =-=[98]-=-. One idea is to use another loss function which bounds the 0/1-loss from above. For instance, AdaBoost employs the exponential loss G AB (fw,S) := N� exp(−ynfw(xn)), n=1 and the LogitBoost algorithm ... |

49 |
Generalized additive models, volume 43.
- Hastie, Tibshirani
- 1990
(Show Context)
Citation Context ...ing to minimize a suitable cost function. In this sense Boosting is strongly related to other algorithms that were known within the Statistics literature for many years, in particular additive models =-=[86]-=- and matching pursuit [118]. However, the recent work on Boosting has brought to the fore many issues which were not studied previously. (i) The important concept of the margin, and its impact on lear... |

47 | Classification on proximity data with LP-machines
- Graepel, Herbrich, et al.
- 1999
(Show Context)
Citation Context ...n Section 6.1. In addition, there seems to be a direct connection to the SmoothBoost algorithm. The following proposition shows that ν has an immediate interpretation: Proposition 2 (ν-Property, e.g. =-=[174, 78, 150]-=-). The solution to the optimization problem (45) possesses the following properties: 1. ν upper-bounds the fraction of margin errors. 2. 1 − ν is greater than the fraction of examples with a margin la... |

41 |
Some infinity theory for predictor ensembles
- Breiman
- 2004
(Show Context)
Citation Context ...an be seen in Figure 4, both loss functions bound the classification error G 0/1 (f,S) from above. Other loss functions have been proposed in the literature, e.g. [74, 127, 154, 196]. It can be shown =-=[116, 196, 28]-=- that in the infinite sample limit, where the sample average converges to the expectation, minimizing either G AB (fw,S)ors144 R. Meir and G. Rätsch loss 7 6 5 4 3 2 1 yf(x) 0/1−loss Squared loss Logi... |

40 | Y.: Using decision trees to construct a practical parser
- Haruno, Shirai, et al.
- 1998
(Show Context)
Citation Context ...as studied in [161] and applications to the problem of modeling auction price uncertainty was introduced in [171]. Applications of boosting methods to natural language processing has been reported in =-=[1, 60, 84, 194]-=-, and approaches to Melanoma Diagnosis are presented in [132]. Some further applications to Pose Invariant Face Recognition [94], Lung Cancer Cell Identification [200] and Volatility Estimation for Fi... |

39 | Relative Loss Bounds for Single Neurons
- Helmbold, Kivinen, et al.
- 1999
(Show Context)
Citation Context ... learners is not convex with respect to the parameters (see Section 5.4). It would be interesting to see whether this problem could be circumvented (e.g. by designing appropriate cost functions as in =-=[89]-=-). Acknowledgements. We thank Klaus-R. Müller for discussions and his contribution to writing this manuscript. Additionally, we thank Shie Mannor, Sebastian Mika, Takashi Onoda, Bernhard Schölkopf, Al... |

35 | Pose invariant face recognition
- Huang, Hou, et al.
- 2000
(Show Context)
Citation Context ...methods to natural language processing has been reported in [1, 60, 84, 194], and approaches to Melanoma Diagnosis are presented in [132]. Some further applications to Pose Invariant Face Recognition =-=[94]-=-, Lung Cancer Cell Identification [200] and Volatility Estimation for Financial Time Series [7] have also been developed. A detailed list of currently known applications of Boosting and Leveraging met... |

34 | Agnostic Boosting.
- Ben-David, Long, et al.
- 2001
(Show Context)
Citation Context |

29 | Duality and auxiliary functions for Bregman distances.
- Pietra, Pietra, et al.
- 2002
(Show Context)
Citation Context ...n the common sense) point on the hyperplane. For another Bregman function G, one finds another projected point d (t+1) B , since closeness is measured differently. The work of [104, 110] and later of =-=[39, 47, 111, 196]-=- lead to a more general understanding of Boosting methods in the context of Bregman distance optimization and Information Theory. Given an arbitrary strictly convex function G : R N + → R (of Legendre... |

29 | Boosting Methods for Regression
- Duffy, D
(Show Context)
Citation Context ...rvised learning (cf. [32, 150]) and to establish convergence proofs for Boosting algorithms by using results from the Theory of Optimization. Further extensions to Boosting algorithms can be found in =-=[32, 150, 74, 168, 169, 183, 3, 57, 129, 53]-=-. Recently, Boosting strategies have been quite successfully used in various real-world applications. For instance [176] and earlier [55] and [112] used boosted ensembles of neural networks for OCR. [... |

27 | A geometric approach to leveraging weak learners.
- Duffy, Helmbold
- 2002
(Show Context)
Citation Context .... 5 Leveraging as Stagewise Greedy Optimization In Section 4 we focused mainly on AdaBoost. We now extend our view to more general ensemble learning methods which we refer to as leveraging algorithms =-=[56]-=-. We will relate these methods to numerical optimization techniques. These techniques served as powerful tools to prove the convergence of leveraging algorithms (cf. Section 5.4). We demonstrate the c... |

25 | Algorithmic luckiness
- Herbrich, Williamson
(Show Context)
Citation Context ...s independent of T , the number of Boosting iterations. As a final comment we add that a considerable amount of recent work has been devoted to the derivation of so-called data-dependent bounds (e.g. =-=[5, 9, 92]-=-), where the second term on the r.h.s. of (16) is made to depend explicitly on the data. Data-dependent bounds depending explicitly on the weights αt of the weak learners are given in [130]. In additi... |

25 | Boosting for document routing."
- Iyer, Lewis
- 2000
(Show Context)
Citation Context ... global optimum of the loss with respect to the combination coefficients can be solved efficiently. Other loss functions have been used to approach multi-class problems [70, 3, 164], ranking problems =-=[95]-=-, unsupervised learning [150] and regression [76, 58, 149]. See Section 7 for more details on some of these approaches. 5.2 A Generic Algorithm Most Boosting algorithms have in common that they iterat... |

23 | Leveraging for regression - Duy, Helmbold |

22 | Potential boosters
- Duffy, Helmbold
- 2000
(Show Context)
Citation Context ...g property refers to schemes which are able to guarantee that weak learning algorithms are indeed transformed into strong learning algorithms in the sense described in Section 2.1. Duffy and Helmbold =-=[59]-=- reserve the term ‘boosting’ for algorithms for which the PAC-boosting property can be proved to hold, while using ‘leveraging’ in all other cases. Since we are not overly concerned with PAC learning ... |

17 | Data-dependent margin-based generalization bounds for classification
- Antos, Kégl, et al.
- 2002
(Show Context)
Citation Context ...s independent of T , the number of Boosting iterations. As a final comment we add that a considerable amount of recent work has been devoted to the derivation of so-called data-dependent bounds (e.g. =-=[5, 9, 92]-=-), where the second term on the r.h.s. of (16) is made to depend explicitly on the data. Data-dependent bounds depending explicitly on the weights αt of the weak learners are given in [130]. In additi... |

16 |
Multicategory separation via linear programming
- Bennett, Mangasarian
- 1993
(Show Context)
Citation Context |

16 | A simple cost function for boosting
- Frean, Downs
- 1998
(Show Context)
Citation Context ...ntuitive in the context of algorithmic design, a step forward in transparency was taken by explaining Boosting in terms of a stage-wise gradient descent procedure in an exponential cost function (cf. =-=[27, 63, 74, 127, 153]-=-). A further interesting step towards practical applicability is worth mentioning: large parts of the early Boosting literature persistently contained the misconception that Boosting would not overfit... |

15 | Some theoretical aspects of boosting in the presence of noisy data.
- Jiang
- 2001
(Show Context)
Citation Context ... prove in specific cases) that as T →∞the class of functions coT (H) is dense in T. The consistency of Boosting algorithms has recently been established in [116, 123], following related previous work =-=[97]-=-. The work of [123] also includes rates of convergence for specific weak learners and target classes T. We point out that the full proof of consistency must tackle at least three issues. First, it mus... |

13 | Improving algorithms for boosting
- Aslam
- 2000
(Show Context)
Citation Context ...ed in advance (cf. recent work by [30]). We are not aware of any applications of SmoothBoost to real data. Other approaches aimed at directly reducing the effect of difficult examples can be found in =-=[109, 6, 183]-=-. 6.2 Optimization of the Margins Let us return to the analysis of AdaBoost based on margin distributions as discussed in Section 3.4. Consider a base-class of binary hypotheses H, characterized by VC... |

13 |
MadaBoost: A modification of AdaBoost
- Domingo, Watanabe
- 2000
(Show Context)
Citation Context ...rvised learning (cf. [32, 150]) and to establish convergence proofs for Boosting algorithms by using results from the Theory of Optimization. Further extensions to Boosting algorithms can be found in =-=[32, 150, 74, 168, 169, 183, 3, 57, 129, 53]-=-. Recently, Boosting strategies have been quite successfully used in various real-world applications. For instance [176] and earlier [55] and [112] used boosted ensembles of neural networks for OCR. [... |

12 | Volatility Estimation with Functional Gradient Descent for Very High-Dimensional Financial Time Series.
- Audrino, Buhlmann
- 2003
(Show Context)
Citation Context ...Melanoma Diagnosis are presented in [132]. Some further applications to Pose Invariant Face Recognition [94], Lung Cancer Cell Identification [200] and Volatility Estimation for Financial Time Series =-=[7]-=- have also been developed. A detailed list of currently known applications of Boosting and Leveraging methods will be posted on the web at the Boosting homepage http://www.boosting.org/applications.s9... |

12 | Bounds on approximate steepest descent for likelihood maximization in exponential families
- Cesa-Bianchi, Krogh, et al.
- 1994
(Show Context)
Citation Context ... one example, then predicts its label and incurs a loss. The important question in this setting relates to the speed at which the algorithm is able to learn to produce predictions with small loss. In =-=[35, 106, 105]-=- the total loss was bounded in terms of the loss of the best predictor. To derive these results, Bregman divergences [24] and generalized projections were extensively used. In the case of boosting, on... |

10 | A boosting algorithm for regression
- Bertoni, Campadelli, et al.
- 1997
(Show Context)
Citation Context |

8 |
Greedy function approximation
- Friedman
- 1999
(Show Context)
Citation Context ... of algorithms that are able to generate a combined hypothesis f converging to the minimum of some loss function G[f] (if it exists). Special cases are AdaBoost [70], Logistic Regression and LS-Boost =-=[76]-=-. While assuming rather mild conditions on the base learning algorithm and the loss function G, linear convergence rates (e.g. [115]) of the type G[ft+1] − G[f ∗ ] ≤sAn Introduction to Boosting and Le... |

7 |
A stable exponential penalty algorithm with superlinear convergence
- Cominetti
- 1994
(Show Context)
Citation Context ...ax ˜h∈H � N� n=1 r=1 d t n ˜ h(xn) � . (34) This is a rather restrictive assumption, being one among many that can be made (cf. [168, 199, 152]). We will later significantly relax this condition (cf. =-=(40)-=-).s146 R. Meir and G. Rätsch Let us discuss why the choice in (34) is useful. For this we compute the gradient of the loss function with respect to the weight w˜ h of each hypothesis ˜ h in the hypoth... |

6 | On boosting with polynomially bounded distributions
- Bshouty, Gavinsky
(Show Context)
Citation Context ...ems like a very promising algorithm for dealing with noisy situations, it should be kept in mind that it is not fully adaptive, in that both κ and γ need to be supplied in advance (cf. recent work by =-=[30]-=-). We are not aware of any applications of SmoothBoost to real data. Other approaches aimed at directly reducing the effect of difficult examples can be found in [109, 6, 183]. 6.2 Optimization of the... |

6 | Bagging can stabilize without reducing variance
- Grandvalet
- 2001
(Show Context)
Citation Context ...2] for more refined choices). Although this algorithm seems somewhat simplistic, it has the nice property that it tends to reduce the variance of the overall estimate f(x) as discussed for regression =-=[27, 79]-=- and classification [26, 75, 51]. Thus, Bagging is quite often found to improve the performance of complex (unstable) classifiers, such as neural networks or decision trees [25, 51]. For Boosting the ... |

6 |
Learning Linear Classifiers - Theory and Algorithms
- Herbrich
- 2000
(Show Context)
Citation Context ...and which, at the same time, lead to sparse solutions. The motivation for the former aspect is clear, while the motivation for sparsity is that it often leads to superior generalization results (e.g. =-=[91, 90]-=-) and also to smaller, and therefore computationally more efficient, ensemble hypotheses. Moreover, in the case of infinite hypothesis spaces, sparsity leads to a precise representation in terms of a ... |

5 |
Localized rademacher averages
- Bartlett, Bousquet, et al.
- 2002
(Show Context)
Citation Context ...σnf(xn) � �N � , f∈F where the expectation is taken with respect to both {σn} and {xn}. The Rademacher complexity has proven to be essential in the derivation of effective generalization bounds (e.g. =-=[189, 11, 108, 10]-=-). The basic intuition behind the definition of RN(F) is its interpretation as a measure of correlation between the class F and a random set of labels. For very rich function classes F we expect a lar... |

5 | Sparsity vs. large margins for linear classifiers
- Herbrich, Graepel, et al.
- 2000
(Show Context)
Citation Context ...and which, at the same time, lead to sparse solutions. The motivation for the former aspect is clear, while the motivation for sparsity is that it often leads to superior generalization results (e.g. =-=[91, 90]-=-) and also to smaller, and therefore computationally more efficient, ensemble hypotheses. Moreover, in the case of infinite hypothesis spaces, sparsity leads to a precise representation in terms of a ... |

4 |
Nonintrusive appliance load monitoring system
- Carmichael
(Show Context)
Citation Context ...m, in particular for appliances with inverter systems, 20 whereas non-intrusive measuring systems have already been developed for conventional on/off (non-inverter) operating electric equipments (cf. =-=[83, 33]-=-). The study in [139] presents a first evaluation of machine learning techniques to classify the operating status of electric appliances (with and without inverter) for the purpose of constructing a n... |

2 |
Multiclass learning, boosing, and error-correcting codes
- Guruswami, Sahai
- 1999
(Show Context)
Citation Context |

2 |
On the boosting ability og top-down decision tree learning algorithms
- Kearns
- 1996
(Show Context)
Citation Context |

1 |
The Linguistic Basis of Text Generation
- unknown authors
- 1987
(Show Context)
Citation Context ... for arbitrary loss functions and any concave regularizer such as ℓ1-norm [147]. Note that N + 1 is an upper bound on the number of non-zero elements – in practice much sparser solutions are observed =-=[37, 48, 23, 8]-=-. In neural networks, SVMs, matching pursuit (e.g. [118]) and many other algorithms, one uses the ℓ2-norm for regularization. In this case the optimal solution w ∗ can be expressed as a linear combina... |

1 |
Eser Emine Erguvanli. The Function of Word Order in Turkish Grammar
- unknown authors
- 1984
(Show Context)
Citation Context ...s independent of T , the number of Boosting iterations. As a final comment we add that a considerable amount of recent work has been devoted to the derivation of so-called data-dependent bounds (e.g. =-=[5, 9, 92]-=-), where the second term on the r.h.s. of (16) is made to depend explicitly on the data. Data-dependent bounds depending explicitly on the weights αt of the weak learners are given in [130]. In additi... |

1 |
Improving regressors using boosting techniques
- Fisher, editor
- 1997
(Show Context)
Citation Context |

1 | On bias, variance, 0/1–loss, and the corse of dimensionality - Friedman - 1997 |

1 |
Boosting mixture models for semi-supervised tasks
- Grandvalet, D’alché-Buc, et al.
- 2001
(Show Context)
Citation Context |