Results 1  10
of
21
Boosting the margin: A new explanation for the effectiveness of voting methods
 In Proceedings International Conference on Machine Learning
, 1997
"... Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show ..."
Abstract

Cited by 721 (52 self)
 Add to MetaCart
Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the biasvariance decomposition. 1
Performance guarantees for regularized maximum entropy density estimation
 Proceedings of the 17th Annual Conference on Computational Learning Theory
, 2004
"... Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of th ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove nonasymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniformconvergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequentialupdate and parallelupdate algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions. 1
Measuring the VCdimension of a Learning Machine
 Neural Computation
, 1994
"... A method for measuring the capacity of learning machines is described. The method is based on fitting a theoretically derived function to empirical measurements of the maximal difference between the error rates on two separate data sets of varying sizes. Experimental measurements of the capacity of ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
A method for measuring the capacity of learning machines is described. The method is based on fitting a theoretically derived function to empirical measurements of the maximal difference between the error rates on two separate data sets of varying sizes. Experimental measurements of the capacity of various types of linear classifiers are presented. 1 Introduction. Many theoretical and experimental studies have shown the influence of the capacity of a learning machine on its generalization ability (Vapnik, 1982; Baum and Haussler, 1989; Le Cun et al., 1990; Weigend, Rumelhart and Huberman, 1991; Guyon et al., 1992; AbuMostafa, 1993). Learning machines with a small capacity may not require large training sets to approach the best possible solution (lowest error rate on test sets). Highcapacity learning machines, on the other hand, may provide better asymptotical solutions (i.e. lower test error rate for very large training sets), but may require large amounts of training data to reach...
Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
"... We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including ℓ1, ℓ2, ℓ 2 2, and ℓ1+ ℓ 2 2 style regularization. We propose an algorithm solving a large and general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on realworld data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models.
DistributionDependent VapnikChervonenkis Bounds
, 1999
"... . VapnikChervonenkis (VC) bounds play an important role in statistical learning theory as they are the fundamental result which explains the generalization ability of learning machines. There have been consequent mathematical works on the improvement of VC rates of convergence of empirical mean ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
. VapnikChervonenkis (VC) bounds play an important role in statistical learning theory as they are the fundamental result which explains the generalization ability of learning machines. There have been consequent mathematical works on the improvement of VC rates of convergence of empirical means to their expectations over the years. The result obtained by Talagrand in 1994 seems to provide more or less the final word to this issue as far as universal bounds are concerned. Though for fixed distributions, this bound can be practically outperformed. We show indeed that it is possible to replace the 2ffl 2 under the exponential of the deviation term by the corresponding Cram'er transform as shown by large deviations theorems. Then, we formulate rigorous distributionsensitive VC bounds and we also explain why these theoretical results on such bounds can lead to practical estimates of the effective VC dimension of learning structures. 1 Introduction and motivations One of t...
Inequalities For A New DataBased Method For Selecting Nonparametric Density Estimates
 in M.L. Puri (editor), Festschrift in Honour of George Roussas, VSP International Science Publishers
, 1998
"... We continue the development of a method for the selection of a bandwidth or a number of design parameters in density estimation. We provide explicit nonasymptotic densityfree inequalities that relate the L 1 error of the selected estimate with that of the best possible estimate, and study in parti ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
We continue the development of a method for the selection of a bandwidth or a number of design parameters in density estimation. We provide explicit nonasymptotic densityfree inequalities that relate the L 1 error of the selected estimate with that of the best possible estimate, and study in particular the connection between the richness of the class of density estimates and the performance bound. For example, our method allows one to pick the bandwidth and kernel order in the kernel estimate simultaneously and still assure that for all densities, the L 1 error of the corresponding kernel estimate is not larger than about three times the error of the estimate with the optimal smoothing factor and kernel plus a constant times p log n=n, where n is the sample size, and the constant only depends on the complexity of the family of kernels used in the estimate. Further applications include multivariate kernel estimates, transformed kernel estimates, and variable kernel estimates.
Maximum entropy density estimation and modeling geographic distributions of species
, 2007
"... Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used densityestimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but t ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used densityestimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but theory explaining their properties is often missing or needs to be derived for each case separately. In this dissertation, we propose a unified treatment for a large and general class of smoothing techniques. We provide fully general guarantees on their statistical performance and propose optimization algorithms with complete convergence proofs. As special cases, we can easily derive performance guarantees for many known regularization types including L1 and L2squared regularization. Furthermore, our general approach enables us to derive entirely new regularization functions with superior statistical guarantees. The new regularization functions use information about the structure of the feature space, incorporate information about sample selection bias, and combine information across several related densityestimation tasks. We propose algorithms solving a large and general subclass of generalized maxent problems, including all
A Bound Concerning the Generalization Ability of a Certain Class of Learning Algorithms
, 1999
"... A classifier is said to have good generalization ability if it performs on test data almost as well as it does on the training data. The main result of this paper provides a sufficient condition for a learning algorithm to have good finite sample generalization ability. This criterion applies in som ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
A classifier is said to have good generalization ability if it performs on test data almost as well as it does on the training data. The main result of this paper provides a sufficient condition for a learning algorithm to have good finite sample generalization ability. This criterion applies in some cases where the set of all possible classifiers has infinite VC dimension. We apply the result to prove the good generalization ability of support vector machines. Introduction I consider the classical problem of learning a classifier from examples which can be formalized as follows: Let Z i = (X i ; Y i ); i = 1; 2; : : : be iid random variables taking values in Z = X \Theta f\Gamma1; +1g. The problem is predicting Y l+1 given X 1 ; : : : ; X l+1 and Y 1 ; : : : ; Y l . The solution to the problem is a map M : Z l ! F , where F is a space of classifier functions, i.e., each f 2 F is a function f : X ! f\Gamma1; +1g. Thus the prediction is Y l+1 = f (X l+1 ) where f = M (Z 1 ;...
Deviation Bounds And Limit Theorems For The Maxima Of Some Stochastic Processes
, 1996
"... . We consider stochastic processes which may be defined as averages Fn = 1 n P n i=1 f i of n small, slowly varying, independent (or almost independent) random cadlag functions. Such processes arise in many contexts, including queueing and storage problems. Using slight modifications of standard ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
. We consider stochastic processes which may be defined as averages Fn = 1 n P n i=1 f i of n small, slowly varying, independent (or almost independent) random cadlag functions. Such processes arise in many contexts, including queueing and storage problems. Using slight modifications of standard empiricalprocess methods, we derive general inequalities for the maximum fluctuations of such processes from their means. This allows us to rederive a wellknown functional central limit theorem for these processes due to E. Gin'e and J. Zinn. When the expectation E \Theta Fn (t) itself has a unique maximum, at a point t0 , we may then also derive secondorder bounds for the difference between maxFn(t) and Fn (t0 ). While fi fi maxFn (t) \Gamma E \Theta Fn(t0 ) fi fi is stochastically on the order of n \Gamma1=2 for large n, maxFn(t) \Gamma Fn (t0 ) is strictly smaller, being only on the order of n \Gammaq=(2q\Gamma1) , where q is the order of the maximum of E \Theta\Theta ...
A refined margin analysis for boosting algorithms via equilibrium margin
 Journal of Machine Learning Research
, 2011
"... Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. Howe ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LPAdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LPAdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly