Results 1  10
of
59
Optimal aggregation of classifiers in statistical learning
 Ann. Statist
, 2004
"... Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of cand ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching O(n−1) can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor. 1. Introduction. Let (Xi,Yi)
Empirical margin distributions and bounding the generalization error of combined classifiers
 Ann. Statist
, 2002
"... Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such ..."
Abstract

Cited by 112 (8 self)
 Add to MetaCart
Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of ℓ1norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Lévy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.
Local Rademacher complexities
 Annals of Statistics
, 2002
"... We propose new bounds on the error of learning algorithms in terms of a datadependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a ..."
Abstract

Cited by 106 (18 self)
 Add to MetaCart
We propose new bounds on the error of learning algorithms in terms of a datadependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.
A Hilbert space embedding for distributions
 In Algorithmic Learning Theory: 18th International Conference
, 2007
"... Abstract. We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in twosample tests, which are used for ..."
Abstract

Cited by 53 (26 self)
 Add to MetaCart
Abstract. We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in twosample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, entropy, or KullbackLeibler divergence, we require sophisticated space partitioning and/or
Mechanism Design via Machine Learning
 IN PROC. OF THE 46TH IEEE SYMP. ON FOUNDATIONS OF COMPUTER SCIENCE
, 2005
"... We use techniques from samplecomplexity in machine learning to reduce problems of incentivecompatible mechanism design to standard algorithmic questions, for a broad class of revenuemaximizing pricing problems. Our reductions imply that for these problems, given an optimal (or #approximation) al ..."
Abstract

Cited by 47 (11 self)
 Add to MetaCart
We use techniques from samplecomplexity in machine learning to reduce problems of incentivecompatible mechanism design to standard algorithmic questions, for a broad class of revenuemaximizing pricing problems. Our reductions imply that for these problems, given an optimal (or #approximation) algorithm for the standard algorithmic problem, we can convert it into a (1 + #)approximation (or #(1 + #)approximation) for the incentivecompatible mechanism design problem, so long as the number of bidders is sufficiently large as a function of an appropriate measure of complexity of the comparison class of solutions. We apply these results to the problem of auctioning a digital good, to the attribute auction problem which includes a wide variety of discriminatory pricing problems, and to the problem of itempricing in unlimitedsupply combinatorial auctions. From a machine learning perspective, these settings present several challenges: in particular, the loss function is discontinuous and asymmetric, and the range of bidders' valuations may be large.
Moment Inequalities for Functions of Independent Random Variables
"... this paper is to provide such generalpurpose inequalities. Our approach is based on a generalization of Ledoux's entropy method (see [26, 28]). Ledoux's method relies on abstract functional inequalities known as logarithmic Sobolev inequalities and provide a powerful tool for deriving exponential i ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
this paper is to provide such generalpurpose inequalities. Our approach is based on a generalization of Ledoux's entropy method (see [26, 28]). Ledoux's method relies on abstract functional inequalities known as logarithmic Sobolev inequalities and provide a powerful tool for deriving exponential inequalities for functions of independent random variables, see Boucheron, Massart, and AMS 1991 subject classifications. Primary 60E15, 60C05, 28A35; Secondary 05C80 Key words and phrases. Moment inequalities, Concentration inequalities; Empirical processes; Random graphs Supported by EU Working Group RANDAPX, binational PROCOPE Grant 05923XL The work of the third author was supported by the Spanish Ministry of Science and Technology and FEDER, grant BMF200303324 Lugosi [6, 7], Bousquet [8], Devroye [14], Massart [30, 31], Rio [36] for various applications. To derive moment inequalities for general functions of independent random variables, we elaborate on the pioneering work of Latala and Oleszkiewicz [25] and describe socalled #Sobolev inequalities which interpolate between Poincare's inequality and logarithmic Sobolev inequalities (see also Beckner [4] and Bobkov's arguments in [26])
Rademacher Processes And Bounding The Risk Of Function Learning
 High Dimensional Probability II
, 1999
"... We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific propertie ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific properties of the function class. Using Talagrand's type concentration inequalities for empirical and Rademacher processes, we show that the bounds hold with high probability that decreases exponentially fast when the sample size grows. In typical situations that are frequently encountered in the theory of function learning, the bounds give nearly optimal rate of convergence of the risk to zero. 1. Local Rademacher norms and bounds on the risk: main results Let (S; A) be a measurable space and let F be a class of Ameasurable functions from S into [0; 1]: Denote P(S) the set of all probability measures on (S; A): Let f 0 2 F be an unknown target function. Given a probability measure P 2 P(S) (also unknown), let (X 1 ; : : : ; Xn ) be an i.i.d. sample in (S; A) with common distribution P (defined on a probability space(\Omega ; \Sigma; P)). In computer learning theory, the problem of estimating f 0 ; based on the labeled sample (X 1 ; Y 1 ); : : : ; (Xn ; Yn ); where Y j := f 0 (X j ); j = 1; : : : ; n; is referred to as function learning problem. The so called concept learning is a special case of function learning. In this case, F := fI C : C 2 Cg; where C ae A is called a class of concepts (see Vapnik (1998), Vidyasagar (1996), Devroye, Gyorfi and Lugosi (1996) for the account on statistical learning theory). The goal of function learning is to find an estimate
A Review of Kernel Methods in Machine Learning
, 2006
"... We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticate ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
We review recent methods for learning with positive definite kernels. All these methods formulate learning and estimation problems as linear tasks in a reproducing kernel Hilbert space (RKHS) associated with a kernel. We cover a wide range of methods, ranging from simple classifiers to sophisticated methods for estimation with structured data.
Learning from multiple sources
 In Advances in Neural Information Processing Systems 19
, 2007
"... We consider the problem of learning accurate models from multiple sources of “nearby ” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
We consider the problem of learning accurate models from multiple sources of “nearby ” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decisiontheoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations.