Results 1  10
of
150
Empirical margin distributions and bounding the generalization error of combined classifiers
 Ann. Statist
, 2002
"... Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such ..."
Abstract

Cited by 122 (9 self)
 Add to MetaCart
Dedicated to A.V. Skorohod on his seventieth birthday We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of ℓ1norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Lévy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.
Convexity, Classification, and Risk Bounds
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2003
"... Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficien ..."
Abstract

Cited by 121 (12 self)
 Add to MetaCart
(Show Context)
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we
Local Rademacher complexities
 Annals of Statistics
, 2002
"... We propose new bounds on the error of learning algorithms in terms of a datadependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a ..."
Abstract

Cited by 110 (18 self)
 Add to MetaCart
(Show Context)
We propose new bounds on the error of learning algorithms in terms of a datadependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.
Introduction to Statistical Learning Theory
 In , O. Bousquet, U.v. Luxburg, and G. Rsch (Editors
, 2004
"... ..."
(Show Context)
Theory of classification: A survey of some recent advances
, 2005
"... The last few years have witnessed important new developments in the theory and practice of pattern classification. We intend to survey some of the main new ideas that have led to these recent results. ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
The last few years have witnessed important new developments in the theory and practice of pattern classification. We intend to survey some of the main new ideas that have led to these recent results.
Rademacher Processes And Bounding The Risk Of Function Learning
 High Dimensional Probability II
, 1999
"... We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific propertie ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
(Show Context)
We construct data dependent upper bounds on the risk in function learning problems. The bounds are based on the local norms of the Rademacher process indexed by the underlying function class and they do not require prior knowledge about the distribution of training examples or any specific properties of the function class. Using Talagrand's type concentration inequalities for empirical and Rademacher processes, we show that the bounds hold with high probability that decreases exponentially fast when the sample size grows. In typical situations that are frequently encountered in the theory of function learning, the bounds give nearly optimal rate of convergence of the risk to zero. 1. Local Rademacher norms and bounds on the risk: main results Let (S; A) be a measurable space and let F be a class of Ameasurable functions from S into [0; 1]: Denote P(S) the set of all probability measures on (S; A): Let f 0 2 F be an unknown target function. Given a probability measure P 2 P(S) (also unknown), let (X 1 ; : : : ; Xn ) be an i.i.d. sample in (S; A) with common distribution P (defined on a probability space(\Omega ; \Sigma; P)). In computer learning theory, the problem of estimating f 0 ; based on the labeled sample (X 1 ; Y 1 ); : : : ; (Xn ; Yn ); where Y j := f 0 (X j ); j = 1; : : : ; n; is referred to as function learning problem. The so called concept learning is a special case of function learning. In this case, F := fI C : C 2 Cg; where C ae A is called a class of concepts (see Vapnik (1998), Vidyasagar (1996), Devroye, Gyorfi and Lugosi (1996) for the account on statistical learning theory). The goal of function learning is to find an estimate
Concentration inequalities
 Advanced Lectures in Machine Learning
, 2004
"... Abstract. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. In the last decade new tools have been introduced making it possible to establish simple and powerful inequalities. These inequalities are at the heart of the mathematical a ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Concentration inequalities deal with deviations of functions of independent random variables from their expectation. In the last decade new tools have been introduced making it possible to establish simple and powerful inequalities. These inequalities are at the heart of the mathematical analysis of various problems in machine learning and made it possible to derive new efficient algorithms. This text attempts to summarize some of the basic tools. 1
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are know ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multiagent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience reuse. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
Monte Carlo algorithms for optimal stopping and statistical learning. Working paper
, 2003
"... Abstract. We extend the LongstaffSchwartz algorithm for approximately solving optimal stopping problems on highdimensional state spaces. We reformulate the optimal stopping problem for Markov processes in discrete time as a generalized statistical learning problem. Within this setup we apply devia ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
Abstract. We extend the LongstaffSchwartz algorithm for approximately solving optimal stopping problems on highdimensional state spaces. We reformulate the optimal stopping problem for Markov processes in discrete time as a generalized statistical learning problem. Within this setup we apply deviation inequalities for suprema of empirical processes to derive consistency criteria, and to estimate the convergence rate and sample complexity. Our results strengthen and extend earlier results obtained by Clément, Lamberton and Protter (2002). 1.