Results 1  10
of
20
InformationTheoretic Determination of Minimax Rates of Convergence
 Ann. Stat
, 1997
"... In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence. ..."
Abstract

Cited by 98 (18 self)
 Add to MetaCart
In this paper, we present some general results determining minimax bounds on statistical risk for density estimation based on certain informationtheoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the minimax rates of convergence.
Generalization Performance of Regularization Networks and Support . . .
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2001
"... We derive new bounds for the generalization error of kernel machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs make use of a viewpoint that is apparently novel in the field of statistical learning theory. The hy ..."
Abstract

Cited by 73 (20 self)
 Add to MetaCart
We derive new bounds for the generalization error of kernel machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs make use of a viewpoint that is apparently novel in the field of statistical learning theory. The hypothesis class is described in terms of a linear operator mapping from a possibly infinitedimensional unit ball in feature space into a finitedimensional space. The covering numbers of the class are then determined via the entropy numbers of the operator. These numbers, which characterize the degree of compactness of the operator, can be bounded in terms of the eigenvalues of an integral operator induced by the kernel function used by the machine. As a consequence, we are able to theoretically explain the effect of the choice of kernel function on the generalization performance of support vector machines.
Combining Discriminant Models with new MultiClass SVMs
, 2000
"... The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlati ..."
Abstract

Cited by 39 (10 self)
 Add to MetaCart
The idea of combining models instead of simply selecting the best one, in order to improve performance, is well known in statistics and has a long theoretical background. However, making full use of theoretical results is ordinarily subject to the satisfaction of strong hypotheses (weak correlation among the errors, availability of large training sets, possibility to rerun the training procedure an arbitrary number of times, etc.). In contrast, the practitioner who has to make a decision is frequently faced with the dicult problem of combining a given set of pretrained classiers, with highly correlated errors, using only a small training sample. Overtting is then the main risk, which cannot be overcome but with a strict complexity control of the combiner selected. This suggests that SVMs, which implement the SRM inductive principle, should be well suited for these dicult situations. Investigating this idea, we introduce a new family of multiclass SVMs and assess them as ensemble methods on a realworld problem. This task, protein secondary structure prediction, is an open problem in biocomputing for which model combination appears to be an issue of central importance. Experimental evidence highlights the gain in quality resulting from combining some of the most widely used prediction methods with our SVMs rather than with the ensemble methods traditionally used in the eld. The gain is increased when the outputs of the combiners are postprocessed with a simple DP algorithm.
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multiagent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience reuse. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
Minimax nonparametric classification  Part I: Rates of convergence

, 1998
"... This paper studies minimax aspects of nonparametric classification. We first study minimax estimation of the conditional probability of a class label, given the feature variable. This function, say f � is assumed to be in a general nonparametric class. We show the minimax rate of convergence under ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
This paper studies minimax aspects of nonparametric classification. We first study minimax estimation of the conditional probability of a class label, given the feature variable. This function, say f � is assumed to be in a general nonparametric class. We show the minimax rate of convergence under square L 2 loss is determined by the massiveness of the class as measured by metric entropy. The second part of the paper studies minimax classification. The loss of interest is the difference between the probability of misclassification of a classifier and that of the Bayes decision. As is wellknown, an upper bound on risk for estimating f gives an upper bound on the risk for classification, but the rate is known to be suboptimal for the class of monotone functions. This suggests that one does not have to estimate f well in order to classify well. However, we show that the two problems are in fact of the same difficulty in terms of rates of convergence under a sufficient condition, which is satisfied by many function classes including Besov (Sobolev), Lipschitz, and bounded variation. This is somewhat surprising in view of a result of Devroye, Györfi, and Lugosi (1996).
Regression and Classification with Regularization
, 2002
"... The purpose of this chapter is to present a theoretical framework for the problem of learning from examples. Learning from examples can be regarded [13] as the problem of approximating a multivariate function from sparse data. The function can be real valued as in regression or binary valued as in c ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
The purpose of this chapter is to present a theoretical framework for the problem of learning from examples. Learning from examples can be regarded [13] as the problem of approximating a multivariate function from sparse data. The function can be real valued as in regression or binary valued as in classification. The problem of approximating a function from sparse data is illposed and a classical solution is regularization theory [19]. Regularization theory, as we will consider here, formulates the regression problem as a variational problem of finding the function f that minimizes the functional K (6.1) where V (; ) is a loss function (in the classical formulation the square loss was used), kfk K is a norm in a Reproducing Kernel Hilbert Space (RKHS) H de ned by the positive definite function K, ` is the number of data points or examples (the ` training pairs (x i ; y i )) and is the regularization parameter. Under rather general conditions [14, 22, ...
Margin Error and Generalization Capabilities of MultiClass Discriminant Systems
, 2001
"... The theory and practice of discriminant analysis have been mainly developed for twoclass problems (computation of dichotomies). This phenomenon can easily be explained, since there is an obvious way to perform multicategory discrimination tasks using solely models computing dichotomies. It consists ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
The theory and practice of discriminant analysis have been mainly developed for twoclass problems (computation of dichotomies). This phenomenon can easily be explained, since there is an obvious way to perform multicategory discrimination tasks using solely models computing dichotomies. It consists in dividing the problem at hand into several oneagainstall ones and applying a simple rule to construct the global discriminant function from the partial ones. Adopting a direct approach, however, should make it possible to improve the results, let them be theoretical (bounds on the expected risk) or practical (empirical risk). Although multicategory extensions of the main models computing dichotomies can often be conceived simply, as in the case of multilayer perceptrons, in other cases, this cannot be done readily but at the expense of the loss of part of the theoretical foundations. This is for instance the main shortcoming of the multicategory support vector machines developed so far. One of the major difficulties of multicategory discriminant analysis rests in the fact that it requires specific uniform convergence results. Indeed, the uniform strong laws of large numbers established for dichotomies do not extend nicely to multicategory problems. They become significantly looser. This is problematical indeed, since the question of the quality of bounds is of central importance if one wants to implement with confidence the structural risk minimization inductive principle, which is precisely grounding the support vector method. In this paper, building upon the notions of margin used in the context of statistical learning theory and boosting theory, and the corresponding generalization error bounds, we derive sharper bounds on the expected risk (generalization error) of m...
From Uniform Laws of Large Numbers to Uniform Ergodic Theorems
"... The purpose of these lectures is to present three different approaches with their own methods for establishing uniform laws of large numbers and uniform ergodic theorems for dynamical systems. The presentation follows the principle according to which the i.i.d. case is considered first in great deta ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The purpose of these lectures is to present three different approaches with their own methods for establishing uniform laws of large numbers and uniform ergodic theorems for dynamical systems. The presentation follows the principle according to which the i.i.d. case is considered first in great detail, and then attempts are made to extend these results to the case of more general dependence structures. The lectures begin (Chapter 1) with a review and description of classic laws of large numbers and ergodic theorems, their connection and interplay, and their infinite dimensional extensions towards uniform theorems with applications to dynamical systems. The first approach (Chapter 2) is of metric entropy with bracketing which relies upon the BlumDeHardt law of large numbers and HoffmannJørgensen’s extension of it. The result extends to general dynamical systems using the uniform ergodic lemma (or Kingman’s subadditive ergodic theorem). In this context metric entropy and majorizing measure type conditions are also considered. The second approach (Chapter 3) is of Vapnik and Chervonenkis. It relies
Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework
, 1995
"... Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then c ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Here we survey some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory.
Analysis of Regularized Linear Functions for Classification Problems
, 1999
"... Recently, sample complexity bounds have been... In this paper, we extend some theoretical results in this area by providing convergence analysis for regularized linear functions with an emphasis on classification problems. The class of methods we study in this paper generalize support vector machine ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Recently, sample complexity bounds have been... In this paper, we extend some theoretical results in this area by providing convergence analysis for regularized linear functions with an emphasis on classification problems. The class of methods we study in this paper generalize support vector machines and are conceptually very simple. To analyze these methods within the traditional PAC learning framework, we derive dimensional independent covering number bounds for linear systems under certain regularization conditions, and obtain relevant generalization bounds. We also present an analysis for these methods from the asymptotic statistical point of view. Weshow that this technique provides better description for large sample behaviors of these algorithms. Furthermore, we shall investigate numerical aspects of the proposed methods, and establish their relationship with illposed problems studied in numerical mathematics.