Results 1  10
of
63
Regularization networks and support vector machines
 Advances in Computational Mathematics
, 2000
"... Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization a ..."
Abstract

Cited by 266 (33 self)
 Add to MetaCart
Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik’s theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.
Scalesensitive Dimensions, Uniform Convergence, and Learnability
, 1997
"... Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions enjoy ..."
Abstract

Cited by 208 (1 self)
 Add to MetaCart
Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions enjoying such a property are also known as uniform GlivenkoCantelli classes. In this paper we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform GlivenkoCantelli classes. Our characterization yields Dudley, Gin'e, and Zinn's previous characterization as a corollary. Furthermore, it is the first based on a simple combinatorial quantity generalizing the VapnikChervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or "agnostic") framework. Furthermore, we show a characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire. These results show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.
A Model of Inductive Bias Learning
 Journal of Artificial Intelligence Research
, 2000
"... A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonablysized training sets. Typically such bias is suppl ..."
Abstract

Cited by 143 (0 self)
 Add to MetaCart
A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonablysized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Exp...
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
 Machine Learning
, 1994
"... In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract

Cited by 108 (12 self)
 Add to MetaCart
In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1gvalued functions over an instance space X....
Sphere Packing Numbers for Subsets of the Boolean nCube with Bounded VapnikChervonenkis Dimension
, 1992
"... : Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn ..."
Abstract

Cited by 93 (4 self)
 Add to MetaCart
: Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn=k) log(n=k)) d . This new result has applications in the theory of empirical processes. 1 The author gratefully acknowledges the support of the Mathematical Sciences Research Institute at UC Berkeley and ONR grant N0001491J1162. 1 1 Statement of Results Let n be natural number greater than zero. Let V ` f0; 1g n . For a sequence of indices I = (i 1 ; . . . ; i k ), with 1 i j n, let V j I denote the projection of V onto I, i.e. V j I = f(v i 1 ; . . . ; v i k ) : (v 1 ; . . . ; v n ) 2 V g: If V j I = f0; 1g k then we say that V shatters the index sequence I. The VapnikChervonenkis dimension of V is the size of the longest index sequence I that is shattered by V [VC71] (t...
A unified framework for Regularization Networks and Support Vector Machines
, 1999
"... This report describers research done at the Center for Biological & Computational Learning and the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. This research was sponsored by theN ational Science Foundation under contractN o. IIS9800032, the O#ce ofN aval Researc ..."
Abstract

Cited by 50 (13 self)
 Add to MetaCart
This report describers research done at the Center for Biological & Computational Learning and the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. This research was sponsored by theN ational Science Foundation under contractN o. IIS9800032, the O#ce ofN aval Research under contractN o.N 0001493 10385 and contractN o.N 000149510600. Partial support was also provided by DaimlerBenz AG, Eastman Kodak, Siemens Corporate Research, Inc., ATR and AT&T. Contents Introductic 3 2 OverviF of stati.48EF learni4 theory 5 2.1 Unifo6 Co vergence and the VapnikChervo nenkis bo und ............. 7 2.2 The metho d o Structural Risk Minimizatio ..................... 10 2.3 #unifo8 co vergence and the V # ..................... 10 2.4 Overviewo fo urappro6 h ............................... 13 3 Reproduci9 Kernel HiT ert Spaces: a briL overviE 14 4RegulariEqq.L Networks 16 4.1 Radial Basis Functio8 ................................. 19 4.2 Regularizatioz generalized splines and kernel smo oxy rs .............. 20 4.3 Dual representatio o f Regularizatio Netwo rks ................... 21 4.4 Fro regressioto 5 Support vector machiT9 22 5.1 SVMin RKHS ..................................... 22 5.2 Fro regressioto 6SRMforRNsandSVMs 26 6.1 SRMfo SVMClassificatio .............................. 28 6.1.1 Distributio dependent bo undsfo SVMC .................. 29 7 A BayesiL Interpretatiq ofRegulariTFqEL and SRM? 30 7.1 Maximum A Po terio6 Interpretatio o f ............... 30 7.2 Bayesian interpretatio o f the stabilizer in the RN andSVMfunctio6I6 ...... 32 7.3 Bayesian interpretatio o f the data term in the Regularizatio andSVMfunctioy8 33 7.4 Why a MAP interpretatio may be misleading .................... 33 Connectine between SVMs and Sparse Ap...
Convergence rates of posterior distributions
 Ann. Statist
, 2000
"... We consider the asymptotic behavior of posterior distributions and Bayes estimators for infinitedimensional statistical models. We give general results on the rate of convergence of the posterior measure. These are applied to several examples, including priors on finite sieves, logspline models, D ..."
Abstract

Cited by 43 (11 self)
 Add to MetaCart
We consider the asymptotic behavior of posterior distributions and Bayes estimators for infinitedimensional statistical models. We give general results on the rate of convergence of the posterior measure. These are applied to several examples, including priors on finite sieves, logspline models, Dirichlet processes and interval censoring. 1. Introduction. Suppose
Covering Number Bounds of Certain Regularized Linear Function Classes
 Journal of Machine Learning Research
, 2002
"... Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In many of these theoretical studies, the concept of covering numbers played an important role. It is thus useful to study covering numbers for linear ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In many of these theoretical studies, the concept of covering numbers played an important role. It is thus useful to study covering numbers for linear function classes. In this paper, we investigate two closely related methods to derive upper bounds on these covering numbers. The first method, already employed in some earlier studies, relies on the socalled Maurey's lemma; the second method uses techniques from the mistake bound framework in online learning. We compare results from these two methods, as well as their consequences in some learning formulations.
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 39 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...