Results 1 
8 of
8
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
 Machine Learning
, 1994
"... In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract

Cited by 108 (12 self)
 Add to MetaCart
In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1gvalued functions over an instance space X....
Sphere Packing Numbers for Subsets of the Boolean nCube with Bounded VapnikChervonenkis Dimension
, 1992
"... : Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn ..."
Abstract

Cited by 93 (4 self)
 Add to MetaCart
: Let V ` f0; 1g n have VapnikChervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn=k) log(n=k)) d . This new result has applications in the theory of empirical processes. 1 The author gratefully acknowledges the support of the Mathematical Sciences Research Institute at UC Berkeley and ONR grant N0001491J1162. 1 1 Statement of Results Let n be natural number greater than zero. Let V ` f0; 1g n . For a sequence of indices I = (i 1 ; . . . ; i k ), with 1 i j n, let V j I denote the projection of V onto I, i.e. V j I = f(v i 1 ; . . . ; v i k ) : (v 1 ; . . . ; v n ) 2 V g: If V j I = f0; 1g k then we say that V shatters the index sequence I. The VapnikChervonenkis dimension of V is the size of the longest index sequence I that is shattered by V [VC71] (t...
Preservation theorems for GlivenkoCantelli and uniform GlivenkoCantelli classes
 134 In High Dimensional Probability II, Evarist Giné
, 2000
"... ABSTRACT We show that the P −Glivenko property of classes of functions F1,...,Fk is preserved by a continuous function ϕ from R k to R in the sense that the new class of functions x → ϕ(f1(x),...,fk(x)), fi ∈Fi, i =1,...,k is again a GlivenkoCantelli class of functions if it has an integrable envel ..."
Abstract

Cited by 19 (8 self)
 Add to MetaCart
ABSTRACT We show that the P −Glivenko property of classes of functions F1,...,Fk is preserved by a continuous function ϕ from R k to R in the sense that the new class of functions x → ϕ(f1(x),...,fk(x)), fi ∈Fi, i =1,...,k is again a GlivenkoCantelli class of functions if it has an integrable envelope. We also prove an analogous result for preservation of the uniform GlivenkoCantelli property. Corollaries of the main theorem include two preservation theorems of Dudley (1998). We apply the main result to reprove a theorem of Schick and Dudley 1998a or b? Yu (1999)concerning consistency of the NPMLE in a model for “mixed case” interval censoring. Finally a version of the consistency result of Schick and Yu (1999)is established for a general model for “mixed case interval censoring ” in which a general sample space Y is partitioned into sets which are members of some VCclass C of subsets of Y. 1 GlivenkoCantelli theorems Let (X, A,P) be a probability space, and suppose that F ⊂ L1(P). For
Strong Minimax Lower Bounds for Learning
, 1998
"... Minimax lower bounds for concept learning state, for example, that for each sample size n and learning rule g_n, there exists a distribution of the observation X and a concept C to be learnt such that the expected error of g_n is at least a constant times V/n, where V is the VC dimension of the conc ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
Minimax lower bounds for concept learning state, for example, that for each sample size n and learning rule g_n, there exists a distribution of the observation X and a concept C to be learnt such that the expected error of g_n is at least a constant times V/n, where V is the VC dimension of the concept class. However, these bounds do not tell anything about the rate of decrease of the error for a fixed distributionconcept pair. In this paper we investigate minimax lower bounds in such astrongersense. We show that for several natural kparameter concept classes, including the class of linear halfspaces, the class of balls, the class of polyhedra with a certain number of faces, and a class of neural networks, for any sequence of learning rules {g_n}, there exists a fixed distribution of X and a fixed concept C such that the expected error is larger than a constant times k/n for infinitely many n. We also obtain such strong minimax lower bounds for the tail distribution of the probability of error, which extend the corresponding minimax lower bounds.
Neural Networks with Local Receptive Fields and Superlinear VC Dimension
 Neural Computation
, 2002
"... Local receptive field neurons comprise such wellknown and widely used unit types as radial basis function neurons and neurons with centersurround receptive field. We study the VapnikChervonenkis (VC) dimension of feedforward neural networks with one hidden layer of these units. For several varian ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Local receptive field neurons comprise such wellknown and widely used unit types as radial basis function neurons and neurons with centersurround receptive field. We study the VapnikChervonenkis (VC) dimension of feedforward neural networks with one hidden layer of these units. For several variants of local receptive field neurons we show that the VC dimension of these networks is superlinear.
The discrepancy method in computational geometry
 In Handbook of Discrete and Computational Geometry
, 2004
"... Discrepancy theory investigates how uniform nonrandom structures can be. For example, given n points in the plane, how should we color them red and blue so as to minimize the difference between the number of red points and the number of blue ones within any disk? Or, how should we place n points in ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Discrepancy theory investigates how uniform nonrandom structures can be. For example, given n points in the plane, how should we color them red and blue so as to minimize the difference between the number of red points and the number of blue ones within any disk? Or, how should we place n points in the unit square
Metric Entropy and Minimax Risk in Classification
 In Lecture Notes in Comp. Sci.: Studies in Logic and
, 1997
"... . We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously se ..."
Abstract
 Add to MetaCart
. We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy twovalued classification problems in terms of the Assouad density and the VapnikChervonenkis dimension. 1 Introduction The most basic problem in pattern recognition is the problem of classifying instances consisting of vectors of measurements into a one of a finite number of types or classes. One standard example is the recognition of isolated capital characters, in which the instances are measurements on images of letters ...
Learning Using Information Theory and the VC Dimension
"... Abstract. In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances see ..."
Abstract
 Add to MetaCart
Abstract. In this paper we study a Bayesian or averagecase model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models.