## Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension (1994)

### Download From

IEEE### Download Links

- [ftp.cse.ucsc.edu]
- [www.research.att.com]
- [classes.cec.wustl.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 114 - 12 self |

### BibTeX

@INPROCEEDINGS{Haussler94boundson,

author = {David Haussler and Michael Kearns and Robert Schapire},

title = {Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension},

booktitle = {Machine Learning},

year = {1994},

pages = {61--74},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1g-valued functions over an instance space X....

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...ound the probability of mistake by the information gain. We also provide an information-theoretic lower bound on the probability of mistake, which can be viewed as a special case of Fano's inequality =-=[9,14]-=-. Together these bounds provide a general characterization of learning curve behavior that is accurate to within a logarithmic factor. In Section 6 we exploit the learning curve bounds of Section 5 an... |

4179 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...efine to be that of predicting the label f(xm+1 ) given only the previous labels f(x 1 ); . . . ; f(xm ). The first learning algorithm we consider is called the Bayes optimal classification algorithm =-=[12]-=-, or the Bayes algorithm for short. It is a special case of the weighted majority algorithm [20]. For any m and b 2 f0; 1g, define F b m (x; f) = F b m (f) = f f 2 Fm (x; f) : f (xm+1 ) = bg. Then the... |

1754 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...ful viewpoints of the learning process in terms of learning curves and cumulative mistakes. One of these, arising from the study of Valiant's distribution-free or probably approximately correct model =-=[32]-=- and having roots in the pattern recognition and minimax decision theory literature, characterizes the distribution-free, worst-case sample complexity of concept learning in terms of a combinatorial p... |

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...so based on a worst-case assumption over instance space distributions. In addition to the VC dimension, Vapnik and Chervonenkis have a distribution-specific formulation that overcomes this limitation =-=[35]-=-, but apart from Natarajan's work [23], it has not been used much in computational learning theory. We extend this idea further in Section 9. The third and perhaps most subtle pessimistic assumption c... |

835 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...decision theory literature, characterizes the distribution-free, worst-case sample complexity of concept learning in terms of a combinatorial parameter known as the Vapnik-Chervonenkis (VC) dimension =-=[34,4]-=-. In contrast, the average-case sample complexity of learning in neural networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas a... |

701 | The weighted majority algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...g algorithm, and the Shannon information gain from the labels of the instance sequence. In doing so, we borrow from and contribute to the work on weighted majority and aggregating learning strategies =-=[18,20,36,11,2,19]-=-, as well as to the VC dimension and statistical physics work. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approach... |

640 |
Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...decision theory literature, characterizes the distribution-free, worst-case sample complexity of concept learning in terms of a combinatorial parameter known as the Vapnik-Chervonenkis (VC) dimension =-=[34,4]-=-. In contrast, the average-case sample complexity of learning in neural networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas a... |

322 |
What size net gives valid generalization
- Baum, Haussler
- 1989
(Show Context)
Citation Context ...nonzero, finite VC dimensions d 1 ; d 2 ; . . ., respectively [34]. A typical decomposition might let F i be all neural networks of a given type with at most i weights, in which case d i = O(i log i) =-=[3]-=-. We can also look at this from a Bayesian point of view by letting the prior P be over all concepts in F , and decomposing it as a linear sum P = P 1 i=1 ff i P i , where P i is an arbitrary prior ov... |

263 |
Aggregating strategies
- Vovk
- 1990
(Show Context)
Citation Context ...g algorithm, and the Shannon information gain from the labels of the instance sequence. In doing so, we borrow from and contribute to the work on weighted majority and aggregating learning strategies =-=[18,20,36,11,2,19]-=-, as well as to the VC dimension and statistical physics work. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approach... |

247 |
On the density of families of sets
- Sauer
- 1972
(Show Context)
Citation Context ...are given in the papers of Dudley [13] and Blumer et al. [4] and elsewhere. The following important combinatorial result relating dimm (F ; x) and j\Pi F m (x)j has been proven independently by Sauer =-=[27]-=-, Vapnik and Chervonenkis [34], and others (see Assouad [1]): for all x, log j\Pi F m (x)jslog dimm (F ;x) X i=0 / m i !s(1 + o(1)) dimm (F ; x) log m dimm (F ; x) (17) where o(1) is a quantity that g... |

240 |
Probability theory
- Renyi
- 1970
(Show Context)
Citation Context ...1 2 H P (\Pi F 1 (x 1 )) \Gamma 1 2 H P (\Pi F 1 (x 2 )) However, since the partition \Pi F 2 (x 1 ; x 2 ) is a refinement of the partitions \Pi F 1 (x 1 ) and \Pi F 1 (x 2 ), we have (see e.g. Renyi =-=[25]-=-) H P (\Pi F 2 (x 1 ; x 2 ))sH P (\Pi F 1 (x 1 )) +H P (\Pi F 1 (x 2 )) Thus E f2P;x2f(x 1 ;x 2 );(x 2 ;x 1 )g [I 2 (x; f )]s1 2 H P (\Pi F 1 (x 1 )) + 1 2 H P (\Pi F 1 (x 2 )) = E f2P ;x2f(x 1 ;x 2 )... |

161 | Bayesian Methods for Adaptive Models
- Mackay
- 1992
(Show Context)
Citation Context ...tudy leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approaches to learning in neural networks are described in the recent papers =-=[21,6]-=-. One of our main motivations for this research arises from the frequent claims of machine learning practitioners that sample complexity bounds derived via the VC dimension are overly pessimistic in p... |

109 | Informationtheoretic asymptotics of Bayes methods
- Clarke, Barron
- 1990
(Show Context)
Citation Context ...e varied and realistic models; some of this ongoing work is outlined in Section 12. Many beautiful results on the performance of Bayesian methods are also given in the statistics literature, see e.g. =-=[8,7]-=- and references therein. 2 Summary of results Following a brief introduction of some notation in Section 3, our results begin in Section 4. Here we define the Shannon information gain of an example, a... |

108 | Mistake bounds and logarithmic linear-threshold learning algorithms - Littlestone - 1989 |

96 | Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension
- Haussler
- 1995
(Show Context)
Citation Context ...to obtain the desired result. Alternately, we can also derive the result directly from the lemmas used in establishing their Theorem 2.3. This latter approach is outlined in the discussion section of =-=[16]-=-. From Equation (21) we can also obtain similar upper bounds for the Gibbs algorithm. In particular, using Equation (11) and Equation (21) we have for all P, E f2P;x2D [Gibbs m (x; f )]sE x2D [ 2 dimm... |

85 |
A Course on Empirical Processes
- Dudley
- 1982
(Show Context)
Citation Context ...gles in ! n then dim(F) = 2n; also if F is the set of all indicator functions for n-fold unions of intervals on X = ! then dim(F) = 2n. These and many other examples are given in the papers of Dudley =-=[13]-=- and Blumer et al. [4] and elsewhere. The following important combinatorial result relating dimm (F ; x) and j\Pi F m (x)j has been proven independently by Sauer [27], Vapnik and Chervonenkis [34], an... |

81 | A theory of learning classification rules
- Buntine
- 1990
(Show Context)
Citation Context ... our main motivations for this research arises from the frequent claims of machine learning practitioners that sample complexity bounds derived via the VC dimension are overly pessimistic in practice =-=[5,26]-=-. This pessimism can be traced to three assumptions that are implicit in results that are based on the VC dimension. The first pessimistic assumption is that only the worst-case performance over possi... |

80 |
Large automatic learning, rule extraction, and generalization
- Denker, Schwartz, et al.
- 1987
(Show Context)
Citation Context ...ral networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory =-=[10,31,15,29,24]-=-. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average-case or Bayesian ... |

50 |
Predicting f0; 1g functions on randomly drawn points
- Haussler, Littlestone, et al.
- 1994
(Show Context)
Citation Context ... f2P;x2D [ m X i=1 Bayes i (x; f )]sE f2P ;x2D [ m X i=1 Gibbs i (x; f )]s(1 + o(1))E x2D [ dimm (F ; x) 2 log m dimm (F ; x) ]s(1 + o(1)) dim(F) 2 log m dim(F) (19) Haussler, Littlestone and Warmuth =-=[17]-=- (Section 3, latter part) show that specific distributionssD and priors P can be constructed for each of the classes F listed above (i.e., (homogeneous) linear threshold functions, indicator functions... |

49 |
On the prediction of general recursive functions
- Barzdin, Freivalds
- 1972
(Show Context)
Citation Context ...g algorithm, and the Shannon information gain from the labels of the instance sequence. In doing so, we borrow from and contribute to the work on weighted majority and aggregating learning strategies =-=[18,20,36,11,2,19]-=-, as well as to the VC dimension and statistical physics work. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approach... |

45 |
Consistent inference of probabilities in layered networks: Predictions and generalization
- Tishby, Levin, et al.
- 1989
(Show Context)
Citation Context ...ral networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory =-=[10,31,15,29,24]-=-. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average-case or Bayesian ... |

42 | On-line learning of linear functions
- Littlestone, Long, et al.
- 1991
(Show Context)
Citation Context |

34 |
Learning probabilistic prediction functions
- DeSantis, Markowsky, et al.
- 1988
(Show Context)
Citation Context |

27 |
Statistical theory of learning a rule
- Gyorgyi, Tishby
- 1990
(Show Context)
Citation Context ...ral networks has recently been investigated from a standpoint that is essentially Bayesian 1 , and is strongly influenced by ideas and tools from statistical physics, as well as by information theory =-=[10,31,15,29,24]-=-. While each of these theories has its own distinct strengths and drawbacks, there is little understanding of what relationships hold between them. In this paper, we study an average-case or Bayesian ... |

26 |
Learning from examples in large neural networks
- Sompolinsky, Tishby, et al.
- 1990
(Show Context)
Citation Context |

25 | Bounding Sample Size with the VapnikChervonenkis Dimension - Shawe-Taylor, Anthony, et al. - 1989 |

19 | Rates of convergence in the central limit theorem for empirical processes - Massart - 1986 |

17 | Calculation of the learning curve of bayes optimal classification on algorithm for learning a perceptron with noise
- Opper, Haussler
- 1991
(Show Context)
Citation Context |

13 |
Densité et dimension. Annales de l’Institut Fourier 33
- Assouad
- 1983
(Show Context)
Citation Context ... of the expression inside the expectation of Equation (4) is pG(p)+(1 \Gamma p)G(1 \Gamma p) (using the substitution p = ��m+1 (f)), and is suggestive of a binary "entropy", in which we =-=interpret p 2 [0; 1] as a prob-=-ability, and G(p) to be the "information" conveyed by the occurrence of an event whose probability is p. We now apply Equation (4) to the three forms of G we have been considering, namely G(... |

8 |
Probably-approximate learning over classes of distributions. Unpublished manuscript
- Natarajan
- 1989
(Show Context)
Citation Context ...er instance space distributions. In addition to the VC dimension, Vapnik and Chervonenkis have a distribution-specific formulation that overcomes this limitation [35], but apart from Natarajan's work =-=[23]-=-, it has not been used much in computational learning theory. We extend this idea further in Section 9. The third and perhaps most subtle pessimistic assumption can be seen by noting that the VC dimen... |

7 |
Average case analysis of empirical and explanation-based learning algorithms
- Sarrett, Pazzani
- 1989
(Show Context)
Citation Context ... our main motivations for this research arises from the frequent claims of machine learning practitioners that sample complexity bounds derived via the VC dimension are overly pessimistic in practice =-=[5,26]-=-. This pessimism can be traced to three assumptions that are implicit in results that are based on the VC dimension. The first pessimistic assumption is that only the worst-case performance over possi... |

6 |
Theorie der Zeichenerkennung
- Vapnik
- 1979
(Show Context)
Citation Context ...ance space distribution dependent form of the bound for the 1-inclusion graph algorithm for all target 2 Vapnik had obtained the special case of this result for homogeneous linear threshold functions =-=[33]-=-. Also, see [30] for further interesting properties of Ex2D [dim m (F ; x)]. concepts, and then apply the argument described in the previous paragraph to obtain the desired result. Alternately, we can... |

4 |
Entropy, risk and the Bayesian central limit theorem, manuscript
- Clarke, Barron
- 1991
(Show Context)
Citation Context ...e varied and realistic models; some of this ongoing work is outlined in Section 12. Many beautiful results on the performance of Bayesian methods are also given in the statistics literature, see e.g. =-=[8,7]-=- and references therein. 2 Summary of results Following a brief introduction of some notation in Section 3, our results begin in Section 4. Here we define the Shannon information gain of an example, a... |

4 |
Donsker classes of sets. Probability Theory and Related Fields, 78:169-191
- Talagrand
- 1988
(Show Context)
Citation Context ...ibution dependent form of the bound for the 1-inclusion graph algorithm for all target 2 Vapnik had obtained the special case of this result for homogeneous linear threshold functions [33]. Also, see =-=[30]-=- for further interesting properties of Ex2D [dim m (F ; x)]. concepts, and then apply the argument described in the previous paragraph to obtain the desired result. Alternately, we can also derive the... |

3 |
Bayesian back propagation. Unpublished manuscript
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...tudy leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approaches to learning in neural networks are described in the recent papers =-=[21,6]-=-. One of our main motivations for this research arises from the frequent claims of machine learning practitioners that sample complexity bounds derived via the VC dimension are overly pessimistic in p... |

3 |
Class notes for course 6.574
- Fano
- 1952
(Show Context)
Citation Context ...ound the probability of mistake by the information gain. We also provide an information-theoretic lower bound on the probability of mistake, which can be viewed as a special case of Fano's inequality =-=[9,14]-=-. Together these bounds provide a general characterization of learning curve behavior that is accurate to within a logarithmic factor. In Section 6 we exploit the learning curve bounds of Section 5 an... |