## Fat-shattering and the learnability of real-valued functions (1996)

### Cached

### Download Links

- [axiom.anu.edu.au]
- [users.cecs.anu.edu.au]
- [users.cecs.anu.edu.au]
- [spigot.anu.edu.au]
- [wwwsyseng.anu.edu.au]
- [wwwsyseng.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Computer and System Sciences |

Citations: | 62 - 10 self |

### BibTeX

@ARTICLE{Bartlett96fat-shatteringand,

author = {Peter L. Bartlett and Philip M. Long and Robert C. Williamson},

title = {Fat-shattering and the learnability of real-valued functions},

journal = {Journal of Computer and System Sciences},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider the problem of learning real-valued functions from random examples when the function values are corrupted with noise. With mild conditions on independent observation noise, we provide characterizations of the learnability of a real-valued function class in terms of a generalization of the Vapnik-Chervonenkis dimension, the fat-shattering function, introduced by Kearns and Schapire. We show that, given some restrictions on the noise, a function class is learnable in our model if and only if its fat-shattering function is finite. With different (also quite mild) restrictions, satisfied for example by gaussian noise, we show that a function class is learnable from polynomially many examples if and only if its fatshattering function grows polynomially. We prove analogous results in an agnostic setting, where there is no assumption of an underlying function class. 1

### Citations

1754 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...function. The function is assumed to be a member of some known class. Using a popular definition of the problem of learning f0� 1g-valued functions (probably approximately correct learning — see [9], =-=[22]-=-), Blumer, Ehrenfeucht, Haussler, and Warmuth have shown [9] that the Vapnik-Chervonenkis dimension (see [23]) of a function class characterizes its learnability, in the sense that a function class is... |

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...em of learning f0� 1g-valued functions (probably approximately correct learning — see [9], [22]), Blumer, Ehrenfeucht, Haussler, and Warmuth have shown [9] that the Vapnik-Chervonenkis dimension (see =-=[23]-=-) of a function class characterizes its learnability, in the sense that a function class is learnable if and only if its Vapnik-Chervonenkis dimension is finite. Natarajan [15] and Ben-David, Cesa-Bia... |

647 |
Convergence of Stochastic Processes
- Pollard
- 1984
(Show Context)
Citation Context ...lly, if F is a set of functions from W to R, let Fjw Rm be defined by Fjw = ffjw : f 2 Fg. The following theorem is due to Haussler [11] (Theorem 3, p107); it is an improvement of a result of Pollard =-=[18]-=-. We say a function class is PH-permissible if it satisfies the mild measurability condition defined in Haussler’s Section 9.2 [11]. We say a class F of real-valued functions is permissible if the cla... |

640 |
Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...that function. The function is assumed to be a member of some known class. Using a popular definition of the problem of learning f0� 1g-valued functions (probably approximately correct learning — see =-=[9]-=-, [22]), Blumer, Ehrenfeucht, Haussler, and Warmuth have shown [9] that the Vapnik-Chervonenkis dimension (see [23]) of a function class characterizes its learnability, in the sense that a function cl... |

384 |
Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...ils and be symmetric about zero; gaussian noise satisfies these conditions) if and only if the fat-shattering function of the class has a polynomial rate of growth. We also consider agnostic learning =-=[11]-=- [13], in which there is no assumption of an underlying function generating the training examples, and the performance of the learning algorithm is measured by comparison with some function class F . ... |

365 |
Real Analysis
- Royden
- 1963
(Show Context)
Citation Context ... i=1 [0, 1]i : \j 0 _j>j 0, q( j)=0]. Informally, Q represents the set of all infinite binary sequences that do not end with repeating 1's. Each real number in [0, 1) has a unique representation in Q =-=[23]-=-. Suppose F= [fq:q#Q]. Since X is countable, F is permissible. Trivially, fatF(1 4)= ,soFis not learnable in any sense described in this paper. However, since for any q1, q2 # Q for which q1{q 2, for ... |

267 |
Probabilistic algorithms for Hamiltonian circuits and matchings
- Angluin, Valiant
- 1979
(Show Context)
Citation Context ...ils of the gaussian density) imply that, with probability at least 3=4 a noisy observation is closer to the value f (x) than to any other integral multiple of 2 ;d;2. From the standard Chernov bounds =-=[2]-=-, if m 12 log(1= ) the probability that the algorithm will store the correct label for fewer than half of the examples is less than . So this algorithm can ( � � )-learn from 12 log(1= ) examples, for... |

211 | Scale sensitive dimensions, uniform convergence and learnability
- Alon, Ben-David, et al.
- 1997
(Show Context)
Citation Context ...haracterized the learnability of f0�:::�ng-valued functions for fixed n. Alon, Ben-David, Cesa-Bianchi, and Haussler have proved an analogous result for the problem of learning probabilistic concepts =-=[1]-=-. In this case, there is an unknown [0� 1]-valued function, but the learner does not receive a sequence of values of the function at random points. Instead, with each random point it sees either 0 or ... |

201 | Towards efficient agnostic learning
- Kearns, Schapire, et al.
- 1992
(Show Context)
Citation Context ...nd be symmetric about zero; gaussian noise satisfies these conditions) if and only if the fat-shattering function of the class has a polynomial rate of growth. We also consider agnostic learning [11] =-=[13]-=-, in which there is no assumption of an underlying function generating the training examples, and the performance of the learning algorithm is measured by comparison with some function class F . We sh... |

199 | Efficient distribution-free learning of probabilistic concepts
- Kearns, Schapire
- 1994
(Show Context)
Citation Context ... learnability. If the ``loss'' of the learning algorithm was measured with (h(x)&y) 2 , instead of |h(x)&y|, then the necessity part of Theorem 22 would follow from the results of Kearns and Schapire =-=[16]-=-. The following result proves the ``only if'' parts of the theorem. Theorem 23. Let F be a class of [0, 1]-valued functions defined on X. Suppose 0<#<1, 0<= # 65, 0<$ 1 16, and d # N. If fatF(#) d>100... |

47 |
On learning sets and functions
- Natarajan
- 1989
(Show Context)
Citation Context ...vonenkis dimension (see [23]) of a function class characterizes its learnability, in the sense that a function class is learnable if and only if its Vapnik-Chervonenkis dimension is finite. Natarajan =-=[15]-=- and Ben-David, Cesa-Bianchi, Haussler and Long [7] have characterized the learnability of f0�:::�ng-valued functions for fixed n. Alon, Ben-David, Cesa-Bianchi, and Haussler have proved an analogous ... |

30 | Function learning from interpolation
- Anthony, Bartlett
(Show Context)
Citation Context ...ity. Anthony, Bartlett, Ishai, and ShaweTaylor [4] give necessary and sufficient conditions that a function that approximately interpolates the target function is a good approximation to it (see also =-=[5, 3]-=-). Natarajan [20] considers the problem of learning a class of real-valued functions in the presence of bounded observation noise and presents sufficient conditions for learnability. (Theorem 2 in [4]... |

29 | Universal schemes for sequential decision from individual data sequences
- Merhav, Feder
- 1993
(Show Context)
Citation Context .... Natarajan [16] considers the problem of learning a class of real-valued functions in the presence of bounded observation noise, and presents sufficient conditions for learnability. Merhav and Feder =-=[14]-=-, and Auer, Long, Maass, and Woeginger [4] study function learning in a worst-case setting. In the next section, we define admissible noise distribution classes and the learning problems, and present ... |

28 |
Learnability with respect to a fixed distribution
- Benedek, Itai
- 1991
(Show Context)
Citation Context ...ic algorithms, but an almost identical proof gives the same result for randomized algorithms. We will make use of the following lemma, which is implicit in the results of Benedek and Itai. Theorem 9 (=-=[8]-=-) Choose X, a probability distribution P on X, and f 2f0� 1gX . If (a) h1� :::� hr 2f0� 1gX are such that there exists i for which erP�f (hi) 1=32, (b) m = d96 ln(8r)e, (c) x1� :::� xm are drawn indep... |

25 | Bounding Sample Size with the VapnikChervonenkis Dimension
- Shawe-Taylor, Anthony, et al.
- 1989
(Show Context)
Citation Context ...s needed for learning, we will make use of the following lemma. Lemma 16 For any y1�y2�y4� >0 and y3 1,if m 2 y4 4y2 4 + ln y2y3 y4 then y1 exp(y2 ln 2 (y3m); y4m) . 2 + ln y1 The proof uses the fact =-=[20]-=- that for all a� b > 0� ln a ab + ln(1=b) in a manner similar to [20]. We can now present the upper bound. Again, the constants have not been optimized. Theorem 17 For any permissible class F of funct... |

21 |
Approximation and learning of convex superpositions
- Gurvits, Koiran
- 1997
(Show Context)
Citation Context ...F | z that is no bigger. Now, choose (x1, y1), ..., (xm, ym)#X_[a, b], and f, g: X [0, 1]. We have 5 Recently, Gurvits and Koiran have proved a result relating the fatshattering functions of lF and F =-=[14]-=-.446 BARTLETT, LONG, AND WILLIAMSON 1 m : m |(g(xi)&yi) 2 &( f(xi)&yi) 2 | i=1 = 1 m : m |(g(xi)&yi) i=1 2 &((f(xi)&g(xi))+g(xi)&yi) 2 | = 1 m : m |( f(xi)&g(xi)) i=1 2 &2( f(xi)&g(xi))(g(xi)&yi)| 1 ... |

20 | Characterizations of learnability for classes of {0 .... n}-valued functions
- Ben-David
- 1995
(Show Context)
Citation Context ...racterizes its learnability, in the sense that a function class is learnable if and only if its Vapnik Chervonenkis dimension is finite. Natarajan [19] and Ben-David, Cesa-Bianchi, Haussler, and Long =-=[11]-=- have characterized the learnability of [0, ..., n]-valued functions for fixed n. Alon, Ben-David, Cesa-Bianchi, and Haussler have proved an analogous result for the problem of learning probabilistic ... |

17 | More theorems about scale-sensitive dimensions and learning
- Bartlett, Long
- 1995
(Show Context)
Citation Context ...etween constant factors in the argument of the fatshattering function. If the domain X is infinite, this gap alone can lead to an arbitrarily large gap in the sample complexity bounds. Recent results =-=[9]-=- for agnostic learning narrow this gap to a factor of two. The lower bound on the sample complexity of real-valued learning (Theorem 11) does not increase with 1 = and 1 $.In fact, the lower bound of ... |

14 | Learning with a slowly changing distribution
- Bartlett
- 1992
(Show Context)
Citation Context ... ): R n + =2 n ; =2 p(x;y) dx ; R n + =2 n ; =2 p; x;Q (y) n=;1jp(x; y + n ) ;p ; x;Q (y) +n dx We will use the following lemma. The proof is by induction, and is implicit in the proof of Lemma 12 in =-=[6]-=-. Lemma 7 If Pi and Qi are distributions on a set Y (i = 1�:::�m), and E is a measurable subset of Y m , then mY i=1 Pi ! (E); mY i=1 Qi ! (E) dx P 1 m 2 i=1 dTV (Pi�Qi): Proof (of Lemma 5) We will de... |

14 |
Characterizations of learnability for classes of f0; ::; ng-valued functions
- Ben-David, Cesa-Bianchi, et al.
- 1995
(Show Context)
Citation Context ...aracterizes its learnability, in the sense that a function class is learnable if and only if its Vapnik-Chervonenkis dimension is finite. Natarajan [15] and Ben-David, Cesa-Bianchi, Haussler and Long =-=[7]-=- have characterized the learnability of f0�:::�ng-valued functions for fixed n. Alon, Ben-David, Cesa-Bianchi, and Haussler have proved an analogous result for the problem of learning probabilistic co... |

10 | Long,\Simulating access to hidden information while learning
- Auer, P
- 1994
(Show Context)
Citation Context ...gh for our purposes because of the restrictions on required to show that learning F is not much harder than learning Q (F ). In this section, we present a new technique, inspired by the techniques of =-=[5]-=-. We show that an algorithm for learning a class of discrete-valued functions can effectively be used as a subroutine in an algorithm for learning binary-valued functions. We then apply a lower bound ... |

9 | Bounds on the number of examples needed for learning functions
- Simon
- 1997
(Show Context)
Citation Context ...e upper and lower bounds with 1= and 1= is essential. However, if the noise variance is sufficiently large, it seems likely that there is a general lower bound that grows with these quantities. Simon =-=[21]-=- shows that a stronger notion of shattering provides a lower bound for the problem of learning without noise. However, the finiteness of this strong-fat-shattering function is not sufficient for learn... |

7 | Valid generalisation from approximate interpolation
- Anthony, Bartlett, et al.
(Show Context)
Citation Context ...d papers, other general results about learning real-valued functions have been obtained. Haussler [15] gives sufficient conditions for agnostic learnability. Anthony, Bartlett, Ishai, and ShaweTaylor =-=[4]-=- give necessary and sufficient conditions that a function that approximately interpolates the target function is a good approximation to it (see also [5, 3]). Natarajan [20] considers the problem of l... |

6 |
Occam's razor for functions
- Natarajan
- 1993
(Show Context)
Citation Context ...ions for agnostic learnability. Anthony and Shawe-Taylor [3] provide sufficient conditions that a function that approximately interpolates the target function is a good approximation to it. Natarajan =-=[16]-=- considers the problem of learning a class of real-valued functions in the presence of bounded observation noise, and presents sufficient conditions for learnability. Merhav and Feder [14], and Auer, ... |

4 | On the complexity of function learning
- Auer, Long, et al.
- 1993
(Show Context)
Citation Context ...earning a class of real-valued functions in the presence of bounded observation noise, and presents sufficient conditions for learnability. Merhav and Feder [14], and Auer, Long, Maass, and Woeginger =-=[4]-=- study function learning in a worst-case setting. In the next section, we define admissible noise distribution classes and the learning problems, and present the characterizations of learnability. Sec... |

2 | Approximation and learning of convex superpositions, Computational Learning Theory: EUROCOLT'95 - Gurvits, Koiran - 1995 |

1 | Function learning from interpolation, Computational Learning Theory: EUROCOLT'95 - Anthony, Bartlett - 1995 |

1 |
Valid generalization from approximate interpolation, in ``Computational Learning Theory: EUROCOLT '93
- Anthony, Shawe-Taylor
- 1993
(Show Context)
Citation Context ...the aforementioned papers, other general results about learning real-valued functions have been obtained. Haussler [11] gives sufficient conditions for agnostic learnability. Anthony and Shawe-Taylor =-=[3]-=- provide sufficient conditions that a function that approximately interpolates the target function is a good approximation to it. Natarajan [16] considers the problem of learning a class of real-value... |

1 |
de Geer, Regression analysis and empirical processes, in ``Centrum voor Wiskunde en Informatica
- van
- 1988
(Show Context)
Citation Context ... of learnability in terms of finiteness of the fat-shattering function to weaker noise models. It seems likely that it could be extended to the case of unbounded noise; perhaps the techniques used in =-=[13]-=- to prove uniform convergence with unbounded noise could be useful here. There are several ways in which our results could be improved. The sample complexity upper bound in Theorem 19 increases at lea... |

1 | razor for functions - Occam's - 1993 |

1 |
The Big Book of Facts about the Normal Distribution
- Patel, Read
- 1982
(Show Context)
Citation Context ... (x) is monotonically decreasing p for x>0, the total vari;1 ation function is v( )=2( 2 ) . Obviously, f is an even function. Standard bounds on the area under the tails of the gaussian density (see =-=[17]-=-, p.64, Fact 3.7.3) give G f 2 R : j j >s=2g exp ; s2 8 2 � (1) and if s>8 ,exp(;s 2 =(8 2 )) < exp(;s= ), so the constants c0 = 1 and s0 = 8 will satisfy Condition 3 0 . So the class G of gaussian di... |

1 |
Regression Analysis and Empirical Processes, Centrum voor Wiskunde en Informatica
- GEER
- 1988
(Show Context)
Citation Context ...oblem. It seems likely that the characterization of learnability in terms of finiteness of the fat-shattering function could be extended to the case of unbounded noise. Perhaps the techniques used in =-=[10]-=- to prove uniform convergence with unbounded noise could be useful here. There are several ways in which our results could be improved. The sample complexity upper bound in Theorem 17 increases at lea... |