## Local Rademacher complexities (2002)

### Cached

### Download Links

Venue: | Annals of Statistics |

Citations: | 111 - 18 self |

### BibTeX

@INPROCEEDINGS{Bartlett02localrademacher,

author = {Peter L. Bartlett and Olivier Bousquet and Shahar Mendelson},

title = {Local Rademacher complexities},

booktitle = {Annals of Statistics},

year = {2002},

pages = {44--58}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.

### Citations

1076 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...T AND S. MENDELSON the infimum.) Then, by definition of f ˆ, Pnℓfˆ ≤ Pnℓf ∗. Since the variance of ℓf ∗(Xi,Yi) is no more than some constant times L∗ , we can apply Bernstein’s inequality (see, e.g., =-=[10]-=-, Theorem 8.2) to show that with probability at least 1 − e−x , �� Pℓf Pnℓfˆ ≤ Pnℓf ∗ ≤ Pℓf∗ + c ∗x � x + = L n n ∗ �� L + c ∗ � x x + . n n Thus, by Theorem 3.3, with probability at least 1 − 2e−x , ... |

985 |
On the uniform convergence of relative frequences of events to their probabilities
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...complexity might depend on the (unknown) underlying probability measure according to which the data is produced. Distribution-free notions of the complexity, such as the Vapnik–Chervonenkis dimension =-=[35]-=- or the metric entropy [28], typically give conservative estimates. Distribution-dependent estimates, based for example on entropy numbers in the L2(P ) distance, where P is the underlying distributio... |

660 |
Convergence of stochastic processes
- Pollard
- 1984
(Show Context)
Citation Context ...the (unknown) underlying probability measure according to which the data is produced. Distribution-free notions of the complexity, such as the Vapnik–Chervonenkis dimension [35] or the metric entropy =-=[28]-=-, typically give conservative estimates. Distribution-dependent estimates, based for example on entropy numbers in the L2(P ) distance, where P is the underlying distribution, are not useful when P is... |

510 |
Asymptotic Statistics
- Vaart
- 1998
(Show Context)
Citation Context ...d with the class of functions (rather than to the expected supremum of that empirical process). This modulus of continuity is well understood from the empirical processes theory viewpoint (see, e.g., =-=[33, 34]-=-). Also, from the point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained ... |

393 |
2000) Weak Convergence of Empirical Processes
- Vaart, Wellner
(Show Context)
Citation Context ...with the class of functions (rather than to the expected supremum of that empirical process). This modulus of continuity is well understood from the empirical processes theory point of view (see e.g. =-=[35]-=- and [36]). Also, from the point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been ... |

387 |
Decision theoretic generalizations of the PAC model for neural net and learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...results of Vapnik and Chervonenkis based on weighting [35] are restricted to classes of nonnegative functions. Also, most previous results, such as those of Pollard [28], van de Geer [32] or Haussler =-=[13]-=-, give complexity terms that involve “global” measures of complexity of the class, such as covering numbers. None of these results uses the recently introduced Rademacher averages as measures of compl... |

314 | Probability in Banach Spaces - Ledoux, Talagrand - 1991 |

306 |
Weak convergence and empirical processes: with applications in statistics
- Vaart, Wellner
- 1996
(Show Context)
Citation Context ...d with the class of functions (rather than to the expected supremum of that empirical process). This modulus of continuity is well understood from the empirical processes theory viewpoint (see, e.g., =-=[33, 34]-=-). Also, from the point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained ... |

276 | Rademacher and Gaussian complexities: risk bounds and structural results
- Bartlett, Mendelson
- 2002
(Show Context)
Citation Context ...hanks to symmetrization inequalities), it was first proposed as an effective complexity measure by Koltchinskii [15], Bartlett, Boucheron and Lugosi [1] and Mendelson [25] and then further studied in =-=[3]-=-. Unfortunately, one of the shortcomings of the Rademacher averages is that they provide global estimates of the complexity of the function class, that is, they do not reflect the fact that the algori... |

234 |
A Distribution-free Theory of Non-parametric Regression
- Györfi, Kohler, et al.
- 1996
(Show Context)
Citation Context ...dures. To analyze the case of unbounded targets, one usually truncates the values at a certain threshold and bounds the probability of exceeding that threshold (see, e.g., the techniques developed in =-=[12]-=-). The training sample is a sequence (X1,Y1),...,(Xn,Yn) of n independent and identically distributed (i.i.d.) pairs sampled according to P . A loss function ℓ : Y × Y →[0, 1] is defined and the goal ... |

186 |
Probability in Banach spaces. Isoperimetry and Processes
- Ledoux, Talagrand
- 1991
(Show Context)
Citation Context ...if(X ′ i 1 ) + E sup n f ∈F Using an identical argument, the same holds for Pnf − Pf . � � n� −σif(Xi) i=1 In addition, recall the following contraction property, which is due to Ledoux and Talagrand =-=[17]-=-. THEOREM A.6. Let φ be a contraction, that is, |φ(x)−φ(y)|≤|x −y|. Then, for every class F , where φ ◦ F := {φ ◦ f : f ∈ F }. Eσ Rnφ ◦ F ≤ Eσ RnF , The interested reader may find some additional usef... |

173 |
der Vaart, Asymptotic Statistics
- van
- 2000
(Show Context)
Citation Context ...the class of functions (rather than to the expected supremum of that empirical process). This modulus of continuity is well understood from the empirical processes theory viewpoint (see e.g. [34] and =-=[33]-=-). Also, from the point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained ... |

156 |
bounds for gaussian and empirical processes
- Talagrand, Sharper
- 1994
(Show Context)
Citation Context ...over, the same results hold for the quantity supf ∈F (Pnf − Pf ). This theorem, which is proved in Appendix A.2, is a more or less direct consequence of Talagrand’s inequality for empirical processes =-=[30]-=-. However, the actual statement presented here is new in the sense that it displays the best known constants. Indeed, compared to the previous result of Koltchinskii and Panchenko [16] which was based... |

149 |
Uniform Central Limit Theorems
- Dudley
(Show Context)
Citation Context ..., Z # EZ + # 2xv + cx 3 . The assertion of this theorem is stated for a countable class of functions to avoid measurability problems. Since only mild assumptions are needed to resolve this issue (see =-=[7]-=- for more details), we ignore it. In a similar way, one can obtain a concentration result for the Rademacher averages of a class (see e.g. [3]) 1 . Theorem 2.3 Let F be a class of functions that map X... |

111 | Smooth discrimination analysis
- Mammen, Tsybakov
- 1999
(Show Context)
Citation Context ...m 4.1 is that under the assumptions of the theorem, the minimizers of the empirical loss and of the true loss are close with respect to the L2(P ) and the L2(Pn) distances (this has also been used in =-=[20]-=- and [31, 32]). PROOF OF THEOREM 5.4. Define the function ψ as ψ(r)= c1 2 ERn{f ∈ F : L 2 P(f − f ∗ ) 2 ≤ r}+ (c2 − c1)x (5.2) . n Notice that since F is convex and thus star-shaped around each of its... |

97 | Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension
- Haussler
- 1995
(Show Context)
Citation Context ... ɛ-cover for star(F , 0) using an ɛ/2-cover for F and an ɛ/2-cover for the interval [0, 1], which implies log N � ε, star(F , 0), L2(Pn) � � ��� � � ε 2 ≤ log N , F ,L2(Pn) + 1 . 2 ɛ Now, recall that =-=[14]-=- for any probability distribution P and any class F with VC-dimension d<∞, � ε log N 2 , F ,L2(P � � � 1 ) ≤ cd log . ɛ Therefore ERn{f ∈ star(F , 0) : Pnf 2 ≤ 2r ∗ � � cd }≤ n √ 2r∗ � � � 1 log dε 0 ... |

92 |
Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse
- Massart
- 2000
(Show Context)
Citation Context ...ght behavior in the general, noisy case is to analyze the increments of the empirical process, in other words, to directly consider the functions f − f ∗ . This approach was first proposed by Massart =-=[22]-=-; see also [26]. Massart introduces the assumption Var[ℓf (X) − ℓf ∗(X)]≤d2 (f, f ∗ ) ≤ B(Pℓf − Pℓf ∗), where ℓf is the loss associated with the function f [in other words, ℓf (X, Y ) = ℓ(f (X), Y ), ... |

89 |
About the constants in Talagrand’s concentration inequalities for empirical processes
- Massart
- 2000
(Show Context)
Citation Context ...s new in the sense that it displays the best known constants. Indeed, compared to the previous result of Koltchinskii and Panchenko [16] which was based on Massart’s version of Talagrand’s inequality =-=[21]-=-, we have used the most refined concentration inequalities available:s1504 P. L. BARTLETT, O. BOUSQUET AND S. MENDELSON that of Bousquet [7] for the supremum of the empirical process and that of Bouch... |

88 |
A sharp concentration inequality with applications,” Random Struct
- Boucheron, Lugosi, et al.
- 2000
(Show Context)
Citation Context ...efined concentration inequalities available:s1504 P. L. BARTLETT, O. BOUSQUET AND S. MENDELSON that of Bousquet [7] for the supremum of the empirical process and that of Boucheron, Lugosi and Massart =-=[5]-=- for the Rademacher averages. This last inequality is a powerful tool to obtain data-dependent bounds, since it allows one to replace the Rademacher average (which measures the complexity of the class... |

81 | Model selection and error estimation
- Bartlett, Boucheron, et al.
(Show Context)
Citation Context ...o the expected supremum of the empirical process (thanks to symmetrization inequalities), it was first proposed as an effective complexity measure by Koltchinskii [15], Bartlett, Boucheron and Lugosi =-=[1]-=- and Mendelson [25] and then further studied in [3]. Unfortunately, one of the shortcomings of the Rademacher averages is that they provide global estimates of the complexity of the function class, th... |

77 | Rademacher penalties and structural risk minimization
- Koltchinskii
- 2001
(Show Context)
Citation Context ...known for a long time to be related to the expected supremum of the empirical process (thanks to symmetrization inequalities), it was first proposed as an effective complexity measure by Koltchinskii =-=[15]-=-, Bartlett, Boucheron and Lugosi [1] and Mendelson [25] and then further studied in [3]. Unfortunately, one of the shortcomings of the Rademacher averages is that they provide global estimates of the ... |

73 |
Uniform Central Limit Theorems, Cambridge
- Dudley
- 1999
(Show Context)
Citation Context ...re measurable without explicitly mentioning it. In other words, we assume that the class F and the distribution P satisfy appropriate (mild) conditions for measurability of this supremum (we refer to =-=[11, 28]-=- for a detailed account of such issues). The following theorem is the main result of this section and is at the core of all the proofs presented later. It shows that if the functions in a class have s... |

70 | Efficient agnostic learning of neural networks with bounded fan-in - Lee, Bartlett, et al. - 1996 |

66 | A Bennett concentration inequality and its application to suprema of empirical processes
- Bousquet
(Show Context)
Citation Context ...h was based on Massart’s version of Talagrand’s inequality [21], we have used the most refined concentration inequalities available:s1504 P. L. BARTLETT, O. BOUSQUET AND S. MENDELSON that of Bousquet =-=[7]-=- for the supremum of the empirical process and that of Boucheron, Lugosi and Massart [5] for the Rademacher averages. This last inequality is a powerful tool to obtain data-dependent bounds, since it ... |

66 |
Empirical Processes in M-Estimation
- Geer
- 2000
(Show Context)
Citation Context ...he point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained by van de Geer =-=[31, 32]-=- (among others), who also provides non-asymptotic exponential inequalities. Unfortunately, these are in terms of entropy (or random entropy) and hence are not useful when the probability distribution ... |

54 | A few notes on statistical learning theory
- Mendelson
- 2002
(Show Context)
Citation Context ...is, |φ(x)−φ(y)|≤|x −y|. Then, for every class F , where φ ◦ F := {φ ◦ f : f ∈ F }. Eσ Rnφ ◦ F ≤ Eσ RnF , The interested reader may find some additional useful properties of the Rademacher averages in =-=[3, 27]-=-.sA.2. Proofs. LOCAL RADEMACHER COMPLEXITIES 1531 PROOF OF THEOREM 2.1. Define V + = sup f ∈F (Pf − Pnf). Since sup f ∈F Var[f(Xi)]≤r,and�f − Pf �∞ ≤ b − a, Theorem A.1 implies that, with probability ... |

49 | The Importance of Convexity in Learning with Squared Loss
- Lee, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...y − y ′ ) 2 , when the function class F is convex and uniformly bounded. In particular, if |f(x)− y|∈[0, 1] for all f ∈ F , x ∈ X and y ∈ Y, then the conditions are satisfied with L = 2and B = 1 (see =-=[18]-=-). Other examples are described in [26] and in [2]. The first result we present is a direct but instructive corollary of Theorem 3.3. COROLLARY 5.3. Let F be a class of functions with ranges in [−1, 1... |

47 | Rademacher processes and bounding the risk of function learning
- Koltchinskii, Panchenko
- 2000
(Show Context)
Citation Context ...gative uniformly bounded functions (or increments with respect to a fixed null function). In this case, one trivially has for all f ∈ F , Var[f ]≤cPf . This is exploited by Koltchinskii and Panchenko =-=[16]-=-, who consider the case of prediction with absolute loss when functions in F have values in [0, 1] and there is a perfect function f ∗ in the class, that is, Pf ∗ = 0. They introduce an iterative meth... |

44 | Concentration inequalities using the entropy method
- Boucheron, Lugosi, et al.
(Show Context)
Citation Context ...t 1 − e−x , Z ≤ EZ + √ 2xv + cx 3 .sLOCAL RADEMACHER COMPLEXITIES 1529 In a similar way one can obtain a concentration result for the Rademacher averages of a class (using the result of [5]; see also =-=[6]-=-). In order to obtain the appropriate constants, notice that Eσ sup f ∈F n� i=1 σif(Xi) = Eσ sup f ∈F and |f − (b − a)/2|≤(b − a)/2. n� i=1 � σi f(Xi) − (b − a)/2 � THEOREM A.2. Let F be a class of fu... |

38 | Improving the sample complexity using global data
- Mendelson
(Show Context)
Citation Context ... the general, noisy case is to analyze the increments of the empirical process, in other words, to directly consider the functions f − f ∗ . This approach was first proposed by Massart [22]; see also =-=[26]-=-. Massart introduces the assumption Var[ℓf (X) − ℓf ∗(X)]≤d2 (f, f ∗ ) ≤ B(Pℓf − Pℓf ∗), where ℓf is the loss associated with the function f [in other words, ℓf (X, Y ) = ℓ(f (X), Y ), which measures ... |

30 | Rademacher averages and phase transition in Glivenko-Cantelli classes
- Mendelson
- 2002
(Show Context)
Citation Context ...remum of the empirical process (thanks to symmetrization inequalities), it was first proposed as an effective complexity measure by Koltchinskii [15], Bartlett, Boucheron and Lugosi [1] and Mendelson =-=[25]-=- and then further studied in [3]. Unfortunately, one of the shortcomings of the Rademacher averages is that they provide global estimates of the complexity of the function class, that is, they do not ... |

27 | Complexity regularization via localized random penalties
- Lugosi, Wegkamp
(Show Context)
Citation Context ...nction instead of the one that minimizes Pf . As a consequence, the fixed point ˆr ∗ cannot be expected to converge to zero when inff ∈F Pf > 0. In order to remove this limitation, Lugosi and Wegkamp =-=[19]-=- use localized Rademacher averages of a small ball around the minimizer fˆ of Pn.However,their result is restricted to nonnegative functions, and in particular functions with values in {0, 1}. Moreove... |

26 |
Concentration inequalities using the entropy method, Ann Probab 31
- Boucheron, Lugosi, et al.
- 2003
(Show Context)
Citation Context ...log(1 + x) − x and v = nσ2 + 2cEZ. Also, with Z ≤ EZ + √ 2xv + cx 3 . In a similar way, one can obtain a concentration result for the Rademacher averages of a class (using the result of [5], see also =-=[6]-=-) 2 . Theorem A.2 Let F be a class of functions that map X into [a, b]. Let Then for all x ≥ 0, and Lemma A.3 For u, v ≥ 0, and for any α > 0, Z = Eσsup f∈F n� σif(Xi) = nEσRnF . i=1 � Pr Z ≥ EZ + � (... |

18 |
Concentration inequalities for sub-additive functions using the entropy method
- Bousquet
- 2003
(Show Context)
Citation Context ...derive from classical results. We present proofs for the sake of completeness. Recall the following improvement of Rio’s [29] version of Talagrand’s concentration inequality, which is due to Bousquet =-=[7, 8]-=-. THEOREM A.1. Let c>0, let Xi be independent random variables distributed according to P and let F be a set of functions from X to R. Assume that all functions f in F satisfy Ef = 0 and �f �∞ ≤ c. Le... |

18 | Concentration - McDiarmid - 1998 |

16 | Some local measures of complexity of convex hulls and generalization bounds
- Bousquet, Koltchinskii, et al.
- 2002
(Show Context)
Citation Context ...the error of an estimator, and in the presence of noise there may not be any perfect estimator (even the best in the class can have nonzero error). More recently, Bousquet, Koltchinskii and Panchenko =-=[9]-=- have obtained a more general result avoiding the iterative procedure. Their result is that for functions with values in [0, 1], with probability at least 1 − e−x , � ∀ f ∈ F Pf ≤ c Pnf +ˆr ∗ � t + lo... |

14 |
Une inégalité de Bennett pour les maxima de processus empiriques
- Rio
(Show Context)
Citation Context ...of results that is needed in the proofs. Most of them are classical or easy to derive from classical results. We present proofs for the sake of completeness. Recall the following improvement of Rio’s =-=[29]-=- version of Talagrand’s concentration inequality, which is due to Bousquet [7, 8]. THEOREM A.1. Let c>0, let Xi be independent random variables distributed according to P and let F be a set of functio... |

14 |
A new approach to least squares estimation, with applications
- Geer
- 1987
(Show Context)
Citation Context ...he point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained by van de Geer =-=[31, 32]-=- (among others), who also provides nonasymptotic exponential inequalities. Unfortunately, these are in terms of entropy (or random entropy) and hence might not be useful when the probability distribut... |

13 | Probabilistic methods for algorithmic discrete mathematics - In - 1998 |

10 |
Empirical processes in M-estimation. Cambridge university press
- Geer
- 2000
(Show Context)
Citation Context ...he point of view of M-estimators, the quantity which determines the rate of convergence is actually a fixed point of this modulus of continuity. Results of this type have been obtained by van de Geer =-=[31, 32]-=- (among others), who also provides nonasymptotic exponential inequalities. Unfortunately, these are in terms of entropy (or random entropy) and hence might not be useful when the probability distribut... |

10 | Combinatorial methods in density estimation, Springer Series in Statistics - Devroye, Lugosi - 2001 |

9 | Geometric parameters of kernel machines
- Mendelson
- 2002
(Show Context)
Citation Context ...y)f(y)dP (y). It is possible to show that T k is a positive semi-definite compact operator. Let (# i ) # i=1 be its eigenvalues, arranged in a non-increasing order. The following result was proved in =-=[16]-=-. Theorem 6.4 For every probability measure P and r > 0, ER n # f # F : Pf 2 # r # # # # 2 n # # j=1 min{r, # i } # # 1/2 . Moreover, there exists an absolute constant c such that if # 1 # 1/n, then f... |

6 |
Stéphane Boucheron, and Gábor Lugosi. Model selection and error estimation
- Bartlett
(Show Context)
Citation Context ... not known) and thus it is desirable to obtain data-dependent estimates. One of the most interesting data-dependent complexity estimates is the so-called Rademacher averages associated with the class =-=[9, 1]-=-. Unfortunately, one of the shortcomings of the Rademacher averages is that they provide global estimates on the complexity of the function class, that is, they do not reflect the fact that the algori... |

6 |
Agnostic learning of non-convex classes of functions
- Mendelson, Williamson
- 2002
(Show Context)
Citation Context ...pected squared loss), then for every f ∈ F, P f 2 ≤ 16P f [7]. Related results are known for other loss classes, such as those defined using p-norms with p > 2 [9], and for certain non-convex classes =-=[10]-=-. This article is organized as follows; first, we present some definitions and notation. Then, we present the basic properties of the Rademacher (both local and global) complexities. In Section 4 we r... |

4 |
Empirical minimization. Probab. Theory Related Fields 135
- Bartlett, Mendelson
- 2006
(Show Context)
Citation Context ...lasses: the functions with T(f)≤ r and the ones with T(f)>r, and this is done by introducing the weighting factor T(f)∨ r. This idea was exploited in the work of Mendelson [26] and, more recently, in =-=[4]-=-. Moreover, when one considers the set Fr = star(F , 0) ∩{T(f)≤ r}, any function f ′ ∈ F with T(f ′ )>r will have a scaled down representative in that set. So even though it seems that we look at the ... |

3 |
A Bennett concentration inequality and its application to empirical processes. Comptes Rendus de l'Academie des Sciences
- Bousquet
- 2002
(Show Context)
Citation Context ...es the deviation of supf∈F | �n i=1 f(Xi)| � � 1 (or IEσ � n �n i=1 σif(Xi) � � � ) from its mean value. The result we use is a version of Talagrand’s concentration inequality for empirical processes =-=[4]-=-. The benefit of this result is that it enables one to control the deviation in terms of the Rademacher averages and the largest variance of a class member. As an application of this result we obtain ... |

1 |
Geometric parameters of kernel machines. Computational Learning Theory
- MENDELSON
- 2002
(Show Context)
Citation Context ... the normalized Gram matrix (or kernel matrix) ˆTn defined as ˆTn = 1 n (k(Xi,Xj ))i,j=1,...,n.Let(ˆλi) n i=1 be its eigenvalues, arranged in a nonincreasing order. The following result was proved in =-=[24]-=-. THEOREM 6.5. For every r>0, ERn{f ∈ F : Pf 2 � ∞� �1/2 2 ≤ r}≤ min{r, λi} . n i=1 Moreover, there exists an absolute constant c such that if λ1 ≥ 1/n, then for every r ≥ 1/n, ERn{f ∈ F : Pf 2 � ∞� �... |

1 | Convex optimization. Manuscript (available at http://www.stanford.edu/class/ee364/index.html#book - Boyd, Vandenberghe - 2001 |

1 |
A few remarks on statistical learning theory
- Mendelson
- 2002
(Show Context)
Citation Context ... #(y)| # |x - y|). Then, for every class F , E # R n # # F # E # R n F , where # # F := {# # f : f # F}. The interested reader may find some additional useful properties of the Rademacher averages in =-=[2, 15]-=-. 6 3 Error Bounds with Local Complexity In this section we show that the Rademacher averages associated with a small subset of the class may be considered as a complexity term in an error bound. Sinc... |

1 |
Concentration inequalities using the entropy method. Preprint, CNRS-Université Paris-Sud
- Boucheron, Lugosi, et al.
- 2002
(Show Context)
Citation Context ... = (1 + x) log(1 + x) − x and v = nσ2 + 2IE [Z]. Also, � IP Z ≥ IE [Z] + √ 2xv + x � ≤ e 3 −x .sIn a similar way one can obtain a concentration result for the Rademacher averages of a class (see e.g. =-=[3]-=-). Theorem 2. Assume |f(x)| ≤ 1. Let � n� � � � � n� �� � � � Z := IEσ sup σif(Xi) or Z := IEσ sup � σif(Xi) � , f∈F f∈F � � then for all x ≥ 0, and i=1 � IP Z ≥ IE [Z] + � 2xIE [Z] + x � 3 i=1 ≤ e −x... |