## Risk bounds for Statistical Learning

Citations: | 43 - 2 self |

### BibTeX

@MISC{Massart_riskbounds,

author = {Pascal Massart and Élodie Nédélec},

title = {Risk bounds for Statistical Learning },

year = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a general theorem providing upper bounds for the risk of an empirical risk minimizer (ERM).We essentially focus on the binary classi…cation framework. We extend Tsybakov’s analysis of the risk of an ERM under margin type conditions by using concentration inequalities for conveniently weighted empirical processes. This allows us to deal with other ways of measuring the ”size”of a class of classi…ers than entropy with bracketing as in Tsybakov’s work. In particular we derive new risk bounds for the ERM when the classi…cation rules belong to some VC-class under margin conditions and discuss the optimality of those bounds in a minimax sense.

### Citations

830 |
Estimation of dependences based on empirical data
- Vapnik
- 1982
(Show Context)
Citation Context ...usly the bias inft∈S ℓ(s ∗ ,t) is small enough and the “size” of S is not too large represents the main challenge of model selection procedures. Since the early work of Vapnik and his celebrated book =-=[23]-=-, there have been many works on this topic and several attempts to improve on the penalization method of the empirical risk (the structural risk minimization) initially proposed by Vapnik to select am... |

706 |
Algorithms in Combinatorial Geometry
- Edelsbrunner
- 1987
(Show Context)
Citation Context ...his first example could appear to be rather artificial. More interestingly, our result also applies to half-spaces in Rd , for d ≥ 2. Indeed, a very nice combinatorial geometric result to be found in =-=[7]-=- says that, for every integer N ≥ d + 1, there exist N distinct points x1,x2,...,xN of Rd such that the trace of the collection of half-spaces in Rd on {x1,x2,...,xN} contains all the subsets of {x1,x... |

400 |
On the method of bounded differences
- McDiarmid, C
- 1989
(Show Context)
Citation Context ... ≤ v and ‖f‖∞ ≤ b, then, for every positive y, the following inequality holds for Z = supf∈F(Pn − P)(f): (18) P [ Z − E[Z] ≥ √ 2 (v + 4bE[Z])y n + 2by ] ≤ e 3n −y . Unlike McDiarmid’s inequality (see =-=[18]-=-) which has been widely used in statistical learning theory (see [13]), a concentration inequality like (18) offers the possibility of controlling the empirical process locally. Applying this inequali... |

207 |
Risk bounds for model selection via penalization. Probab. Theory Related Fields 113
- BARRON, BIRGÉ, et al.
- 1999
(Show Context)
Citation Context ...e one developed in [5]. Let us now state some of the results that we prove in this paper. In order to take into account the margin condition (5) within a minimax approach, we introduce, for every h ∈ =-=[0,1]-=-, the set P(h,S) of probability distributions P satisfying the conditions (6) |2η(x) − 1| ≥ h for all x ∈ X and s ∗ ∈ S (one should keep in mind that η as well as s ∗ depends on P, which gives a sense... |

178 | Probability in Banach Spaces. Isoperimetry and Processes. Ergebnisse der Mathematik und ihrer Grenzgebiete (3) 23 - Ledoux, Talagrand - 1991 |

155 | Optimal aggregation of classifiers in statistical learning
- Tsybakov
(Show Context)
Citation Context ...close to 1/2. This tends to indicate that maybe some analysis taking into account the way η behaves around 1/2 could be sharper than the preceding one. 1.2.3. Faster rates under margin conditions. In =-=[22]-=-, Tsybakov attracted attention to rates faster than 1/ √ n that can be achieved by the ERM estimator under a “margin” type condition which is of a different nature from the Devroye and Lugosi conditio... |

152 |
Minimax Theory of Image Reconstruction
- KOROSTELEV, TSYBAKOV
- 1993
(Show Context)
Citation Context ...µ) ≥ K2ε −r for every ε ∈ (0,ε0], then, for some positive constant C2 depending on K1,K2,ε0 and r, one has inf ˜s sup P ∈P(h,S,µ) E[ℓ(s ∗ , ˜s)] ≥ C2(1 − h) 1/(r+1) ((nh 1−r ) −1/(r+1) ∧ n −1/2 ). In =-=[11, 14]-=- or [6], one can find some explicit examples of classes of subsets of R d with smooth boundaries which satisfy both (10) and (11) when µ is equivalent to the Lebesgue measure on the unit cube. The pap... |

124 |
New concentration inequalities in product spaces
- Talagrand
- 1996
(Show Context)
Citation Context ...r S with respect to d and the modulus of uniform continuity of d with respect to ℓ.10 P. MASSART AND E. NÉDÉLEC The main tool that we shall use is Talagrand’s inequality for empirical processes (see =-=[21]-=-) which will allow us to control the oscillations of the empirical process γn by the modulus of uniform continuity of γn in expectation. More precisely, we shall use the following version of it due to... |

108 | Smooth discrimination analysis
- MAMMEN, TSYBAKOV
- 1999
(Show Context)
Citation Context ...ator under a “margin” type condition which is of a different nature from the Devroye and Lugosi condition above, as we shall see below. This condition was first introduced by Mammen and Tsybakov (see =-=[14]-=-) in the related context of discriminant analysis and can be stated as, (4) ℓ(s ∗ ,t) ≥ h θ ‖s ∗ − t‖ θ 1 for every t ∈ S, where ‖ · ‖1 denotes the L1(µ)-norm, h is some positive constant [that we can... |

101 | Theory of Pattern Recognition - Chervonenkis - 1974 |

99 | Information-theoretic determination of minimax rates of convergence,” Ann
- Yang, Barron
(Show Context)
Citation Context ...lassifier t ∈ S, ℓ(s,t) ≥ h‖t − s‖1, then, for any finite subset C of S, the following lower bound is available: Rn(h,S) ≥ h 2 inf ˆs∈C supEs[‖s − ˆs‖1]. We use now an argument due to Yang and Barron =-=[25]-=-. Given ε > 0, the idea is to construct an ε-net (i.e., a maximal set of points such that the mutual distances between the elements of this net stay of order ε, less or equal to 2Cε, say, for some con... |

95 | Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension
- Haussler
- 1995
(Show Context)
Citation Context ... probability measures on X. The universal metric entropy is related to the VC-dimension via Haussler’s bound Huniv(ε, A) ≤ κV (1 + log(ε −1 ∨ 1)), where κ denotes some absolute positive constant (see =-=[8]-=-). The way of expressing φ by using either the random combinatorial entropy or the universal metric entropy is detailed in the Appendix. Precisely, to use the maximal inequalities proved in the Append... |

89 |
Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse
- Massart
- 2000
(Show Context)
Citation Context ...ction 2 we give a general theorem which provides an upper bound for the risk of an ERM via the techniques based on concentration inequalities for weighted empirical processes which were introduced in =-=[15]-=-. The nature of the weight that we are using is absolutely crucial because this is exactly what makes the difference at the end of the day between our upper bounds for VC-classes and those of Devroye ... |

82 |
Minimum contrast estimators on sieves: exponential bounds and rates of convergence
- Birgé, Massart
- 1998
(Show Context)
Citation Context ...hat ‖f‖∞ ≤ 1 and Zf = ∑n i=1 f(ξi) − E[f(ξi)], where ξ1,...,ξn are independent random variables. Then, setting ∑ni=1 v = supf∈F E[f2 (ξi)], as a by-product of the proof of Bernstein’s inequality (see =-=[3]-=-), assumption (A.1) is satisfied with ψ(λ) = λ2v/(2(1 − λ/3)) and, therefore, ( ) √ E ≤ 2v log(N) + 1 3 log(N). (A.4) sup Zf f∈F We are now ready to prove a maximal inequality for Rademacher processes... |

72 |
Uniform Central Limit Theorems. Cambridge Univ
- Dudley
- 1999
(Show Context)
Citation Context ...hat the lower bound (42) and the upper bound (36) coincide. Note that (43) is, in particular, satisfied when A is a collection of sets with smooth boundaries in various senses as shown in [11, 14] or =-=[6]-=-. 4. Proofs of the main results. 4.1. The upper bound: proof of Theorem 2. Since S satisfies (M), we notice that, by dominated convergence, for every t ∈ S, considering the sequence (tk) provided by c... |

61 | A Bennett concentration inequality and its application to suprema of empirical processes
- Bousquet
- 2002
(Show Context)
Citation Context ...ll allow us to control the oscillations of the empirical process γn by the modulus of uniform continuity of γn in expectation. More precisely, we shall use the following version of it due to Bousquet =-=[4]-=- which has the advantage of providing explicit constants and of dealing with one-sided suprema. If F is a countable family of measurable functions such that, for some positive constants v and b, one h... |

55 | On the method of bounded dierences - McDiarmid - 1989 |

53 |
Predicting {0, 1}-functions on randomly drawn points
- Haussler, Littlestone, et al.
- 1994
(Show Context)
Citation Context ... √ √ V − 1 (38) 6 n for every n ≥ 5(V − 1). If h = 1, we are in the zero-error case for which Y = s ∗ (X) and we have at our disposal a lower bound proved by Vapnik and Chervonenkis in [24] (see also =-=[9]-=-), V − 1 (39) Rn(1,S) ≥ 4en for every n ≥ 2 ∨ (V − 1). As expected, the order of these lower bounds for the minimax risk is very sensitive to the set of joint distributions over which the supremum is ... |

50 | Predicting f0; 1g functions on randomly drawn points - Haussler, Littlestone, et al. - 1994 |

48 | Risk bounds for model selection via penalization - Barron, Birgé, et al. - 1999 |

46 | Concentration inequalities and model selection - Massart - 2007 |

35 |
Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Related Fields
- Reynaud-Bouret
(Show Context)
Citation Context ...lly sufficiently distant (w.r.t. the Hamming distance). This can be done thanks to a combinatorial argument due to Birgé and Massart (see [17]). We more precisely use the version of it to be found in =-=[20]-=- and which is more convenient for our needs here. So by Lemma 8 in [20], since N ≥ 4D, we can choose C in such a way that (54) δ(b,b ′ ) > D/2, for every b,b ′ in C with b ̸= b ′ , ( ) N log(#C) ≥ ρDl... |

27 |
Concentration inequalities and model selection. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003. With a foreword by Jean Picard
- Massart
- 2007
(Show Context)
Citation Context ... F. Let δ be such that supf∈F ‖f‖2 ≤ δ. Then (A.5) ( E sup Zf f∈F ) ∞∑ ≤ 3δ j=0 2 −j√ H2(2 −j−1 δ, F). The proof being straightforward, we skip it. The interested reader will find a detailed proof in =-=[16]-=-. We turn now to maximal inequalities for set-indexed empirical processes. The VC-case will be treated via symmetrization by using the preceding bounds for Rademacher processes, while the bracketing c... |

26 |
Lower bounds in pattern recognition and learning
- Devroye, Lugosi
- 1995
(Show Context)
Citation Context ...hniques and the notion of universal entropy, which was introduced in [10] and [19] independently. Furthermore, this upper bound is optimal in the minimax sense since, whenever 2 ≤ V ≤ n, one has (see =-=[5]-=-) inf ˜s sup P ∈P(S) E[ℓ(s ∗ , ˜s)] ≥ κ2 √ V n , for some absolute positive constant κ2, where the infimum is taken over the family of all estimators. Apparently this sounds like the end of the story,... |

21 |
A central limit theorem for empirical processes
- Pollard
- 1982
(Show Context)
Citation Context ... based on direct combinatorial methods on empirical processes. This factor can be removed (see [13]) by using chaining techniques and the notion of universal entropy, which was introduced in [10] and =-=[19]-=- independently. Furthermore, this upper bound is optimal in the minimax sense since, whenever 2 ≤ V ≤ n, one has (see [5]) inf ˜s sup P ∈P(S) E[ℓ(s ∗ , ˜s)] ≥ κ2 √ V n , for some absolute positive con... |

17 | Edelsbrunner: Algorithms in Combinatorial Geometry - unknown authors - 1987 |

17 | Pattern classification and learning theory
- Lugosi
- 2002
(Show Context)
Citation Context ... the regression function η as well as the Bayes classifier s∗ depend on P), then (under some convenient measurability condition on A) the following uniform risk bound is available for the ERM ˆs (see =-=[13]-=-, e.g.): sup E[ℓ(s P ∈P(S) ∗ , ˆs)] ≤ κ1 √ V n , where κ1 denotes some absolute constant. Note that the initial upper bounds on the expected risk for a VC-class found in [23] involved an extra logarit... |

8 |
A new lower bound for multiple hypothesis testing
- Birgé
(Show Context)
Citation Context ... (AN,D) for every N ≥ 4D. Fano’s lemma is one of the classical tools used to build minimax lower bounds. We would rather use the following very convenient bound for multiple testing due to Birgé (see =-=[2]-=-), which has the advantage of being relevent even when testing only two hypotheses. Lemma 8. Let N ≥ 1, (Pi)0≤i≤N be a family of probability distributions and (Ai)0≤i≤N be a family of disjoint events.... |

7 |
On the central limit theorem for empirical measures
- Koltchinskii
- 1981
(Show Context)
Citation Context ...they were based on direct combinatorial methods on empirical processes. This factor can be removed (see [13]) by using chaining techniques and the notion of universal entropy, which was introduced in =-=[10]-=- and [19] independently. Furthermore, this upper bound is optimal in the minimax sense since, whenever 2 ≤ V ≤ n, one has (see [5]) inf ˜s sup P ∈P(S) E[ℓ(s ∗ , ˜s)] ≥ κ2 √ V n , for some absolute pos... |

3 | Optimal aggregation of classiers in statistical learning - Tsybakov |

2 | in Russian); German translation; Theorie der Zeichenerkennung, Akademie - Nauka - 1979 |

1 | Lower bounds in pattern recognition - Devroye, Lugosi - 1995 |

1 | Pattern classi…cation and learning theory. Principles of Nonparametric Learning - Lugosi - 2002 |

1 |
A uniform Marcinkiewicz–Zygmund strong law of large numbers for empirical processes
- Massart, Rio
- 1998
(Show Context)
Citation Context ...C with maximal cardinality such that the points of C are mutually sufficiently distant (w.r.t. the Hamming distance). This can be done thanks to a combinatorial argument due to Birgé and Massart (see =-=[17]-=-). We more precisely use the version of it to be found in [20] and which is more convenient for our needs here. So by Lemma 8 in [20], since N ≥ 4D, we can choose C in such a way that (54) δ(b,b ′ ) >... |