## Learning by mirror averaging

Venue: | The Annals of Statistics |

Citations: | 30 - 2 self |

### BibTeX

@ARTICLE{Juditsky_learningby,

author = {A. Juditsky and P. Rigollet and A. B. Tsybakov},

title = {Learning by mirror averaging},

journal = {The Annals of Statistics},

year = {},

pages = {2008}

}

### OpenURL

### Abstract

Given a finite collection of estimators or classifiers, we study the problem of model selection type aggregation, that is, we construct a new estimator or classifier, called aggregate, which is nearly as good as the best among them with respect to a given risk criterion. We define our aggregate by a simple recursive procedure which solves an auxiliary stochastic linear programming problem related to the original nonlinear one and constitutes a special case of the mirror averaging algorithm. We show that the aggregate satisfies sharp oracle inequalities under some general assumptions. The results are applied to several problems including regression, classification and density estimation. 1. Introduction. Several

### Citations

246 |
Aggregating strategies
- VOVK
- 1990
(Show Context)
Citation Context ...ction is defined as the cumulative loss over the trajectory. Interestingly, for such problems, which are quite different from ours, methods similar to (3.3) constitute one of the principal tools; cf. =-=[11, 13, 23, 26, 27]-=-. However, in contrast to our procedure, those methods do not involve the averaging step (3.4); they do not need it because they deal with non-random observations and cumulative losses. Note that the ... |

175 |
Limit Theorems of Probability Theory
- Petrov
- 1995
(Show Context)
Citation Context ...PX-a.s., E(exp(b|ξ|)|X) ≤ D. Since E(ξ|X) = 0, this assumption is equivalent to the existence of positive constants b0 and σ 2 such that, PX-a.s., (5.8) E(exp(tξ)|X) ≤ exp(σ 2 t 2 /2) ∀ |t| ≤ b0; cf. =-=[21]-=-, page 56. In this case, application of Corollary 5.1 leads to suboptimal rates because of the term E[Rβ(Y )] in (5.4). We show now that, using Theorem 4.2, we can obtain an oracle inequality with opt... |

170 |
Problem complexity and method efficiency in optimization. Nauka (published in English by
- Nemirovskii, Yudin
- 1983
(Show Context)
Citation Context ...Procedures with values in Θ, that is, convex mixtures of the initial estimators, can be constructed in various ways. One of them originates from the idea of mirror descent due to Nemirovski and Yudin =-=[19]-=-. This idea has been further developed in [3, 20], mainly in the deterministic optimization framework. A version of the mirror descent method due to Nesterov [20] has been applied to the aggregation p... |

81 |
Statistical learning theory and stochastic optimization, ser
- Catoni
(Show Context)
Citation Context ...1 = 0, so that Corollary 5.4 follows directly from Lemma 4.1. Writing the proof ofLEARNING BY MIRROR AVERAGING 17 Lemma 4.1 for this particular Q we essentially recover the proof of Theorem 3.1.1 in =-=[9]-=-. Extension of Corollary 5.4 to β ≥ 1 is straightforward but the oracle inequality for the corresponding aggregate (“Gibbs estimator”; cf. [9]) is less interesting because it has obviously a larger re... |

79 |
Aggregation for gaussian regression
- Bunea, Tsybakov, et al.
- 2014
(Show Context)
Citation Context ...n the optimal one (1.2); a detailed account can be found in the survey [4] or in the lecture notes [17]. We mention here only some recent work where aggregation of arbitrary estimators is considered: =-=[1, 6, 16, 22, 28, 30]-=-. These results are useful for statistical applications, especially if the leading constant K is close to 1. However, the inequalities with K > 1 do not provide valid bounds for the excess risk EnA( ˜... |

75 | Model selection and error estimation
- Bartlett, Boucheron, et al.
(Show Context)
Citation Context ...n the optimal one (1.2); a detailed account can be found in the survey [4] or in the lecture notes [17]. We mention here only some recent work where aggregation of arbitrary estimators is considered: =-=[1, 6, 16, 22, 28, 30]-=-. These results are useful for statistical applications, especially if the leading constant K is close to 1. However, the inequalities with K > 1 do not provide valid bounds for the excess risk EnA( ˜... |

73 | Primal-Dual subgradient methods for convex problems
- Nesterov
- 2009
(Show Context)
Citation Context ...ixtures of the initial estimators, can be constructed in various ways. One of them originates from the idea of mirror descent due to Nemirovski and Yudin [19]. This idea has been further developed in =-=[3, 20]-=-, mainly in the deterministic optimization framework. A version of the mirror descent method due to Nesterov [20] has been applied to the aggregation problem in [12] under the name of mirror averaging... |

73 |
Introduction à l’estimation non-paramétrique
- Tsybakov
- 2004
(Show Context)
Citation Context ...lback–Leibler divergence between pk n and p1 n is given explicitly by K(pk n,p1 n) = (log M)/4, for any k = 2,...,M, where pk n denotes the density of P k n. We can therefore apply Proposition 2.3 in =-=[25]-=- with α∗ = (logM)/4. Taking in that proposition τ = 1/M we get (6.1) with some c > 0 which finishes the proof. Acknowledgments. We would like to thank Jean-Yves Audibert, Arnak Dalalyan and Gilles Sto... |

63 | Competitive on-line statistics
- Vovk
(Show Context)
Citation Context ...ction is defined as the cumulative loss over the trajectory. Interestingly, for such problems, which are quite different from ours, methods similar to (3.3) constitute one of the principal tools; cf. =-=[11, 13, 23, 26, 27]-=-. However, in contrast to our procedure, those methods do not involve the averaging step (3.4); they do not need it because they deal with non-random observations and cumulative losses. Note that the ... |

57 | Averaging expert predictions
- KIVINEN, WARMUTH
- 1999
(Show Context)
Citation Context ...RNING BY MIRROR AVERAGING 11 of deterministic sequences discussed in Section 3 above. We sketch here the argument that can be used. If written in our notation, some results of that theory (see, e.g., =-=[13, 23]-=- or Section 3.3 of [10]) are as follows: under exponential concavity of θ ↦→ −ηQ(z,θ) for some η > 0 and boundedness of supz,θ |Q(z,θ)|, for any fixed sequence Zi we have (4.7) 1 n n∑ Q(Zi,θi−1) ≤ min... |

46 |
Optimal rates of aggregation
- Tsybakov
- 2003
(Show Context)
Citation Context ...m ∆n,M. Lower bounds can be established showing that, under some assumptions, the smallest possible value of ∆n,M in a minimax sense has the form C logM (1.2) ∆n,M = , n with some constant C > 0; cf. =-=[24]-=-. Besides being in themselves precise finite sample results, oracle inequalities of the type (1.1) are very useful in adaptive nonparametric estimation.LEARNING BY MIRROR AVERAGING 3 They allow one t... |

39 |
Concentration inequalities and model selection. Ecole d'ete de Probabilites de Saint-Flour 2003
- Massart
- 2006
(Show Context)
Citation Context ...onstant K > 1, instead of minj A(ej) in (1.1) and with a remainder term which is sometimes larger than the optimal one (1.2); a detailed account can be found in the survey [4] or in the lecture notes =-=[17]-=-. We mention here only some recent work where aggregation of arbitrary estimators is considered: [1, 6, 16, 22, 28, 30]. These results are useful for statistical applications, especially if the leadin... |

39 | Mixing strategies for density estimation
- Yang
- 2000
(Show Context)
Citation Context .... For two special cases [density estimation with the Kullback–Leibler (KL) loss, and regression model with squared loss] such bounds have been proved earlier in the works of Catoni [7, 8, 9] and Yang =-=[29]-=-. They independently obtained the bound for density estimation with the KL loss, and Catoni [8, 9] solved the problem for the regression model with squared loss. Bunea and Nobel [5] improved the regre... |

38 | Universal linear prediction by model order weighting
- Singer, Feder
- 1999
(Show Context)
Citation Context ...ction is defined as the cumulative loss over the trajectory. Interestingly, for such problems, which are quite different from ours, methods similar to (3.3) constitute one of the principal tools; cf. =-=[11, 13, 23, 26, 27]-=-. However, in contrast to our procedure, those methods do not involve the averaging step (3.4); they do not need it because they deal with non-random observations and cumulative losses. Note that the ... |

37 |
Information theory and mixing least-squares regressions
- Leung, Barron
- 2006
(Show Context)
Citation Context ... to ours (MS aggregation in the Gaussian white noise model with squared loss) Nemirovski [18], page 226, established an inequality similar to (1.1), with a suboptimal remainder term. Leung and Barron =-=[15]-=- improved upon this result to achieve the optimal remainder term. Several other works provided less precise bounds than (1.1)–(1.2), with K minj A(ej) where the leading constant K > 1, instead of minj... |

28 | Sequential procedures for aggregating arbitrary estimators of a conditional mean
- Bunea, Nobel
- 2005
(Show Context)
Citation Context ...7, 8, 9] and Yang [29]. They independently obtained the bound for density estimation with the KL loss, and Catoni [8, 9] solved the problem for the regression model with squared loss. Bunea and Nobel =-=[5]-=- improved the regression with squared loss result of [8, 9] in the case of bounded response, and obtained some related inequalities under weaker conditions. For a problem which is different but close ... |

28 | A mixture approach to universal model selection. preprint LMENS 97-30, Available from http://www.dma.ens.fr/ edition/preprints/Index.97.html
- Catoni
- 1997
(Show Context)
Citation Context ...the loss function Q. For two special cases [density estimation with the Kullback–Leibler (KL) loss, and regression model with squared loss] such bounds have been proved earlier in the works of Catoni =-=[7, 8, 9]-=- and Yang [29]. They independently obtained the bound for density estimation with the KL loss, and Catoni [8, 9] solved the problem for the regression model with squared loss. Bunea and Nobel [5] impr... |

27 | Model selection in nonparametric regression
- Wegkamp
- 2003
(Show Context)
Citation Context ...n the optimal one (1.2); a detailed account can be found in the survey [4] or in the lecture notes [17]. We mention here only some recent work where aggregation of arbitrary estimators is considered: =-=[1, 6, 16, 22, 28, 30]-=-. These results are useful for statistical applications, especially if the leading constant K is close to 1. However, the inequalities with K > 1 do not provide valid bounds for the excess risk EnA( ˜... |

24 |
Universal” aggregation rules with exact bias bounds
- Catoni
- 1999
(Show Context)
Citation Context ...the loss function Q. For two special cases [density estimation with the Kullback–Leibler (KL) loss, and regression model with squared loss] such bounds have been proved earlier in the works of Catoni =-=[7, 8, 9]-=- and Yang [29]. They independently obtained the bound for density estimation with the KL loss, and Catoni [8, 9] solved the problem for the regression model with squared loss. Bunea and Nobel [5] impr... |

24 | Complexity regularization via localized random penalties
- Lugosi, Wegkamp
- 2004
(Show Context)
Citation Context |

13 |
Topics in Non-parametric Statistics. In: Ecole d’Eté de Probabilités de Saint-Flour
- Nemirovski
- 2000
(Show Context)
Citation Context ...R AVERAGING 3 They allow one to prove that the aggregate estimator ˜ θ⊤ n H is adaptive in a minimax asymptotic sense (and even sharp minimax adaptive in several cases; for more discussion see, e.g., =-=[18]-=-). The aim of this paper is to obtain bounds of the form (1.1)–(1.2) under some general conditions on the loss function Q. For two special cases [density estimation with the Kullback–Leibler (KL) loss... |

12 |
Theory of classification: some recent advances
- Boucheron, Bousquet, et al.
- 2005
(Show Context)
Citation Context ...nj A(ej) where the leading constant K > 1, instead of minj A(ej) in (1.1) and with a remainder term which is sometimes larger than the optimal one (1.2); a detailed account can be found in the survey =-=[4]-=- or in the lecture notes [17]. We mention here only some recent work where aggregation of arbitrary estimators is considered: [1, 6, 16, 22, 28, 30]. These results are useful for statistical applicati... |

12 | Aggregation for regression learning - Bunea, Tsybakov, et al. - 2004 |

11 | Aggregating regression procedures for a better performance. Bernoulli, 10: 25 – 47 - Yang - 2004 |

11 | Optimal rates of aggregation. Computational Learning Theory and Kernel Machines. B.Schölkopf and M.Warmuth, eds - Tsybakov - 2003 |

10 |
Recursive aggregation of estimators via the Mirror Descent Algorithm with averaging. Problems of Information Transmission
- Juditsky, Nazin, et al.
- 2005
(Show Context)
Citation Context ...a has been further developed in [3, 20], mainly in the deterministic optimization framework. A version of the mirror descent method due to Nesterov [20] has been applied to the aggregation problem in =-=[12]-=- under the name of mirror averaging. As shown in [12], for convex loss functions Q the mirror averaging estimator ˜ θn satisfies under mild assumptions the following oracle inequality: (3.1) EnA( ˜ θn... |

9 |
Aggregation of density estimators and dimension reduction
- Samarov, Tsybakov
- 2007
(Show Context)
Citation Context |

7 |
V.: Spatial aggregation of local likelihood estimates with applications to classification
- Belomestny, Spokoiny
(Show Context)
Citation Context ...orithm, satisfies (5.16) En−1K(a ∗ ,ãn) ≤ min 1≤j≤M K(a∗ ,aj) + β logM . n Aggregation procedures can be used to construct pointwise adaptive locally parametric estimators in nonparametric regression =-=[2]-=-. In this case inequality (5.16) can be applied to prove the corresponding adaptive risk bounds. We now check that Assumption 5.1 is satisfied for several standard parametric families. • Univariate Ga... |

7 |
From epsilon-entropy to KL-complexity: analysis of minimum information complexity density estimation
- Zhang
- 2006
(Show Context)
Citation Context |

5 | Local likelihood modeling via stagewise aggregation - Belomestny, Spokoiny - 2007 |

4 |
The conjugate barrier mirror descent method for non-smooth convex optimization
- Ben-Tal, Nemirovski
- 1999
(Show Context)
Citation Context ...ixtures of the initial estimators, can be constructed in various ways. One of them originates from the idea of mirror descent due to Nemirovski and Yudin [19]. This idea has been further developed in =-=[3, 20]-=-, mainly in the deterministic optimization framework. A version of the mirror descent method due to Nesterov [20] has been applied to the aggregation problem in [12] under the name of mirror averaging... |

2 |
Sequential prediction if individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context |

2 |
Efficient agnostic learning with bounded fan-in
- Lee, Bartlett, et al.
- 1996
(Show Context)
Citation Context ...)] − min 1≤j≤M Ak(ej) √ } logM ≥ cσ n , (2.3) k=1,...,M where the infimum is taken over all the selectors Tn. A weaker result of similar type [with the rate 1/ √ n instead of √ (log M)/n] is given in =-=[14]-=-. Proposition 2.1 implies that the slow rate √ (logM)/n is the best attainable rate for selectors, since the standard ERM selector satisfies the oracle inequality (1.1) with rate ∆n,M ∼ √ (log M)/n. P... |

1 | Saint-Flour Lecture Notes. Ecole d’Eté de Probabilités de Saint-Flour XXXIII - Massart - 2006 |