## Aggregation by exponential weighting and sharp oracle inequalities

### Cached

### Download Links

- [www.proba.jussieu.fr]
- [eprints.pascal-network.org]
- [arxiv.org]
- DBLP

### Other Repositories/Bibliography

Citations: | 22 - 3 self |

### BibTeX

@MISC{Dalalyan_aggregationby,

author = {A. Dalalyan and A. B. Tsybakov},

title = {Aggregation by exponential weighting and sharp oracle inequalities},

year = {}

}

### OpenURL

### Abstract

Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We show how these results can be applied to derive a sparsity oracle inequality. 1

### Citations

911 | Continuous martingales and Brownian motion, volume 293 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences - Revuz, Yor - 1999 |

669 | The weighted majority algorithm
- LITTLESTONE, WARMUTH
- 1994
(Show Context)
Citation Context ...an approach. For finite Λ, procedures (2)–(3) were independently introduced for prediction of deterministic individual sequences with expert advice. Representative work and references can be found in =-=[24, 17, 11]-=-; in this framework the results are proved for cumulative loss and no assumption is made on the statistical nature of the data, whereas the observations Yi are supposed to be uniformly bounded by a kn... |

424 | The dantzig selector: Statistical estimation when p is much larger than n the dantzig selector: Statistical estimation when p is much larger than n
- Candes, Tao
- 2007
(Show Context)
Citation Context ...�fλ − f�2 n, we get � � S = −βE log exp − Λ (n + 1)�fλ − f�2 n − 2(ξ + ζ) ⊤ � hλ π(dλ) β � � + βE log exp − Λ n�f − fλ�2 n − 2ξ ⊤ � hλ π(dλ) β � = βE log e −nρ(λ) � π(dλ) − βE log e −(n+1)ρ(λ) π(dλ), =-=(7)-=- Λ ΛsAggregation by exponential weighting 5 where we used the notation ρ(λ) = (�f − fλ�2 n − 2n−1ξ ⊤ hλ)/β and the fact that ξ+ζ can be replaced by (1+1/n)ξ inside the expectation. The Hölder inequali... |

297 | Stable recovery of sparse overcomplete representations in the presence of noise
- Donoho, Elad, et al.
- 2006
(Show Context)
Citation Context ... The result of Corollary 3 can be compared with the SOI obtained for other procedures [5–7]. These papers impose heavy restrictions on the Gram matrix Φ either in terms of the coherence introduced in =-=[12]-=- or analogous local characteristics. Our result is not of that kind: we need only that the maximal eigenvalue of Φ were bounded. On the other hand, we assume that the oracle vector λ ∗ belongs to a ba... |

246 |
Aggregating strategies
- VOVK
- 1990
(Show Context)
Citation Context ...an approach. For finite Λ, procedures (2)–(3) were independently introduced for prediction of deterministic individual sequences with expert advice. Representative work and references can be found in =-=[24, 17, 11]-=-; in this framework the results are proved for cumulative loss and no assumption is made on the statistical nature of the data, whereas the observations Yi are supposed to be uniformly bounded by a kn... |

185 | Simultaneous analysis of lasso and dantzig selector
- Bickel, Ritov, et al.
(Show Context)
Citation Context ...are used [17, 8]. It is proved that if M(λ∗ ) ≪ n and if the dictionary {φ1, . . . , φM } satisfies certain conditions, then the vector λ∗ and the function f can be estimated with reasonable accuracy =-=[18, 6, 7, 8, 40, 3]-=-. However, the conditions on the dictionary {φ1, . . .,φM } required to get risk bounds for the Lasso and Dantzig selector are quite restrictive. One of the consequences of our results in Section 7 is... |

175 |
Limit Theorems of Probability Theory
- Petrov
- 1995
(Show Context)
Citation Context ... convolution of two distributions from Dn belongs to Dn. Finally, note that the intersection D = ∩n≥1Dn is included in the set of all infinitely divisible distributions and is called the L-class (see =-=[19]-=-, Theorem 3.6, p. 102). However, some basic distributions such as the uniform or the Bernoulli distribution do not belong to Dn. To show this, let us recall that the characteristic function of the uni... |

134 | C.Gentile, “On the generalization ability of on-line learning algorithms
- Cesa-Bianchi, Conconi
(Show Context)
Citation Context ...e Λ. Papers [13], [14] point out that ˆ fn and its averaged version can be obtained as a special case of mirror descent algorithms that were considered earlier in deterministic minimization. Finally, =-=[10]-=- establishes an interesting link between the results for cumulative risks proved in the theory of prediction of deterministic sequences and generalization error bounds for the aggregates in the stocha... |

81 |
Statistical learning theory and stochastic optimization, ser
- Catoni
(Show Context)
Citation Context ...estimators is finite, i.e., w.l.o.g. Λ = {1, . . . , M}, and the distribution π is uniform on Λ. Procedures of the type (2)–(3) with general sets Λ and priors π came into consideration quite recently =-=[9, 8, 3, 29, 30, 1, 2, 25]-=-, partly in connection to the PAC-Bayesian approach. For finite Λ, procedures (2)–(3) were independently introduced for prediction of deterministic individual sequences with expert advice. Representat... |

79 |
Aggregation for gaussian regression
- Bunea, Tsybakov, et al.
- 2014
(Show Context)
Citation Context ...this case, as a consequence of our main result we obtain a sparsity oracle inequality (SOI). We refer to [22] where the notion of SOI is introduced in a general context. Examples of SOI are proved in =-=[15, 5, 4, 6, 23]-=-. In particular, [5] deals with the regression model with fixed design that we consider here and proves approximate SOI for BIC type and Lasso type aggregates. We show that the aggregate with exponent... |

63 | Competitive on-line statistics
- Vovk
(Show Context)
Citation Context ...estimators is finite, i.e., w.l.o.g. Λ = {1, . . . , M}, and the distribution π is uniform on Λ. Procedures of the type (2)–(3) with general sets Λ and priors π came into consideration quite recently =-=[9, 8, 3, 29, 30, 1, 2, 25]-=-, partly in connection to the PAC-Bayesian approach. For finite Λ, procedures (2)–(3) were independently introduced for prediction of deterministic individual sequences with expert advice. Representat... |

56 | High-dimensional generalized linear models and the lasso
- Geer
(Show Context)
Citation Context ...this case, as a consequence of our main result we obtain a sparsity oracle inequality (SOI). We refer to [22] where the notion of SOI is introduced in a general context. Examples of SOI are proved in =-=[15, 5, 4, 6, 23]-=-. In particular, [5] deals with the regression model with fixed design that we consider here and proves approximate SOI for BIC type and Lasso type aggregates. We show that the aggregate with exponent... |

39 | Adaptive regression by mixing
- Yang
- 2001
(Show Context)
Citation Context ... = fλ(xi) + ξ ′ i , where ξ′ i are iid normally distributed with mean 0 and variance β/2. The idea of mixing with exponential weights has been discussed by many authors apparently since 1970-ies (see =-=[27]-=- for a nice overview of the subject). Most of the work focused on the important particular case where the set of estimators is finite, i.e., w.l.o.g. Λ = {1, . . . , M}, and the distribution π is unif... |

37 |
Information theory and mixing least-squares regressions
- Leung, Barron
- 2006
(Show Context)
Citation Context ...or the aggregate ˆ fn under the squared loss, i.e., oracle inequalities with leading constant 1 and optimal rate of the remainder term. For a particular case, such an inequality has been pioneered in =-=[16]-=-. The result of [16] is proved for a finite set Λ and Gaussian errors. It makes use of Stein’s unbiased risk formula, and gives a very precise constant in the remainder term of the inequality. The ine... |

36 | Combining different procedures for adaptive regression
- Yang
- 2000
(Show Context)
Citation Context ...ive exponential weighting methods: there the aggregate is defined as the average n−1 �n k=1 ˆ fk. For regression models with random design, such procedures are introduced and analyzed in [8], [9] and =-=[26]-=-. In particular, [8] and [9] establish a sharp oracle inequality, i.e., an inequality with leading constant 1. This result is further refined in [3] and [13]. In addition, [13] derives sharp oracle in... |

30 | Learning by mirror averaging
- Juditsky, Rigollet, et al.
(Show Context)
Citation Context ...re introduced and analyzed in [8], [9] and [26]. In particular, [8] and [9] establish a sharp oracle inequality, i.e., an inequality with leading constant 1. This result is further refined in [3] and =-=[13]-=-. In addition, [13] derives sharp oracle inequalities not only for the squared loss but also for general loss functions. However, these techniques are not helpful in the framework that we consider her... |

28 | Sequential procedures for aggregating arbitrary estimators of a conditional mean
- Bunea, Nobel
- 2005
(Show Context)
Citation Context ...estimators is finite, i.e., w.l.o.g. Λ = {1, . . . , M}, and the distribution π is uniform on Λ. Procedures of the type (2)–(3) with general sets Λ and priors π came into consideration quite recently =-=[9, 8, 3, 29, 30, 1, 2, 25]-=-, partly in connection to the PAC-Bayesian approach. For finite Λ, procedures (2)–(3) were independently introduced for prediction of deterministic individual sequences with expert advice. Representat... |

24 |
Universal” aggregation rules with exact bias bounds
- Catoni
- 1999
(Show Context)
Citation Context |

24 | Regression with multiple candidate models: selecting or mixing
- Yang
- 2003
(Show Context)
Citation Context ...he desired result follows from Theorem 1. Assume now that ξi are distributed with the double exponential density fξ(x) = 1 √ 2σ 2 e−√ 2|x|/σ , x ∈ R. Aggregation under this assumption is discussed in =-=[28]-=- where it is recommended to modify the shape of the aggregate in order to match the shape of the distribution of the errors. The next proposition shows that sharp risk bounds can be obtained without m... |

23 |
Sparsity in penalized empirical risk minimization
- Koltchinskii
(Show Context)
Citation Context ...this case, as a consequence of our main result we obtain a sparsity oracle inequality (SOI). We refer to [22] where the notion of SOI is introduced in a general context. Examples of SOI are proved in =-=[15, 5, 4, 6, 23]-=-. In particular, [5] deals with the regression model with fixed design that we consider here and proves approximate SOI for BIC type and Lasso type aggregates. We show that the aggregate with exponent... |

21 |
Aggregation and Sparsity Via ℓ1 Penalized Least Squares. Learning theory
- Bunea, Tsybakov, et al.
- 2006
(Show Context)
Citation Context |

15 |
The Skorokhod embedding problem and its offspring
- Ob̷lój
- 2004
(Show Context)
Citation Context ...o strong as compared to what we really need in the proof of Theorem 1. Below we come to a weaker condition invoking a version of Skorokhod embedding (a detailed survey on this subject can be found in =-=[18]-=-). For simplicity we assume that the errors ξi are symmetric, i.e., P (ξi > a) = P (ξi < −a) for all a ∈ R. The argument can be adapted to the asymmetric case as well, but we do not discuss it here. W... |

13 | Information theoretical upper and lower bounds for statistical estimation
- Zhang
- 2006
(Show Context)
Citation Context |

11 |
A randomized online learning algorithm for better variance control
- Audibert
- 2006
(Show Context)
Citation Context |

11 |
Optimal rates of aggregation. Computational Learning Theory and Kernel Machines. B.Schölkopf and M.Warmuth, eds
- Tsybakov
- 2003
(Show Context)
Citation Context ...d inequality is an obvious consequence of the first one. Remark. The rate of convergence (log M)/n obtained in (10) is optimal rate of model selection type aggregation when the errors ξi are Gaussian =-=[21, 5]-=-. 4 Checking assumptions (A) and (B) In this section we give some sufficient conditions for assumptions (A) and (B). Denote by Dn the set of all probability distributions of ξ1 satisfying assumption (... |

10 |
Recursive aggregation of estimators via the Mirror Descent Algorithm with averaging. Problems of Information Transmission
- Juditsky, Nazin, et al.
- 2005
(Show Context)
Citation Context ...ully adapted to models with non-identically distributed observations. Aggregate ˆ fn can be computed on-line. This, in particular, motivated its use for on-line prediction with finite Λ. Papers [13], =-=[14]-=- point out that ˆ fn and its averaged version can be obtained as a special case of mirror descent algorithms that were considered earlier in deterministic minimization. Finally, [10] establishes an in... |

7 |
From epsilon-entropy to KL-complexity: analysis of minimum information complexity density estimation
- Zhang
- 2006
(Show Context)
Citation Context |

3 |
Une approche PAC-bayésienne de la théorie statistique de l’apprentissage
- Audibert
- 2004
(Show Context)
Citation Context |

2 |
Regularization, boosting and mirror averaging. Comments on “Regularization in Statistics”, by P.Bickel and B.Li. Test 15
- Tsybakov
- 2006
(Show Context)
Citation Context ...where fλ is a linear combination of M known functions with the vector of weights λ ∈ R M . For this case, as a consequence of our main result we obtain a sparsity oracle inequality (SOI). We refer to =-=[22]-=- where the notion of SOI is introduced in a general context. Examples of SOI are proved in [15, 5, 4, 6, 23]. In particular, [5] deals with the regression model with fixed design that we consider here... |