Results 1  10
of
33
Aggregation by exponential weighting and sharp oracle inequalities
"... Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on t ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We show how these results can be applied to derive a sparsity oracle inequality. 1
Sparse Regression Learning by Aggregation and Langevin MonteCarlo
, 2009
"... We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PACBayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We consider the problem of regression learning for deterministic design and independent random errors. We start by proving a sharp PACBayesian type bound for the exponentially weighted aggregate (EWA) under the expected squared empirical loss. For a broad class of noise distributions the presented bound is valid whenever the temperature parameter β of the EWA is larger than or equal to 4σ 2, where σ 2 is the noise variance. A remarkable feature of this result is that it is valid even for unbounded regression functions and the choice of the temperature parameter depends exclusively on the noise level. Next, we apply this general bound to the problem of aggregating the elements of a finitedimensional linear space spanned by a dictionary of functions φ1,...,φM. We allow M to be much larger than the sample size n but we assume that the true regression function can be well approximated by a sparse linear combination of functions φj. Under this sparsity scenario, we propose an EWA with a heavy tailed prior and we show that it satisfies a sparsity oracle inequality with leading constant one. Finally, we propose several Langevin MonteCarlo algorithms to approximately compute such an EWA when the number M of aggregated functions can be large. We discuss in some detail the convergence of these algorithms and present numerical experiments that confirm our theoretical findings.
Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms
, 2010
"... In this paper we present a generic algorithmic framework, namely, the accelerated stochastic approximation (ACSA) algorithm, for solving strongly convex stochastic composite optimization (SCO) problems. While the classical stochastic approximation (SA) algorithms are asymptotically optimal for solv ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
In this paper we present a generic algorithmic framework, namely, the accelerated stochastic approximation (ACSA) algorithm, for solving strongly convex stochastic composite optimization (SCO) problems. While the classical stochastic approximation (SA) algorithms are asymptotically optimal for solving differentiable and strongly convex problems, the ACSA algorithm, when employed with proper stepsize policies, can achieve optimal or nearly optimal rates of convergence for solving different classes of SCO problems during a given number of iterations. Moreover, we investigate these ACSA algorithms in more detail, such as, establishing the largedeviation results associated with the convergence rates and introducing efficient validation procedure to check the accuracy of the generated solutions.
Progressive mixture rules are deviation suboptimal
 Advances in Neural Information Processing Systems
"... We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by th ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ˆg satisfies ER(ˆg) ≤ ming∈G R(g) + Cst log G n, (1) where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is no better than Cst / √ n: it fails to achieve the expected Cst/n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
Suboptimality of penalized empirical risk minimization in classification
 In Proceedings of the 20th annual conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 4539 142–156
, 2007
"... Abstract. Let F be a set of M classification procedures with values in [−1, 1]. Given a loss function, we want to construct a procedure which mimics at the best possible rate the best procedure in F. This fastest rate is called optimal rate of aggregation. Considering a continuous scale of loss func ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Abstract. Let F be a set of M classification procedures with values in [−1, 1]. Given a loss function, we want to construct a procedure which mimics at the best possible rate the best procedure in F. This fastest rate is called optimal rate of aggregation. Considering a continuous scale of loss functions with various types of convexity, we prove that optimal rates of aggregation can be either ((log M)/n) 1/2 or (log M)/n. We prove that, if all the M classifiers are binary, the (penalized) Empirical Risk Minimization procedures are suboptimal (even under the margin/low noise condition) when the loss function is somewhat more than convex, whereas, in that case, aggregation procedures with exponential weights achieve the optimal rate of aggregation. 1
Sparsity regret bounds for individual sequences in online linear regression
 JMLR Workshop and Conference Proceedings, 19 (COLT 2011 Proceedings):377–396
, 2011
"... We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an onlinelearning algorithm called SeqSEW and based on exponential weighting and datadriven truncation. In a second part we apply a parameterfree version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.
Model selection for density estimation with L2loss. ArXiv eprints
, 2008
"... We consider here estimation of an unknown probability density s belonging to L2(µ) where µ is a probability measure. We have at hand n i.i.d. observations with density s and use the squared L2norm as our loss function. The purpose of this paper is to provide an abstract but completely general metho ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We consider here estimation of an unknown probability density s belonging to L2(µ) where µ is a probability measure. We have at hand n i.i.d. observations with density s and use the squared L2norm as our loss function. The purpose of this paper is to provide an abstract but completely general method for estimating s by model selection, allowing to handle arbitrary families of finitedimensional (possibly nonlinear) models and any s ∈ L2(µ). We shall, in particular, consider the cases of unbounded densities and bounded densities with unknown L∞norm and investigate how the L∞norm of s may influence the risk. We shall also provide applications to adaptive estimation and aggregation of preliminary estimators. Although of a purely theoretical nature, our method leads to results that cannot presently be reached by more concrete ones. 1
1 ClosedForm MMSE Estimation for Signal Denoising Under Sparse Representation Modeling Over a Unitary Dictionary
"... This paper deals with the Bayesian signal denoising problem, assuming a prior based on a sparse representation modeling over a unitary dictionary. It is well known that the Maximum Aposteriori Probability (MAP) estimator in such a case has a closedform solution based on a simple shrinkage. The foc ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This paper deals with the Bayesian signal denoising problem, assuming a prior based on a sparse representation modeling over a unitary dictionary. It is well known that the Maximum Aposteriori Probability (MAP) estimator in such a case has a closedform solution based on a simple shrinkage. The focus in this paper is on the better performing and less familiar MinimumMeanSquaredError (MMSE) estimator. We show that this estimator also leads to a simple formula, in the form of a plain recursive expression for evaluating the contribution of every atom in the solution. An extension of the model to realworld signals is also offered, considering heteroscedastic nonzero entries in the representation, and allowing varying probabilities for the chosen atoms and the overall cardinality of the sparse representation. The MAP and MMSE estimators are redeveloped for this extended model, again resulting in closedform simple algorithms. Finally, the superiority of the MMSE estimator is demonstrated both on synthetically generated signals and on realworld signals (image patches).
Greedy Model Averaging
"... This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are misspecified, model averaging is superior to model selection. Specifically, let n be the ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are misspecified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst case regret of the latter decays at the rate of O(1 / √ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
Hypersparse optimal aggregation
, 2010
"... Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggreg ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elasticnet estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to crossvalidation selection procedures.