Results 1  10
of
30
Adaptive Regression by Mixing
 Journal of American Statistical Association
"... Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus auto ..."
Abstract

Cited by 39 (7 self)
 Add to MetaCart
Adaptation over different procedures is of practical importance. Different procedures perform well under different conditions. In many practical situations, it is rather hard to assess which conditions are (approximately) satisfied so as to identify the best procedure for the data at hand. Thus automatic adaptation over various scenarios is desirable. A practically feasible method, named Adaptive Regression by Mixing (ARM) is proposed to convexly combine general candidate regression procedures. Under mild conditions, the resulting estimator is theoretically shown to perform optimally in rates of convergence without knowing which of the original procedures work the best. Simulations are conducted in several settings, including comparing a parametric model with nonparametric alternatives, comparing a neural network with a projection pursuit in multidimensional regression, and combining bandwidths in kernel regression. The results clearly support the theoretical property of ARM. The ARM ...
Fast learning rates in statistical inference through aggregation
 SUBMITTED TO THE ANNALS OF STATISTICS
, 2008
"... We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, w ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set G up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form C () log G  v with tight evaluation of the positive constant C and with n exact 0 < v ≤ 1, the latter value depending on the convexity of the loss function and on the level of noise in the output distribution. The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worstcase viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key idea of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data. The risk lower bounds are based on refinements of the Assouad lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the Lqregression setting for which an exhaustive analysis of the convergence rates is given while q ranges in [1; +∞[.
Aggregation by exponential weighting and sharp oracle inequalities
"... Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on t ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Abstract. In the present paper, we study the problem of aggregation under the squared loss in the model of regression with deterministic design. We obtain sharp oracle inequalities for convex aggregates defined via exponential weights, under general assumptions on the distribution of errors and on the functions to aggregate. We show how these results can be applied to derive a sparsity oracle inequality. 1
Combining forecasting procedures: some theoretical results
 Econometric Theory
, 2004
"... We study some methods of combining procedures for forecasting a continuous random variable. Statistical risk bounds under the square error loss are obtained under mild distributional assumptions on the future given the current outside information and the past observations. The risk bounds show that ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
We study some methods of combining procedures for forecasting a continuous random variable. Statistical risk bounds under the square error loss are obtained under mild distributional assumptions on the future given the current outside information and the past observations. The risk bounds show that the combined forecast automatically achieves the best performance among the candidate procedures up to a constant factor and an additive penalty term. In term of the rate of convergence, the combined forecast performs as well as if one knew which candidate forecasting procedure is the best in advance. Empirical studies suggest combining procedures can sometimes improve forecasting accuracy compared to the original procedures. Risk bounds are derived to theoretically quantify the potential gain and price for linearly combining forecasts for improvement. The result supports the empirical finding that it is not automatically a good idea to combine forecasts. A blind combining can degrade performance dramatically due to the undesirable large variability in estimating the best combining weights. An automated combining method is shown in theory to achieve a balance between the potential gain and the complexity penalty (the price for combining); to take advantage (if any) of sparse combining; and to maintain the best performance (in rate) among the candidate forecasting procedures if linear or sparse combining does not help.
Simultaneous adaptation to the margin and to complexity in classification
, 2005
"... We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexiti ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexities which lead to numerical difficulties. On the other hand there exist classifiers that are easy to compute and that converge with fast rates but are not adaptive. Combining these classifiers by our aggregation procedure we get numerically realizable adaptive classifiers that converge with fast rates.
Aggregation for regression learning
 Laboratoire de Probabilités, Université Paris VI, 2004, http://www.proba.jussieu.fr/mathdoc/preprints/index.html# 2004. L. Birgé
, 2004
"... Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, conv ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recent minimax results in Nemirovski (2000) and Tsybakov (2003). There exist competing aggregation procedures achieving optimal convergence separately for each one of (MS), (C) and (L) cases. Since the bounds in these results are not directly comparable with each other, we suggest an alternative solution. We prove that all the three optimal bounds can be nearly achieved via a single “universal ” aggregation procedure. We propose such a procedure which consists in mixing of the initial estimators with the weights obtained by penalized least squares. Two different penalities are considered: one of them is related to hard thresholding techniques, the second one is a data dependent L1type penalty. 1.
Progressive mixture rules are deviation suboptimal
 Advances in Neural Information Processing Systems
"... We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by th ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule ˆg satisfies ER(ˆg) ≤ ming∈G R(g) + Cst log G n, (1) where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is no better than Cst / √ n: it fails to achieve the expected Cst/n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
Aggregating Regression Procedures for a Better Performance
 Bernoulli
, 1999
"... Methods have been proposed to linearly combine candidate regression procedures to improve estimation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the great potential of combining procedures. A fundamental question regarding combining procedure is: What i ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
Methods have been proposed to linearly combine candidate regression procedures to improve estimation accuraccy. Applications of these methods in many examples are very succeesful, pointing to the great potential of combining procedures. A fundamental question regarding combining procedure is: What is the potential gain and how much one needs to pay for it? A partial answer to this question is obtained by Juditsky and Nemirovski (1996) for the case when a large number of procedures are to be combined. We attempt to give a more general solution. Under a l 1 constrain on the linear coefficients, we show that for pursuing the best linear combination over n procedures, in terms of rate of convergence under the squared L 2 loss, one can pay a price of order O \Gamma log n=n 1\Gamma \Delta when 0 ! ! 1=2 and a price of order O i (log n=n) 1=2 j when 1=2 ! 1. These rates can not be improved or essentially improved in a uniform sense. This result suggests that one should be cautious...
Adaptive Estimation in Pattern Recognition by Combining Different Procedures
 Statistica Sinica
"... : We study a problem of adaptive estimation of a conditional probability function in a pattern recognition setting. In many applications, for more flexibility, one may want to consider various estimation procedures targeted at different scenarios and/or under different assumptions. For example, when ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
: We study a problem of adaptive estimation of a conditional probability function in a pattern recognition setting. In many applications, for more flexibility, one may want to consider various estimation procedures targeted at different scenarios and/or under different assumptions. For example, when the feature dimension is high, to overcome the familiar curse of dimensionality, one may seek a good parsimonious model among a number of candidates such as CART, neural nets, additive models, and others. For such a situation, one wishes to have an automated final procedure performing always as well as the best candidate. In this work, we propose a method to combine a countable collection of procedures for estimating the conditional probability. We show that the combined procedure has a property that its statistical risk is bounded above by that of any of the procedure being considered plus a small penalty. Thus in an asymptotic sense, the strengths of the different estimation procedures i...
ESTIMATOR SELECTION WITH RESPECT TO HELLINGERTYPE RISKS
, 2009
"... We observe a random measure N and aim at estimating its intensity s. This statistical framework allows to deal simultaneously with the problems of estimating a density, the marginals of a multivariate distribution, the mean of a random vector with nonnegative components and the intensity of a Poiss ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We observe a random measure N and aim at estimating its intensity s. This statistical framework allows to deal simultaneously with the problems of estimating a density, the marginals of a multivariate distribution, the mean of a random vector with nonnegative components and the intensity of a Poisson process. Our estimation strategy is based on estimator selection. Given a family of estimators of s based on the observation of N, we propose a selection rule, based on N as well, in view of selecting among these. Little assumption is made on the collection of estimators. The procedure offers the possibility to perform model selection and also to select among estimators associated to different model selection strategies. Besides, it provides an alternative to the Testimators as studied recently in Birgé (2006). For illustration, we consider the problems of estimation and (complete) variable selection in various regression settings.