Results 1 - 10
of
18
High dimensional graphs and variable selection with the Lasso
- Annals of Statistics
, 2006
"... The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a ..."
Abstract
-
Cited by 232 (17 self)
- Add to MetaCart
The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighborhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variable selection for Gaussian linear models. We show that the proposed neighborhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighborhood estimate. Controlling instead the probability of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the number of variables grows as the number of observations raised to an arbitrary power. 1. Introduction. Consider
Consistency of the group lasso and multiple kernel learning
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2007
"... We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it ..."
Abstract
-
Cited by 81 (14 self)
- Add to MetaCart
We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.
Spam: Sparse additive models
- In Advances in Neural Information Processing Systems 20
, 2007
"... We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when t ..."
Abstract
-
Cited by 37 (9 self)
- Add to MetaCart
We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
Sequential procedures for aggregating arbitrary estimators of a conditional mean
, 2005
"... In this paper we describe and analyze a sequential procedure for aggregating linear combinations of a finite family of regression estimates, with particular attention to linear combinations having coefficients in the generalized simplex. The procedure is based on exponential weighting, and has a com ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
In this paper we describe and analyze a sequential procedure for aggregating linear combinations of a finite family of regression estimates, with particular attention to linear combinations having coefficients in the generalized simplex. The procedure is based on exponential weighting, and has a computationally tractable approximation. Analysis of the procedure is based in part on techniques from the sequential prediction of non-random sequences. Here these techniques are applied in a stochastic setting to obtain cumulative loss bounds for the aggregation procedure. From the cumulative loss bounds we derive an oracle inequality for the aggregate estimator for an unbounded response having a suitable moment generating function. The inequality shows that the risk of the aggregate estimator is less than the risk of the best candidate linear combination in the generalized simplex, plus a complexity term that depends on the size of the coefficient set. The inequality readily yields convergence rates for aggregation over the unit simplex that are within logarithmic factors of known minimax bounds. Some preliminary results on model selection are also presented.
Linear and convex aggregation of density estimators
, 2004
"... We study the problem of learning the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lowe ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
We study the problem of learning the best linear and convex combination of M estimators of a density with respect to the mean squared risk. We suggest aggregation procedures and we prove sharp oracle inequalities for their risks, i.e., oracle inequalities with leading constant 1. We also obtain lower bounds showing that these procedures attain optimal rates of aggregation. As an example, we consider aggregation of multivariate kernel density estimators with different bandwidths. We show that linear and convex aggregates mimic the kernel oracles in asymptotically exact sense. We prove that, for Pinsker’s kernel, the proposed aggregates are sharp asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we provide simulations demonstrating performance of the convex aggregation procedure.
Simultaneous adaptation to the margin and to complexity in classification, (2005), Available at http://hal.ccsd.cnrs.fr/ccsd-00009241/en
"... We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexiti ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest a learning method with a numerically easy aggregation step. Adaptivity both to the margin and complexity in classification, usually involves empirical risk minimization or Rademacher complexities which lead to numerical difficulties. On the other hand there exist classifiers that are easy to compute and that converge with fast rates but are not adaptive. Combining these classifiers by our aggregation procedure we get numerically realizable adaptive classifiers that converge with fast rates.
Aggregation for regression learning
- Laboratoire de Probabilités, Université Paris VI, 2004, http://www.proba.jussieu.fr/mathdoc/preprints/index.html# 2004. L. Birgé
, 2004
"... Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, conv ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. This paper studies statistical aggregation procedures in regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recent minimax results in Nemirovski (2000) and Tsybakov (2003). There exist competing aggregation procedures achieving optimal convergence separately for each one of (MS), (C) and (L) cases. Since the bounds in these results are not directly comparable with each other, we suggest an alternative solution. We prove that all the three optimal bounds can be nearly achieved via a single “universal ” aggregation procedure. We propose such a procedure which consists in mixing of the initial estimators with the weights obtained by penalized least squares. Two different penalities are considered: one of them is related to hard thresholding techniques, the second one is a data dependent L1-type penalty. 1.
LARGE DEVIATIONS OF VECTOR-VALUED MARTINGALES IN 2-SMOOTH NORMED SPACES
- SUBMITTED TO THE ANNALS OF PROBABILITY
, 2008
"... In this paper, we derive exponential bounds on probabilities of large deviations for “light tail” martingales taking values in finitedimensional normed spaces. Our primary emphasis is on the case where the bounds are dimension-independent or nearly so. We demonstrate that this is the case when the n ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper, we derive exponential bounds on probabilities of large deviations for “light tail” martingales taking values in finitedimensional normed spaces. Our primary emphasis is on the case where the bounds are dimension-independent or nearly so. We demonstrate that this is the case when the norm on the space can be approximated, within an absolute constant factor, by a norm which is differentiable on the unit sphere with a Lipschitz continuous gradient. We also present various examples of spaces possessing the latter property.
Density estimation with stagewise optimization of the empirical risk,” (http://www.rni.helsinki.fi/ jsk/ps/kitera.pdf
, 2005
"... We consider multivariate density estimation with identically distributed observations. We study a density estimator which is a convex combination of functions in a dictionary and the convex combination is chosen by minimizing the L2 empirical risk in a stagewise manner. We derive the convergence rat ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider multivariate density estimation with identically distributed observations. We study a density estimator which is a convex combination of functions in a dictionary and the convex combination is chosen by minimizing the L2 empirical risk in a stagewise manner. We derive the convergence rates of the estimator when the estimated density belongs to the L2 closure of the convex hull of a class of functions which satisfies entropy conditions. The L2 closure of a convex hull is a large non-parametric class but under suitable entropy conditions the convergence rates of the estimator do not depend on the dimension, and density estimation is feasible also in high dimensional cases. The variance of the estimator does not increase when the number of components of the estimator increases. Instead, we control the bias-variance trade-off by the choice of the dictionary from which the components are chosen.
On Minimax Prediction for Nonparametric Autoregressive Models
, 1997
"... : We consider the problem of nonparametric prediction for a multi-dimensional functional autoregression y t = f(y t\Gamma1 ; :::; y t\Gammad ) + e t on the basis of N observations of y t . In the case when the unknown nonlinear function f belongs to the Barron class, we propose an estimation algori ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
: We consider the problem of nonparametric prediction for a multi-dimensional functional autoregression y t = f(y t\Gamma1 ; :::; y t\Gammad ) + e t on the basis of N observations of y t . In the case when the unknown nonlinear function f belongs to the Barron class, we propose an estimation algorithm which provides approximations of f with expected L 2 accuracy O(N 1=4 ln 1=4 N ). We also show that this approximation rate cannot be significantly improved. The proposed algorithms are "computationally efficient" -- the total number of elementary computations necessary to complete the estimate grows polynomially with N . Key-words: Non-parametric estimation, stochastic approximation (R'esum'e : tsvp) * INRIA Rhone-Alpes, 655 avenue de l'Europe, 38330 MONTBONNOT SAINT MARTIN, FRANCE ** IRISA-INRIA, Campus de Beaulieu, 35042 RENNES Cedex, FRANCE CENTRE NATIONAL DE LA RECHERCHE SCIENTIFIQUE Centre National de la Recherche Scientifique Institut National de Recherche en Informatique ...

