Results 1 - 10
of
1,271
Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties
, 2001
"... Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized ..."
Abstract
-
Cited by 948 (62 self)
- Add to MetaCart
Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.
The Dantzig selector: statistical estimation when p is much larger than n
, 2005
"... In many important statistical applications, the number of variables or parameters p is much larger than the number of observations n. Suppose then that we have observations y = Ax + z, where x ∈ R p is a parameter vector of interest, A is a data matrix with possibly far fewer rows than columns, n ≪ ..."
Abstract
-
Cited by 879 (14 self)
- Add to MetaCart
In many important statistical applications, the number of variables or parameters p is much larger than the number of observations n. Suppose then that we have observations y = Ax + z, where x ∈ R p is a parameter vector of interest, A is a data matrix with possibly far fewer rows than columns, n ≪ p, and the zi’s are i.i.d. N(0, σ 2). Is it possible to estimate x reliably based on the noisy data y? To estimate x, we introduce a new estimator—we call the Dantzig selector—which is solution to the ℓ1-regularization problem min ˜x∈R p ‖˜x‖ℓ1 subject to ‖A T r‖ℓ ∞ ≤ (1 + t −1) √ 2 log p · σ, where r is the residual vector y − A˜x and t is a positive scalar. We show that if A obeys a uniform uncertainty principle (with unit-normed columns) and if the true parameter vector x is sufficiently sparse (which here roughly guarantees that the model is identifiable), then with very large probability ‖ˆx − x ‖ 2 ℓ2 ≤ C2 ( · 2 log p · σ 2 + ∑ min(x 2 i, σ 2) Our results are nonasymptotic and we give values for the constant C. In short, our estimator achieves a loss within a logarithmic factor of the ideal mean squared error one would achieve with an oracle which would supply perfect information about which coordinates are nonzero, and which were above the noise level. In multivariate regression and from a model selection viewpoint, our result says that it is possible nearly to select the best subset of variables, by solving a very simple convex program, which in fact can easily be recast as a convenient linear program (LP).
A Practical Guide to Wavelet Analysis
, 1998
"... A practical step-by-step guide to wavelet analysis is given, with examples taken from time series of the El Nio-- Southern Oscillation (ENSO). The guide includes a comparison to the windowed Fourier transform, the choice of an appropriate wavelet basis function, edge effects due to finite-length t ..."
Abstract
-
Cited by 869 (3 self)
- Add to MetaCart
A practical step-by-step guide to wavelet analysis is given, with examples taken from time series of the El Nio-- Southern Oscillation (ENSO). The guide includes a comparison to the windowed Fourier transform, the choice of an appropriate wavelet basis function, edge effects due to finite-length time series, and the relationship between wavelet scale and Fourier frequency. New statistical significance tests for wavelet power spectra are developed by deriving theoretical wavelet spectra for white and red noise processes and using these to establish significance levels and confidence intervals. It is shown that smoothing in time or scale can be used to increase the confidence of the wavelet spectrum. Empirical formulas are given for the effect of smoothing on significance levels and confidence intervals. Extensions to wavelet analysis such as filtering, the power Hovmller, cross-wavelet spectra, and coherence are described. The statistical significance tests are used to give a qu...
Regularization paths for generalized linear models via coordinate descent
, 2009
"... We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, twoclass logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic ..."
Abstract
-
Cited by 724 (15 self)
- Add to MetaCart
(Show Context)
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, twoclass logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Image denoising using a scale mixture of Gaussians in the wavelet domain
- IEEE TRANS IMAGE PROCESSING
, 2003
"... We describe a method for removing noise from digital images, based on a statistical model of the coefficients of an overcomplete multiscale oriented basis. Neighborhoods of coefficients at adjacent positions and scales are modeled as the product of two independent random variables: a Gaussian vecto ..."
Abstract
-
Cited by 513 (17 self)
- Add to MetaCart
(Show Context)
We describe a method for removing noise from digital images, based on a statistical model of the coefficients of an overcomplete multiscale oriented basis. Neighborhoods of coefficients at adjacent positions and scales are modeled as the product of two independent random variables: a Gaussian vector and a hidden positive scalar multiplier. The latter modulates the local variance of the coefficients in the neighborhood, and is thus able to account for the empirically observed correlation between the coefficient amplitudes. Under this model, the Bayesian least squares estimate of each coefficient reduces to a weighted average of the local linear estimates over all possible values of the hidden multiplier variable. We demonstrate through simulations with images contaminated by additive white Gaussian noise that the performance of this method substantially surpasses that of previously published methods, both visually and in terms of mean squared error.
From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
, 2007
"... A full-rank matrix A ∈ IR n×m with n < m generates an underdetermined system of linear equations Ax = b having infinitely many solutions. Suppose we seek the sparsest solution, i.e., the one with the fewest nonzero entries: can it ever be unique? If so, when? As optimization of sparsity is combin ..."
Abstract
-
Cited by 427 (36 self)
- Add to MetaCart
(Show Context)
A full-rank matrix A ∈ IR n×m with n < m generates an underdetermined system of linear equations Ax = b having infinitely many solutions. Suppose we seek the sparsest solution, i.e., the one with the fewest nonzero entries: can it ever be unique? If so, when? As optimization of sparsity is combinatorial in nature, are there efficient methods for finding the sparsest solution? These questions have been answered positively and constructively in recent years, exposing a wide variety of surprising phenomena; in particular, the existence of easily-verifiable conditions under which optimally-sparse solutions can be found by concrete, effective computational methods. Such theoretical results inspire a bold perspective on some important practical problems in signal and image processing. Several well-known signal and image processing problems can be cast as demanding solutions of undetermined systems of equations. Such problems have previously seemed, to many, intractable. There is considerable evidence that these problems often have sparse solutions. Hence, advances in finding sparse solutions to underdetermined systems energizes research on such signal and image processing problems – to striking effect. In this paper we review the theoretical results on sparse solutions of linear systems, empirical
Sure independence screening for ultra-high dimensional feature space
, 2006
"... Variable selection plays an important role in high dimensional statistical modeling which nowa-days appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, ..."
Abstract
-
Cited by 283 (26 self)
- Add to MetaCart
Variable selection plays an important role in high dimensional statistical modeling which nowa-days appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, Candes and Tao (2007) propose the Dantzig selector using L1 regularization and show that it achieves the ideal risk up to a logarithmic factor log p. Their innovative procedure and remarkable result are challenged when the dimensionality is ultra high as the factor log p can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the SIS is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be ac-
Wavelet Thresholding via a Bayesian Approach
- J. R. STATIST. SOC. B
, 1996
"... We discuss a Bayesian formalism which gives rise to a type of wavelet threshold estimation in non-parametric regression. A prior distribution is imposed on the wavelet coefficients of the unknown response function, designed to capture the sparseness of wavelet expansion common to most applications. ..."
Abstract
-
Cited by 262 (33 self)
- Add to MetaCart
We discuss a Bayesian formalism which gives rise to a type of wavelet threshold estimation in non-parametric regression. A prior distribution is imposed on the wavelet coefficients of the unknown response function, designed to capture the sparseness of wavelet expansion common to most applications. For the prior specified, the posterior median yields a thresholding procedure. Our prior model for the underlying function can be adjusted to give functions falling in any specific Besov space. We establish a relation between the hyperparameters of the prior model and the parameters of those Besov spaces within which realizations from the prior will fall. Such a relation gives insight into the meaning of the Besov space parameters. Moreover, the established relation makes it possible in principle to incorporate prior knowledge about the function's regularity properties into the prior model for its wavelet coefficients. However, prior knowledge about a function's regularity properties might b...
A Sparse Signal Reconstruction Perspective for Source Localization With Sensor Arrays
, 2005
"... We present a source localization method based on a sparse representation of sensor measurements with an overcomplete basis composed of samples from the array manifold. We enforce sparsity by imposing penalties based on the 1-norm. A number of recent theoretical results on sparsifying properties of ..."
Abstract
-
Cited by 231 (6 self)
- Add to MetaCart
We present a source localization method based on a sparse representation of sensor measurements with an overcomplete basis composed of samples from the array manifold. We enforce sparsity by imposing penalties based on the 1-norm. A number of recent theoretical results on sparsifying properties of 1 penalties justify this choice. Explicitly enforcing the sparsity of the representation is motivated by a desire to obtain a sharp estimate of the spatial spectrum that exhibits super-resolution. We propose to use the singular value decomposition (SVD) of the data matrix to summarize multiple time or frequency samples. Our formulation leads to an optimization problem, which we solve efficiently in a second-order cone (SOC) programming framework by an interior point implementation. We propose a grid refinement method to mitigate the effects of limiting estimates to a grid of spatial locations and introduce an automatic selection criterion for the regularization parameter involved in our approach. We demonstrate the effectiveness of the method on simulated data by plots of spatial spectra and by comparing the estimator variance to the Cramér–Rao bound (CRB). We observe that our approach has a number of advantages over other source localization techniques, including increased resolution, improved robustness to noise, limitations in data quantity, and correlation of the sources, as well as not requiring an accurate initialization.
Bivariate Shrinkage Functions for Wavelet-Based Denoising Exploiting Interscale Dependency
, 2002
"... Most simple nonlinear thresholding rules for wavelet-based denoising assume that the wavelet coefficients are independent. However, wavelet coefficients of natural images have significant dependencies. In this paper, we will only consider the dependencies between the coefficients and their parents i ..."
Abstract
-
Cited by 206 (8 self)
- Add to MetaCart
Most simple nonlinear thresholding rules for wavelet-based denoising assume that the wavelet coefficients are independent. However, wavelet coefficients of natural images have significant dependencies. In this paper, we will only consider the dependencies between the coefficients and their parents in detail. For this purpose, new non-Gaussian bivariate distributions are proposed, and corresponding nonlinear threshold functions (shrinkage functions) are derived from the models using Bayesian estimation theory. The new shrinkage functions do not assume the independence of wavelet coefficients. We will show three image denoising examples in order to show the performance of these new bivariate shrinkage rules. In the second example, a simple subband-dependent data-driven image denoising system is described and compared with effective data-driven techniques in the literature, namely VisuShrink, SureShrink, BayesShrink, and hidden Markov models. In the third example, the same idea is applied to the dual-tree complex wavelet coefficients.