## Covariance regularization by thresholding (2007)

### Cached

### Download Links

Citations: | 65 - 9 self |

### BibTeX

@TECHREPORT{Bickel07covarianceregularization,

author = {J. Bickel and Elizaveta Levina},

title = {Covariance regularization by thresholding},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

This paper considers regularizing a covariance matrix of p variables estimated from n observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data. 1. Introduction. Estimation

### Citations

2046 |
Matrix Computations
- Golub, Loan
- 1996
(Show Context)
Citation Context ...1) where �x� r r = � p j=1 |xj| r . In particular, we write �M� = �M� (2,2) for the operator norm, which for a symmetric matrix is given by �M� = |λmax(M)|. For symmetric matrices, we have (see e.g., =-=[15]-=-), �M� ≤ (�M� (1,1)�M� (∞,∞)) 1/2 = �M� (1,1) = max j We also use the Frobenius matrix norm, �M� 2 F = � We define the thresholding operator by i,j m 2 ij = tr(MM T ). � |mij| . (2) Ts(M) = [mij 1(|mi... |

884 | Ideal Spatial Adaptation by Wavelet Shrinkage”, Biometrika
- Donoho, Johnstone
- 1994
(Show Context)
Citation Context ...ue structure. Thresholding, on the other hand, is applicable to many more situations. In fact, our treatment is in many respects similar to the pioneering work on thresholding of Donoho and Johnstone =-=[8]-=- and the recent work of Johnstone and Silverman [19] and Abramovich et al. [1]. The rest of this paper is organized as follows. In Section 2 we introduce the thresholding estimator and our notion of s... |

370 | Variable selection via nonconcave penalised likelihood and its oracle properties
- Fan, Li
- 2001
(Show Context)
Citation Context ...ut all of these are computationally intensive. A faster algorithm that employs the lasso was proposed by Friedman et al. [16]. This approach has also been extended to more general penalties like SCAD =-=[15]-=- by Lam and Fan [25] and Fan et al. [14]. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: Zou et al. [35] apply the lasso penal... |

338 |
The Concentration of Measure Phenomenon
- Ledoux
- 2001
(Show Context)
Citation Context ...can take J ∼ nκ ,forany κ<∞ if q>0, and if p ∼ nδ ,evenifq = 0. 2. Similar results can be obtained for banding. 3. The assumption of Gaussianity can be relaxed. By applying Corollary 4.10 from Ledoux =-=[27]-=-, we can include distributions F of X = Aε, whereε = (ε1,...,εp) T and the εj are i.i.d. |εj |≤c<∞ (thanks to N. El Karoui for pointing this out). 4. Simulation results. The simulation results we pres... |

260 |
Weak Convergence and Empirical Processes: With Applications to Statistics
- Vaart, Wellner
- 1996
(Show Context)
Citation Context ...pect ρ(J) to be slowly varying. We begin with two essential technical results of possibly independent interest. 3.2. An inequality. We note an inequality derivable from a classic one of Pinelis – see =-=[28]-=- for instance. Proposition 1. Let U 1, . . . , U n be i.i.d. p-variate vectors with E|U 1| 2 ≤ K, EU 1 = 0. Let v1, . . . , vJ be fixed p-variate vectors of length 1. Define for x ∈ R p Then, E� n� i=... |

251 |
Sparse inverse covariance estimation with the graphical lasso
- Friedman, Hastie, et al.
(Show Context)
Citation Context ...omposition to re-parametrize4 BICKEL & LEVINA the concentration matrix [30], but all of these are computationally intensive. A faster algorithm that employs the lasso was proposed by Friedman et al. =-=[16]-=-. This approach has also been extended to more general penalties like SCAD [15] by Lam and Fan [25] and Fan et al. [14]. In specific applications, there have been other permutation-invariant approache... |

211 |
A Distribution-free Theory of Nonparametric Regression
- Kohler, Krzy·zak, et al.
- 2002
(Show Context)
Citation Context ...imate ˆµ c is defined as follows. Let ¯W B = 1 B B� W n+j . j=1s14 BICKEL & LEVINA Then, ˆµ c ≡ arg min | j ¯ W B − ˆµ j| 2 . Here is our basic result which has in some form appeared in Gyorfi et al. =-=[16]-=- (Ch. 7, Theorem 7.1, p.101), Bickel et al. [4], and Dudoit and van der Laan [9]. The major public proof in [16] appears to be in error and does not directly apply to our case so we give the proof of ... |

209 | On the distribution of the largest eigenvalue in principal components analysis
- Johnstone
(Show Context)
Citation Context ...large. Many results in random matrix theory illustrate this, from the classical Marĉenko-Pastur law [24] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues =-=[12, 20, 25]-=- and associated eigenvectors [21]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance m... |

183 |
Distribution of eigenvalues for some sets of random matrices
- Marčenko, Pastur
- 1967
(Show Context)
Citation Context ...iate Gaussian distribution, Np(µ, Σp), is not a good estimator of the population covariance if p is large. Many results in random matrix theory illustrate this, from the classical Marĉenko-Pastur law =-=[24]-=- to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 20, 25] and associated eigenvectors [21]. However, with the exception of a method for estimating th... |

173 | A direct formulation for sparse PCA using semidefinite programming
- d’Aspremont, Ghaoui, et al.
- 2004
(Show Context)
Citation Context ...-invariant approaches that use different notions of sparsity: ZousCOVARIANCE THRESHOLDING 3 et al. [31] apply the lasso penalty to loadings in PCA to achieve sparse representation; d’Aspremont et al. =-=[6]-=- compute sparse principal components by semi-definite programming; Johnstone and Lu [21] regularize PCA by moving to a sparse basis and thresholding; and Fan et al. [13] impose sparsity on the covaria... |

141 |
A well-conditioned estimator for large-dimensional covariance matrices
- Ledoit, Wolf
- 2004
(Show Context)
Citation Context ... at all. These applications require estimators invariant under variable permutations. Shrinkage estimators are in this category and have been proposed early on [7, 17]. More recently, Ledoit and Wolf =-=[22]-=- proposed an estimator where the optimal amount of shrinkage is estimated from data. Shrinkage estimators shrink the over-dispersed sample covariance eigenvalues, but they do not change the eigenvecto... |

140 | Sparse Principal Component Analysis
- Zou, Hastie, et al.
- 2004
(Show Context)
Citation Context ...D [15] byLam and Fan [25] and Fan, Fan and Lv [14]. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: Zou, Hastie and Tibshirani =-=[36]-=- apply the lasso penalty to loadings in PCA to achieve sparse representation; d’Aspremont et al. [6] compute sparse principal components by semidefinite programming; Johnstone and Lu [24] regularize P... |

119 | Model selection and estimation in the gaussian graphical model
- Yuan, Lin
(Show Context)
Citation Context ...ators shrink the over-dispersed sample covariance eigenvalues, but they do not change the eigenvectors, which are also inconsistent [21], and do not result in sparse estimators. Several recent papers =-=[5, 26, 30]-=-, construct a sparse permutationinvariant estimate of the inverse of the covariance matrix, also known as the concentration or precision matrix. Sparse concentration matrices are of interest in graphi... |

113 | Adapting to unknown sparsity by controlling the False Discovery Rate
- Abramovich, Benjamini, et al.
- 2006
(Show Context)
Citation Context ...ations. In fact, our treatment is in many respects similar to the pioneering work on thresholding of Donoho and Johnstone [8] and the recent work of Johnstone and Silverman [19] and Abramovich et al. =-=[1]-=-. The rest of this paper is organized as follows. In Section 2 we introduce the thresholding estimator and our notion of sparsity, prove the convergence result, and compare to results of El Karoui (Se... |

94 | 2008), Regularized Estimation of Large Covariance Matrices
- Bickel, Levina
(Show Context)
Citation Context ...e that variables far apart in the ordering are only weakly correlated, and those invariant to variable permutations. The first class includes regularizing the covariance matrix by banding or tapering =-=[2, 3, 14]-=-, which we will discuss below. It also includes estimators based on regularizing the Cholesky factor of the inverse covariance matrix. These methods use the fact that the entries of the Cholesky facto... |

89 | Empirical Bayes selection of wavelet thresholds. Unpublished manuscript
- Johnstone, Silverman
- 2003
(Show Context)
Citation Context ...pplicable to many more situations. In fact, our treatment is in many respects similar to the pioneering work on thresholding of Donoho and Johnstone [8] and the recent work of Johnstone and Silverman =-=[19]-=- and Abramovich et al. [1]. The rest of this paper is organized as follows. In Section 2 we introduce the thresholding estimator and our notion of sparsity, prove the convergence result, and compare t... |

82 | Sparse permutation invariant covariance estimation
- Rothman, Bickel, et al.
(Show Context)
Citation Context ...ators shrink the over-dispersed sample covariance eigenvalues, but they do not change the eigenvectors, which are also inconsistent [21], and do not result in sparse estimators. Several recent papers =-=[5, 26, 30]-=-, construct a sparse permutationinvariant estimate of the inverse of the covariance matrix, also known as the concentration or precision matrix. Sparse concentration matrices are of interest in graphi... |

77 |
Some theory for fisher’s linear discriminant function, “naive Bayes,” and some alternatives when there are many more variables than observations
- Bickel, Levina
- 2004
(Show Context)
Citation Context ...e that variables far apart in the ordering are only weakly correlated, and those invariant to variable permutations. The first class includes regularizing the covariance matrix by banding or tapering =-=[2, 3, 14]-=-, which we will discuss below. It also includes estimators based on regularizing the Cholesky factor of the inverse covariance matrix. These methods use the fact that the entries of the Cholesky facto... |

55 | First-order methods for sparse covariance selection
- D’Aspremont, Banerjee, et al.
(Show Context)
Citation Context ...ators shrink the over-dispersed sample covariance eigenvalues, but they do not change the eigenvectors, which are also inconsistent [21], and do not result in sparse estimators. Several recent papers =-=[5, 26, 30]-=-, construct a sparse permutationinvariant estimate of the inverse of the covariance matrix, also known as the concentration or precision matrix. Sparse concentration matrices are of interest in graphi... |

54 |
Covariance matrix selection and estimation via penalised normal likelihood
- Huang, Liu, et al.
(Show Context)
Citation Context ... These methods use the fact that the entries of the Cholesky factor have a regression interpretation, which allows application of regression regularization tools such as the lasso and ridge penalties =-=[18]-=-, or the nested lasso penalty [23] specifically designed for the ordered variables situation. Banding the Cholesky factor has also been proposed [3, 29]. These estimators are appropriate for a number ... |

48 | Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants
- Furrer, Bengtsson
- 2007
(Show Context)
Citation Context ...e that variables far apart in the ordering are only weakly correlated, and those invariant to variable permutations. The first class includes regularizing the covariance matrix by banding or tapering =-=[2, 3, 14]-=-, which we will discuss below. It also includes estimators based on regularizing the Cholesky factor of the inverse covariance matrix. These methods use the fact that the entries of the Cholesky facto... |

43 | Tracy-Widom limit for the largest eigenvalue of a large class of complex sample covariance matrices
- Karoui
(Show Context)
Citation Context ...large. Many results in random matrix theory illustrate this, from the classical Marĉenko-Pastur law [24] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues =-=[12, 20, 25]-=- and associated eigenvectors [21]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance m... |

43 | Sparsistency and rates of convergence in large covariance matrix estimation
- Lam, Fan
(Show Context)
Citation Context ...computationally intensive. A faster algorithm that employs the lasso was proposed by Friedman et al. [16]. This approach has also been extended to more general penalties like SCAD [15] by Lam and Fan =-=[25]-=- and Fan et al. [14]. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: Zou et al. [35] apply the lasso penalty to loadings in PC... |

42 |
2007): High-dimensional covariance matrix estimation using a factor model,Journal of Econometrics
- Fan, Fan, et al.
(Show Context)
Citation Context ...resentation; d’Aspremont et al. [6] compute sparse principal components by semi-definite programming; Johnstone and Lu [21] regularize PCA by moving to a sparse basis and thresholding; and Fan et al. =-=[13]-=- impose sparsity on the covariance via a factor model, which is often appropriate in finance applications. In this paper, we propose thresholding of the sample covariance matrix as a simple and permut... |

38 |
Nonparametric estimation of large covariance matrices of longitudinal data
- Wu, Pourahmadi
- 2003
(Show Context)
Citation Context ...zation tools such as the lasso and ridge penalties [18], or the nested lasso penalty [23] specifically designed for the ordered variables situation. Banding the Cholesky factor has also been proposed =-=[3, 29]-=-. These estimators are appropriate for a number of applications with ordered data (time series, spectroscopy, climate data). For climate applications and other spatial data, since there is no total or... |

37 | Sparse principal components analysis
- Johnstone, Lu
- 2008
(Show Context)
Citation Context ...ry illustrate this, from the classical Marĉenko-Pastur law [24] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 20, 25] and associated eigenvectors =-=[21]-=-. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance matrix. ∗ Supported by a grant fro... |

32 |
Operator norm consistent estimation of large dimensional sparse covariance matrices
- Karoui
- 2008
(Show Context)
Citation Context ...pose thresholding of the sample covariance matrix as a simple and permutation-invariant method of covariance regularization. This idea has been simultaneously and independently developed by El Karoui =-=[10]-=-, who studied it under a special notion of sparsity called βsparsity (see details in Section 2.4). Here we develop a natural permutationinvariant notion of sparsity which, though more specialized than... |

32 | 2008), Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty
- Levina, Rothman, et al.
(Show Context)
Citation Context ...he entries of the Cholesky factor have a regression interpretation, which allows application of regression regularization tools such as the lasso and ridge penalties [18], or the nested lasso penalty =-=[23]-=- specifically designed for the ordered variables situation. Banding the Cholesky factor has also been proposed [3, 29]. These estimators are appropriate for a number of applications with ordered data ... |

28 | Spectrum estimation for large dimensional covariance matrices using random matrix theory
- Karoui
- 2008
(Show Context)
Citation Context ...rk of Johnstone and his students on the theory of the largest eigenvalues [12, 20, 25] and associated eigenvectors [21]. However, with the exception of a method for estimating the covariance spectrum =-=[11]-=-, these probabilistic results do not offer alternatives to the sample covariance matrix. ∗ Supported by a grant from the NSF (DMS-0605236). † Supported by grants from the NSF (DMS-0505424) and the NSA... |

26 |
Estimation of a covariance matrix under Stein’s loss
- Dey, Srinivasan
- 1985
(Show Context)
Citation Context ... no notion of distance between variables at all. These applications require estimators invariant under variable permutations. Shrinkage estimators are in this category and have been proposed early on =-=[7, 17]-=-. More recently, Ledoit and Wolf [22] proposed an estimator where the optimal amount of shrinkage is estimated from data. Shrinkage estimators shrink the over-dispersed sample covariance eigenvalues, ... |

26 |
Empirical Bayes estimation of the multivariate normal covariance matrix
- Haff
- 1980
(Show Context)
Citation Context ... no notion of distance between variables at all. These applications require estimators invariant under variable permutations. Shrinkage estimators are in this category and have been proposed early on =-=[7, 17]-=-. More recently, Ledoit and Wolf [22] proposed an estimator where the optimal amount of shrinkage is estimated from data. Shrinkage estimators shrink the over-dispersed sample covariance eigenvalues, ... |

22 |
Limit Theorems for Large Deviations
- SAULIS, ˇCIUS, et al.
- 1991
(Show Context)
Citation Context ...RIANCE THRESHOLDING 7 The second term above is, by the union sum inequality, max | i ¯ Xi| 2 � � log p = OP , (10) n since F is Gaussian and σii ≤ M for all i. By a result of Saulis and Statulevičius =-=[27]-=- adapted for this context in Lemma 3 of [3], and σii ≤ M for all i, P � max |ˆσ i,j 0 ij − σij| ≥ t � ≤ p 2 e −δnt2 , (11) if t = o(1). We now recap an argument of Donoho and Johnstone [8]. Bound, �Tt... |

21 |
Asymptotics of the leading sample eigenvalues for a spiked covariance model, Available at http://www-stat.stanford.edu/ debashis
- Paul
- 2004
(Show Context)
Citation Context ...large. Many results in random matrix theory illustrate this, from the classical Marĉenko-Pastur law [24] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues =-=[12, 20, 25]-=- and associated eigenvectors [21]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance m... |

21 | 2009) "Network exploration via the adaptive LASSO and SCAD penalties
- Fan, Feng, et al.
(Show Context)
Citation Context ...nsive. A faster algorithm that employs the lasso was proposed by Friedman et al. [16]. This approach has also been extended to more general penalties like SCAD [15] by Lam and Fan [25] and Fan et al. =-=[14]-=-. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: Zou et al. [35] apply the lasso penalty to loadings in PCA to achieve sparse ... |

20 | Operator norm consistent estimation of largedimensional sparse covariance matrices
- Karoui
- 2008
(Show Context)
Citation Context ...pose thresholding of the sample covariance matrix as a simple and permutation-invariant method of covariance regularization. This idea has been simultaneously and independently developed by El Karoui =-=[10]-=-, who studied it under a special notion of sparsity called β-sparsity (see details in Section 2.4). Here we develop a natural permutation-invariant notion of sparsity which, though more specialized th... |

12 | Some theory for generalized boosting algorithms
- Bickel, Ritov, et al.
- 2006
(Show Context)
Citation Context ... B� W n+j . j=1s14 BICKEL & LEVINA Then, ˆµ c ≡ arg min | j ¯ W B − ˆµ j| 2 . Here is our basic result which has in some form appeared in Gyorfi et al. [16] (Ch. 7, Theorem 7.1, p.101), Bickel et al. =-=[4]-=-, and Dudoit and van der Laan [9]. The major public proof in [16] appears to be in error and does not directly apply to our case so we give the proof of our statement for completeness. Theorem 3. Supp... |

7 |
Asymptotics of cross-validated risk estimation in estimator selection and performance assessment
- S, Laan
- 2005
(Show Context)
Citation Context ...A Then, ˆµ c ≡ arg min | j ¯ W B − ˆµ j| 2 . Here is our basic result which has in some form appeared in Gyorfi et al. [16] (Ch. 7, Theorem 7.1, p.101), Bickel et al. [4], and Dudoit and van der Laan =-=[9]-=-. The major public proof in [16] appears to be in error and does not directly apply to our case so we give the proof of our statement for completeness. Theorem 3. Suppose, (A1) |ˆµ o − µ(P )| 2 = Ωp(r... |

5 |
Sparse principal components analysis. Unpublished Manuscript
- Johnstone, Lu
- 2004
(Show Context)
Citation Context ...ry illustrate this, from the classical Marčenko–Pastur law [29] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 23, 30] and associated eigenvectors =-=[24]-=-. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance matrix. Alternative estimators for... |

3 |
Sparse principal components analysis. Journal of Computational and Graphical Statistics, 15:265–286
- Zou, Hastie, et al.
- 2006
(Show Context)
Citation Context ...these are very computationally intensive. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: ZousCOVARIANCE THRESHOLDING 3 et al. =-=[31]-=- apply the lasso penalty to loadings in PCA to achieve sparse representation; d’Aspremont et al. [6] compute sparse principal components by semi-definite programming; Johnstone and Lu [21] regularize ... |

3 |
Estimation of a covariance matrix under Stein’s
- Dey, Srinivasan
- 1985
(Show Context)
Citation Context ... no notion of distance between variables at all. These applications require estimators invariant under variable permutations. Shrinkage estimators are in this category and have been proposed early on =-=[7, 20]-=-. More recently, Ledoit and Wolf [26] proposed an estimator where the optimal amount of shrinkage is estimated from data. Shrinkage estimators shrink the overdispersed sample covariance eigenvalues, b... |

2 |
Limit Theorems for Large Deviations.Kluwer
- SAULIS, A
- 1991
(Show Context)
Citation Context ...) ˆ� = ˆ� 0 − ¯X ¯X T , where Note that, by (8), (9) max i,j ˆ� 0 ≡[ˆσ 0 1 ij ]= n n∑ XkX k=1 T k . |ˆσij 0 − σij |≤max |ˆσ ij i,j − σij |+max i,j | ¯Xi ¯Xj |. By a result of Saulis and Statulevičius =-=[32]-=- adapted for this context in Lemma 3 of [3], and σii ≤ M for all i, [ 0 P max |ˆσ ij i,j − σij ] |≥t ≤ p 2 −C2nt2 (10) C1e , for |t| <δ,whereC1, C2 and δ are constants depending only on M. In particul... |

1 |
A well-conditioned estimator for large-dimensional BICKEL & LEVINA covariance matrices
- Ledoit, Wolf
- 2003
(Show Context)
Citation Context ... at all. These applications require estimators invariant under variable permutations. Shrinkage estimators are in this category and have been proposed early on [7, 20]. More recently, Ledoit and Wolf =-=[26]-=- proposed an estimator where the optimal amount of shrinkage is estimated from data. Shrinkage estimators shrink the over-dispersed sample covariance eigenvalues, but they do not change the eigenvecto... |

1 |
Spectrum estimation for large-dimensional covariance matrices using random matrix theory
- Karoui
(Show Context)
Citation Context ...rk of Johnstone and his students on the theory of the largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However, with the exception of a method for estimating the covariance spectrum =-=[11]-=-, these probabilistic results do not offer alternatives to the sample covariance matrix. Alternative estimators for large covariance matrices have therefore attracted a lot of attention recently. Two ... |