Results 1  10
of
173
Consistency of spectral clustering
, 2004
"... Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spe ..."
Abstract

Cited by 286 (15 self)
 Add to MetaCart
Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spectral clustering algorithms, which cluster the data with the help of eigenvectors of graph Laplacian matrices. We show that one of the two of major classes of spectral clustering (normalized clustering) converges under some very general conditions, while the other (unnormalized), is only consistent under strong additional assumptions, which, as we demonstrate, are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering in practical applications. We believe that methods used in our analysis will provide a basis for future exploration of Laplacianbased methods in a statistical setting.
On Model Selection Consistency of Lasso
, 2006
"... Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used ..."
Abstract

Cited by 231 (16 self)
 Add to MetaCart
Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection.
On the distribution of the largest eigenvalue in principal components analysis
 Ann. Statist
, 2001
"... Let x �1 � denote the square of the largest singular value of an n × p matrix X, all of whose entries are independent standard Gaussian variates. Equivalently, x �1 � is the largest principal component variance of the covariance matrix X ′ X, or the largest eigenvalue of a pvariate Wishart distribu ..."
Abstract

Cited by 197 (2 self)
 Add to MetaCart
Let x �1 � denote the square of the largest singular value of an n × p matrix X, all of whose entries are independent standard Gaussian variates. Equivalently, x �1 � is the largest principal component variance of the covariance matrix X ′ X, or the largest eigenvalue of a pvariate Wishart distribution on n degrees of freedom with identity covariance. Consider the limit of large p and n with n/p = γ ≥ 1. When centered by µ p = � √ n − 1 + √ p � 2 and scaled by σ p = � √ n − 1 + √ p��1 / √ n − 1 + 1 / √ p � 1/3 � the distribution of x �1 � approaches the Tracy–Widom lawof order 1, which is defined in terms of the Painlevé II differential equation and can be numerically evaluated and tabulated in software. Simulations showthe approximation to be informative for n and p as small as 5. The limit is derived via a corresponding result for complex Wishart matrices using methods from random matrix theory. The result suggests that some aspects of large p multivariate distribution theory may be easier to apply in practice than their fixed p counterparts. 1. Introduction. The
No eigenvalues outside the support of the limiting spectral distribution of largedimensional sample covariance matrices
 ANNALS OF PROBABILITY 26
, 1998
"... We consider a class of matrices of the form Cn = (1/N)(Rn+σXn)(Rn+σXn) ∗, where Xn is an n × N matrix consisting of independent standardized complex entries, Rj is an n×N nonrandom matrix, and σ> 0. Among several applications, Cn can be viewed as a sample correlation matrix, where information is co ..."
Abstract

Cited by 107 (18 self)
 Add to MetaCart
We consider a class of matrices of the form Cn = (1/N)(Rn+σXn)(Rn+σXn) ∗, where Xn is an n × N matrix consisting of independent standardized complex entries, Rj is an n×N nonrandom matrix, and σ> 0. Among several applications, Cn can be viewed as a sample correlation matrix, where information is contained in (1/N)RnR ∗ n, but each column of Rn is contaminated by noise. As n → ∞, if n/N → c> 0, and the empirical distribution of the eigenvalues of (1/N)RnR ∗ n converge to a proper probability distribution, then the empirical distribution of the eigenvalues of Cn converges a.s. to a nonrandom limit. In this paper we show that, under certain conditions on Rn, for any closed interval in R + outside the support of the limiting distribution, then, almost surely, no eigenvalues of Cn will appear in this interval for all n large.
Sure independence screening for ultrahigh dimensional feature space
, 2006
"... Variable selection plays an important role in high dimensional statistical modeling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, ..."
Abstract

Cited by 90 (12 self)
 Add to MetaCart
Variable selection plays an important role in high dimensional statistical modeling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, Candes and Tao (2007) propose the Dantzig selector using L1 regularization and show that it achieves the ideal risk up to a logarithmic factor log p. Their innovative procedure and remarkable result are challenged when the dimensionality is ultra high as the factor log p can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the SIS is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be ac
The sparsity and bias of the lasso selection in highdimensional linear regression. Ann. Statist. Volume 36, Number 4, 15671594. Alexandre Belloni Duke University Fuqua
 School of Business 1 Towerview Drive Durham, NC 277080120 PO Box 90120 Email: abn5@duke.edu Victor Chernozhukov Massachusetts Institute of Technology Department of Economics and Operations research Center 50 Memorial Drive Room E52262f Cambridge, MA 02
, 2008
"... showed that, for neighborhood selection in Gaussian graphical models, under a neighborhood stability condition, the LASSO is consistent, even when the number of variables is of greater order than the sample size. Zhao and Yu [(2006) J. Machine Learning Research 7 2541–2567] formalized the neighborho ..."
Abstract

Cited by 80 (14 self)
 Add to MetaCart
showed that, for neighborhood selection in Gaussian graphical models, under a neighborhood stability condition, the LASSO is consistent, even when the number of variables is of greater order than the sample size. Zhao and Yu [(2006) J. Machine Learning Research 7 2541–2567] formalized the neighborhood stability condition in the context of linear regression as a strong irrepresentable condition. That paper showed that under this condition, the LASSO selects exactly the set of nonzero regression coefficients, provided that these coefficients are bounded away from zero at a certain rate. In this paper, the regression coefficients outside an ideal model are assumed to be small, but not necessarily zero. Under a sparse Riesz condition on the correlation of design variables, we prove that the LASSO selects a model of the correct order of dimensionality, controls the bias of the selected model at a level determined by the contributions of small regression coefficients and threshold bias, and selects all coefficients of greater order than the bias of the selected model. Moreover, as a consequence of this rate consistency of the LASSO in model selection, it is proved that the sum of error squares for the mean response and the ℓαloss for the regression coefficients converge at the best possible rates under the given conditions. An interesting aspect of our results is that the logarithm of the number of variables can be of the same order as the sample size for certain random dependent designs. 1. Introduction. Consider
Concentration of the Spectral Measure for Large Matrices
, 2000
"... We derive concentration inequalities for functions of the empirical measure of eigenvalues for large, random, self adjoint matrices, with not necessarily Gaussian entries. The results presented apply in particular to nonGaussian Wigner and Wishart matrices. We also provide concentration bounds for ..."
Abstract

Cited by 65 (11 self)
 Add to MetaCart
We derive concentration inequalities for functions of the empirical measure of eigenvalues for large, random, self adjoint matrices, with not necessarily Gaussian entries. The results presented apply in particular to nonGaussian Wigner and Wishart matrices. We also provide concentration bounds for non commutative functionals of random matrices. 1 Introduction and statement of results Consider a random N N Hermitian matrix X with i.i.d. complex entries (except for the symmetry constraint) satisfying a moment condition. It is well known since Wigner [28] that the spectral measure of N 1=2 X converges to the semicircle law. This observation has been generalized to a large class of matrices, e.g. sample covariance matrices of the form XRX where R is a deterministic diagonal matrix ([19]), band matrices (see [5, 16, 20]), etc. For the Wigner case, this convergence has been supplemented by Central Limit Theorems, see [15] for the case of Gaussian entries and [17], [22] for the gen...
A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices
 J. Statist. Phys
, 2002
"... Recently Johansson (21) and Johnstone (16) proved that the distribution of the (properly rescaled) largest principal component of the complex (real) Wishart matrix X g X(X t X) converges to the Tracy–Widom law as n, p (the dimensions of X) tend to. in some ratio n/p Q c>0.We extend these results in ..."
Abstract

Cited by 60 (3 self)
 Add to MetaCart
Recently Johansson (21) and Johnstone (16) proved that the distribution of the (properly rescaled) largest principal component of the complex (real) Wishart matrix X g X(X t X) converges to the Tracy–Widom law as n, p (the dimensions of X) tend to. in some ratio n/p Q c>0.We extend these results in two directions. First of all, we prove that the joint distribution of the first, second, third, etc. eigenvalues of a Wishart matrix converges (after a proper rescaling) to the Tracy–Widom distribution. Second of all, we explain how the combinatorial machinery developed for Wigner random matrices in refs. 27, 38, and 39 allows to extend the results by Johansson and Johnstone to the case of X with nonGaussian entries, provided n − p=O(p 1/3). We also prove that l max [ (n 1/2 +p 1/2) 2 +O(p 1/2 log(p)) (a.e.) for general c>0. KEY WORDS: Sample covariance matrices; principal component; Tracy– Widom distribution.
HighSNR power offset in multiantenna communication
 IEEE Transactions on Information Theory
, 2005
"... Abstract—The analysis of the multipleantenna capacity in the high regime has hitherto focused on the high slope (or maximum multiplexing gain), which quantifies the multiplicative increase as a function of the number of antennas. This traditional characterization is unable to assess the impact of ..."
Abstract

Cited by 59 (13 self)
 Add to MetaCart
Abstract—The analysis of the multipleantenna capacity in the high regime has hitherto focused on the high slope (or maximum multiplexing gain), which quantifies the multiplicative increase as a function of the number of antennas. This traditional characterization is unable to assess the impact of prominent channel features since, for a majority of channels, the slope equals the minimum of the number of transmit and receive antennas. Furthermore, a characterization based solely on the slope captures only the scaling but it has no notion of the power required for a certain capacity. This paper advocates a more refined characterization whereby, as a function of �f, the high capacity is expanded as an affine function where the impact of channel features such as antenna correlation, unfaded components, etc., resides in the zeroorder term or power offset. The power offset, for which we find insightful closedform expressions, is shown to play a chief role for levels of practical interest. Index Terms—Antenna correlation, channel capacity, coherent communication, fading channels, high analysis, multiantenna arrays, Ricean channels.