## Sure independence screening for ultra-high dimensional feature space (2006)

Citations: | 90 - 12 self |

### BibTeX

@TECHREPORT{Fan06sureindependence,

author = {Jianqing Fan and Jinchi Lv},

title = {Sure independence screening for ultra-high dimensional feature space},

institution = {},

year = {2006}

}

### OpenURL

### Abstract

Variable selection plays an important role in high dimensional statistical modeling which nowa-days appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, Candes and Tao (2007) propose the Dantzig selector using L1 regularization and show that it achieves the ideal risk up to a logarithmic factor log p. Their innovative procedure and remarkable result are challenged when the dimensionality is ultra high as the factor log p can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the SIS is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be ac-

### Citations

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT - Freund, Schapire - 1995 |

1652 | Atomic decomposition by basis pursuit - Chen, Donoho, et al. - 2001 |

1227 | Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring - Golub, Slomin, et al. - 1999 |

842 | Ideal spatial adaptation by wavelet shrinkage - Donoho, Johnstone - 1994 |

567 |
Convergence of Stochastic Processes
- Pollard
- 1984
(Show Context)
Citation Context ...implies that the moment generating function of the random variable ξ1 − 1 is Ee t(ξ1−1) = (1 − 2t) −1/2 e −t , t ∈ (−∞,1/2) . Thus, for any ε > 0 and 0 < t < 1/2, by Chebyshev’s inequality (see, e.g. =-=Pollard, 1984-=- or van der Vaart and Wellner, 1996) we have ( ) ξ1 + · · · + ξn P > 1 + ε ≤ n 1 etnεE exp {t (ξ1 − 1) + · · · + t (ξn − 1)} = exp(−nfε(t)), 32where fε (t) = 1 2 log (1 − 2t)+(1 + ε) t. Setting the d... |

424 | The dantzig selector: Statistical estimation when p is much larger than n the dantzig selector: Statistical estimation when p is much larger than n - Candes, Tao - 2007 |

382 | High-dimensional graphs and variable selection with - Meinshausen, Buhlmann - 2006 |

362 | Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression - Tibshirani, Hastie, et al. - 2002 |

359 | Uncertainty principles and ideal atomic decomposition - Donoho, Huo - 2001 |

343 | Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties - Fan, Li - 2001 |

341 | Weak Convergence and Empirical Processes - Vaart, Wellner - 1996 |

326 | The Concentration of Measure Phenomenon - Ledoux - 2001 |

325 | Toeplitz Forms and their Applications - Grenander, Szego - 1958 |

325 |
Statistical Significance for Genomewide Studies
- Storey, Tibshirani
- 2003
(Show Context)
Citation Context ...on with componentwise regression is using the two-sample t-test statistic to select features. This has been widely used in the significance analysis of gene selection in microarray data analysis (see =-=Storey and Tibshirani, 2003-=-; Fan and Ren, 2006), including the nearest shrunken centroids method of Tibshirani et al. (2002). In other words, the componentwise regression technique is an insightful and natural extension of a tw... |

324 |
Regression shrinkage and selection via the
- Tibshirani
- 1996
(Show Context)
Citation Context ...agram of the approach. When it is desired to reduce the model size further, we can further single out d ′ n variables with d ′ n < dn using the Dantzig selector along with hard thresholding or Lasso (=-=Tibshirani, 1996-=-) with a suitable choice of the penalty parameter. From there, one can apply a more refined but computationally more intensive method such as the SCAD or adaptive Lasso. See Figure 2. These two method... |

252 | The adaptive lasso and its oracle properties - Zou - 2006 |

232 | A Statistical View of Some Chemometrics Regression Tools (with discussion)," Technometries - Friedman - 1993 |

222 | On Model Selection Consistency of Lasso - Zhao, Yu - 2006 |

196 | On the distribution of the largest eigenvalue in principal components analysis - Johnstone - 2001 |

190 | Methodologies in spectral analysis of large dimensional random matrices - Bai - 1999 |

185 | Simultaneous analysis of lasso and dantzig selector - Bickel, Ritov, et al. |

167 | Pathwise coordinate optimization - Friedman, Hastie, et al. - 2007 |

149 |
Heuristics of instability and stabilization in model selection. The Annals of Statistics 24
- Breiman
- 1996
(Show Context)
Citation Context ...For example, we may want to keep certain important predictors in the model and choose not to penalize their coefficients. The regularization parameters λj can be chosen by cross-validation (see, e.g. =-=Breiman, 1996-=- and Tibshirani, 1996). A unified and efficient algorithm for optimizing penalized likelihood, called local quadratic approximation (LQA), was proposed in Fan and Li (2001) and well studied in Hunter ... |

140 | The group lasso for logistic regression - Meier, Geer, et al. |

137 | Asymptotics for Lasso-type estimators - Knight, Fu - 2000 |

126 | Approaches for Bayesian variable selection
- George, McCulloch
- 1997
(Show Context)
Citation Context ...iable selection drastically. It also makes the model selection problem efficient and modular. SIS can be used in conjunction with any model selection techniques, including Bayesian methods (see, e.g. =-=George and McCulloch, 1997-=-). 1.5. Outline of the paper In Section 2, we present a simple and fast sure screening method, and study its accuracy. We discuss applications of SIS to classification problems in Section 3. In Sectio... |

103 | Subset Regression Using the Nonnegative Garrote - Breiman - 1995 |

95 | A limit theorem for the norm of random matrices - Geman - 1980 |

91 | High-dimensional data analysis: The curses and blessings of dimensionality - Donoho - 2000 |

90 | Least Angle Regression (with discussion - Efron, Hastie, et al. - 2004 |

89 | Regularized estimation of large covariance matrics - Bickel, Levina - 2008 |

84 | Reflections on compressed sensing - Candès, Tao |

79 | Nonconcave penalized likelihood with a diverging number of parameters - Fan, Peng |

76 | Sparse additive models - RAVIKUMAR, LAFFERTY, et al. - 2009 |

76 | The sparsity and bias of the lasso selection in highdimensional linear regression. The Annals of Statistics - Zhang, Huang |

73 |
Some theory for Fisher’s linear discriminant, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10
- Bickel, Levina
- 2004
(Show Context)
Citation Context ...permutation of the predictors. Therefore, the largest eigenvalue of Σn usually does not grow too fast with n. In addition, Condition 4 holds for the covariance matrix of a stationary time series (see =-=Bickel and Levina, 2004-=-, 2006). See Grenander and Szegö (1984) for more details about the characterization on extreme eigenvalues of the covariance matrix of a stationary process in terms of its spectral 9density. It is in... |

73 | Persistence in high-dimensional linear predictor selection and the virtue of overparametrization - Greenshtein, Ritov |

67 | Limit of the smallest eigenvalue of a large dimensional sample covariance matrix - Bai, Yin - 1993 |

62 | Multivariate Statistics: A Vector Space Approach - Eaton - 1983 |

56 | One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion - Zou, Li - 2008 |

53 | Local strong homogeneity of a regularized estimator - Nikolova |

46 | Regularization of wavelets approximations (with discussions - Antoniadis, Fan |

46 | Variable selection for Cox’s proportional hazards model and frailty - Fan, Li |

46 | Geometric representation of high dimension, low sample size data - Hall, Marron, et al. |

40 | Sparsistency and rates of convergence in large covariance matrix estimation - LAM, J - 2009 |

39 | The smallest eigenvalue of a large dimensional Wishart matrix - Silverstein - 1985 |

38 | Asymptotic properties of bridge estimators in sparse high-dimensional regression models - Huang, Horowitz, et al. |

36 | Variable selection using MM algorithms - Hunter, Li |

35 | Statistical challenges with high dimensionality: Feature selection in knowledge discovery - Fan, Li - 2006 |

29 | Maximal sparsity representation via l1 minimization - Donoho, Elad - 2003 |