Results 1 - 10
of
88
Regularization and variable selection via the Elastic Net
- Journal of the Royal Statistical Society, Series B
, 2005
"... Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where ..."
Abstract
-
Cited by 159 (5 self)
- Add to MetaCart
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Sparse graphical models for exploring gene expression data
- Journal of Multivariate Analysis
, 2004
"... DMS-0112069. Any opinions, findings, and conclusions or recommendations expressed in this material are ..."
Abstract
-
Cited by 98 (19 self)
- Add to MetaCart
DMS-0112069. Any opinions, findings, and conclusions or recommendations expressed in this material are
Consensus clustering -- A resampling-based method for class discovery and visualization of gene expression microarray data
- MACHINE LEARNING, FUNCTIONAL GENOMICS SPECIAL ISSUE
, 2003
"... ..."
Correcting sample selection bias by unlabeled data
"... We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We prese ..."
Abstract
-
Cited by 69 (5 self)
- Add to MetaCart
We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
Bayesian Factor Regression Models in the "Large p, Small n" Paradigm
- Bayesian Statistics
, 2003
"... TOR REGRESSION MODELS 1.1 SVD Regression Begin with the linear model y = X# + # where y is the n-vector of responses, X is the n p matrix of predictors, # is the p-vector regression parameter, and # , # I) is the n-vector error term. Of key interest are cases when p >> n, when X is "long a ..."
Abstract
-
Cited by 66 (11 self)
- Add to MetaCart
TOR REGRESSION MODELS 1.1 SVD Regression Begin with the linear model y = X# + # where y is the n-vector of responses, X is the n p matrix of predictors, # is the p-vector regression parameter, and # , # I) is the n-vector error term. Of key interest are cases when p >> n, when X is "long and skinny." The standard empirical factor (principal component) regression is best represented using the reduced singular-value decomposition (SVD) of X, namely X = FA where F is the nk factor matrix (columns are factors, rows are samples) and A is the k p SVD "loadings" matrix, subject to AA # = I and F # F = D where D is the diagonal matrix of k positive singular values, arranged in decreasing order. This reduced form assumes factors with zero singular values have been ignored without loss; k with equality only if all singular values are positive. Now the regression transforms via X# = F# where # = A# is the k-vector of regression parameters for the factor variables, representing
Boosting for high-dimensional linear models
- The Annals of Statistics
"... We prove that boosting with the squared error loss, L2Boosting, is consistent for very high-dimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as O(exp(sample size)), assuming that the true underlying regression function is sparse in terms of th ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We prove that boosting with the squared error loss, L2Boosting, is consistent for very high-dimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as O(exp(sample size)), assuming that the true underlying regression function is sparse in terms of the ℓ1-norm of the regression coefficients. In the language of signal processing, this means consistency for de-noising using a strongly overcomplete dictionary if the underlying signal is sparse in terms of the ℓ1-norm. We also propose here an AICbased method for tuning, namely for choosing the number of boosting iterations. This makes L2Boosting computationally attractive since it is not required to run the algorithm multiple times for cross-validation as commonly used so far. We demonstrate L2Boosting for simulated data, in particular where the predictor dimension is large in comparison to sample size, and for a difficult tumor-classification problem with gene expression microarray data.
Finding Predictive Gene Groups from Microarray Data
- Journal of Multivariate Analysis
, 2004
"... Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To nd these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classi cation in a supervised, simultaneous way. With an empirical study on six dierent microarray datasets, we show that Pelora identi es gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classi cation methods based on single genes. Thus, our gene groups can be bene cial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.
A robust procedure for gaussian graphical model search from microarray data with p larger than n
- Journal of Machine Learning Research
, 2006
"... Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a conse ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data.
Regression approaches for microarray data analysis
- Journal of Computational Biology
, 2003
"... A variety of new procedures have been devised to handle the two sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
A variety of new procedures have been devised to handle the two sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available, and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifac-tual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice. Key words: cardiomyopathy, covariance inflation criterion, gene harvesting, lasso, least angle regres-sion, microarray, model selection, support vector machine. 1

