• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Supervised harvesting of expression trees (2001)

by T Hastie
Venue:Genome Biol
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Regularization and variable selection via the Elastic Net

by Hui Zou, Trevor Hastie - Journal of the Royal Statistical Society, Series B , 2005
"... Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where ..."
Abstract - Cited by 159 (5 self) - Add to MetaCart
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.

Cluster Analysis for Gene Expression Data: A Survey

by Daxin Jiang, Chun Tang, Aidong Zhang - IEEE Transactions on Knowledge and Data Engineering , 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract - Cited by 48 (3 self) - Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.

Class prediction by nearest shrunken centroids, with applicaitons to dna microarrays

by Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, Gilbert Chu - Stat Sci , 2003
"... Abstract. We propose a new method for class prediction in DNA microarray studies based on an enhancement of the nearest prototype classifier. Our technique uses “shrunken ” centroids as prototypes for each class to identify the subsets of the genes that best characterize each class. The method is ge ..."
Abstract - Cited by 36 (9 self) - Add to MetaCart
Abstract. We propose a new method for class prediction in DNA microarray studies based on an enhancement of the nearest prototype classifier. Our technique uses “shrunken ” centroids as prototypes for each class to identify the subsets of the genes that best characterize each class. The method is general and can be applied to other high-dimensional classification problems. The method is illustrated on data from two gene expression studies: lymphoma and cancer cell lines. Key words and phrases: Sample classification, gene expression arrays. 1.

Prediction by supervised principal components

by Eric Bair, Debashis Paul, Robert Tibshirani - Journal of the American Statistical Association , 2006
"... In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal co ..."
Abstract - Cited by 36 (5 self) - Add to MetaCart
In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer. KEY WORDS: Gene expression; Microarray; Regression; Survival analysis. 1.

Treelets — An Adaptive Multi-Scale Basis for Sparse Unordered Data

by Ann B Lee, Boaz Nadler, Larry Wasserman
"... In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered — with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typicall ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered — with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper, we present treelets — a novel construction of multi-scale bases that extends wavelets to non-smooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonor-mal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situ-ations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables. 1

Identifying splits with clear separation: a new class discovery Method for Gene Expression Data.

by Anja von Heydebreck, Wolfgang Huber, Annemarie Poustka, Martin Vingron , 2001
"... We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subset ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subsets of genes. Several mutually independent class distinctions may be found, which is difficult to obtain from most commonly used clustering algorithms. Each class distinction can be biologically interpreted in terms of its supporting genes. The mathematical characterization of the favored class distinctions is based on statistical concepts. By analyzing three data sets from cancer gene expression studies, we demonstrate that our method is able to detect biologically relevant structures, for example cancer subtypes, in an unsupervised fashion.

A robust clustering method and visualization tool based on data depth

by Rebecka Jornsten, Yehuda Vardi, Cun-hui Zhang - In Statistical data analysis based on the L1norm and related methods (Neuchâtel, 2002), Stat. Ind. Technol , 2002
"... Abstract. We present a robust clustering method based on a modi ed Weiszfeld algorithm for the multivariate median, and associated data depth. The multivariate medians are used to represent the clusters, while the induced relative L1-depths are used to identify outliers and to select the number of c ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract. We present a robust clustering method based on a modi ed Weiszfeld algorithm for the multivariate median, and associated data depth. The multivariate medians are used to represent the clusters, while the induced relative L1-depths are used to identify outliers and to select the number of clusters. We develop a cluster validation and visualization tool based on the withincluster data depths, and the cluster data depths with respect to competing clusters. We apply our method to high-dimensional gene expression data, and several simulated data sets. Our method successfully identi es the number of clusters in noisy data sets, and generates accurate cluster assignments. 1. Introduction. In this paper we present a robust clustering method, and a novel cluster validation tool based on data depth. We consider a clustering method robust if it is not be a ected by small perturbations of the data, i.e. noisy observations, or by the inclusion of unrelated variables, i.e. clustering in a higher dimension than necessary.

Averaged gene expressions for regression

by Mee Young Park, Trevor Hastie, Robert Tibshirani - Biostatistics , 2007
"... Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that c ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

Hierarchical Testing of Variable Importance

by Nicolai Meinshausen
"... Abstract. A frequently encountered challenge in high-dimensional regression is the detection of relevant variables. Variable selection suffers from instability and the power to detect relevant variables is typically low if predictor variables are highly correlated. When taking the multiplicity of th ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract. A frequently encountered challenge in high-dimensional regression is the detection of relevant variables. Variable selection suffers from instability and the power to detect relevant variables is typically low if predictor variables are highly correlated. When taking the multiplicity of the testing problem into account, the power diminishes even further. To gain power and insight, it can be advantageous to look for influence not at the level of individual variables but rather at the level of clusters of highly correlated variables. We propose a hierarchical approach. Variable importance is first tested at the coarsest level, corresponding to the global null hypothesis. If possible, the method tries then to attribute any effect to smaller sub-clusters or even individual variables. The smallest possible clusters which still exhibit a significant influence on the response variable are retained. It is shown that the proposed testing procedure controls the family-wise error rate at a prespecified level, simultaneously over all resolution levels. The method has comparable power to Bonferroni-Holm on the level of individual variables and dramatically larger power for coarser resolution levels. The best resolution level is selected adaptively.

Stable Feature Selection via Dense Feature Groups

by Lei Yu, Chris Ding, Steven Loscalzo
"... Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two causes of instability of feature selection algori ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two causes of instability of feature selection algorithms: selection of a minimum subset without redundant features and small sample size. We propose a general framework for stable feature selection which emphasizes both good generalization and stability of feature selection results. The framework identifies dense feature groups based on kernel density estimation and treats features in each dense group as a coherent entity for feature selection. An efficient algorithm DRAGS (Dense Relevant Attribute Group Selector) is developed under this framework. We also introduce a general measure for assessing the stability of feature selection algorithms. Our empirical study based on microarray data verifies that dense feature groups remain stable under random sample hold out, and the DRAGS algorithm is effective in identifying a set of feature groups which exhibit both high classification accuracy and stability.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University