Results 1 - 10
of
327
An introduction to variable and feature selection
- Journal of Machine Learning Research
, 2003
"... Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. ..."
Abstract
-
Cited by 431 (8 self)
- Add to MetaCart
Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available.
Linear models and empirical Bayes methods for assessing differential expression in microarray experiments
- STAT. APPL. GENET. MOL. BIOL
, 2004
"... ..."
Regularization and variable selection via the Elastic Net
- Journal of the Royal Statistical Society, Series B
, 2005
"... Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where ..."
Abstract
-
Cited by 159 (5 self)
- Add to MetaCart
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Replicated Microarray Data
- Statistica Sinica
, 2001
"... cDNA microarrays permit us to study the expression of thousands of genes simultaneously. They are now used in many dierent contexts to compare mRNA levels between two or more samples of cells. Microarray experiments typically give us expression measurements on a large number of genes, say 10,000-20, ..."
Abstract
-
Cited by 81 (4 self)
- Add to MetaCart
cDNA microarrays permit us to study the expression of thousands of genes simultaneously. They are now used in many dierent contexts to compare mRNA levels between two or more samples of cells. Microarray experiments typically give us expression measurements on a large number of genes, say 10,000-20,000, but with few, if any replicates for each gene. Traditional methods using means and standard deviations to detect dierential expression are not completely satisfactory in this context, and so a dierent approach seems desirable. In this paper we present an empirical Bayes method for analysing replicated microarray data. Data from all the genes in a replicate set of experiments are combined into estimates of parameters of a prior distribution. These parameter estimates are then combined at the gene level with means and standard deviations to form a statistic B which can be used to decide whether dierential expression has occurred. The statistic B avoids the problems of using averages or t-statistics. The method is illustrated using data from an experiment comparing the expression of genes in the livers of SR-BI transgenic mice with that of the corresponding wild-type mice. In addition we present the results of a simulation study estimating the ROC curve of B and three other statistics for determining dierential expression: the average and two simple modications of the usual t-statistic. B was found to be the most powerful of the four, though the margin was not great. The data were simulated to resemble the SR-BI data. Keywords: cDNA microarray, dierential expression, empirical Bayes, replication, ROC curve, t-statistic Department of Mathematics, Uppsala University y Correspondence should be addressed to Ingrid Lonnstedt, telephone/fax +46-18-4712842/4713201, e...
F: Evolving gene/ transcript definitions significantly alter the interpretation of GeneChip data
- Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng
, 2005
"... ..."
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
Resampling-Based Multiple Testing for Microarray Data Analysis
, 2003
"... The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, We ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, Westfall & Young propose resampling-based p-value adjustment procedures which are highly relevant to microarray experiments. This article discusses different criteria for error control in resampling-based multiple testing, including (a) the family wise error rate of Westfall & Young (1993) and (b) the false discovery rate developed by Benjamini & Hochberg (1995), both from a frequentist viewpoint; and (c) the positive false discovery rate of Storey (2002), which has a Bayesian motivation. We also introduce our recently developed fast algorithm for implementing the minP adjustment to control familywise error rate. Adjusted p-values for different approaches are applied to gene expression data from two recently published microarray studies. The properties of these procedures for multiple testing are compared.
Gene selection: a Bayesian variable selection approach
- BIOINFORMATICS
, 2003
"... Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We e ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.
Statistical Issues in cDNA Microarray Data Analysis
, 2003
"... This article summarizes some of the issues involved and provides a brief review of the analysis tools which are available to researchers to deal with them. Any microarray experiment involves a number of distinct stages. Firstly there is the design of the experiment. The researchers must decide which ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
This article summarizes some of the issues involved and provides a brief review of the analysis tools which are available to researchers to deal with them. Any microarray experiment involves a number of distinct stages. Firstly there is the design of the experiment. The researchers must decide which genes are to be printed on the arrays, which sources of RNA are to be hybridized to the arrays and on how many arrays the hybridizations will be replicated. Secondly, after hybridization, there follows a number of data-cleaning steps or `low-level analysis' of the microarray data. The microarray images must be processed to acquire red and green foreground and background intensities for each spot. The acquired red/green ratios must be normalized to adjust for dye-bias and for any systematic variation other than that due to the differences between the RNA samples being studied. Thirdly, the normalized ratios are analyzed by various graphical and numerical means to select differentially expressed genes or to find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groups. The sections of this article correspond roughly to the various analysis steps. The following notation will be used throughout the article. The foreground red and green intensities will be written Pp and 9p for each spot. The background intensities will be Pf and 9f . The background-corrected intensities will be P and 9 where usually P Pp Pf 0 # and 9 9p 9f 0 # . The log-differential expression ratio will be vyq # E P 9 0 for each spot. Finally, the log-intensity of the spot will be vyq 3 P9 0 , a measure of the overall brightness of the spot. (The letter E is a mnemonic for minus as vyq vyq E P 9 0 # while 3 is a mnemonic for add as #vyq vyq #...
Prediction by supervised principal components
- Journal of the American Statistical Association
, 2006
"... In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal co ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer. KEY WORDS: Gene expression; Microarray; Regression; Survival analysis. 1.

