Results 11  20
of
301
Estimation and confidence sets for sparse normal mixtures
, 2006
"... Estimation and confidence sets for sparse normal mixtures ..."
Abstract

Cited by 35 (17 self)
 Add to MetaCart
Estimation and confidence sets for sparse normal mixtures
Semilinear highdimensional model for normalization of microarray data: a theoretical analysis and partial consistency
 J. Amer. Statist. Assoc
, 2005
"... Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear inslide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
(Show Context)
Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear inslide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to account for acrossarray information, resulting in an even more dynamic semiparametric regression model. This model can be used to normalize microarray data even when there is no replication within an array. We demonstrate that this semiparametric model has a number of interesting features. The parametric component and the nonparametric component that are of primary interest can be consistently estimated, the former having a parametric rate and the latter having a nonparametric rate, whereas the nuisance parameters cannot be consistently estimated. This is an interesting extension of the partial consistent phenomena, which itself is of theoretical interest. The asymptotic normality for the parametric component and the rate of convergence for the nonparametric component are established. The results are augmented by simulation studies and illustrated by an application to the cDNA microarray analysis of neuroblastoma cells in response to the macrophage migration inhibitory factor.
Largescale multiple testing under dependence
 J ROY STAT SOC B
, 2009
"... Summary. The paper considers the problem of multiple testing under dependence in a compound decision theoretic framework. The observed data are assumed to be generated from an underlying twostate hidden Markov model.We propose oracle and asymptotically optimal datadriven procedures that aim to mini ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Summary. The paper considers the problem of multiple testing under dependence in a compound decision theoretic framework. The observed data are assumed to be generated from an underlying twostate hidden Markov model.We propose oracle and asymptotically optimal datadriven procedures that aim to minimize the false nondiscovery rate FNR subject to a constraint on the false discovery rate FDR. It is shown that the performance of a multipletesting procedure can be substantially improved by adaptively exploiting the dependence structure among hypotheses, and hence conventional FDR procedures that ignore this structural information are inefficient. Both theoretical properties and numerical performances of the procedures proposed are investigated. It is shown that the procedures proposed control FDR at the desired level, enjoy certain optimality properties and are especially powerful in identifying clustered nonnull cases. The new procedure is applied to an influenzalike illness surveillance study for detecting the timing of epidemic periods.
Assessing stability of gene selection in microarray data analysis
 BMC BIOINFORMATICS
, 2006
"... Background. The number of genes declared differentially expressed is a random variable and its variability can be assessed by resampling techniques. Another important stability indicator is the frequency with which a given gene is selected across subsamples. We have conducted studies to assess sta ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Background. The number of genes declared differentially expressed is a random variable and its variability can be assessed by resampling techniques. Another important stability indicator is the frequency with which a given gene is selected across subsamples. We have conducted studies to assess stability and some other properties of several gene selection procedures with biological and simulated data. Results. Using crossvalidation techniques we have found that some genes are selected much less frequently (across crossvalidation samples) than other genes with the same adjusted pvalues. The extent to which this type of instability manifests itself depends on a specific multiple testing procedure and the choice of a test statistic. The effect of correlation between gene expression levels on the performance of multiple testing procedures is studied by computer simulations. Conclusions. Crossvalidation represents a tool for reducing the set of initially selected genes to those with a sufficiently high selection frequency. Using crossvalidation it is also possible to assess variability of different performance indicators. Stability properties of several multiple testing procedures are described at length in the present paper.
Local False Discovery Rates
, 2005
"... Modern scientific technology is providing a new class of largescale simultaneous inference problems, with hundreds or thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology but similar problems arise in proteomics, time of flight spectroscopy, flow ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Modern scientific technology is providing a new class of largescale simultaneous inference problems, with hundreds or thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology but similar problems arise in proteomics, time of flight spectroscopy, flow cytometry, FMRI imaging, and massive social science surveys. This paper uses local false discovery rate methods to carry out size and power calculations on largescale data sets. An empirical Bayes approach allows the fdr analysis to proceed from a minimum of frequentist or Bayesian modeling assumptions. Microarray and simulated data sets are used to illustrate a convenient estimation methodology whose accuracy can be calculated in closed form. A crucial part of the methodology is an fdr assessment of “thinned counts”, what the histogram of test statistics would look like for just the nonnull cases
Simultaneous Inference: When Should Hypothesis Testing Problems Be Combined?
"... Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature ten ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to “enrichment”, the detection of multicase effects. Key Words: false discovery rates, Twoclass model, enrichment 1. Introduction Modern scientific devices such as microarrays routinely provide the statistician with thousands of hypothesis testing problems to consider at the same time. A
Inferring gene dependency networks from genomic longitudinal data: a functional data approach
, 2006
"... A key aim of systems biology is to unravel the regulatory interactions among genes and gene products in a cell. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around an estimator of dynamical pairw ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
A key aim of systems biology is to unravel the regulatory interactions among genes and gene products in a cell. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around an estimator of dynamical pairwise correlation that takes account of the functional nature of the observed data. This allows to extend the graphical Gaussian modeling framework from i.i.d. data to analyze longitudinal genomic data. The new method is illustrated by analyzing highly replicated data from a genome experiment concerning the expression response of human Tcells to PMA and ionomicin treatment.
False discovery rate analysis of brain diffusion direction maps
 Ann. Appl. Statist
, 2008
"... Diffusion tensor imaging (DTI) is a novel modality of magnetic resonance imaging that allows noninvasive mapping of the brain’s white matter. A particular map derived from DTI measurements is a map of water principal diffusion directions, which are proxies for neural fiber directions. We consider a ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
(Show Context)
Diffusion tensor imaging (DTI) is a novel modality of magnetic resonance imaging that allows noninvasive mapping of the brain’s white matter. A particular map derived from DTI measurements is a map of water principal diffusion directions, which are proxies for neural fiber directions. We consider a study in which diffusion direction maps were acquired for two groups of subjects. The objective of the analysis is to find regions of the brain in which the corresponding diffusion directions differ between the groups. This is attained by first computing a test statistic for the difference in direction at every brain location using a Watson model for directional data. Interesting locations are subsequently selected with control of the false discovery rate. More accurate modeling of the null distribution is obtained using an empirical null density based on the empirical distribution of the test statistics across the brain. Further, substantial improvements in power are achieved by local spatial averaging of the test statistic map. Although the focus is on one particular study and imaging technology, the proposed inference methods can be applied to other large scale simultaneous hypothesis testing problems with a continuous underlying spatial structure. 1. Introduction. A
Statistical validation of peptide identifications in largescale proteomics using targetdecoy database search strategy and flexible mixture modeling
 J. Proteome Res
"... Reliable statistical validation of peptide and protein identifications is a top priority in largescale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained u ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
Reliable statistical validation of peptide and protein identifications is a top priority in largescale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. We present two flexible methods, the variable component mixture model and the semiparametric mixture model, that remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. Using a control protein mixture data set generated on an linear ion trap Fourier transform (LTQFT) mass spectrometer, we demonstrate that both methods improve parametric models in terms of the accuracy of probability estimates and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient number of decoy (known to be incorrect) peptide identifications, which can be obtained using the targetdecoy database search strategy.
Feature selection in omics prediction problems using cat scores and false nondiscovery rate control
 Ann. Appl. Stat
, 2009
"... We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlat ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
(Show Context)
We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlationadjusted t scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). We show that contrary to previous claims this FNDR procedures performs very well and similar to “higher criticism”. Third, training of the classifier function is conducted by plugin of JamesStein shrinkage estimates of correlations and variances, using analytic procedures for choosing regularization parameters. Overall, this results in an effective and computationally inexpensive framework for highdimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda ” available from the R repository CRAN.