Results 1  10
of
205
A Shrinkage Approach to LargeScale Covariance Matrix Estimation and Implications for Functional Genomics
, 2005
"... ..."
An Empirical Bayes Approach to Inferring LargeScale Gene Association Networks
 BIOINFORMATICS
, 2004
"... Motivation: Genetic networks are often described statistically by graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standar ..."
Abstract

Cited by 171 (6 self)
 Add to MetaCart
Motivation: Genetic networks are often described statistically by graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standard algorithms for graphical models inapplicable, and inferring genetic networks an “illposed” inverse problem. Methods: We introduce a novel framework for smallsample inference of graphical models from gene expression data. Specifically, we focus on socalled graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes. Our new approach is based on (i) improved (regularized) smallsample point estimates of partial correlation, (ii) an exact test of edge inclusion with adaptive estimation of the degree of freedom, and (iii) a heuristic network search based on false discovery rate multiple testing. Steps (ii) and (iii) correspond to an empirical Bayes estimate of the network topology. Results: Using computer simulations we investigate the sensitivity (power) and specificity (true negative rate) of the proposed framework to estimate GGMs from microarray data. This shows that it is possible to recover the true network topology with high accuracy even for smallsample data sets. Subsequently, we analyze gene expression data from a breast cancer tumor study and illustrate our approach by inferring a corresponding largescale gene association network for 3,883 genes. Availability: The authors have implemented the approach in the R package “GeneTS ” that is freely available from
Correlation and LargeScale Simultaneous Significance Testing
 Journal of the American Statistical Association
"... Largescale hypothesis testing problems, with hundreds or thousands of test statistics “zi ” to consider at once, have become familiar in current practice. Applications of popular analysis methods such as false discovery rate techniques do not require independence of the zi’s, but their accuracy can ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
Largescale hypothesis testing problems, with hundreds or thousands of test statistics “zi ” to consider at once, have become familiar in current practice. Applications of popular analysis methods such as false discovery rate techniques do not require independence of the zi’s, but their accuracy can be compromised in highcorrelation situations. This paper presents computational and theoretical methods for assessing the size and effect of correlation in largescale testing. A simple theory leads to the identification of a single omnibus measure of correlation. The theory relates to the correct choice of a null distribution for simultaneous significance testing, and its effect on inference. 1. Introduction Modern computing machinery and improved scientific equipment have combined to revolutionize experimentation in fields such as biology, medicine, genetics, and neuroscience. One effect on statistics has been to vastly magnify the scope of multiple hypothesis testing, now often involving thousands of cases considered simultaneously. The cases themselves are typically of familiar form, each perhaps a simple twosample comparison,
Statistical challenges with high dimensionality: feature selection in knowledge discovery
, 2006
"... ..."
(Show Context)
Capturing heterogeneity in gene expression studies by ‘surrogate variable analysis’. PLoS Genetics 3:e161
, 2007
"... It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too com ..."
Abstract

Cited by 41 (8 self)
 Add to MetaCart
(Show Context)
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for welldesigned, randomized studies. We introduce ‘‘surrogate variable analysis’ ’ (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genomewide expression studies.
Microarrays, empirical Bayes and the twogroups model
 STATIST. SCI
, 2006
"... The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the TwentyFirst Century: high throughput devices, such as microarrays, routinely re ..."
Abstract

Cited by 37 (10 self)
 Add to MetaCart
The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the TwentyFirst Century: high throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The twogroups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the twogroups setting, with particular attention focussed on Benjamini and Hochberg’s False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in largescale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals, and Bayesian competitors to the twogroups model.
Size, power and false discovery rates
, 2007
"... Modern scientific technology has provided a new class of largescale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
Modern scientific technology has provided a new class of largescale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on largescale problems. A simple empirical Bayes approach allows the fdr analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closedform accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tailarea fdr’s, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology the power diagnostics showing why nonnull cases might easily fail to appear on a list of “significant ” discoveries. Short Title “Size, Power, and Fdr’s”
Bayesian robust inference for differential gene expression in microarrays with multiple samples
 Biometrics
"... We consider the problem of identifying differentially expressed genes under different conditions using gene expression microarrays. Because of the many steps involved in the experimental process, from hybridization to image analysis, cDNA microarray data often contain outliers. For example, an outly ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of identifying differentially expressed genes under different conditions using gene expression microarrays. Because of the many steps involved in the experimental process, from hybridization to image analysis, cDNA microarray data often contain outliers. For example, an outlying data value could occur because of scratches or dust on the surface, imperfections in the glass, or imperfections in the array production. We develop a robust Bayesian hierarchical model for testing for differential expression. Errors are modeled explicitly using a tdistribution, which accounts for outliers. The model includes an exchangeable prior for the variances which allow different variances for the genes but still shrink extreme empirical variances. Our model can be used for testing for differentially expressed genes among multiple samples, and it can distinguish between the different possible patterns of differential expression when there are three or more samples. Parameter estimation is carried out using a novel version of Markov chain Monte Carlo that is appropriate when the model puts mass on subspaces of the full parameter space. The method is illustrated using two publicly available gene expression data sets. We compare our method to six other baseline and commonly used techniques, namely the ttest, the Bonferroniadjusted ttest, Significance Analysis of Microarrays (SAM), Efron’s empirical Bayes, and EBarrays in both its LognormalNormal and GammaGamma forms. In an experiment with HIV data, our method performed better than these alternatives, on the basis of betweenreplicate agreement and disagreement.
Estimating the null and the proportion of nonnull effects in largescale multiple comparisons
 J. Amer. Statist. Assoc
, 2007
"... An important issue raised by Efron [7] in the context of largescale multiple comparisons is that in many applications the usual assumption that the null distribution is known is incorrect, and seemingly negligible differences in the null may result in large differences in subsequent studies. This s ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
(Show Context)
An important issue raised by Efron [7] in the context of largescale multiple comparisons is that in many applications the usual assumption that the null distribution is known is incorrect, and seemingly negligible differences in the null may result in large differences in subsequent studies. This suggests that a careful study of estimation of the null is indispensable. In this paper, we consider the problem of estimating a null normal distribution, and a closely related problem, estimation of the proportion of nonnull effects. We develop an approach based on the empirical characteristic function and Fourier analysis. The estimators are shown to be uniformly consistent over a wide class of parameters. Numerical performance of the estimators is investigated using both simulated and real data. In particular, we apply our
Semilinear highdimensional model for normalization of microarray data: a theoretical analysis and partial consistency
 J. Amer. Statist. Assoc
, 2005
"... Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear inslide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
(Show Context)
Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear inslide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to account for acrossarray information, resulting in an even more dynamic semiparametric regression model. This model can be used to normalize microarray data even when there is no replication within an array. We demonstrate that this semiparametric model has a number of interesting features. The parametric component and the nonparametric component that are of primary interest can be consistently estimated, the former having a parametric rate and the latter having a nonparametric rate, whereas the nuisance parameters cannot be consistently estimated. This is an interesting extension of the partial consistent phenomena, which itself is of theoretical interest. The asymptotic normality for the parametric component and the rate of convergence for the nonparametric component are established. The results are augmented by simulation studies and illustrated by an application to the cDNA microarray analysis of neuroblastoma cells in response to the macrophage migration inhibitory factor.