Results 11  20
of
137
Estimation and confidence sets for sparse normal mixtures
, 2005
"... For high dimensional statistical models, researchers have begun to focus on situations which can be described as having relatively few moderately large coefficients. Such situations lead to some very subtle statistical problems. In particular, Ingster and Donoho and Jin have considered a sparse norm ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
For high dimensional statistical models, researchers have begun to focus on situations which can be described as having relatively few moderately large coefficients. Such situations lead to some very subtle statistical problems. In particular, Ingster and Donoho and Jin have considered a sparse normal means testing problem, in which they described the precise demarcation, or the detection boundary. Meinshausen and Rice have shown that it is even possible to estimate consistently the fraction of nonzero coordinates on a subset of the detectable region, but leave unanswered the question of exactly which parts of the detectable region that consistent estimation is possible. In the present paper we develop a new approach for estimating the fraction of nonzero means for problems where the nonzero means are moderately large. We show that the detection region described by Ingster and Donoho and Jin turns out to be the region where it is possible to consistently estimate the expected fraction of nonzero coordinates. This theory is developed further and minimax rates of convergence are derived. A procedure is constructed which attains the optimal rate of convergence in this setting. Furthermore, the procedure also provides an honest lower bound for confidence intervals while minimizing the expected length of such an interval. Simulations are used to enable comparison with the work of Meinshausen and Rice, where a procedure is given but where rates of convergence have not been discussed. Extensions to more general Gaussian mixture models are also given.
Assessing stability of gene selection in microarray data analysis
 BMC BIOINFORMATICS
, 2006
"... Background. The number of genes declared differentially expressed is a random variable and its variability can be assessed by resampling techniques. Another important stability indicator is the frequency with which a given gene is selected across subsamples. We have conducted studies to assess sta ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Background. The number of genes declared differentially expressed is a random variable and its variability can be assessed by resampling techniques. Another important stability indicator is the frequency with which a given gene is selected across subsamples. We have conducted studies to assess stability and some other properties of several gene selection procedures with biological and simulated data. Results. Using crossvalidation techniques we have found that some genes are selected much less frequently (across crossvalidation samples) than other genes with the same adjusted pvalues. The extent to which this type of instability manifests itself depends on a specific multiple testing procedure and the choice of a test statistic. The effect of correlation between gene expression levels on the performance of multiple testing procedures is studied by computer simulations. Conclusions. Crossvalidation represents a tool for reducing the set of initially selected genes to those with a sufficiently high selection frequency. Using crossvalidation it is also possible to assess variability of different performance indicators. Stability properties of several multiple testing procedures are described at length in the present paper.
Simultaneous Inference: When Should Hypothesis Testing Problems Be Combined?
"... Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature ten ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to “enrichment”, the detection of multicase effects. Key Words: false discovery rates, Twoclass model, enrichment 1. Introduction Modern scientific devices such as microarrays routinely provide the statistician with thousands of hypothesis testing problems to consider at the same time. A
Genomewide requirements for resistance to functionally distinct DNAdamaging agents
 PLoS Genet
, 2005
"... The mechanistic and therapeutic differences in the cellular response to DNAdamaging compounds are not completely understood, despite intense study. To expand our knowledge of DNA damage, we assayed the effects of 12 closely related DNAdamaging agents on the complete pool of;4,700 barcoded homozygo ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
The mechanistic and therapeutic differences in the cellular response to DNAdamaging compounds are not completely understood, despite intense study. To expand our knowledge of DNA damage, we assayed the effects of 12 closely related DNAdamaging agents on the complete pool of;4,700 barcoded homozygous deletion strains of Saccharomyces cerevisiae. In our protocol, deletion strains are pooled together and grown competitively in the presence of compound. Relative strain sensitivity is determined by hybridization of PCRamplified barcodes to an oligonucleotide array carrying the barcode complements. These screens identified genes in wellcharacterized DNAdamageresponse pathways as well as genes whose role in the DNAdamage response had not been previously established. Highthroughput individual growth analysis was used to independently confirm microarray results. Each compound produced a unique genomewide profile. Analysis of these data allowed us to determine the relative importance of DNArepair modules for resistance to each of the 12 profiled compounds. Clustering the data for 12 distinct compounds uncovered both known and novel functional interactions that comprise the DNAdamage response and allowed us to define the genetic determinants required for repair of interstrand crosslinks. Further genetic
Feature selection in omics prediction problems using cat scores and false nondiscovery rate control
 Ann. Appl. Stat
, 2009
"... We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlat ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlationadjusted t scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). We show that contrary to previous claims this FNDR procedures performs very well and similar to “higher criticism”. Third, training of the classifier function is conducted by plugin of JamesStein shrinkage estimates of correlations and variances, using analytic procedures for choosing regularization parameters. Overall, this results in an effective and computationally inexpensive framework for highdimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda ” available from the R repository CRAN.
Inferring gene dependency networks from genomic longitudinal data: a functional data approach,” RevStat
, 2006
"... A key aim of systems biology is to unravel the regulatory interactions among genes and gene products in a cell. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around an estimator of dynamical pairwi ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
A key aim of systems biology is to unravel the regulatory interactions among genes and gene products in a cell. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around an estimator of dynamical pairwise correlation that takes account of the functional nature of the observed data. This allows to extend the graphical Gaussian modeling framework from i.i.d. data to analyze longitudinal genomic data. The new method is illustrated by analyzing highly replicated data from a genome experiment concerning the expression response of human Tcells to PMA and ionomicin treatment. KeyWords: Graphical model, longitudinal data, dynamical correlation, gene dependency networks
Rodeo: Sparse nonparametric regression in high dimensions
 in Advances in Neural Information Processing Systems (NIPS
, 2005
"... We present a method for simultaneously performing bandwidth selection and variable selection in nonparametric regression. The method starts with a local linear estimator with large bandwidths, and incrementally decreases the bandwidth in directions where the gradient of the estimator with respect to ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We present a method for simultaneously performing bandwidth selection and variable selection in nonparametric regression. The method starts with a local linear estimator with large bandwidths, and incrementally decreases the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, the approach avoids the curse of dimensionality. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces testing with soft thresholding may be viewed as solving a sequence of lasso problems. When applied in one dimension, the rodeo yields a method for choosing the locally optimal bandwidth.
Estimating highdimensional intervention effects from observation data. The Ann
 of Stat
"... We assume that we have observational data generated from an unknown underlying directed acyclic graph (DAG) model. A DAG is typically not identifiable from observational data, but it is possible to consistently estimate the equivalence class of a DAG. Moreover, for any given DAG, causal effects can ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We assume that we have observational data generated from an unknown underlying directed acyclic graph (DAG) model. A DAG is typically not identifiable from observational data, but it is possible to consistently estimate the equivalence class of a DAG. Moreover, for any given DAG, causal effects can be estimated using intervention calculus. In this paper, we combine these two parts. For each DAG in the estimated equivalence class, we use intervention calculus to estimate the causal effects of the covariates on the response. This yields a collection of estimated causal effects for each covariate. We show that the distinct values in this set can be consistently estimated by an algorithm that uses only local information of the graph. This local approach is computationally fast and feasible in highdimensional problems. We propose to use summary measures of the set of possible causal effects to determine variable importance. In particular, we use the minimum absolute value of this set, since that is a lower bound on the size of the causal effect. We demonstrate the merits of our methods in a simulation study and on a data set about riboflavin production. 1. Introduction. Our
To how many simultaneous hypothesis tests the normal, Student’s t, or bootstrap calibration be applied
, 2007
"... ABSTRACT. In the analysis of microarray data, and in some other contemporary statistical problems, it is not uncommon to apply hypothesis tests in a highly simultaneous way. The number, N say, of tests used can be much larger than the sample sizes, n, to which the tests are applied, yet we wish to c ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
ABSTRACT. In the analysis of microarray data, and in some other contemporary statistical problems, it is not uncommon to apply hypothesis tests in a highly simultaneous way. The number, N say, of tests used can be much larger than the sample sizes, n, to which the tests are applied, yet we wish to calibrate the tests so that the overall level of the simultaneous test is accurate. Often the sampling distribution is quite different for each test, so there may not be an opportunity for combining data across samples. In this setting, how large can N be, as a function of n, before level accuracy becomes poor? In the present paper we answer this question in cases where the statistic under test is of Student’s t type. We show that if either Normal or Student’s t distribution is used for calibration then the level of the simultaneous test is accurate provided log N increases at a strictly slower rate than n 1/3 as n diverges. On the other hand, if bootstrap methods are used for calibration then we may choose log N almost as large as n 1/2 and still achieve asymptotic level accuracy. The implications of these results are explored both theoretically and numerically. KEYWORDS. Bonferroni’s inequality, Edgeworth expansion, genetic data, largedeviation expansion, level accuracy, microarray data, quantile estimation, skewness, Student’s t statistic.
Using regularized dynamic correlation to infer gene dependency networks from timeseries microarray data
 In Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB 2006
, 2006
"... Graphical models allow to understand regulatory interactions among genes and gene products in a cell, and hence contribute to an enhanced understanding of systems biology. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Graphical models allow to understand regulatory interactions among genes and gene products in a cell, and hence contribute to an enhanced understanding of systems biology. Here we investigate a graphical model that treats the observed gene expression over time as realizations of random curves. This approach is centered around a regularized estimator of dynamical pairwise correlation that takes account of the functional nature of the observed data. The new method is illustrated by analyzing highly replicated gene expression time series data. 1.