Results 1 - 10
of
41
Multicriteria gene screening for analysis of differential expression with DNA microarrays
- EURASIP Journal on Applied Signal Processing
, 2004
"... Abstract £ This paper introduces a statistical methodology for identification of differentially expressed genes in DNA microarray experiments based on multiple criteria. These criteria are: false discovery rate (FDR); variance-normalized differential expression levels (paired t statistics); and mini ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract £ This paper introduces a statistical methodology for identification of differentially expressed genes in DNA microarray experiments based on multiple criteria. These criteria are: false discovery rate (FDR); variance-normalized differential expression levels (paired t statistics); and minimum acceptable difference (MAD). The methodology also provides a set of simultaneous FDR confidence intervals on the true expression differences. The analysis can be implemented as a two stage algorithm in which there is an initial screen that controls only FDR, which is then followed by a second screen which controls both FDR and MAD. It can also be implemented by computing and thresholding the set of FDR pvalues for each gene that satisfies the MAD criterion. We illustrate the procedure to identify differentially expressed genes from a wild-type vs. knockout comparison of microarray data.
On optimality of stepdown and stepup multiple test procedures
- Ann. Statist
, 2005
"... Consider the multiple testing problem of testing k null hypotheses, where the unknown family of distributions is assumed to satisfy a certain monotonicity assumption. Attention is restricted to procedures that control the familywise error rate in the strong sense and which satisfy a monotonicity con ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Consider the multiple testing problem of testing k null hypotheses, where the unknown family of distributions is assumed to satisfy a certain monotonicity assumption. Attention is restricted to procedures that control the familywise error rate in the strong sense and which satisfy a monotonicity condition. Under these assumptions, we prove certain maximin optimality results for some well-known stepdown and stepup procedures. 1. Introduction. For
On the Statistical Comparison of Inductive Learning Methods
- In D. Fisher & H.-J. Lenz (Eds.), Learning from Data: Artificial and Intelligence V
, 1996
"... Experimental comparisons between statistical and machine learning methods appear with increasing frequency in the literature. However, there does not seem to be a consensus on how such a comparison is performed in a methodologically sound way. Especially the effect of testing multiple hypotheses on ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Experimental comparisons between statistical and machine learning methods appear with increasing frequency in the literature. However, there does not seem to be a consensus on how such a comparison is performed in a methodologically sound way. Especially the effect of testing multiple hypotheses on the probability of producing a "false alarm" is often ignored. We transfer multiple comparison procedures from the statistical literature to the type of study discussed in this paper. These testing procedures take the number of tests performed into account, thereby controlling the probability of generating "false alarms". The multiple comparison procedures selected are illustrated on well-known regression and classification data sets. 26.1 Introduction Recent interactions between the statistical and artificial intelligence communities (see e.g. [Han93, CO94]), have led to many studies that compare the performance of empirical statistical and machine learning methods on real-life data sets; ...
Data snooping, dredging and fishing: The dark side of data mining a SIGKDD99 panel report
- SIGKDD Explorations
, 2000
"... This article briefly describes a panel discussion at SIGKDD99. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This article briefly describes a panel discussion at SIGKDD99.
SOME NON-ASYMPTOTIC RESULTS ON RESAMPLING IN HIGH DIMENSION, I: CONFIDENCE REGIONS
- SUBMITTED TO THE ANNALS OF STATISTICS
, 2009
"... We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality
robust non-negative matrix factorization analysis of microarray data
- Bioinformatics
"... Motivation: Modern methods like micro arrays, proteomics and metabolomics often produce data sets where there are many more predictor variables than observations. Research in these areas is often exploratory; even so, there is interest in statistical methods that accurately point to effects that are ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Motivation: Modern methods like micro arrays, proteomics and metabolomics often produce data sets where there are many more predictor variables than observations. Research in these areas is often exploratory; even so, there is interest in statistical methods that accurately point to effects that are likely to replicate. Correlations among predictors are used to improve the statistical analysis. We exploit two ideas: nonnegative matrix factorization methods that create ordered sets of predictors; and statistical testing within ordered sets which is done sequentially, removing the need for correction for multiple testing within the set. Results: Simulations and theory point to increased statistical power. Computational algorithms are described in detail. The analysis and biological interpretation of a real data set are given. In addition to the increased power, the benefit of our method is that the organized gene lists are likely to lead better understanding of the biology. Availablity: A SAS JMP executable script is available from
NON-ASYMPTOTIC RESAMPLING-BASED CONFIDENCE REGIONS AND MULTIPLE TESTS IN HIGH DIMENSION
"... Abstract. We study generalized bootstrapped confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a non-asymptotic control of the confiden ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. We study generalized bootstrapped confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a non-asymptotic control of the confidence level. The random vector is supposed to be either Gaussian or to have a symmetric bounded distribution. We consider two approaches, the first based on a concentration principle and the second on a direct boostrapped quantile. The first one allows us to deal with a very large class of resampling weights while our results for the second are specific to Rademacher weights. We present an application of these results to the one-sided and two-sided multiple testing problem, in which we derive several resampling-based step-down procedures providing a non-asymptotic FWER control. We compare our different procedures in a simulation study, and we show that they can outperform Bonferroni’s or Holm’s procedures as soon as the observed vector has sufficiently correlated coordinates. 1.
Using Gene Ontology on genome-scale studies to find significant associations of biologically relevant terms to group of genes
- In Neural Networks for Signal Processing XIII. IEEE
, 2003
"... The pdf and the html versions of this paper (and related ones) are available from ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The pdf and the html versions of this paper (and related ones) are available from
Efficient Algorithms for Genome-Wide Association Study
"... Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. A ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study. In this article, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs. The principles used in FastANOVA can be applied to categorical phenotypes and other statistics such as Chi-square test.
A systematic comparison and evaluation of biclustering methods for gene expression data
, 2005
"... Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify set ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and data sets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare to each other with respect to the biological relevance of the clusters as well as to other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. Results: First, this paper provides a methodology for comparing and

