Results 1 - 10
of
198
The control of the false discovery rate in multiple testing under dependency
- Annals of Statistics
, 2001
"... Benjamini and Hochberg suggest that the false discovery rate may be the appropriate error rate to control in many applied multiple testing problems. A simple procedure was given there as an FDR controlling procedure for independent test statistics and was shown to be much more powerful than comparab ..."
Abstract
-
Cited by 267 (3 self)
- Add to MetaCart
Benjamini and Hochberg suggest that the false discovery rate may be the appropriate error rate to control in many applied multiple testing problems. A simple procedure was given there as an FDR controlling procedure for independent test statistics and was shown to be much more powerful than comparable procedures which control the traditional familywise error rate. We prove that this same procedure also controls the false discovery rate when the test statistics have positive regression dependency on each of the test statistics corresponding to the true null hypotheses. This condition for positive dependency is general enough to cover many problems of practical interest, including the comparisons of many treatments with a single control, multivariate normal test statistics with positive correlation matrix and multivariate t. Furthermore, the test statistics may be discrete, and the tested hypotheses composite without posing special difficulties. For all other forms of dependency, a simple conservative modification of the procedure controls the false discovery rate. Thus the range of problems for which
EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis
- J. Neurosci. Methods
"... Abstract: We have developed a toolbox and graphic user interface, EEGLAB, running under the cross-platform MATLAB environment (The Mathworks, Inc.) for processing collections of single-trial and/or averaged EEG data of any number of channels. Available functions include EEG data, channel and event i ..."
Abstract
-
Cited by 129 (16 self)
- Add to MetaCart
Abstract: We have developed a toolbox and graphic user interface, EEGLAB, running under the cross-platform MATLAB environment (The Mathworks, Inc.) for processing collections of single-trial and/or averaged EEG data of any number of channels. Available functions include EEG data, channel and event information importing, data visualization (scrolling, scalp map and dipole model plotting, plus multi-trial ERP-image plots), preprocessing (including artifact rejection, filtering, epoch selection, and averaging), Independent Component Analysis (ICA) and time/frequency decompositions including channel and component cross-coherence supported by bootstrap statistical methods based on data resampling. EEGLAB functions are organized into three layers. Top-layer functions allow users to interact with the data through the graphic interface without needing to use MATLAB syntax. Menu options allow users to tune the behavior of EEGLAB to available memory. Middle-layer functions allow users to customize data processing using command history and interactive ‘pop ’ functions. Experienced MATLAB users can use EEGLAB data structures and stand-alone signal processing functions to write custom and/or batch analysis scripts. Extensive function help and tutorial information are included. A ‘plug-in ’ facility allows easy incorporation of new EEG modules into the main menu. EEGLAB is freely available
Statistical Comparisons of Classifiers over Multiple Data Sets
, 2006
"... While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but igno ..."
Abstract
-
Cited by 120 (0 self)
- Add to MetaCart
While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
A three dimensional statistical analysis for CBF activation studies in human brain
- Journal of Cerebral Blood Flow and Metabolism
, 1992
"... Many studies of brain function with positron emission tomography (PET) involve the interpretation of a subtracted PET image, usually the difference between two images under baseline and stimulation conditions. The purpose of these studies is to see which areas of the brain are activated by the stimu ..."
Abstract
-
Cited by 84 (27 self)
- Add to MetaCart
Many studies of brain function with positron emission tomography (PET) involve the interpretation of a subtracted PET image, usually the difference between two images under baseline and stimulation conditions. The purpose of these studies is to see which areas of the brain are activated by the stimulation condition. In many cognitive studies, the activation is so slight that the experiment must be repeated on several subjects and the subtracted images are averaged to improve the signal to noise ratio. The averaged image is then standardized to have unit variance and then searched for local maxima (Fox et al., 1988). The main problem facing investigators is which of these local maxima are statistically significant. We describe a simple method for determining an approximate p-value for the global maximum based on the theory of Gaussian random fields as developed by Adler and Hasofer (1976) and Adler (1981). The p-value is proportional to the volume searched divided by the product of the FWHMs of the image reconstruction process, or number of resolution elements (resels). Rather than working with local maxima as in Fox et al. (1988), our method focuses on the Euler characteristic of the set of voxels with a value larger than a given threshold. The Euler characteristic depends only on the topology of the regions of high activation, irrespective
Resampling-Based Multiple Testing for Microarray Data Analysis
, 2003
"... The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, We ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
The burgeoning field of genomics has revived interest in multiple testing procedures by raising new methodological and computational challenges. For example, microarray experiments generate large multiplicity problems in which thousands of hypotheses are tested simultaneously. In their 1993 book, Westfall & Young propose resampling-based p-value adjustment procedures which are highly relevant to microarray experiments. This article discusses different criteria for error control in resampling-based multiple testing, including (a) the family wise error rate of Westfall & Young (1993) and (b) the false discovery rate developed by Benjamini & Hochberg (1995), both from a frequentist viewpoint; and (c) the positive false discovery rate of Storey (2002), which has a Bayesian motivation. We also introduce our recently developed fast algorithm for implementing the minP adjustment to control familywise error rate. Adjusted p-values for different approaches are applied to gene expression data from two recently published microarray studies. The properties of these procedures for multiple testing are compared.
Characterizing gene sets with FuncAssociate
- Bioinformatics
, 2003
"... Summary: FuncAssociate is a web-based tool to help researchers use Gene Ontology attributes to characterize large sets of genes derived from experiment. Distinguishing features of FuncAssociate include the ability to handle ranked input lists, and a Monte Carlo simulation approach that is more appro ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
Summary: FuncAssociate is a web-based tool to help researchers use Gene Ontology attributes to characterize large sets of genes derived from experiment. Distinguishing features of FuncAssociate include the ability to handle ranked input lists, and a Monte Carlo simulation approach that is more appropriate to determine significance than other methods, such as Bonferroni or ˘Sidák p-value correction. FuncAssociate currently supports 10 organisms (Vibrio cholerae, Shewanella oneidensis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Arabidopsis thaliana,
A linear non-gaussian acyclic model for causal discovery
- J. Machine Learning Research
, 2006
"... In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identification from purely observational data. Continuing this line of research, we show how to ..."
Abstract
-
Cited by 33 (16 self)
- Add to MetaCart
In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identification from purely observational data. Continuing this line of research, we show how to discover the complete causal structure of continuous-valued data, under the assumptions that (a) the data generating process is linear, (b) there are no unobserved confounders, and (c) disturbance variables have non-Gaussian distributions of non-zero variances. The solution relies on the use of the statistical method known as independent component analysis, and does not require any pre-specified time-ordering of the variables. We provide a complete Matlab package for performing this LiNGAM analysis (short for Linear Non-Gaussian Acyclic Model), and demonstrate the effectiveness of the method using artificially generated data and real-world data.
Controlling the familywise error rate in functional neuroimaging: a comparative review
- Statistical Methods in Medical Research
, 2003
"... Functional neuroimaging data embodies a massive multiple testing problem, where 100 000 correlated test statistics must be assessed. The familywise error rate, the chance of any false positives is the standard measure of Type I errors in multiple testing. In this paper we review and evaluate three a ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
Functional neuroimaging data embodies a massive multiple testing problem, where 100 000 correlated test statistics must be assessed. The familywise error rate, the chance of any false positives is the standard measure of Type I errors in multiple testing. In this paper we review and evaluate three approaches to thresholding images of test statistics: Bonferroni, random �eld and the permutation test. Owing to recent developments, improved Bonferroni procedures, such as Hochberg’s methods, are now applicable to dependent data. Continuous random �eld methods use the smoothness of the image to adapt to the severity of the multiple testing problem. Also, increased computing power has made both permutation and bootstrap methods applicable to functional neuroimaging. We evaluate these approaches on t images using simulations and a collection of real datasets. We �nd that Bonferroni-related tests offer little improvement over Bonferroni, while the permutation method offers substantial improvement over the random �eld method for low smoothness and low degrees of freedom. We also show the limitations of trying to �nd an equivalent number of independent tests for an image of correlated test statistics. 1
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patter ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Detecting differentially expressed genes in microarrays using Bayesian model selection
- J. Amer. Statist. Assoc
, 2003
"... DNA microarrays open up a broad new horizon for investigators interested in studying the genetic determinants of disease. The high throughput nature of these arrays, where differential expression for thousands of genes can be measured simultaneously, creates an enormous wealth of information, but al ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
DNA microarrays open up a broad new horizon for investigators interested in studying the genetic determinants of disease. The high throughput nature of these arrays, where differential expression for thousands of genes can be measured simultaneously, creates an enormous wealth of information, but also poses a challenge for data analysis because of the large multiple testing problem involved. The solution has generally been to focus on optimizing false-discovery rates while sacri � cing power. The drawback of this approach is that more subtle expression differences will be missed that might give investigators more insight into the genetic environment necessary for a disease process to take hold. We introduce a new method for detecting differentially expressed genes based on a high-dimensional model selection technique, Bayesian ANOVA for microarrays (BAM), which strikes a balance between false rejections and false nonrejections. The basis of the new approach involves a weighted average of generalized ridge regression estimates that provides the bene � ts of using shrinkage estimation combined with model averaging. A simple graphical tool based on the amount of shrinkage is developed to visualize the trade-off between low false-discovery rates and � nding more genes. Simulations are used to illustrate BAM’s performance, and the method is applied to a large database of colon cancer gene expression data. Our working hypothesis in the colon cancer analysis is that large differential expressions may not be the only ones contributing to metastasis—in fact, moderate changes in expression of genes may be involved in modifying the genetic environment to a suf � cient extent for metastasis to occur. A functional biological analysis of gene effects found by BAM, but not other false-discovery-based approaches, lends support to this hypothesis.

