Results 1  10
of
75
Empirical bayes estimates for largescale prediction problems
, 2008
"... Classical prediction methods such as Fisher’s linear discriminant function were designed for smallscale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might inc ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
Classical prediction methods such as Fisher’s linear discriminant function were designed for smallscale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10, 000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to largescale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to largescale prediction, and also with false discovery rate theory.
Feature selection in omics prediction problems using cat scores and false nondiscovery rate control
 Ann. Appl. Stat
, 2009
"... We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlat ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobistranformed predictors are given by correlationadjusted t scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). We show that contrary to previous claims this FNDR procedures performs very well and similar to “higher criticism”. Third, training of the classifier function is conducted by plugin of JamesStein shrinkage estimates of correlations and variances, using analytic procedures for choosing regularization parameters. Overall, this results in an effective and computationally inexpensive framework for highdimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda ” available from the R repository CRAN.
Gene ranking and biomarker discovery under correlation. Bioinformatics
"... Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the tscore, such as the moderated t or the SAM statistic. However, these procedures ignore genegene correlations, which may ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the tscore, such as the moderated t or the SAM statistic. However, these procedures ignore genegene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. Results: We propose a simple procedure that adjusts genewise tstatistics to take account of correlations among genes. The resulting correlationadjusted tscores (“cat ” scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in twoclass linear discriminant analysis. In the absence of correlation the cat score reduces to the standard tscore. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. Availability: The shrinkage cat score is implemented in the R package “st ” available from
Tweedie’s Formula and Selection Bias
"... We suppose that the statistician observes some large number of estimates zi, each with its own unobserved expectation parameter µi. The largest few of the zi’s are likely to substantially overestimate their corresponding µi’s, this being an example of selection bias, or regression to the mean. Tweed ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We suppose that the statistician observes some large number of estimates zi, each with its own unobserved expectation parameter µi. The largest few of the zi’s are likely to substantially overestimate their corresponding µi’s, this being an example of selection bias, or regression to the mean. Tweedie’s formula, first reported by Robbins in 1956, offers a simple empirical Bayes approach for correcting selection bias. This paper investigates its merits and limitations. In addition to the methodology, Tweedie’s formula raises more general questions concerning empirical Bayes theory, discussed here as “relevance ” and “empirical Bayes information. ” There is a close connection between applications of the formula and James–Stein estimation. Keywords: Bayesian relevance, empirical Bayes information, James–Stein, false discovery rates, regret, winner’s curse
Are a set of microarrays independent of each other
, 2009
"... Having observed an m × n matrix X whose rows are possibly correlated, we wish to test the hypothesis that the columns are independent of each other. Our motivation comes from microarray studies, where the rows of X record expression levels for m different genes, often highly correlated, while the co ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Having observed an m × n matrix X whose rows are possibly correlated, we wish to test the hypothesis that the columns are independent of each other. Our motivation comes from microarray studies, where the rows of X record expression levels for m different genes, often highly correlated, while the columns represent n individual microarrays, presumably obtained independently. The presumption of independence underlies all the familiar permutation, crossvalidation, and bootstrap methods for microarray analysis, so it is important to know when independence fails. We develop nonparametric and normaltheory testing methods. The row and column correlations of X interact with each other in a way that complicates test procedures, essentially by reducing the accuracy of the relevant estimators.
PowerEnhanced Multiple Decision Functions Controlling FamilyWise Error and False Discovery Rates
, 2009
"... 2 ..."
(Show Context)
Signal identification for rare and weak features: higher criticism or false discovery rates? Biostatistics
, 2012
"... Signal identification in largedimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show th ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Signal identification in largedimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show that the HC threshold may be viewed as an approximation to a natural class boundary (CB) in twoclass discriminant analysis which in turn is expressible as FDR threshold. We demonstrate that in a rareweak setting in the region of the phase space where signal identification is possible both thresholds are practicably indistinguishable, and thus HC thresholding is identical to using a simple local FDR cutoff. The relationship of the HC and CB thresholds and their properties are investigated both analytically and by simulations, and are further compared by application to four cancer gene expression data sets.
An efficient hierarchical generalized linear mixed model for pathway analysis of genomewide association studies
 Bioinformatics
, 2011
"... Motivation: In genomewide association studies (GWAS) of complex diseases, genetic variants having real but weak associations often fail to be detected at the stringent genomewide significance level. Pathway analysis, which tests disease association with combined association signals from a group of ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Motivation: In genomewide association studies (GWAS) of complex diseases, genetic variants having real but weak associations often fail to be detected at the stringent genomewide significance level. Pathway analysis, which tests disease association with combined association signals from a group of variants in the same pathway, has become increasingly popular. However, because of the complexities in genetic data and the large sample sizes in typical GWAS, pathway analysis remains to be challenging. We propose a new statistical model for pathway analysis of GWAS. This model includes a fixed effects component that models mean disease association for a group of genes, and a random effects component that models how each gene’s association with disease varies about the gene group mean, thus belongs to the class of mixed effects models. Results: The proposed model is computationally efficient and uses only summary statistics. In addition, it corrects for the presence of overlapping genes and linkage disequilibrium (LD). Via simulated and real GWAS data, we showed our model improved power over currently available pathway analysis methods while preserving type I error rate. Furthermore, using the WTCCC Type 1 Diabetes (T1D) dataset, we demonstrated mixed model analysis identified meaningful biological processes that agreed well with previous reports on T1D. Therefore, the proposed methodology provides an efficient statistical modeling framework for systems analysis of
False Discovery Rates and Copy Number Variation
"... Huge data sets, simple questions ..."
(Show Context)
The Future of Indirect Evidence
"... Stanford UniversityWhat is Statistics? • The theory of learning from experience when experience arrives a little bit at a time • Collecting and combining many small pieces of sometimes contradictory evidence • Direct and Indirect statistical evidence ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Stanford UniversityWhat is Statistics? • The theory of learning from experience when experience arrives a little bit at a time • Collecting and combining many small pieces of sometimes contradictory evidence • Direct and Indirect statistical evidence