Results 1 - 10
of
85
A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics
, 2005
"... ..."
Correlation and Large-Scale Simultaneous Significance Testing
- Journal of the American Statistical Association
"... Large-scale hypothesis testing problems, with hundreds or thousands of test statistics “zi ” to consider at once, have become familiar in current practice. Applications of popular analysis methods such as false discovery rate techniques do not require independence of the zi’s, but their accuracy can ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
Large-scale hypothesis testing problems, with hundreds or thousands of test statistics “zi ” to consider at once, have become familiar in current practice. Applications of popular analysis methods such as false discovery rate techniques do not require independence of the zi’s, but their accuracy can be compromised in high-correlation situations. This paper presents computational and theoretical methods for assessing the size and effect of correlation in large-scale testing. A simple theory leads to the identification of a single omnibus measure of correlation. The theory relates to the correct choice of a null distribution for simultaneous significance testing, and its effect on inference. 1. Introduction Modern computing machinery and improved scientific equipment have combined to revolutionize experimentation in fields such as biology, medicine, genetics, and neuroscience. One effect on statistics has been to vastly magnify the scope of multiple hypothesis testing, now often involving thousands of cases considered simultaneously. The cases themselves are typically of familiar form, each perhaps a simple two-sample comparison,
Statistical challenges with high dimensionality: Feature selection in knowledge discovery
- Proceedings of the International Congress of Mathematicians
, 2006
"... Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
Abstract. Technological innovations have revolutionized the process of scientific research and knowledge discovery. The availability of massive data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. The challenges of high-dimensionality arise in diverse fields of sciences and the humanities, ranging from computational biology and health studies to financial engineering and risk management. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. We first give a comprehensive overview of statistical challenges with high dimensionality in these diverse disciplines. We then approach the problem of variable selection and feature extraction using a unified framework: penalized likelihood methods. Issues relevant to the choice of penalty functions are addressed. We demonstrate that for a host of statistical problems, as long as the dimensionality is not excessively large, we can estimate the model parameters as well as if the best model is known in advance. The persistence property in risk minimization is also addressed. The applicability of such a theory and method to diverse statistical problems is demonstrated. Other related problems with high-dimensionality are also discussed.
Microarrays, empirical Bayes and the two-groups model
- STATIST. SCI
, 2006
"... The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the Twenty-First Century: high throughput devices, such as microarrays, routinely re ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
The classic frequentist theory of hypothesis testing developed by Neyman, Pearson, and Fisher has a claim to being the Twentieth Century’s most influential piece of applied mathematics. Something new is happening in the Twenty-First Century: high throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The two-groups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the two-groups setting, with particular attention focussed on Benjamini and Hochberg’s False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in large-scale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals, and Bayesian competitors to the two-groups model.
Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency
- J. Amer. Statist. Assoc
, 2005
"... Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear in-slide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Normalization of microarray data is essential for removing experimental biases and revealing meaningful biological results. Motivated by a problem of normalizing microarray data, a semilinear in-slide model (SLIM) has been proposed. To aggregate information from other arrays, SLIM is generalized to account for across-array information, resulting in an even more dynamic semiparametric regression model. This model can be used to normalize microarray data even when there is no replication within an array. We demonstrate that this semiparametric model has a number of interesting features. The parametric component and the nonparametric component that are of primary interest can be consistently estimated, the former having a parametric rate and the latter having a nonparametric rate, whereas the nuisance parameters cannot be consistently estimated. This is an interesting extension of the partial consistent phenomena, which itself is of theoretical interest. The asymptotic normality for the parametric component and the rate of convergence for the nonparametric component are established. The results are augmented by simulation studies and illustrated by an application to the cDNA microarray analysis of neuroblastoma cells in response to the macrophage migration inhibitory factor.
Size, power and false discovery rates
, 2007
"... Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on large-scale problems. A simple empirical Bayes approach allows the fdr analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closed-form accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tail-area fdr’s, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology the power diagnostics showing why non-null cases might easily fail to appear on a list of “significant ” discoveries. Short Title “Size, Power, and Fdr’s”
Estimating the null and the proportion of non-null effects in large-scale multiple comparisons
- J. Amer. Statist. Assoc
, 2007
"... An important issue raised by Efron [7] in the context of large-scale multiple comparisons is that in many applications the usual assumption that the null distribution is known is incorrect, and seemingly negligible differences in the null may result in large differences in subsequent studies. This s ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
An important issue raised by Efron [7] in the context of large-scale multiple comparisons is that in many applications the usual assumption that the null distribution is known is incorrect, and seemingly negligible differences in the null may result in large differences in subsequent studies. This suggests that a careful study of estimation of the null is indispensable. In this paper, we consider the problem of estimating a null normal distri-bution, and a closely related problem, estimation of the proportion of non-null effects. We develop an approach based on the empirical characteristic function and Fourier analysis. The estimators are shown to be uniformly consistent over a wide class of parameters. Numerical performance of the estimators is investigated using both simulated and real data. In particular, we apply our
Local False Discovery Rates
, 2005
"... Modern scientific technology is providing a new class of large-scale simultaneous inference problems, with hundreds or thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology but similar problems arise in proteomics, time of flight spectroscopy, flow ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Modern scientific technology is providing a new class of large-scale simultaneous inference problems, with hundreds or thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology but similar problems arise in proteomics, time of flight spectroscopy, flow cytometry, FMRI imaging, and massive social science surveys. This paper uses local false discovery rate methods to carry out size and power calculations on large-scale data sets. An empirical Bayes approach allows the fdr analysis to proceed from a minimum of frequentist or Bayesian modeling assumptions. Microarray and simulated data sets are used to illustrate a convenient estimation methodology whose accuracy can be calculated in closed form. A crucial part of the methodology is an fdr assessment of “thinned counts”, what the histogram of test statistics would look like for just the non-null cases
Estimation and confidence sets for sparse normal mixtures
, 2005
"... For high dimensional statistical models, researchers have begun to focus on situations which can be described as having relatively few moderately large coefficients. Such situations lead to some very subtle statistical problems. In particular, Ingster and Donoho and Jin have considered a sparse norm ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
For high dimensional statistical models, researchers have begun to focus on situations which can be described as having relatively few moderately large coefficients. Such situations lead to some very subtle statistical problems. In particular, Ingster and Donoho and Jin have considered a sparse normal means testing problem, in which they described the precise demarcation, or the detection boundary. Meinshausen and Rice have shown that it is even possible to estimate consistently the fraction of nonzero coordinates on a subset of the detectable region, but leave unanswered the question of exactly which parts of the detectable region that consistent estimation is possible. In the present paper we develop a new approach for estimating the fraction of nonzero means for problems where the nonzero means are moderately large. We show that the detection region described by Ingster and Donoho and Jin turns out to be the region where it is possible to consistently estimate the expected fraction of nonzero coordinates. This theory is developed further and minimax rates of convergence are derived. A procedure is constructed which attains the optimal rate of convergence in this setting. Furthermore, the procedure also provides an honest lower bound for confidence intervals while minimizing the expected length of such an interval. Simulations are used to enable comparison with the work of Meinshausen and Rice, where a procedure is given but where rates of convergence have not been discussed. Extensions to more general Gaussian mixture models are also given.
To how many simultaneous hypothesis tests the normal, Student’s t, or bootstrap calibration be applied
, 2007
"... ABSTRACT. In the analysis of microarray data, and in some other contemporary statistical problems, it is not uncommon to apply hypothesis tests in a highly simultaneous way. The number, N say, of tests used can be much larger than the sample sizes, n, to which the tests are applied, yet we wish to c ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
ABSTRACT. In the analysis of microarray data, and in some other contemporary statistical problems, it is not uncommon to apply hypothesis tests in a highly simultaneous way. The number, N say, of tests used can be much larger than the sample sizes, n, to which the tests are applied, yet we wish to calibrate the tests so that the overall level of the simultaneous test is accurate. Often the sampling distribution is quite different for each test, so there may not be an opportunity for combining data across samples. In this setting, how large can N be, as a function of n, before level accuracy becomes poor? In the present paper we answer this question in cases where the statistic under test is of Student’s t type. We show that if either Normal or Student’s t distribution is used for calibration then the level of the simultaneous test is accurate provided log N increases at a strictly slower rate than n 1/3 as n diverges. On the other hand, if bootstrap methods are used for calibration then we may choose log N almost as large as n 1/2 and still achieve asymptotic level accuracy. The implications of these results are explored both theoretically and numerically. KEYWORDS. Bonferroni’s inequality, Edgeworth expansion, genetic data, large-deviation expansion, level accuracy, microarray data, quantile estimation, skewness, Student’s t statistic.

