DMCA
Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology. Advance online publication. http://dx.doi.org/10.1002/ejsp.2023 (2014)
Venue: | Journal of Applied Psychology |
Citations: | 3 - 0 self |
BibTeX
@ARTICLE{Lakens14performinghigh-powered,
author = {Daniël Lakens and Daniël Lakens},
title = {Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology. Advance online publication. http://dx.doi.org/10.1002/ejsp.2023},
journal = {Journal of Applied Psychology},
year = {2014},
pages = {753--772}
}
OpenURL
Abstract
Abstract Running studies with high statistical power, while effect size estimates in psychology are often inaccurate, leads to a practical challenge when designing an experiment. This challenge can be addressed by performing sequential analyses while the data collection is still in progress. At an interim analysis, data collection can be stopped whenever the results are convincing enough to conclude that an effect is present, more data can be collected, or the study can be terminated whenever it is extremely unlikely that the predicted effect will be observed if data collection would be continued. Such interim analyses can be performed while controlling the Type 1 error rate. Sequential analyses can greatly improve the efficiency with which data are collected. Additional flexibility is provided by adaptive designs where sample sizes are increased on the basis of the observed effect size. The need for pre-registration, ways to prevent experimenter bias, and a comparison between Bayesian approaches and null-hypothesis significance testing (NHST) are discussed. Sequential analyses, which are widely used in large-scale medical trials, provide an efficient way to perform high-powered informative experiments. I hope this introduction will provide a practical primer that allows researchers to incorporate sequential analyses in their research. Copyright © 2014 John Wiley & Sons, Ltd. Repeatedly analyzing results while data collection is in progress has many advantages. Researchers can stop the data collection when observed differences reach a desired confidence level or when unexpected data patterns occur that warrant a reconsideration of the aims of the study. When, after an interim analysis, the effect is smaller than expected, researchers might decide to collect more data or even stop collecting data for specific conditions. One could easily argue that psychological researchers have an ethical obligation to repeatedly analyze accumulating data, given that continuing data collection whenever the desired level of confidence is reached, or whenever it is sufficiently clear that the expected effects are not present, is a waste of the time of participants and the money provided by taxpayers. In addition to this ethical argument, designing studies that make use of sequential analyses are more efficient compared with not performing sequential analyses. Incorporating sequential analyses into the study design can easily reduce the sample size of studies by 30% or more. In psychology, sequential analyses are rarely, if ever, used. In recent years, researchers have been reminded of the fact that repeatedly analyzing data, and continuing the data collection when results are not significant, increases the likelihood of a Type 1 error, or a significant test result in the absence of any differences in the population (e.g., I believe sequential analyses are relevant for psychological science. There is an increasing awareness that underpowered studies in combination with publication bias (the tendency to only accept manuscripts for publication that reveal statistically significant findings) yield a scientific literature that potentially consists of a large number of Type 1 errors (e.g., PRACTICAL ISSUES WHEN DESIGNING AN ADEQUATELY POWERED STUDY One problem with planning the sample size on the basis of the size of an effect (as is done in an a priori power analysis) is that the effect size is precisely the information that the researcher is trying to uncover by performing the experiment. As a consequence, there is always some uncertainty regarding the required sample size needed to observe a statistically significant effect. Nevertheless, a priori power analyses are often recommended when designing studies to provide at least some indication of the required sample size (e.g., Lakens, 2013), and researchers therefore need to estimate the expected effect size when designing a study. One approach to obtain an effect size estimate is to perform a pilot study. To provide a reasonably accurate effect size estimate, a pilot study must already be quite large (e.g., Lakens & Evers, in press), somewhat surpassing their usefulness. A second approach is to base the effect size estimate on an effect size observed in a highly related study, while acknowledging that effect sizes might vary considerably because of the differences between the studies. Regardless of how effect sizes are estimated, estimated effect sizes have their own CIs (as any other sample statistic) and should be expected to vary between the lower and upper confidence limits across studies. Because statistical power is a function that increases concave down (especially for larger effect sizes, see Because the power function increases concave downwards The idea that we need to collect large amounts of data without any flexibility worries researchers, and some researchers have argued against a fixation on Type 1 error control. Ellemers (2013, p. 3) argues "we are at risk of becoming methodological fetishists," which "would reverse the means and the ends of doing research and stifles the creativity that is essential to the advancement of science." Although flexibility in the generation of hypotheses is, in principle, completely orthogonal to how strict these hypotheses are tested empirically, there is a real risk that researchers will become more conservative in the ideas they test. If researchers believe they should perform high-powered experiments with large samples without looking at the data until all participants have been collected, they might not pursue hypotheses that initially seem more unlikely. Murayama, Pekrun, and Fiedler (2013) discuss the practice of adding additional observations to a study, on the basis of the observed p-value, and warn against jumping to the extreme conclusion that continuing data collection after analyzing the data should be banned. They examine what happens when researchers collect additional observations only when an analysis reveals a p-value between .05 and .10. Such a practice, they show, would lead to a modest increase in Type 1 error rates (as long as the number of times additional data are collected is limited). Although this is important to realize, underpowered studies will often yield p-values higher than .10 when there is a real effect in the population. Because using sequential analyses is not very complex, it is preferable to know and use procedures that control Type 1 error rate while performing interim analyses. In the remainder of this article, I will explain how Type 1 error control is possible and provide a practical primer on how to perform sequential analyses. TYPE 1 ERROR CONTROL WHILE PERFORMING INTERIM ANALYSES Statistical procedures to perform sequential interim analyses while data collection is still in progress have been available for a long time (e.g., One-sided tests or asymmetric boundaries are sometimes used (see This lack of flexibility is impractical in medical settings, where data and safety monitoring boards meet at fixed times each year, and it is not always feasible to control the number of patients between these meetings. In psychological research, it might be difficult to pause an experiment after a predefined number of observations have been collected and wait for the data analysis to be performed. Stopping a trial early has clear benefits, such as saving time and money when the available data are considered convincing, but it also has disadvantages, as noted by It is also possible that the effect size estimate at an interim analysis is (close to) zero, which indicates that the effect is non-existent or very small. Because the chance of observing a statistically significant difference is very small or would require huge sample sizes, researchers can decide to terminate the data collection early for futility to spare time and resources. Obviously, researchers might want to continue a study even when the conditional or predictive power after an interim analysis is very low; for example, when they are also interested in demonstrating that there are no effects in a study. In many situations, the decision to stop a study for futility will be more complex, and I will return to this issue when discussing how to define a smallest effect size of interest. An Illustrative Example When a Reliable Effect Size Estimate Exists As an example of how sequential analyses can be used, suppose a researcher is interested in whether a relatively Sequential analyses 703