DMCA
Modeling the hemodynamic response function in fMRI: efficiency, bias and mis-modeling. Neuroimage (2009)
Citations: | 34 - 4 self |
BibTeX
@MISC{Lindquist09modelingthe,
author = {Martin A Lindquist and Ji Meng Loh and Lauren Y Atlas and Tor D Wager},
title = {Modeling the hemodynamic response function in fMRI: efficiency, bias and mis-modeling. Neuroimage},
year = {2009}
}
OpenURL
Abstract
Most brain research to date have focused on studying the amplitude of evoked fMRI responses, though there has recently been an increased interest in measuring onset, peak latency and duration of the responses as well. A number of modeling procedures provide measures of the latency and duration of fMRI responses. In this work we compare several techniques that vary in their assumptions, model complexity, and interpretation. For each method, we introduce methods for estimating amplitude, peak latency, and duration and for performing inference in a multi-subject fMRI setting. We then assess the techniques' relative sensitivity and their propensity for mis-attributing task effects on one parameter (e.g., duration) to another (e.g., amplitude). Finally, we introduce methods for quantifying model misspecification and assessing bias and power-loss related to the choice of model. Overall, the results show that it is surprisingly difficult to accurately recover true task-evoked changes in BOLD signal and that there are substantial differences among models in terms of power, bias and parameter confusability. Because virtually all fMRI studies in cognitive and affective neuroscience employ these models, the results bear on the interpretation of hemodynamic response estimates across a wide variety of psychological and neuroscientific studies. © 2008 Elsevier Inc. All rights reserved. Introduction Functional magnetic resonance imaging (fMRI) is based on studying the vascular response in the brain to neuronal activity and can be used to study mental activity. It is most commonly performed using blood oxygenation level-dependent (BOLD) contrast To date most fMRI studies have been primarily focused on estimating the amplitude of evoked HRFs across different tasks. However, there is a growing interest in studying the time-to-peak and duration of activation as well In this paper, we focus on estimation of response amplitude/height (H), time-to-peak (T), and full-width at half-max (W) in the HRF as potential measures of response magnitude, latency and duration of neuronal activity. Ideally, the parameters of the HRF should be directly interpretable in terms of changes in neuronal activity, and should be estimated so that statistical power is maximized. An accurate estimate of the HRF shape may help prevent both false positive and negative results from arising due to ill-fitting constrained statistical models, as even minor amounts of mis-modeling can lead to severe loss in power and validity The issue of interpretability is complex, and the problem can be divided into two parts, shown in In spite of these challenges, well over a thousand studies have to date demonstrated relationships between task-evoked changes in brain metabolic activity and measured BOLD responses. These studies treat the evoked BOLD response as the signal of interest, without attempting to make a direct quantitative link to neuronal activity. Early studies presented events with large separation in time (e.g., visual stimuli every 20-30 s), so that task-evoked average BOLD responses could be recovered, and H, T, and W estimated directly. However, this design is highly inefficient, as very few stimuli can be presented in a session, and it has become common practice to present events rapidly enough so that the BOLD responses to different events overlap. The dominant analysis strategy is to assume that BOLD responses to events add linearly Choices of HRF models range from the use of a single canonical HRF, the use of a basis set of smooth functions Thus, in sum, the nature of the underlying BOLD physiology limits the ultimate interpretability of the parameter estimates in terms of neuronal and metabolic function, but modeling task-evoked BOLD responses is useful, and is in fact critical for inference in virtually all neuroscientific fMRI studies. Because of the complexity in the relationship between neural activity and BOLD, we do not attempt to relate BOLD signal directly to underlying neuronal activity in this work. Instead, we concentrate on the second issue in In previous work we showed that with virtually all models of evoked BOLD responses, true changes in one parameter (e.g. T) can be mistaken for changes in others (e.g. H and W). This problem is independent from the issue of how neuronal activity actually leads to the BOLD response. The goal of this paper is to expand on our previous work assessing the validity and power of various hemodynamic modeling techniques by introducing techniques for performing inference on estimated H, T and W parameters in a multi-subject fMRI setting, as well as methods for quantifying the amount of mis-modeling each model gives rise to. We consider a number of BOLD response models, which vary significantly in their assumptions, model complexity and interpretation, under a range of different conditions, including variations in true BOLD amplitude, latency, and duration. Overall, the results reported here show that it is surprisingly difficult to accurately recover true task-evoked changes in BOLD H, T, and W parameters, and there are substantial differences among models in power, bias and parameter confusability. Because virtually all fMRI studies in cognitive and affective neuroscience employ these models, the results bear on the way HRFs are estimated in hundreds of neuroscientific studies published per year. Thus, the current results can inform the choice of BOLD response models used in these studies, until it becomes practical to incorporate more complete models of BOLD hemodynamics (including nonlinear neuro-vascular coupling) on a voxel-by-voxel basis in cognitive studies. Methods Modeling the hemodynamic response function The relationship between the stimulus and BOLD response is typically modeled using a linear time invariant (LTI) system, where the M.A. Lindquist et al. / NeuroImage 45 (2009) S187-S198 signal at time t, y(t), is modeled as the convolution of a stimulus function s(t) and the hemodynamic response h(t), i.e. In many studies h(t) is assumed to take a fixed canonical shape. However, to increase the flexibility of the approach, h(t) is often modeled as a linear combination of B basis functions g i (t), where i = 1,….B. We can then write where the β i are unknown model parameters. Typically the vectors (s⁎g i )(t) are collated into the columns of a design matrix, X, and the model is expressed where β is a vector of regression coefficients, Y is a vector containing the observed data, and ε is a vector of unexplained error values. For most statistical analysis the use of a LTI system is considered a reasonable assumption that provides for valid statistical inference. Therefore, in this work we assume an LTI system, and our main focus will be finding flexible models for the impulse function in the LTI system, i.e. the HRF. A number of models, varying greatly in their flexibility, have been suggested in the literature. In the most rigid model, the shape of the HRF is completely fixed and the height (i.e., amplitude) of the response alone is allowed to vary In general, the more basis functions used in a linear model or the more free parameters in a nonlinear one, the more flexible the model is in measuring the parameters of interest. However, including more parameters also means more potential for error in estimating them, fewer degrees of freedom, and decreased power and validity if regressors are collinear. It is also easier and statistically more powerful to interpret differences between task conditions on a single parameter such as height than it is to test for differences in multiple parameters -conditional, of course, on the interpretability of those parameter estimates. Together these problems suggest using a single, canonical HRF and it does in fact offer optimal power if the shape is specified exactly correctly. However, the shape of the HRF varies as a function of both task and brain region, and any fixed model will be misspecified for much of the brain HRF models In this work we study seven HRF models in a series of simulation studies and an application to real data. We briefly introduce each model below, but leave a more detailed description for Section A of the Appendix. The first model under consideration is SPMs canonical HRF (Here denoted GAM), which consists of a linear combination of two gamma functions (Eq. (A1) in the Appendix). To increase its ability to fit responses that are shifted in time or have extended activation durations, we also consider models using the canonical HRF plus its temporal derivative (TD) and the canonical HRF plus its temporal and dispersion derivatives (DD). The next class of models is based on the use of the finite impulse response (FIR) basis set, which is the most flexible basis set that can be applied directly in a linear regression framework. In this work, we study both the standard FIR model and a semi-parametric smooth FIR model (sFIR). Finally, we also consider two models fit using nonlinear techniques. These include one with the same functional form as the canonical HRF but with 6 variable parameters (NL) and the inverse logit model (IL), which consists of the superposition of three separate inverse logit functions. Estimating parameters After modeling the HRF we seek to estimate its height (H), time-topeak (T) and width (W). Several of the models have closed form solutions describing the fits (e.g., the gamma based models and IL), and hence estimates of H, T and W can be derived analytically. A lack of closed form solution (e.g., for FIR models) does not preclude estimating values from the fitted HRFs, and procedures for doing so are described in Section B of the Appendix. However, when possible we use the parameter estimates to calculate H, T and W. Estimates for IL have been derived in . When using models that include the canonical HRF and its derivatives it is common to only use the non-derivative term as an estimate of the HRF amplitude. However, this solution will be biased and therefore for TD and DD we use a "derivative boost" to counteract anticipated delayinduced negative amplitude bias whereβ 1 andβ 2 are the regression parameters for the canonical HRF and first derivative term respectively. For DD it is whereβ 3 is the regression parameter corresponding to the second derivative term. Inference We also seek to compare several techniques for performing population inference on the estimated amplitude. Let H i be the estimated amplitude for subject i, i =1,….M, defined for hypothesis testing purposes to be the global extreme point for the HRF, i.e. either a minimum or a maximum. The goal is to test whether H significantly differs from 0 in the population. In this work we compare three statistical techniques: the standard summary statistics approach (Holmes and Friston, 1998), a bootstrap procedure Detecting model misspecification Each of the models presented in this paper differ in their ability to handle unexpected HRF shapes. Using an ill-fitting model will violate the assumptions (e.g., mean 0 noise) required for valid inference and even a small amount of mis-modeling can result in severe power loss and inflate the false positive rate beyond the nominal value. Due to the massive amount of data, performing model diagnostics is challenging, and only limited attention has been given to this problem (e.g., Suppose r(i), i=1,…T are the whitened residuals and K(t) a kernel function. Let, be the moving average of w consecutive observations, starting at time t. Under the null hypothesis that the model is correct, Z w is mean 0 for all values of t. Thus a large value of Z w indicates that model mis-fit might be present and the statistic S = max Z w (t) measures the strongest evidence against the null hypothesis. Choosing a Gaussian kernel allows Gaussian random field theory to be used to determine the p-value where p i is the p-value for subject i. Under the null hypothesis of no effect, Q follows a chi-square distribution with 2M degrees of freedom. As a follow-up we have proposed techniques for determining whether there is task-related signal remaining in the residuals and for quantifying the amount of power-loss and bias directly attributable to model misspecification. Estimates of bias and power-loss can be computed from the residuals for each voxel, and bias and power loss maps can be constructed. The details of this procedure are beyond the scope of this paper, and we refer the interested reader to Comparing HRF models: simulation studies The simulations described below were designed to compare the performance of the HRF modeling methods, specifically with respect to the ability to model variations in stimulus onset and duration relative to the assumed experimental reference ("neuronal") signal (Eq. (1)). We also assess the validity and power of each method using different types of inference: the summary statistic, the bootstrap test, and the sign permutation test. Creation of "ground truth" data for simulation As shown in M.A. Lindquist et al. / NeuroImage 45 (2009) S187-S198 group "random effects" analysis, we generated 15 subject datasets for each simulation, which consisted of the "true" BOLD time series at each voxel plus white noise, creating a plausible effect size (Cohen's d = 0.5) based on observed effect sizes in the visual and motor cortex Simulation 1 An event-related stimulus function with a single spike (see Simulation 2 Data were simulated for 15 subjects in the same manner as in Simulation 1. After fitting each of the 7 methods, the value of H was estimated for each voxel and subject. Population inference was performed using the three testing procedures to determine whether the population height differed significantly from zero. The whole procedure was repeated 30 times and the number of times each voxel was deemed significant at the α = 0.001 level was recorded. Experimental procedures: thermal pain Participants (n = 20) provided informed consent and all procedures were approved by the Columbia University IRB. During fMRI scanning, 48 thermal stimuli, 12 at each of 4 temperatures, were delivered to the left forearm. Temperatures were calibrated individually for each participant before scanning to be warm, mildly painful, moderately painful, and near tolerance. Heat stimuli, preceded by a 2 s warning cue and 6 s anticipation period, lasted 10 s in duration followed by a 30 s inter-trial interval. Functional T2⁎-weighted EPI-BOLD images (TR = 2 s, 3.5 × 3.5 × 4 mm voxels) were collected during functional runs of length 6 min. 8 s. Gradient artifacts were removed from reconstructed images prior to preprocessing. Images were slice-time corrected and adjusted for head motion using SPM5 software (http:// www.fil.ion.ucl.ac.uk/spm/). A high-resolution anatomical image (T1-weighted spoiled-GRASS [SPGR] sequence, 1 × 1 × 1 mm voxels, TR = 19 ms) was coregistered to the mean functional image using a mutual information cost function, and segmented and warped to the MNI template. Warps were also applied to functional images, which were smoothed with a 6 mm-FWHM Gaussian kernel, high-pass filtered with a 120 s (.0083 Hz) discrete cosine basis set, and Winsorized at 3 standard deviations prior to analysis. Each of the 7 models were fit to data voxel-wise in a single axial slice (z = −22 mm) covering several pain-related regions of interest, including the anterior cingulate cortex. Separate HRFs were estimated for stimuli of each temperature, though we focus on the responses to the highest heat level in the results. The misspecification statistic was calculated using a Gaussian kernel (8 s FWHM) and p-values determined using Gaussian random field theory. Results Simulation studies Simulation 1 The results of Simulation 1 are summarized in The GAM model (first column) gives reasonable results for delayed onsets within 3 s and widths up to 3 s (squares in the upper left-hand corner), but under-estimates amplitude dramatically as onset and duration increase. This is natural as the GAM model is correctly specified for the square in the upper left-hand corner (and thus optimal), but not well equipped to handle a large amount of model misspecification. Of special interest is the fact that there is no bias in the squares contained in the first two columns of the W map. This is true because in these cases the fixed width of the canonical HRF exactly coincides with the width of the simulated data. The same is true for the square in the upper left-hand corner of the T map. However, studying the results in the bottom row indicates severe mismodeling present for voxels in the lower right-hand corner. The second and third column show equivalent results for TD and DD, which show that the inclusion of derivative terms provide a slight improvement over GAM for squares where there is a minor amount of mis-modeling of the onset and duration. However, there is again a drastic decrease in model fit with delayed onsets greater than about 3 s or extended durations greater than 3 s. Interestingly enough, there appear to be only minor differences between the results for DD and TD, which indicates that the inclusion of the dispersion derivative does not lead to an improvement in the model robustness across onset and duration shifts. Also it is interesting to note that for each of the gamma-based models (GAM, TD and DD) there is a consistent negative bias in the estimates of T and W (all voxels are blue). The estimation procedure was repeated using only the non-derivative term as an estimate of the HRF amplitude as is the common practice in the field. The results (not presented here) showed that the "derivative boost" resulted in a slight decrease in reported bias. Both models based on the use of FIR basis sets (FIR and sFIR) give rise to some bias in all three model parameters, with estimates tending to be negatively biased (e.g., shrunk towards zero for positive activations). The results for sFIR are consistent with similar simulations performed in . Of special interest is that for FIR, the W map shows a strong, systematic negative bias (all squares are blue), because the full response width is almost never captured due to the roughness of the FIR estimates. The sFIR model performs substantially better in estimating width, with substantial bias only with 4-5 s onset shifts, at a small cost in under-estimating H. This cost is likely due to the fact that the Gaussian prior term leads to shrinkage of the amplitude of the fitted HRF. Both methods showed some bias in estimates of T, but without a clear, consistent directionality. Finally, it appears that the sFIR model has some minor (occasional) problems with model mis-specification, while the FIR shows no significant model misspecification. The NL model shows reasonable results with respect to bias, except for a strong tendency to under-estimate duration, but shows severe problems with model mis-specification. This can further be seen in Simulation 2 Simulation 2 assessed two nonparametric alternatives to the standard parametric summary statistics approach. The results are summarized in In summary, the results are consistent across the three inference techniques (each shown in one row in From Experiment The results of the pain experiment are summarized in Figs. 6-7. The location of the slice used and an illustration of areas of interest (rdACC and S2, two brain regions known to process pain intensity Discussion Though most brain research to date have focused on studying the amplitude of evoked activation, the onset and peak latencies of the HRF can provide information about the timing of activation for various brain areas and the width of the HRF provides information about the duration of activation. However, the independence of these parameter estimates has not been properly assessed, as it appears that even if basis functions are independent (or a nonlinear fitting procedure provides nominally independent estimates), the parameter estimates from real data may not be independent. The present study extends work originally presented in that seeks to bridge this gap in the literature. To assess independence, we determine the amount of confusability between estimates of height (H), time-to-peak (T) and full-width at half-maximum (W) and actual manipulations in the amplitude, time-to-peak and duration of the stimulus. This was investigated using a simulation study that compares model fits across a variety of popular methods. Even models that attempt to account for delay such as a gamma function with nonlinear fitting (e.g., A key point of this paper is that model misspecification can result in bias in addition to loss in power. This bias may inflate the Type I error rate beyond the nominal α level, so that p-values for the test are inaccurate. For example, a statistical parametric map thresholded at p b 0.001 may actually only control the false positive rate at, for example, p b 0.004. We find that even relatively minor model misspecification can result in substantial power loss. In light of our results, it seems important for studies that use a single canonical HRF or a highly constrained basis set to construct maps of bias and power loss, so that regions with low sensitivity or increased false positive rates may be identified. We discuss a procedure for detecting deviations in fMRI time series residuals. Using these ideas, it is possible to construct whole-brain bias and power loss maps due to systematic mismodeling. Matlab implementations of the IL model and a mismodeling toolbox can be obtained by contacting the authors. Conflict of interest The authors declare that there are no conflicts of interest.