Literature DB >> 30318709

Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates.

Anders Eklund^1,2,3, Hans Knutsson^1,3, Thomas E Nichols^4,5,6.

Abstract

Methodological research rarely generates a broad interest, yet our work on the validity of cluster inference methods for functional magnetic resonance imaging (fMRI) created intense discussion on both the minutia of our approach and its implications for the discipline. In the present work, we take on various critiques of our work and further explore the limitations of our original work. We address issues about the particular event-related designs we used, considering multiple event types and randomization of events between subjects. We consider the lack of validity found with one-sample permutation (sign flipping) tests, investigating a number of approaches to improve the false positive control of this widely used procedure. We found that the combination of a two-sided test and cleaning the data using ICA FIX resulted in nominal false positive rates for all data sets, meaning that data cleaning is not only important for resting state fMRI, but also for task fMRI. Finally, we discuss the implications of our work on the fMRI literature as a whole, estimating that at least 10% of the fMRI studies have used the most problematic cluster inference method (p = .01 cluster defining threshold), and how individual studies can be interpreted in light of our findings. These additional results underscore our original conclusions, on the importance of data sharing and thorough evaluation of statistical methods on realistic null data.

Entities: Chemical Disease Gene Species

Keywords: ICA FIX; cluster inference; false positives; functional magnetic resonance imaging; permutation; physiological noise

Mesh：

Year: 2018 PMID： 30318709 PMCID： PMC6445744 DOI： 10.1002/hbm.24350

Source DB: PubMed Journal: Hum Brain Mapp ISSN： 1065-9471 Impact factor: 5.038

INTRODUCTION

In our previous work (Eklund, Nichols, & Knutsson, 2016a), we used freely available resting state functional magnetic resonance imaging (fMRI) data to evaluate the validity of standard fMRI inference methods. Group analyses involving only healthy controls were used to empirically estimate the degree of false positives, after correcting for multiple comparisons, based on the idea that a two‐sample t test using only healthy controls should lead to nominal false positive rates (e.g., 5%). By considering resting state fMRI as null task fMRI data, the same approach was used to evaluate the statistical methods for one‐sample t tests. Briefly, we found that parametric statistical methods (e.g., Gaussian random field theory [GRFT]) perform well for voxel inference, where each voxel is separately tested for significance, but the combination of voxel inference and familywise error (FWE) correction is seldom used due to its low statistical power. For this reason, the false discovery rate is in neuroimaging (Genovese, Lazar, & Nichols, 2002) often used to increase statistical power. For cluster inference, where groups of voxels are tested together by looking at the size of each cluster, we found that parametric methods perform well for a high cluster defining threshold (CDT; p = .001) but result in inflated false positive rates for low CDTs (e.g., p = .01). GRFT is for cluster inference based on two additional assumptions, compared to GRFT for voxel inference, and we found that these assumptions are violated in the analyzed data. First, the spatial autocorrelation function (SACF) is assumed to be Gaussian, but real fMRI data have a SACF with a much longer tail. Second, the spatial smoothness is assumed to be constant over the brain, which is not the case for fMRI data. The nonparametric permutation test is not based on these assumptions (Winkler, Ridgway, Webster, Smith, & Nichols, 2014) and, therefore, produced nominal results for all two‐sample t tests, but in some cases failed to control FWE for one‐sample t tests.

Related work

Our article has generated intense discussions regarding cluster inference in fMRI (Cox, 2018; Cox, Chen, Glen, Reynolds, & Taylor, 2017a, 2017b; Eklund, Nichols, & Knutsson, 2017; Flandin & Friston, 2017; Gopinath, Krishnamurthy, & Sathian, 2018; Kessler, Angstadt, & Sripada, 2017), on the validity of using resting state fMRI data as null data (Nichols, Eklund, & Knutsson, 2017; Slotnick, 2016, 2017), how the spatial resolution can affect parametric cluster inference (Mueller, Lepsien, Möller, & Lohmann, 2017), how to obtain residuals with a Gaussian SACF (Gopinath, Krishnamurthy, Lacey, & Sathian, 2018), how to model the long‐tail SACF (Cox et al., 2017b), as well as how different MR sequences can change the SACF and thereby cluster inference (Wald & Polimeni, 2017). Furthermore, some of our results have been reproduced and extended (Cox et al., 2017b; Flandin & Friston, 2017; Kessler et al., 2017; Mueller et al., 2017), using the same freely available fMRI data (Biswal et al., 2010; Poldrack et al., 2013) and our processing scripts available on github.1 Cluster‐based methods have now also been evaluated for surface‐based group analyses of cortical thickness, surface area, and volume (using FreeSurfer; Greve & Fischl, 2018), with a similar conclusion that the nonparametric permutation test showed good control of the FWE for all settings, while traditional Monte Carlo methods fail to control FWE for some settings.

Realistic first level designs

The event related paradigms (E1, E2) used in our study were criticized by some for not being realistic designs, as only a single regressor was used (Slotnick, 2016) and the rest between the events was too short. The concern here is that this design may have a large transient at the start (due to the delay of the hemodynamic response function) and then only small variation (due to the short interstimulus interval), which may be overly‐sensitive to transients at the start of the acquisition (Figure 1a, however, shows this is not really the case). Another criticism was that exactly the same task was used for all subjects (Flandin & Friston, 2017), meaning that our “false positives” actually reflect consistent pretend‐stimulus‐linked behavior over subjects. Yet another concern was if the first few volumes (often called dummy scans) in each fMRI data set were included in the analysis or not,2 which can affect the statistical analyses. This last point we can definitively address, as according to a NITRC document,3 the first 5 time points of each time series were discarded for all data included in the 1,000 functional connectomes project release. In the Methods section we, therefore, describe new analyses based on two new first level designs.

Figure 1

A comparison of the paradigms used in the original paper (a) E1, (b) E2), and the new paradigms used in this article (c) E3, (d) E4, for the Beijing data sets (sampled with a TR of 2 s). A single task was used for both E1 and E2, while two pretended tasks where used for E3 and E4 (and all first‐level analyses tested for a difference in activation between these two tasks). Paradigms E1, E2, and E3 are the same for all subjects, while E4 is randomized over subjects. For all paradigms, the default hemodynamic response function in SPM 8 (double gamma) was used to generate these plots

Nonparametric inference

Nonparametric group inference is now available in the AFNI function 3dttest++ (Cox et al., 2017a, 2017b), meaning that the three most common fMRI softwares now all support nonparametric group inference (SPM users can use the SnPM toolbox (http://warwick.ac.uk/snpm), and FSL users can use the randomize function (Winkler et al., 2014)). Permutation tests cannot only be applied to simple designs such as one‐sample and two‐sample t tests, but to virtually any regression model with independent errors (Winkler et al., 2014). To increase statistical power, permutation tests enable more advanced thresholding approaches (Smith & Nichols, 2009) as well as the use of multivariate approaches with complicated null distributions (Friman, Borga, Lundberg, & Knutsson, 2003; Stelzer, Chen, & Turner, 2013). The nonparametric permutation test produced nominal results for all two sample t tests, but not for the one sample t tests (Eklund et al., 2016a), and the Oulu data were more problematic compared to Beijing and Cambridge. As described in Section 2, we investigated numerous ways to achieve nominal results, and finally concluded that (physiological) artifacts are a problem for one‐sample t tests, especially for the Oulu data. This is a good example of the challenge of validating statistical methods. One can argue that real fMRI data are essential since they contain all types of noise (Birn, Diamond, Smith, & Bandettini, 2006; Chang & Glover, 2009; Glover, Li, & Ress, 2000; Greve, Brown, Mueller, Glover, & Liu, 2013; Lund, Madsen, Sidaros, Luo, & Nichols, 2006) which are difficult to simulate. From this perspective, the Oulu data are helpful since they highlight the problem of (physiological) noise. On the other hand, one can argue that a pure fMRI simulation (Welvaert & Rosseel, 2014) is better, since the researcher then can control all parameters of the data and independently test different settings. From this perspective, the Oulu data should be avoided, because the assumption of no consistent activation over subjects is violated by the (physiological) noise, however data quality varies dramatically over sites and studies and we expect there is plenty of data collected that has quality comparable to Oulu.

Implications

The original publication (Eklund et al., 2016a) inadvertently implied that a large, unspecified proportion of the fMRI literature was affected by our findings, principally the severe inflation of false positive risk for a CDT of p = .01; this was clarified in a subsequent correction (Eklund, Nichols, & Knutsson, 2016b). In Section 4, we consider the interpretation of our findings and their impact on the literature as a whole. We estimate that at least 10% of 23,000 published fMRI studies have used the problematic CDT p = .01.

METHODS

New paradigms

To address the concerns regarding realistic first level designs, we have made new analyses using two new event‐related paradigms, called E3 and E4. For both E3 and E4, two pretended tasks are used instead of a single task, and each first‐level analysis tests for a difference in activation between the two tasks. Additionally, the rest between the events is longer. For Beijing data, 13 events were used for each of the two tasks. Each task is 3–7 s long, and the rest between each event is 11–13 s. For Cambridge data, 11 events of 3–6 s were used for each task. For Oulu data, 13 events of 3–6 s were used for each task. See Figure 1 for a comparison between E1, E2, E3, and E4. For E4, the regressors are randomized over subjects, such that each subject has the same number of events for each task, but the order and the timing of the events is different for every subject. For E3, the same regressors are used for all subjects. First‐level analyses as well as group level analyses were performed as in the original study (Eklund et al., 2016a), using the same data (Beijing, Cambridge, Oulu) from the 1,000 functional connectomes project (Biswal et al., 2010). Analyses were performed with SPM 8 (Ashburner, 2012), FSL 5.09 (Jenkinson, Beckmann, Behrens, Woolrich, & Smith, 2012) and AFNI 16.3.14 (Cox, 1996). FWE rates were estimated for different levels of smoothing (4–10 mm), one‐sample as well as two‐sample t tests, and two CDTs (p = .01 and p = .001). Group analyses using 3dMEMA in AFNI were not performed, as the results for 3dttest++ and 3dMEMA were very similar in the original study (Eklund et al., 2016a). Another difference is that cluster thresholding for AFNI was performed using the new ACF (autocorrelation function) option in 3dClustSim (Cox et al., 2017b), which uses a long‐tail spatial ACF instead of a Gaussian one. To be able to compare the AFNI results for the new paradigms (E3, E4) and the old paradigms (B1, B2, E1, E2), the group analyses for the old paradigms were reevaluated using the ACF option (note, that this ACF AFNI option still assumes a stationary spatial autocorrelation structure). Interested readers are referred to our github account for further details.

Using ICA‐FIX for denoising

We investigated numerous ways to achieve nominal FWE rates for the one‐sample (sign flipping) permutation test; Applying the Yeo and Johnson (2000) transform (signed Box‐Cox) to reduce skew (as the sign flipping test is based on an assumption of symmetric errors) Using robust regression (in every permutation) to suppress the influence of outliers (Mumford, 2017; Wager, Keller, Lacey, & Jonides, 2005; Woolrich, 2008) Using two‐sided tests instead of one‐sided Increasing the number of head motion regressors from 6 to 24 Using bootstrap instead of sign flipping, and Including the global mean as a covariate in each first‐level analysis (Murphy, Birn, Handwerker, Jones, & Bandettini, 2009; Murphy & Fox, 2017; which is normally not done for task fMRI). While some of these approaches resulted in nominal FWE rates for a subset of the parameter combinations, no approach worked well for all settings and data sets. In our original study, we only used one‐sided tests, but this is based on an implicit assumption that a random regressor is equally likely to be positively or negatively correlated with resting state fMRI data. Additionally, most fMRI studies that use a one‐sample t test take advantage of a one‐sided test to increase statistical power (Chen et al., 2018). To understand the spatial distribution of clusters, we created images of prevalence of false positive clusters, computed by summing the binary maps of FWE‐significant clusters over the random analyses. In our original study, we found a rather structured spatial distribution for the two‐sample t test (supplementary fig. 18 in Eklund et al. (2016a)), with large clusters more prevalent in the posterior cingulate. We have now created the same sort of maps for one‐sample t tests, with a small modification: to increase the number of clusters observed, we created clusters at a CDT of p = .01 for both increases and decreases on a given statistic map. As discussed in Section 3, there appears to be physiological artifacts which ideally would be remediated by respiration or cardiac time series modeling (Birn et al., 2006; Bollmann, Puckett, Cunnington, & Barth, 2018; Chang & Glover, 2009; Glover et al., 2000; Lund et al., 2006), but unfortunately the 1,000 functional connectomes data sets (Biswal et al., 2010) do not have these physiological recordings. To suppress the influence of artifacts, we therefore instead applied ICA FIX (version 1.065) in FSL (Griffanti et al., 2014, 2017; Salimi‐Khorshidi et al., 2014) to all 499 subjects, to remove ICA components that correspond to noise or artifacts. We applied 4 mm of spatial smoothing for MELODIC (Beckmann & Smith, 2004), and used the classifier weights for standard fMRI data available in ICA FIX (trained for 5 mm smoothing). To use ICA FIX for 8 or 10 mm of smoothing would require retraining the classifier. The cleanup was performed using the aggressive (full variance) option instead of the default less‐aggressive option, and motion confounds were also included in the cleanup. To study the effect of retraining the ICA FIX classifier specifically for each data set (Beijing, Cambridge, Oulu), instead of using the pretrained weights available in ICA FIX, we manually labeled the ICA components of 10 subjects for each data set (giving a total of 350–450 ICA components per data set). Indeed, a large portion of the ICA components are artifacts that are similar across subjects. Interested readers are referred to the github account for ICA FIX processing scripts and the retrained classifier weights for Beijing, Cambridge, and Oulu. First‐level analyses for B1, B2, E1, E2, E3, and E4 were performed using FSL for all 499 subjects after ICA FIX, with motion correction and smoothing turned off. Group level analyses were finally performed using the nonparametric one‐sample t‐test available in BROCCOLI (Eklund, Dufort, Villani, & LaConte, 2014).

RESULTS

Figures 2 and 3 show estimated FWE rates for the two new paradigms (E3, E4), for 40 subjects in each group analysis and a CDT of p = .001. Figures A11 and A12 show the FWE rates for a CDT of p = .01. The four old paradigms (B1, B2, E1, E2) are included as well for the sake of comparison. In brief, the new paradigm with two pretend tasks (E3) does not lead to lower FWE rates, compared to the old paradigms. Likewise, randomizing task events over subjects (E4) has if anything worse FWE rates compared to not randomizing the task over subjects. As noted in our original paper, the very low FWE of FSL's FLAME1 is anticipated behavior when there is zero random effects variance. When fitting anything other than a one‐sample group model this conservativeness may not hold; in particular, we previously reported on two‐sample null analyses on task data, where each sample has non‐zero but equal effects, and found that FLAME1’s FWE was equivalent to that of FSL OLS (Eklund et al., 2016a)

Figure 2

Figure 3

Results for two sample t test and cluster‐wise inference using a CDT of p = .001, showing estimated FWE rates for 4–10 mm of smoothing and six different activity paradigms (old paradigms B1, B2, E1, E2, and new paradigms E3, E4), for SPM, FSL, AFNI, and a permutation test. These results are for a group size of 20, giving a total of 40 subjects. Each statistic map was first thresholded using a CDT of p = .001, uncorrected for multiple comparisons, and the surviving clusters were then compared to a FWE‐corrected cluster extent threshold, p FWE = .05. The estimated FWE rates are simply the number of analyses with any significant group activations divided by the number of analyses (1,000). (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data

Results for one sample t‐test and cluster‐wise inference using a CDT of p = .001, showing estimated FWE rates for 4–10 mm of smoothing and six different activity paradigms (old paradigms B1, B2, E1, E2, and new paradigms E3, E4), for SPM, FSL, AFNI, and a permutation test. These results are for a group size of 40. Each statistic map was first thresholded using a CDT of p = .001, uncorrected for multiple comparisons, and the surviving clusters were then compared to a FWE‐corrected cluster extent threshold, p FWE = .05. The estimated FWE rates are simply the number of analyses with any significant group activations divided by the number of analyses (1,000). (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data Results for two sample t test and cluster‐wise inference using a CDT of p = .001, showing estimated FWE rates for 4–10 mm of smoothing and six different activity paradigms (old paradigms B1, B2, E1, E2, and new paradigms E3, E4), for SPM, FSL, AFNI, and a permutation test. These results are for a group size of 20, giving a total of 40 subjects. Each statistic map was first thresholded using a CDT of p = .001, uncorrected for multiple comparisons, and the surviving clusters were then compared to a FWE‐corrected cluster extent threshold, p FWE = .05. The estimated FWE rates are simply the number of analyses with any significant group activations divided by the number of analyses (1,000). (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data By looking at Figure 2, it is possible to compare the parametric methods (who are simultaneously affected by non‐Gaussian SACF, nonstationary smoothness and physiological noise) and the nonparametric permutation test (only affected by physiological noise, as no assumptions are made regarding the SACF and stationary smoothness). For the Beijing data, the permutation test performs rather well, while all parametric approaches struggle despite a strict CDT. It is also clear that the Oulu data is more problematic compared to Beijing and Cambridge.

ICA denoising

Figure 4 shows FWE rates for the nonparametric one‐sample t test, for no ICA FIX, pretrained ICA FIX and retrained ICA FIX, for one‐sided as well as two‐sided tests. For the Beijing data, the FWE rates are almost within the 95% confidence interval even without ICA FIX, and come even closer to the expected 5% after ICA FIX. For the Cambridge data, it is necessary to combine ICA FIX with a two‐sided test to achieve nominal results (only using a two‐sided test is not sufficient). For the Oulu data, neither ICA FIX in isolation nor in combination with two‐sided inference was sufficient to bring false positives to a nominal rate. However, retraining the ICA FIX classifier specifically for the Oulu data set finally resulted in nominal false positive rates.

Figure 4

To test if using ICA FIX also results in nominal FWE rates for FSL OLS, we performed group analyses for no ICA FIX, pretrained ICA FIX and retrained ICA FIX, for one‐sided as well as two‐sided tests, see Figure 5. As ICA FIX cleaning and all first‐level analyses were performed using FSL, we only performed the group analyses using FSL. Clearly, using ICA FIX does not lead to nominal FWE rates for FSL OLS, and using a two‐sided test leads to even higher FWE rates compared to a one‐sided test. A possible explanation is that (two) parametric tests for p = .025 are even more inflated compared to parametric tests for p = .05. To test this hypothesis, we performed 18,000 one‐sided one‐sample group analyses (three data sets and six activity paradigms, 1,000 analyses each, for first‐level analyses with no ICA FIX, CDT p = .001) with FWE significance thresholds of 2.5% and 1%. False positives at FWE 2.5% and 1% should occur 1/2 = 0.5 and 1/5 = 0.2 times as often with FWE 5% results. We found nominal FWE 2.5% false positives occurred at a rate 0.879× the 5% FWE results, and nominal FWE 1% false positives occurred at a rate 0.694× the 5% FWE results. That is, the relative inflation of false positives for parametric methods is much higher for more stringent significance thresholds. For permutation, inaccuracies arise due to a disparity between the upper tail of the sign flipping null and the actual null's upper tail. For the absolute value statistic used for two‐sided tests, the upper tail is essentially an average of f(x) and f(−x) for x > 0 and is less sensitive to violations of symmetry assumption.

Figure 5

Results for FSL OLS one‐sample t tests for cluster‐wise inference using a CDT of p = .001, for no ICA FIX, pretrained ICA FIX and retrained ICA FIX. (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data. Results are only shown for 4 mm smoothing, as other smoothing levels would require retraining the ICA FIX classifier. The two‐sided tests show a higher degree of false positives compared to the one‐sided tests. This is explained by the fact that a two‐sided test involves two tests at p = .025, instead of one test at p = .05, and parametric methods are relatively more inflated at more stringent significance thresholds (as the statistical assumptions are more critical for the tail of the null distribution)

Results for nonparametric (sign flipping) one‐sample t tests for cluster‐wise inference using a CDT of p = .001, for no ICA FIX, pretrained ICA FIX and retrained ICA FIX. (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data. Results are only shown for 4 mm smoothing, as other smoothing levels would require retraining the ICA FIX classifier. For both Beijing and Cambridge, the pretrained classifier weights for ICA FIX are sufficient to achieve nominal false positive rates, while it is necessary to retrain the ICA FIX classifier specifically for the Oulu data (a possible explanation is that the Oulu data have a spatial resolution of 4 x 4 x 4.4 mm3, while ICA FIX for standard fMRI data is pretrained on data with a spatial resolution of 3.5 x 3.5 x 3.5 mm3) Results for FSL OLS one‐sample t tests for cluster‐wise inference using a CDT of p = .001, for no ICA FIX, pretrained ICA FIX and retrained ICA FIX. (a) Results for Beijing data (b) results for Cambridge data (c) results for Oulu data. Results are only shown for 4 mm smoothing, as other smoothing levels would require retraining the ICA FIX classifier. The two‐sided tests show a higher degree of false positives compared to the one‐sided tests. This is explained by the fact that a two‐sided test involves two tests at p = .025, instead of one test at p = .05, and parametric methods are relatively more inflated at more stringent significance thresholds (as the statistical assumptions are more critical for the tail of the null distribution) Figures 6, 7, 8 show cluster prevalence maps for group analyses without first running ICA FIX, with pretrained ICA FIX and with retrained ICA FIX, for first level designs E2 and E4. Using ICA FIX leads to false cluster maps that are more uniform across the brain, with fewer false clusters in white matter, and using ICA FIX made the biggest difference for the Oulu data. While Beijing and Cambridge sites have a concentration of clusters in posterior cingulate, frontal, and parietal areas, Oulu has more clusters and a more diffuse pattern. Further inspection of these maps suggested a venous artifact, and running a PCA on the Oulu activity maps for design E2 finds substantial variation in the sagittal sinus picked up by the task regressor (see Figure 9). The posterior part of the artifact is suppressed by the pretrained ICA FIX classifier, and the retrained ICA FIX classifier is even better at suppressing the artifact. Also see Figure 10 for activation maps from five Oulu subjects, analyzed with design E4. In several cases, significant activity differences between two random task regressors are detected close to the superior sagittal sinus, indicating a vein artifact.

Figure 6

Figure 7

The maps show voxel‐wise incidence of false clusters for the Cambridge data, for two of the six different first level designs (a,b) no ICA FIX (c,d) ICA FIX pretrained (e,f) ICA FIX retrained for Cambridge. Left: results for design E2, right: results for design E4. Image intensity is the number of times, out of 10,000 random analyses, a significant cluster occurred at a given voxel (CDT p = .01) for FSL OLS. Each analysis is a one‐sample t test using 20 subjects. The maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL

Figure 8

The maps show voxel‐wise incidence of false clusters for the Oulu data, for two of the six different first level designs (a,b) no ICA FIX (c,d) ICA FIX pretrained (e,f) ICA FIX retrained for Oulu. Left: results for design E2, right: results for design E4. Image intensity is the number of times, out of 10,000 random analyses, a significant cluster occurred at a given voxel (CDT p = .01) for FSL OLS. Each analysis is a one‐sample t test using 20 subjects. The retrained ICA FIX classifier is clearly better at suppressing artifacts compared to the pretrained classifier, especially for design E4. The maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL

Figure 9

The maps show an axial and a sagittal view of the first Eigen component after running PCA on the 103 activity maps for Oulu E2, (a) without ICA FIX (b) with ICA FIX, using the pretrained classifier, (c) with ICA FIX, after retraining the ICA FIX classifier specifically for Oulu data. Using ICA FIX clearly suppresses the posterior part of the vein artifact in the superior sagittal sinus, but a portion of the artifact is still present. The retrained ICA FIX classifier is clearly better at suppressing the artifact. The axial maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL. The sagittal maps represent sagittal slice 48 (MNI x coordinate = −4)

Figure 10

Activity maps (thresholded at CDT p = .01 and cluster FWE corrected at p = .05, FSL default) for five Oulu subjects analyzed with 4 mm of smoothing and first level design E4. Despite testing for a difference between two random regressors, which are for design E4 also randomized over subjects, significant voxels are in several cases detected close to the superior sagittal sinus (indicating a vein artifact). As many subjects have an activation difference in the same spatial location, this caused inflated false positive rates for the one‐sample t test. The two‐sample t test is not affected by these artifacts, since they cancel out when testing for a group difference

The maps show voxel‐wise incidence of false clusters for the Beijing data, for two of the six different first level designs (a,b) no ICA FIX (c,d) ICA FIX pretrained (e,f) ICA FIX retrained for Beijing. Left: results for design E2, right: results for design E4. Image intensity is the number of times, out of 10,000 random analyses, a significant cluster occurred at a given voxel (CDT p = .01) for FSL OLS. Each analysis is a one‐sample t test using 20 subjects. The maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL The maps show voxel‐wise incidence of false clusters for the Cambridge data, for two of the six different first level designs (a,b) no ICA FIX (c,d) ICA FIX pretrained (e,f) ICA FIX retrained for Cambridge. Left: results for design E2, right: results for design E4. Image intensity is the number of times, out of 10,000 random analyses, a significant cluster occurred at a given voxel (CDT p = .01) for FSL OLS. Each analysis is a one‐sample t test using 20 subjects. The maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL The maps show voxel‐wise incidence of false clusters for the Oulu data, for two of the six different first level designs (a,b) no ICA FIX (c,d) ICA FIX pretrained (e,f) ICA FIX retrained for Oulu. Left: results for design E2, right: results for design E4. Image intensity is the number of times, out of 10,000 random analyses, a significant cluster occurred at a given voxel (CDT p = .01) for FSL OLS. Each analysis is a one‐sample t test using 20 subjects. The retrained ICA FIX classifier is clearly better at suppressing artifacts compared to the pretrained classifier, especially for design E4. The maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL The maps show an axial and a sagittal view of the first Eigen component after running PCA on the 103 activity maps for Oulu E2, (a) without ICA FIX (b) with ICA FIX, using the pretrained classifier, (c) with ICA FIX, after retraining the ICA FIX classifier specifically for Oulu data. Using ICA FIX clearly suppresses the posterior part of the vein artifact in the superior sagittal sinus, but a portion of the artifact is still present. The retrained ICA FIX classifier is clearly better at suppressing the artifact. The axial maps represent axial slice 50 (MNI z coordinate = 26) for the MNI152 2 mm brain template used in FSL. The sagittal maps represent sagittal slice 48 (MNI x coordinate = −4) Activity maps (thresholded at CDT p = .01 and cluster FWE corrected at p = .05, FSL default) for five Oulu subjects analyzed with 4 mm of smoothing and first level design E4. Despite testing for a difference between two random regressors, which are for design E4 also randomized over subjects, significant voxels are in several cases detected close to the superior sagittal sinus (indicating a vein artifact). As many subjects have an activation difference in the same spatial location, this caused inflated false positive rates for the one‐sample t test. The two‐sample t test is not affected by these artifacts, since they cancel out when testing for a group difference

DISCUSSION

We have presented results that support our original findings of inflated false positives with parametric cluster size inference. Specifically, new random null group task fMRI analyses, based on first level models with two fix regressors and models with two intersubject‐randomized regressors, produced essentially the same results as the previous first level designs we considered. This argues against the charge that idiosyncratic attributes of our first level designs gave rise to our observed inflated false positives rates for cluster inference. Instead, we maintain that the best explanations for this behavior are the long‐tail spatial autocorrelation data (also present in MR phantom data (Kriegeskorte, Bodurka, & Bandettini, 2008)) and spatially‐varying smoothness. Recently, Greve and Fischl (2018) showed that group analyses of cortical thickness, surface area, and volume (using only the structural MRI data in the fcon1000 data set (Biswal et al., 2010)) also lead to inflated false positive rates in some cases, indicating that these issues affect structural analyses on the cortical surface as well, and thus is not specific to fMRI paradigms. It should be noted that AFNI provides another function for cluster thresholding, ETAC (equitable thresholding and clustering; Cox, 2018), which performs better than the long‐tail ACF function (Cox et al., 2017b) used here, but ETAC was not available when we started the new group analyses. AFNI also provides nonparametric group inference in the 3dttest++ function.

Influence of artifacts on one‐sample t tests

Another objective of this work was to understand and remediate the less‐than‐perfect false positive rate control for one‐sample permutation tests. We tried various alternative modeling strategies, including data transformations and robust regression, but none yielded consistent control of FWE. It appears that (physiological) artifacts are a major problem for the Oulu data, although the MRIQC tool (Esteban et al., 2017; Gorgolewski et al., 2017) did not reveal any major quality differences between Beijing, Cambridge, and Oulu. The contribution of physiological noise in fMRI depends on the spatial resolution (see e.g., Bodurka, Ye, Petridou, Murphy, & Bandettini, 2007); larger voxels lead to a lower temporal signal to noise ratio. The Oulu data have a spatial resolution of 4 x 4 x 4.4 mm3, compared to 3.13 x 3.13 x 3.6 mm3 for Beijing and 3 x 3 x 3 mm3 for Cambridge. Oulu voxels are thereby two times larger compared to Beijing voxels, and 2.6 times larger compared to Cambridge voxels, and this will make the Oulu data more prone to physiological noise. As mentioned in Section 1, one can argue that a pure simulation (Welvaert & Rosseel, 2014) would avoid the problem of physiological noise, or that the Oulu data should be set aside, but we here opted to show results after denoising with ICA FIX, as many fMRI data sets have been collected without recordings of breathing and pulse. Some of our random regressors are strongly correlated with the fMRI data in specific brain regions (especially the superior sagittal sinus, the transverse sinus, and the sigmoid sinus), which lead to inflated false positive rates. Other artifacts, such as CSF artifacts and susceptibility weighted artifacts, are also present in the data (compared to examples given by Griffanti et al., 2017). For a two‐sample t test, artifacts in the same spatial location for all subjects cancel out, as one tests for a difference between two groups, but this is not the case for a one‐sample t test. Combining ICA FIX with a two‐sided test led to nominal FWE rates for Beijing and Cambridge, but not for Oulu. As can be seen in Figures 6, 7, 8, using ICA FIX clearly leads to false cluster maps which are more uniform across the brain, with a lower number of false clusters in white matter. Retraining the ICA FIX classifier finally lead to nominal results for the Oulu data. A possible explanation is that the pretrained classifier for standard fMRI data in ICA FIX is trained on fMRI data with a spatial resolution of 3.5 x 3.5 x 3.5 mm3 (i.e., 1.6 times smaller voxels than Oulu). Figure 8 shows that the retrained classifier leads to more uniform false cluster maps, compared to the pretrained classifier, for design E4 for Oulu. As can be seen in Figure 9, the retrained classifier is better at suppressing the artifact in the sagittal sinus, compared to the pretrained classifier. We here trained the classifier for each data set (Beijing, Cambridge, Oulu) using labeled ICA components from 10 subjects, as recommended by the ICA FIX user guide, and labeling components from more subjects can lead to even better results. Using ICA FIX for resting state fMRI data is rather easy (but it currently requires a specific version of the R software), as pretrained weights are available for different kinds of fMRI data. However, using ICA FIX for task fMRI data will require more work, as it is necessary to first manually classify ICA components (Griffanti et al., 2017) to provide training data for the classifier (Salimi‐Khorshidi et al., 2014). An open database of manually classified fMRI ICA components, similar to NeuroVault (Gorgolewski et al., 2015), could potentially be used for fMRI researchers to automatically denoise their task fMRI data. A natural extension of MRIQC (Esteban et al., 2017; Gorgolewski et al., 2017) would then be to also measure the presence of artifacts in each fMRI data set, by doing ICA and then comparing each component to the manually classified components in the open database. We also recommend researchers to collect physiological data, such that signal related to breathing and pulse can be modeled (Birn et al., 2006; Bollmann et al., 2018; Chang & Glover, 2009; Glover et al., 2000; Lund et al., 2006). This is especially important for 7T fMRI data, for which the physiological noise is often stronger compared to the thermal noise (Hutton et al., 2011; Triantafyllou et al., 2005). Alternatives to collecting physiological data, or using ICA FIX, include ICA AROMA (Pruim et al., 2015), DPARSF (Yan & Zang, 2010), and FMRIPrep (Esteban et al., 2018). DPARSF and FMRIPrep can automatically generate nuisance regressors (e.g., from CSF and white matter) to be included in the statistical analysis.

Effect of multiband data on cluster inference

We note that multiband MR sequences (Moeller et al., 2010) are becoming increasingly common to improve temporal and/or spatial resolution, for example as provided by the Human Connectome Project (Essen et al., 2013) and the enhanced NKI‐Rockland sample (Nooner et al., 2012). Multiband data have a potentially complex spatial autocorrelation (see, e.g, Risk, Kociuba, and Rowe (2018)), and an important topic for future work is establishing how this impacts parametric cluster inference. The nonparametric permutation test (Winkler et al., 2014) does not make any assumption regarding the shape of the SACF, and is therefore expected to perform well for any MR sequence.

Interpretation of affected studies

In Appendix A, we provide a rough bibliographic analysis to provide an estimate of how many articles used this particular CDT p = .01 setting. For a review conducted in January 2018, we estimated that out of 23,000 fMRI publications about 2,500, over 10%, of all studies have used this most problematic setting with parametric inference. While this calculation suggests how the literature as a whole can be interpreted, a more practical question is how one individual affected study can be interpreted. When examining a study that uses CDT p = .01, or one that uses no correction at all, it is useful to consider three possible states of nature: State 1: Effect is truly present, and with revised methods, significance is retained. State 2: Effect is truly present, but with revised methods, significance is lost. State 3: Effect is truly null, absent; the study's detection is a false positive. In each of these, the statement about “truth” reflects presence or absence of the effect in the population from which the subjects were drawn. When considering heterogeneity of different populations used for research, we could also add a fourth state: State 4: Effect is truly null in population sampled, and this study's detection is a false positive; but later studies find and replicate the effect in other populations. These could be summarized as “State 1: Robust true positive,” “State 2: Fragile true positive,” “State 3: False positive,” and “State 4: Idiosyncratic false positive.” Unfortunately, we can never know the true state of an effect, and, because of a lack of data archiving and sharing, we will mostly never know whether significance is retained or lost with reanalysis. All we can do is make qualitative judgments on individual works. To this end, we can suggest that findings with no form of corrected significance receive the greatest skepticism; likewise, CDT p = .01 cluster size inference cluster p‐values that just barely fall below 5% FWE significance should be judged with great skepticism. In fact, given small perturbations arising from a range of methodological choices, all research findings on the edge of a significance threshold deserves such skepticism. On the other hand, findings based on large clusters with p‐values far below .05 could possibly survive a reanalysis with improved methods.

CONCLUSIONS

To summarize, our new results confirm that inflated FWE rates for parametric cluster inference are also present when testing for a difference between two tasks, and when randomizing the task over subjects. Furthermore, the inflated FWE rates for the nonparametric one‐sample t tests are due to random correlations with artifacts in the fMRI data, which for Beijing and Cambridge we found could be suppressed using the pretrained ICA FIX classifier for standard fMRI data. The Oulu data were collected with a lower spatial resolution, and are therefore more prone to physiological noise. By retraining the ICA FIX classifier specifically for the Oulu data, nominal results were finally obtained for Oulu as well. Data cleaning is clearly important for task fMRI, and not only for resting state fMRI.

Table A1

Upon request, the authors of Woo, Krishnan, and Wager (2014) provided a detailed cross‐tabulation of the frequencies of different CDTs (the data presented in fig. 2(b) of their paper). Among the 607 studies that used cluster thresholding, they found 480 studies for which sufficient detail could be obtained to record the software and the particular CDT used

CDT	AFNI	BrainVoyager	FSL	SPM	Others	Total
>0.01	9	5	9	8	4	35
0.01	9	4	44	20	3	80
0.005	24	6	1	48	3	82
0.001	13	20	11	206	5	255
<0.001	2	5	3	16	2	28
Total	57	40	68	298	17	480

61 in total

1. Increased sensitivity in neuroimaging analyses using robust regression.

Authors: Tor D Wager; Matthew C Keller; Steven C Lacey; John Jonides
Journal: Neuroimage Date: 2005-05-15 Impact factor: 6.556

2. Robust group analysis using outlier inference.

Authors: Mark Woolrich
Journal: Neuroimage Date: 2008-03-06 Impact factor: 6.556

3. Equitable Thresholding and Clustering: A Novel Method for Functional Magnetic Resonance Imaging Clustering in AFNI.

Authors: Robert W Cox
Journal: Brain Connect Date: 2019-09

4. Reevaluating "cluster failure" in fMRI using nonparametric control of the false discovery rate.

Authors: Daniel Kessler; Mike Angstadt; Chandra S Sripada
Journal: Proc Natl Acad Sci U S A Date: 2017-04-18 Impact factor: 11.205

5. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates.

Authors: Anders Eklund; Thomas E Nichols; Hans Knutsson
Journal: Proc Natl Acad Sci U S A Date: 2016-06-28 Impact factor: 11.205

Review 6. A comprehensive review of group level model performance in the presence of heteroscedasticity: Can a single model control Type I errors in the presence of outliers?

Authors: Jeanette A Mumford
Journal: Neuroimage Date: 2016-12-25 Impact factor: 6.556

7. The impact of global signal regression on resting state correlations: are anti-correlated networks introduced?

Authors: Kevin Murphy; Rasmus M Birn; Daniel A Handwerker; Tyler B Jones; Peter A Bandettini
Journal: Neuroimage Date: 2008-10-11 Impact factor: 6.556

Review 8. A review of fMRI simulation studies.

Authors: Marijke Welvaert; Yves Rosseel
Journal: PLoS One Date: 2014-07-21 Impact factor: 3.240

9. Towards a consensus regarding global signal regression for resting state functional connectivity MRI.

Authors: Kevin Murphy; Michael D Fox
Journal: Neuroimage Date: 2016-11-22 Impact factor: 6.556

10. Toward open sharing of task-based fMRI data: the OpenfMRI project.

Authors: Russell A Poldrack; Deanna M Barch; Jason P Mitchell; Tor D Wager; Anthony D Wagner; Joseph T Devlin; Chad Cumba; Oluwasanmi Koyejo; Michael P Milham
Journal: Front Neuroinform Date: 2013-07-08 Impact factor: 4.081

18 in total

1. An integrated cluster-wise significance measure for fMRI analysis.

Authors: Yunjiang Ge; Gang Chen; James A Waltz; Liyi Elliot Hong; Peter Kochunov; Shuo Chen
Journal: Hum Brain Mapp Date: 2022-03-02 Impact factor: 5.399

2. Alcohol- and non-alcohol-related interference: An fMRI study of treatment-seeking adults with alcohol use disorder.

Authors: Laura Murray; Julia C Welsh; Chase G Johnson; Roselinde H Kaiser; Todd J Farchione; Amy C Janes
Journal: Drug Alcohol Depend Date: 2022-04-15 Impact factor: 4.852

3. Acute social isolation evokes midbrain craving responses similar to hunger.

Authors: Livia Tomova; Kimberly L Wang; Todd Thompson; Gillian A Matthews; Atsushi Takahashi; Kay M Tye; Rebecca Saxe
Journal: Nat Neurosci Date: 2020-11-23 Impact factor: 24.884

4. Bayes estimate of primary threshold in clusterwise functional magnetic resonance imaging inferences.

Authors: Yunjiang Ge; Stephanie Hare; Gang Chen; James A Waltz; Peter Kochunov; L Elliot Hong; Shuo Chen
Journal: Stat Med Date: 2021-07-26 Impact factor: 2.373

5. Resting-state functional connectivity of the human hippocampus in periadolescent children: Associations with age and memory performance.

Authors: David E Warren; Anthony J Rangel; Nicholas J Christopher-Hayes; Jacob A Eastman; Michaela R Frenzel; Julia M Stephen; Vince D Calhoun; Yu-Ping Wang; Tony W Wilson
Journal: Hum Brain Mapp Date: 2021-05-12 Impact factor: 5.038

6. Bridging the Brain and Data Sciences.

Authors: John Darrell Van Horn
Journal: Big Data Date: 2020-11-18 Impact factor: 4.426

7. Comment to: "Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates".

Authors: Vesa Kiviniemi
Journal: Hum Brain Mapp Date: 2019-12-13 Impact factor: 5.038

8. Cluster failure revisited: Impact of first level design and physiological noise on cluster false positive rates.

Authors: Anders Eklund; Hans Knutsson; Thomas E Nichols
Journal: Hum Brain Mapp Date: 2018-10-15 Impact factor: 5.038

9. Neural Circuitry of Novelty Salience Processing in Psychosis Risk: Association With Clinical Outcome.

Authors: Gemma Modinos; Paul Allen; Andre Zugman; Danai Dima; Matilda Azis; Carly Samson; Ilaria Bonoldi; Beverly Quinn; George W G Gifford; Sophie E Smart; Mathilde Antoniades; Matthijs G Bossong; Matthew R Broome; Jesus Perez; Oliver D Howes; James M Stone; Anthony A Grace; Philip McGuire
Journal: Schizophr Bull Date: 2020-04-10 Impact factor: 9.306

Review 10. Improving practices and inferences in developmental cognitive neuroscience.

Authors: John C Flournoy; Nandita Vijayakumar; Theresa W Cheng; Danielle Cosme; Jessica E Flannery; Jennifer H Pfeifer
Journal: Dev Cogn Neurosci Date: 2020-06-30 Impact factor: 6.464