| Literature DB >> 31050106 |
Alexander Bowring1, Camille Maumet2, Thomas E Nichols1,3,4.
Abstract
A wealth of analysis tools are available to fMRI researchers in order to extract patterns of task variation and, ultimately, understand cognitive function. However, this "methodological plurality" comes with a drawback. While conceptually similar, two different analysis pipelines applied on the same dataset may not produce the same scientific results. Differences in methods, implementations across software, and even operating systems or software versions all contribute to this variability. Consequently, attention in the field has recently been directed to reproducibility and data sharing. In this work, our goal is to understand how choice of software package impacts on analysis results. We use publicly shared data from three published task fMRI neuroimaging studies, reanalyzing each study using the three main neuroimaging software packages, AFNI, FSL, and SPM, using parametric and nonparametric inference. We obtain all information on how to process, analyse, and model each dataset from the publications. We make quantitative and qualitative comparisons between our replications to gauge the scale of variability in our results and assess the fundamental differences between each software package. Qualitatively we find similarities between packages, backed up by Neurosynth association analyses that correlate similar words and phrases to all three software package's unthresholded results for each of the studies we reanalyse. However, we also discover marked differences, such as Dice similarity coefficients ranging from 0.000 to 0.684 in comparisons of thresholded statistic maps between software. We discuss the challenges involved in trying to reanalyse the published studies, and highlight our efforts to make this research reproducible.Entities:
Keywords: AFNI; FSL; SPM; analytic flexibility; analytic variability; fMRI; reproducibility; software comparison; task-fMRI
Mesh:
Year: 2019 PMID: 31050106 PMCID: PMC6618324 DOI: 10.1002/hbm.24603
Source DB: PubMed Journal: Hum Brain Mapp ISSN: 1065-9471 Impact factor: 5.038
Software processing steps
| Processing step | AFNI | FSL | SPM | |
|---|---|---|---|---|
| Preprocessing | Script |
@SSWarper afni_proc.py |
FEAT First‐level analysis | Batch ( |
| Slice‐timing | ‐tshift_opts_ts ‐tpattern |
Prestats: Slice timing correction | Slice timing | |
| Realignment/motion correction | ‐volreg_align_e2a |
Prestats: Motion correction: MCFLIRT | Realign: estimate and reslice | |
| Segmentation |
|
| Segment | |
| Brain extraction (anatomical) |
‐copy_anat [@SSWarper result] ‐anat_has_skull no | bet ( | Image calculator | |
| Brain extraction (functional) |
|
Prestats: BET brain extraction |
| |
| Intrasubject Coregistration | ‐align_opts_aea ‐giant_move ‐check_flip | Registration: Normal search, BBR | Coregister: Estimate | |
| Intersubject registration |
‐tlrc_base ‐volreg_tlrc_warp ‐tlrc_NL_warp ‐tlrc_NL_warped_dsets [@SSWarper result] | Registration: nonlinear, warp resolution 10 mm | Normalise: Write | |
| Analysis voxel size | ‐volreg_warp_dxyz ( |
|
Normalise: Write: Writing options: Voxel sizes | |
| Smoothing | ‐blur_size |
Prestats: Spatial smoothing FWHM (mm) | Smooth | |
| First‐level | Script | afni_proc.Py |
FEAT First‐level analysis | Specify first‐level |
| Model specification |
‐regress_stim_times ‐regress_stim_labels ‐regress_basis_multi ‐regress_stim_types |
Stats: Full model setup:EVs | fMRI model specification | |
| Inclusion of 6 motion parameters |
| Stats: Standard motion parameters |
fMRI model specification: Data & Design: Multiple regressors: Realignment Param file | |
| Model estimation |
|
| Model estimation | |
| Contrasts |
‐regress_3dD_stop ‐regress_reml_exec ‐regress_opts_3dD ‐gltsym |
Stats: Full model setup: contrasts | Contrast manager | |
| Second‐level | Script |
3dMEMA 3dMVM |
FEAT Higher‐level analysis | Specify second‐level |
| Model specification |
3dMEMA ‐set ‐missing_data 0 3dMVM ‐dataTable |
Stats: Full model setup: EVs |
Factorial design specification: One‐sample Full factorial | |
| Model estimation |
|
| Model estimation | |
| Contrasts |
|
Stats: Full model setup: contrasts | Contrast manager | |
| Second‐level inference |
3dMask_Tool (obtain group‐mask) 3dClustSim 3dClust 3dCalc (Binarizing cluster masks and masking t_stat) 3dTcat (obtaining one image in a 4d volume) | Poststats | Results report | |
| Results sharing | NIDM‐results export |
| nidmfsl | Results report |
| NeuroVault upload |
|
|
|
Implementation of each of the processing steps (ds000001, ds000109, ds000120) within AFNI, FSL and SPM.
The @SSWarper program was ran on each subject prior to afni_proc.py for brain extraction of the anatomical image, and to apply the nonlinear warp of the anatomy to MNI space.
ds000120 only.
Image calculator was used to create bain mask from grey matter, white matter, and CSF images; see text.
Figure 1(a) Comparison of the thresholded statistic maps from our reanalysis with the main figures from each of the three publications. Left: For ds000001 data, thresholded T‐statistic images contrasting the parametric modulation of pumps of reward balloons versus the parametric modulation of the control balloon; beneath, a sagittal slice taken from fig. 3 in Schonberg et al. (2012). Middle: For ds000109, thresholded T‐statistic maps of the false belief versus false photo contrast; beneath, a midsagittal render from Moran et al. (2012). Right: For ds000120, thresholded F‐statistic images of the main effect of time contrast; beneath, a midsagittal render from fig. 3 in Padmanabhan et al. (2011). Note that for ds000109 and ds000120 the publication's figures are renderings onto the cortical surface while our results are slice views. While each major activation area found in the original study exists in the reanalyses, there is substantial variation between each reanalysis. (b) Comparison of the thresholded statistic maps from our reanalysis displayed as a series of axial slices. Top: ds000001’s thresholded T‐statistic maps contrasting parametric modulations of the reward balloons versus pumps of the control balloons. Middle: ds000109’s thresholded T‐statistic maps of the false belief versus false photo contrast. Bottom: ds000120’s thresholded F‐statistic maps of the main effect of time contrast. This figure complements the single slice views shown in Figure 1 [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 2Comparison of the unthresholded statistic maps from our reanalysis of the three studies within each software package. Left: ds000001’s unthresholded T‐statistic maps of the parametric modulation of pumps of reward balloons versus the parametric modulation of the control balloon contrast. Middle: ds000109’s unthresholded T‐statistic maps of the false belief versus false photo contrast. Right: ds000120’s unthresholded F‐statistic maps of the main effect of time contrast. While areas of strong activation are somewhat consistent across all three sets of reanalyses, there is substantial variation in nonextreme values [Color figure can be viewed at http://wileyonlinelibrary.com]
Neurosynth analyses
| AFNI | FSL | SPM | ||||
|---|---|---|---|---|---|---|
| Neurosynth analysis | Corr. | Neurosynth analysis | Corr. | Neurosynth analysis | Corr. | |
|
| Anterior insula | 0.359 | Anterior insula | 0.240 | Anterior insula | 0.322 |
| Insula | 0.276 |
| 0.233 | Anterior | 0.245 | |
| Anterior | 0.243 |
| 0.203 | Insula | 0.240 | |
| Insula anterior | 0.233 | Parietal | 0.190 |
| 0.229 | |
| Thalamus | 0.221 |
| 0.188 |
| 0.225 | |
|
| 0.211 |
| 0.184 | Insula anterior | 0.214 | |
|
| 0.198 |
| 0.181 | Thalamus | 0.201 | |
| Supplementary | 0.197 | Basal ganglia | 0.173 | Acc | 0.199 | |
| Premotor | 0.196 | Ganglia | 0.172 | Anterior cingulate | 0.196 | |
| Anterior cingulate | 0.192 | Basal | 0.169 | Ganglia | 0.188 | |
|
| Medial prefrontal | 0.422 | Medial prefrontal | 0.355 | Medial prefrontal | 0.361 |
| Medial | 0.381 | Medial | 0.309 |
| 0.331 | |
|
| 0.366 | Default | 0.301 |
| 0.329 | |
|
| 0.348 | Posterior cingulate | 0.299 | Precuneus | 0.314 | |
|
| 0.341 |
| 0.290 |
| 0.310 | |
| Precuneus | 0.334 |
| 0.282 | Medial | 0.301 | |
| Posterior cingulate | 0.327 | Cingulate | 0.275 |
| 0.296 | |
|
| 0.322 |
| 0.270 | Prefrontal | 0.294 | |
|
| 0.311 |
| 0.261 |
| 0.289 | |
|
| 0.287 | Precuneus | 0.259 | Posterior cingulate | 0.287 | |
|
|
| 0.377 |
| 0.481 | ||
| v1 | 0.317 | Occipital | 0.367 | |||
| Occipital | 0.293 | v1 | 0.340 | |||
|
| 0.261 | Visual cortex | 0.267 | |||
|
| 0.252 |
| 0.248 | |||
| Visual cortex | 0.243 | Spl | 0.245 | |||
| Early visual | 0.241 |
| 0.242 | |||
|
| 0.232 | Early visual | 0.238 | |||
|
| 0.229 | Lingual | 0.238 | |||
| Parietal | 0.222 | Intraparietal | 0.237 | |||
The Neurosynth analysis terms most strongly associated (via Pearson correlation) to each software's group‐level statistic map across the three studies. Nonanatomical terms are shown in bold.
Figure 3(a) Cross‐software Bland–Altman 2D histograms comparing the unthresholded group‐level T‐statistic maps computed as part our reanalyses of the ds000001 and ds000109 studies within AFNI, FSL, and SPM. Left; Comparisons for ds000001’s balloon analog risk task, T‐statistic images contrasting the parametric modulation of pumps of the reward balloons versus parametric modulation of pumps of the control balloon. Right; Comparisons for ds000109’s false belief task, T‐statistic images contrasting the false belief versus false photo conditions. Density images show the relationship between the average T‐statistic value (abscissa) and difference of T‐statistic values (ordinate) at corresponding voxels in the unthresholded Tstatistic images for each pairwise combination of software packages. While there is no particular pattern of bias, as the T‐statistic differences are centered about zero, there is remarkable range, with differences exceeding ±4 in all comparisons. (b) Cross‐software Bland‐Altman 2D histogram comparing the unthresholded main effect of time Fstatistic maps computed in AFNI and SPM for reanalyses of the ds000120 study. The differences are generally centered about zero, with a trend of large F‐statistics for AFNI. (The funnel‐like pattern is a consequence of the F‐statistic taking on only positive values) [Color figure can be viewed at http://wileyonlinelibrary.com]
Summary of test statistics mean differences and correlations for each pair of test statistic images
| ds000001 | ds000109 | ds000120 | |||||
|---|---|---|---|---|---|---|---|
| Mean diff. | Corr | Mean diff. | Corr | Mean diff. | Corr | ||
| AFNI vs. FSL | Parametric | 0.009 | 0.616 | 0.035 | 0.585 | ||
| Nonparametric | 0.271 | 0.577 | 0.006 | 0.573 | |||
| AFNI vs. SPM | Parametric | 0.061 | 0.614 | −0.490 | 0.747 | 0.415 | 0.748 |
| Nonparametric | −0.096 | 0.628 | −0.445 | 0.787 | n/a | n/a | |
| FSL vs. SPM | Parametric | −0.047 | 0.684 | −0.529 | 0.429 | ||
| Nonparametric | −0.479 | 0.720 | −0.439 | 0.438 | |||
| AFNI | Para. vs. NonP. | 0.155 | 0.984 | −0.048 | 0.981 | ||
| FSL | Para. vs. NonP. | 0.382 | 0.844 | −0.064 | 0.946 | ||
| SPM | Para. vs. NonP. | 0.000 | 1.000 | 0.000 | 1.000 | ||
Mean differences correspond to the y‐axes of the Bland–Altman plots displayed in Figures 3a,b, and 7. Each mean difference is the first item minus second; for example, AFNI versus FSL mean difference is AFNI‐FSL. Correlation is the Pearson's r between the test statistic values for the pair compared. Intersoftware differences are greater than intrasoftware.
Figure 7Intrasoftware Bland–Altman 2D histograms for the ds000001 and ds000109 studies comparing the unthresholded group‐level T‐statistic maps computed for parametric and nonparametric inference methods in AFNI, FSL and SPM. Each comparison here uses the same preprocessed data, varying only the second level statistical model. SPM's parametric and nonparametric both use the same (unweighted) onesample T‐test, and thus show no differences. AFNI and FSL's parametric models use iterative estimation of between subject variance and weighted least squares and thus show some differences, but still smaller than between‐software comparisons [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 4Dice coefficients comparing the thresholded positive and negative T‐statistic maps computed for each pair of software package and inference method for each of the three reproduced studies. Dice coefficients were computed over the intersection of the pair of analysis masks, to assess only regions where activation could occur in both packages. Percentage of “spill over” activation, that is, the percentage of activation in one software's thresholded statistic map that fell outside of the analysis mask of the other software is displayed in grey; left value for row software, right value for column software. For ds000001 increases, FSL permutation obtained no significant results, thus generating Dice coefficients of zero; for ds000109 decreases, only AFNI and FSL parametric obtained a result and hence only one coefficient is displayed. Dice coefficients are mostly below 0.5, parametric‐nonparametric intrasoftware results are generally higher; ds000120’s F‐statistic results are notably high, at 0.684, perhaps because it is testing a main effect with ample power [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 5(a) Euler characteristic (EC) plots for ds000001 and ds000109. On top, comparisons of the Euler characteristic computed for each software's T‐statistic map from our reanalyses using a range of T‐value thresholds between −6 and 6. Below, comparisons of the ECs calculated using the same thresholds on the corresponding T‐statistic images for permutation inference within each package. For each T‐value the EC summarises the topology of the thresholded image, and the curves provide a signature of the structure of the entire image. For extreme thresholds the EC approximates the number of clusters, allowing a simple interpretation of the curves: For example, for ds000001 parametric analyses, FSL clearly has the fewest clusters for positive thresholds. (b) Cluster count plots for ds000001 and ds000109. On top, comparisons of the number of cluster found in each software's T‐statistic map from our reanalyses using a range of T‐value thresholds between −6 and 6. Below, comparisons of the cluster counts calculated using the same thresholds on the corresponding T‐statistic images for permutation inference within each package [Color figure can be viewed at http://wileyonlinelibrary.com]
Figure 6Cross‐software Bland–Altman 2D histograms for the ds000001 and ds000109 studies comparing the unthresholded group‐level T‐statistic maps computed using permutation inference methods within AFNI, FSL, and SPM. Similar to the results obtained using parametric inferences in Figure 3, all of the densities indicate large differences in the size of activations determined within each package [Color figure can be viewed at http://wileyonlinelibrary.com]