John Kruper1,2, Jason D Yeatman3,4, Adam Richie-Halford2, David Bloom1,2, Mareike Grotheer5,6, Sendy Caffarra3,4,7, Gregory Kiar8, Iliana I Karipidis9, Ethan Roy3, Bramsh Q Chandio10, Eleftherios Garyfallidis10, Ariel Rokem1,2. 1. Department of Psychology, University of Washington, Seattle, WA, 98195, USA. 2. eScience Institute, University of Washington, Seattle, WA, 98195, USA. 3. Graduate School of Education, Stanford University, Stanford, CA, 94305, USA. 4. Division of Developmental-Behavioral Pediatrics, Stanford University School of Medicine, Stanford, CA, 94305, USA. 5. Center for Mind, Brain and Behavior - CMBB, Hans-Meerwein-Straße 6, Marburg 35032, Germany. 6. Department of Psychology, University of Marburg, Marburg 35039, Germany. 7. Basque Center on Cognition, Brain and Language, BCBL, 20009, Spain. 8. Department of Biomedical Engineering, McGill University, Montreal, H3A 0E9, Canada. 9. Center for Interdisciplinary Brain Sciences Research, Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine,Stanford, CA, 94305, USA. 10. Department of Intelligent Systems Engineering, Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, 47408, USA.
Abstract
The validity of research results depends on the reliability of analysis methods. In recent years, there have been concerns about the validity of research that uses diffusion-weighted MRI (dMRI) to understand human brain white matter connections in vivo, in part based on the reliability of analysis methods used in this field. We defined and assessed three dimensions of reliability in dMRI-based tractometry, an analysis technique that assesses the physical properties of white matter pathways: (1) reproducibility, (2) test-retest reliability, and (3) robustness. To facilitate reproducibility, we provide software that automates tractometry (https://yeatmanlab.github.io/pyAFQ). In measurements from the Human Connectome Project, as well as clinical-grade measurements, we find that tractometry has high test-retest reliability that is comparable to most standardized clinical assessment tools. We find that tractometry is also robust: showing high reliability with different choices of analysis algorithms. Taken together, our results suggest that tractometry is a reliable approach to analysis of white matter connections. The overall approach taken here both demonstrates the specific trustworthiness of tractometry analysis and outlines what researchers can do to establish the reliability of computational analysis pipelines in neuroimaging.
The validity of research results depends on the reliability of analysis methods. In recent years, there have been concerns about the validity of research that uses diffusion-weighted MRI (dMRI) to understand human brain white matter connections in vivo, in part based on the reliability of analysis methods used in this field. We defined and assessed three dimensions of reliability in dMRI-based tractometry, an analysis technique that assesses the physical properties of white matter pathways: (1) reproducibility, (2) test-retest reliability, and (3) robustness. To facilitate reproducibility, we provide software that automates tractometry (https://yeatmanlab.github.io/pyAFQ). In measurements from the Human Connectome Project, as well as clinical-grade measurements, we find that tractometry has high test-retest reliability that is comparable to most standardized clinical assessment tools. We find that tractometry is also robust: showing high reliability with different choices of analysis algorithms. Taken together, our results suggest that tractometry is a reliable approach to analysis of white matter connections. The overall approach taken here both demonstrates the specific trustworthiness of tractometry analysis and outlines what researchers can do to establish the reliability of computational analysis pipelines in neuroimaging.
The white matter of the brain contains the long-range connections between distant cortical regions. The integration and coordination of brain activity through the fascicles containing these connections are important for information processing and for brain health (1, 2). Using voxel-specific directional diffusion information from diffusion-weighted MRI (dMRI), computational tractography produces three-dimensional trajectories through the white matter within the MRI volume that are called streamlines (3, 4). Collections of streamlines that match the location and direction of major white matter pathways within an individual can be generated with different strategies: using probabilistic (5, 6) or streamline-based (7, 8) atlases or known anatomical landmarks (9–12). Because these are models of the anatomy, we refer to these estimates as bundles to distinguish them from the anatomical pathways themselves. The delineation of well-known anatomical pathways overcomes many of the concerns about confounds in dMRI-based tractography (13, 14), because “brain connections derived from diffusion MRI tractography can be highly anatomically accurate – if we know where white matter pathways start, where they end, and where they do not go” (15).The physical properties of brain tissue affect the diffusion of water, and the microstructure of tissue within the white matter along the length of computationally generated bundles can be assessed using a variety of models (16, 17). Taken together, computational tractography, bundle recognition, and diffusion modeling provide so-called tract profiles: estimates of microstructural properties of tissue along the length of major pathways. This is the basis of tractometry: statistical analysis that compares different groups or assesses individual variability in brain connection structure (9, 18–21). For the inferences made from tractometry to be valid and useful, tract profiles need to be reliable.In the present work, we provide an assessment of three different ways in which scientific results can be reliable: reproducibility, test-retest reliability (TRR), and robustness. These terms are often debated, and conflicting definitions for these terms have been proposed (22, 23). Here, we use the definitions proposed in (24). Reproducibility is defined as the case in which data and methods are fully accessible and usable: running the same code with the same data should produce an identical result. Use of different data (e.g., in a test-retest experiment) resulting in quantitatively comparable results would denote TRR. In clinical science and psychology in general, TRR (e.g., in the form of inter-rater reliability) is considered a key metric of the reliability of a measurement. Use of a different analysis approach or different analysis system (e.g., different software implementation of the same ideas) could result in similar conclusions, denoting their robustness to implementation details. The recent findings of Botvinik-Nezer et al. (25) show that even when full computational reproducibility is achieved, the results of analyzing a single functional MRI (fMRI) dataset can vary significantly between teams and analysis pipelines, demonstrating issues of robustness.The contribution of the present work is three-fold: to support reproducible research using tractometry, we developed an open-source software library called Automated Fiber Quantification in Python (pyAFQ; https://yeatmanlab.github.io/pyAFQ). Given dMRI data that has undergone standard preprocessing (e.g., using QSIprep (26)), pyAFQ automatically performs tractography, classifies streamlines into bundles representing the major tracts, and extracts tract profiles of diffusion properties along those bundles, producing “tidy” CSV output files (27) that are amenable to further statistical analysis (Fig. S1). The library implements the major functionality provided by a previous MATLAB implementation of tractometry analysis (9) and offers a menu of configurable algorithms allowing researchers to tune the pipeline to their specific scientific questions (Fig. S2). Second, we use pyAFQ to assess TRR of tractometry results. Third, we assess robustness of tractometry results to variations across different models of the diffusion in individual voxels, across different bundle recognition approaches, and across different implementations.
MATERIALS AND METHODS pyAFQ
We developed an open-source tractometry software library to support computational reproducibility: pyAFQ. The software relies heavily on methods implemented in Diffusion Imaging in Python (DIPY) (28). Our implementation was also guided by a previous MATLAB implementation of tractometry (mAFQ) (9). More details are available in the “Automated Fiber Quantification in Python (pyAFQ)” section of Supplementary Methods.
Tractometry
The pyAFQ software is configurable, allowing users to specify methods and parameters for different stages of the analysis (Fig. S2). Here, we will describe the default setting. In the first step, computational tractography methods, implemented in DIPY (28), are used to generate streamlines throughout the brain white matter (Fig. S1A). Next, the T1-weighted Montreal Neurological Institute (MNI) template (29, 30) is registered to the anisotropic power map (APM) (31, 32) computed from the diffusion data that has a T1-like contrast (Fig. S1B) using the symmetric image normalization method (33) implemented in DIPY (28). The next step is to perform bundle recognition, where each tractography streamline is classified as either belonging to a particular bundle or discarded. We use the transformation found during registration to bring canonical anatomical landmarks, such as waypoint regions of interest (ROIs) and probability maps, from template space to the individual subject’s native space. Waypoint ROIs are used to delineate the trajectory of the bundles (34). See Table S1 for the bundle abbreviations we use in this paper. Streamlines that pass through inclusion waypoint ROIs for a particular bundle, and do not pass through exclusion ROI, are selected as candidates to include in the bundle. In addition, a probabilistic atlas (35) is used as a tiebreaker to determine whether a streamline is more likely to belong to one bundle or another (in cases where the streamline matches the criteria for inclusion in either). For example, the corticospinal tract is identified by finding streamlines that pass through an axial waypoint ROI in the brainstem and another ROI axially oriented in the white matter of the corona radiata but that do not pass through the midline (Fig. S1C). The final step is to extract the tract profile: each streamline is resampled to a fixed number of points, and the mean value of a diffusion-derived scalar (e.g., fractional anisotropy (FA) and mean diffusivity (MD)) is found for each one of these nodes. The values are summarized by weighting the contribution of each streamline, based on how concordant the trajectory of this streamline is with respect to the other streamlines in the bundle (Fig. S1D). To make sure that profiles represent properties of the core white matter, we remove the first and last five nodes of the profile, then further remove any nodes where either the FA is less than 0.2 or the MD is greater than 0.002. This removes nodes that contain partial volume artifacts (16).
Data
We used two datasets with test-retest measurements. We used Human Connectome Project test-retest (HCP-TR) measurements of dMRI for 44 neurologically healthy subjects aged 22–35 (36). The other is an experimental dataset, with dMRI from 48 children, aged 5 years old, collected at the University of Washington (UW-PREK). More details about the measurement are available in the “Data” section of Supplementary Methods.
HCP-TR configurations
We processed HCP-TR with three different pyAFQ configurations. In the first configuration, we used the diffusional kurtosis imaging (DKI) model as the orientation distribution function (ODF) model. In the second configuration, we used constrained spherical deconvolution (CSD) as the ODF model. For the final configuration, we used RecoBundles (8) for bundle recognition instead of the default waypoint ROI approach, and DKI as the ODF model. More details are available in the “Configurations” section of Supplementary Methods.
Measures of reliability
Tract recognition of each bundle was compared across measurements and methods using the Dice coefficient, weighted by streamline count (wDSC) (37). Tract profiles were compared with three measures: (1) profile reliability: mean intraclass correlation coefficient (ICC) across points in different tract profiles for different data, which quantifies the agreement of tract profiles (38, 39); (2) subject reliability: Spearman’s rank correlation coefficient (Spearman’s ρ) between the means of the tract profiles across individuals, which quantifies the consistency of the mean of tract profiles; and (3) an adjusted contrast index profile (ACIP): to directly compare the values of individual nodes in the tract profiles in different measurements. To estimate TRR, the above measures were calculated for each individual across different measurements, and to estimate robustness, these were calculated for each individual across different analysis methods. For example, if we calculated the subject reliability across measurements, we would call that “subject TRR,” and if we calculated the subject reliability across analysis methods, we would call that “subject robustness.” We explain profile and subject reliability in more detail below; we explain wDSC and ACIP in more detail in equations 1 and 2 in the “Measures of Reliability” section of the Supplementary Methods.
Profile reliability
We use profile reliability to compare the shapes of profiles per bundle and per scalar. Given two sets of data (either from test-retest analysis or from different analyses), we first calculate the ICC between tract profiles for each subject in a given bundle and scalar. Then, we take the mean of those correlations. We do this for every bundle and for every scalar. We call this profile reliability because larger differences in the overall values along the profiles will result in a smaller mean of the ICC. Consistent profile shapes are important for distinguishing bundles. Profile reliability provides an assessment of the overall reliability of the tract profiles, summarizing over the full length of the bundle, for a particular scalar. We calculate the 95% confidence interval on profile reliabilities using the standard error of the measurement.In some cases, there is low between-subject variance in tract profile shape (e.g., this is often the case in corticospinal tract (CST)). We use ICC to account for this, as ICC will penalize low between-subject variance in addition to rewarding high within-subject variance. Profile reliability is a way of quantifying the agreement between profiles. Qualitatively, we use four descriptions for profile reliability: excellent (ICC > 0.75), good (ICC = 0.60 to 0.74), fair (ICC = 0.40 to 0.59), and poor (ICC < 0.40) (40).
Subject reliability
We calculate subject reliability to compare individual differences in profiles, per bundle and per scalar, following (41). Given two measurements for each subject, we first take the mean of each profile within each individual, measurement and scalar. Then, we calculate Spearman’s ρ from the means from different subjects for a given bundle and scalar across the measurements. High subject reliability means the ordering of an individual’s tract profile mean among other individuals is consistent across measurements or methods. This is akin to test reliability that is computed for any clinical measure.One downside of subject reliability is that the shape of the extracted profile is not considered. Additionally, if one measurement or method produces higher values for all subjects uniformly, subject reliability would not be affected. Instead, the intent of subject reliability is to well summarize the preservation of relative differences between individuals for mean tract profiles. In other words, subject reliability quantifies the consistency of mean profiles. The 95% confidence interval on subject reliabilities is parametric.
RESULTS
Tractometry using pyAFQ classifies streamlines into bundles that represent major anatomical pathways. The streamlines are used to sample dMRI-derived scalars into bundle profiles that are calculated for every individual and can be summarized for a group of subjects. An example of the process and result of the tract profile extraction process is shown in Fig. S3 together with the results of this process across the 18 major white matter pathways for all subjects in the HCP-TR dataset.
Assessing TRR of tractometry
In datasets with scan-rescan data, we can assess TRR at several different levels of tractometry. For example, the correlation between two profiles provides a measure of the reliability of the overall tract profile in that subject. Analyzing the HCP-TR dataset, we find that for FA calculated using DKI, the values of profile reliability vary across subjects (Fig. 1A), but they overall tend to be rather high, with the average value within each bundle in the range of 0.77 ± 0.05 to 0.92 ± 0.02 and a median across bundles of 0.86 (Fig. 1B). We find similar results for MD (Fig. S4) and replicate similar results in a second dataset (Fig. 3B).
(A) Histograms of individual subject intraclass correlation coefficient (ICC) between the FA tract profiles across sessions for a given bundle. Colors encode the bundles, matching the diagram showing the rough anatomical positions of the bundles for the left side of the brain (center). (B) Mean (± 95% confidence interval) TRR for each bundle, color-coded to match the histograms and the bundles diagram, with median across bundles in red.
Fig. 3.
Weighted Dice similarity coefficient (wDSC), profile, and subject test-retest reliability (TRR) of Python Automated Fiber Quantification (pyAFQ) and MATLAB Automated Fiber Quantification (mAFQ) on University of Washington (UW-PREK); pyAFQ on Human Connectome Project test-retest (HCP-TR) using different orientation distribution function (ODF) models; and Reproducible Tract Profile (RTP) on HCP-TR.
Colors indicate bundle. (A) Texture indicates the dataset and methods being compared. Error bars show the 95% confidence interval. (B, D, and F) Profile TRR and (C, E, and G) subject TRR. Profile and subject TRR calculations are demonstrated with HCP-TR using diffusion kurtosis model (DKM) in figures 1 and 2 respectively. (B, C) Comparison of the TRR of mAFQ and pyAFQ on UW-PREK. (D, E) Comparison of pyAFQ and RTP on HCP-TR using only single shell data. (F, G) Comparison of DKI and CSD TRR on HCP-TR. Point shapes indicate the extracted scalar. The red dotted line is equal TRR between methods.
Subject reliability assesses the reliability of mean tract profiles across individuals. Subject FA TRR in the HCP-TR also tends to be high, but the values vary more across bundles with a range of 0.57 ± 0.24 to 0.85 ± 0.12 and a median across bundles of 0.73. We can see that subject TRR is lower than profile TRR (Fig. 2). This trend is consistent for MD (Fig. S5) as well as for another dataset (Fig. 3C).
Fig. 2.
Subject test-retest reliability.
(A) Mean tract profiles for a given bundle and the fractional anisotropy (FA) scalar for each subject using the first and second session of Human Connectome Project test-retest (HCP-TR). Colors encode bundle information, matching the core of the bundles (center). (B) Subject reliability is calculated from the Spearman’s ρ of these distributions, with median across bundles in red (± 95% confidence interval).
TRR of tractometry in different implementations, datasets, and tractography methods
We compared TRR across datasets and implementations. In both datasets, we found high TRR in the results of tractography and bundle recognition: wDSC was larger than 0.7 for all but one bundle (Fig. 3A): the delineation of the anterior forceps (FA bundle) seems relatively unreliable using pyAFQ in the UW-PREK dataset (using the FA scalar, pyAFQ subject TRR is only 0.37 ± 0.28 compared to mAFQ’s 0.84 ± 0. 10). We found overall high-profile TRR that did not always translate to high subject TRR (Fig. 3B–G). For example, for FA in UW-PREK, median profile TRRs are 0.75 for pyAFQ and 0.77 for mAFQ, while median subject TRRs are 0.70 for pyAFQ and 0.75 for mAFQ. Note that profile and subject TRRs have different denominators (e.g., subjects that have similar mean profiles to each other would have low subject TRR, even if the profiles are reliable, because it is harder to distinguish between subjects in this case). mAFQ is one of the most popular software pipelines currently available for tractometry analysis, so it provides an important point for comparison. In comparing different software implementations, we found that mAFQ has higher subject TRR relative to pyAFQ in the UW-PREK dataset, when TRR is relatively low for pyAFQ (see the FA bundle, CST L, and ATR L in Fig. 3C). On the other hand, in the HCP-TR dataset pyAFQ, we used the Reproducible Tract Profile (RTP) pipeline (42, 43), which is an extension of mAFQ, and found that pyAFQ tends to have slightly higher profile TRR than RTP for MD but slightly lower profile TRR for FA (Fig. 3D). The pyAFQ and RTP subject TRR are highly comparable (Fig. 3E). In FA, the median pyAFQ subject TRR for FA is 0.76, while the median RTP subject TRR is 0.74. Comparing different ODF models in pyAFQ, we found that the DKI and CSD ODF models have highly similar TRR, both at the level of wDSC (Fig. 3A) and at the level of profile and subject TRRs (Fig. 3F, G).
Robustness: comparison between distinct tractography models and bundles recognition algorithms
To assess the robustness of tractometry results to different models and algorithms, we used the same measures that were used to calculate TRR.
Tractometry results can be robust to differences in ODF models used in tractography
We compared two algorithms: tractography using DKI- and CSD-derived ODFs. The weighted Dice similarity coefficient (wDSC) for this comparison can be rather high in some cases (e.g., the uncinate and corticospinal tracts, Fig. 4A) but produce results that appear very different for some bundles, such as the arcuate and superior longitudinal fasciculi (ARC and SLF) (see also Fig. 4D). Despite these discrepancies, profile and subject robustness are high for most bundles (median FA of 0.77 and 0.75, respectively) (Fig. 4B, C). In contrast to the results found in TRR, MD subject robustness is consistently higher than FA subject robustness. The two bundles with the most marked differences between the two ODF models are the SLF and ARC (Fig. 4D). These bundles have low wDSC and profile robustness, yet their subject robustness remains remarkably high (in FA, 0.75 ± 0.17 for ARC R and 0.88 ± 0.09 for SLF R) (Fig. 4C). These differences are partially explained due to the fact that there are systematic biases in the sampling of white matter by bundles generated with these two ODF models, as demonstrated by the non-zero ACIP between the two models (Fig. 4E).
Fig. 4.
Orientation distribution function (ODF) model robustness.
We compared diffusion kurtosis model (DKI)- and constrained spherical deconvolution (CSD)-derived tractography. Colors encode bundle information as in Figs. 1 and 2. Textured hatching encodes fractional anisotropy/mean diffusivity (FA/MD) information. (A) weighted Dice similarity coefficient (wDSC) robustness. (B) Profile robustness. (C) Subject robustness. Error bars represent 95% confidence interval. (D, E) Adjusted contrast index profile (ACIP) between left arcuate and left superior longitudinal fasciculi (ARC L and SLF L) tract profiles of each algorithm. Positive adjusted contrast index (ACI) indicates DKI found a higher value of FA than CSD at that node. The 95% confidence interval on the mean is shaded. (F) Tractography and bundle recognition results for ARC L and SLF L, respectively, for one example subject.
Most white matter bundles are highly robust across bundle recognition methods
We compared bundle recognition with the same tractography results using two different approaches: the default waypoint ROI approach (9) and an alternative approach (RecoBundles) that uses atlas templates in the space of the streamlines (44). Between these algorithms, wDSC is around or above 0.6 for all but one bundle, Right Inferior Longitudinal Fasciculus (ILF R) (Fig. 5). There is an asymmetry in the ILF atlas bundle (7), which results in discrepancies between ILF R recognized with waypoint ROIs and with RecoBundles. Despite this bundle, we find high robustness overall. For MD, the first quartile subject robustness is 0.82 (Fig. 5C, D).
Fig. 5.
Recognition algorithm robustness.
(A) Weighted Dice similarity coefficient (wDSC). (B) Profile robustness. (C) Subject robustness. Error bars show the 95% confidence interval. (D) The right inferior longitudinal fasciculus (ILF R) fractional anisotropy (FA) adjusted contrast index profile (ACIP), where positive ACI indicating RecoBundles found a higher value of FA than the waypoint regions of interest (ROIs) approach at that node. (E) The ILF R found by each algorithm for an example subject.
Tractometry results are robust to differences in software implementation
Overall, we found that robustness of tractometry across these different software implementations is high in most white matter bundles. In the mAFQ/pyAFQ comparison, most bundles have a wDSC around or above 0.8, except the two callosal bundles (FA bundle and forceps posterior (FP)), which have a much lower overlap (Fig. 6A). Consistent with this pattern, profile and subject robustness are also overall rather high (Fig. 6B, C). The median values across bundles are 0.71 and 0.77 for FA profile and subject robustness, respectively.
Fig. 6.
Robustness between Python Automated Fiber Quantification (pyAFQ) and MATLAB Automated Fiber Quantification (mAFQ) on University of Washington (UW-PREK) session #1 data.
(A) Adjusted contrast index profile (ACIP) between the fractional anisotropy (FA) tract profiles from UW-PREK using pyAFQ and mAFQ. Positive ACI indicates pyAFQ found a higher value than mAFQ at that node. The 95% confidence interval on the mean is shaded. Robustness in wDSC (B) bundle profiles (C) and across subjects (D). Error bars show the 95% confidence interval.
For some bundles, like the right and left uncinate (UNC R and UNC L), there is large agreement between pyAFQ and mAFQ (for subject FA: UNC L ρ = 0.90 ± 0.07, UNC R ρ = 0.89 ± 0.08). However, the callosal bundles have particularly low MD profile robustness (0.07 ± 0.09 for FP, 0.18 ± 0.09 for FA) (Fig. 6B).The robustness of tractometry to the differences between the pyAFQ and mAFQ implementation depends on the bundle, scalar, and reliability metric. In addition, for many bundles, the ACIP between mAFQ and pyAFQ results is very close to 0, indicating no systematic differences (Fig. 6D). In some bundles – the CST and the anterior thalamic radiations (ATR) – there are small systematic differences between mAFQ and pyAFQ. In the forceps posterior (FP), pyAFQ consistently finds smaller FA values than mAFQ in a section on the left side. Notice that the forceps anterior has an ACIP that deviates only slightly from 0, even though the forceps recognitions did not have as much overlap as other bundle recognitions (see Fig. 6A).
DISCUSSION
Previous work has called into question the reliability of neuroimaging analysis (e.g., (25, 45, 46)). We assessed the reliability of a specific approach, tractometry, which is grounded in decades of anatomical knowledge, and we demonstrated that this approach is reproducible, reliable, and robust. A tractometry analysis typically combines the outputs of tractography with diffusion reconstruction at the level of the individual voxels within each bundle. One of the major challenges facing researchers who use tractometry is that there are many ways to analyze diffusion data, including different models of diffusion at the level of individual voxels; techniques to connect voxels through tractography; and approaches to classify tractography results into major white matter bundles. Here, we analyzed the reliability of tractometry analysis at several different levels. We analyzed both TRR of tractometry results and their robustness to changes in analytic details, such as choice of tractography method, bundle recognition algorithm, and software implementation (Fig. 6).
Test-retest reliability of tractometry
TRR of tractometry is usually rather high, comparable in some tracts and measurements to the TRR of the measurement. In comparing the HCP-TR analysis and UW-PREK analysis, we note that higher measurement reliability goes hand in hand with tractometry reliability.In terms of the anatomical definitions of the bundles, quantified as the TRR wDSC, we find reliable results in both datasets and with both software implementations and both tractography methods that we tested. With pyAFQ, we found a relatively low TRR in the frontal callosal bundle (FA bundle) in the UW-PREK dataset. This could be due to the sensitivity of the definition of this bundle to susceptibility distortion artifacts in the frontal poles of the two hemispheres. This low TRR was not found with mAFQ, suggesting that this low TRR is not a necessary feature of the analysis and is a potential avenue for improvement to pyAFQ. While the two implementations were created by teams with partial overlap and despite the fact that pyAFQ implementation drew both inspiration as well as specific implementation details from mAFQ, many details of implementation still differ substantially. For example, the implementations of tractography algorithms are quite different – pyAFQ relies on DIPY (28) for its tractography, while mAFQ uses implementations provided in Vistasoft (47). The two pipelines also use different registration algorithms, with pyAFQ relying on the symmetric diffeomorphic registration (SyN) algorithm (33), while mAFQ relies on registration methods implemented as part of the Statistical Parametric Mapping (SPM) software (48). These differences may explain the discrepancies observed.We also find that TRR is high at the level of profiles within subjects and mean tract profiles across subjects. This is generally observed in both datasets that we examined and using different analysis methods and software implementations. For the UW-PREK dataset, subject TRR tends to be higher in mAFQ than in pyAFQ. On the other hand, for the HCP-TR dataset, pyAFQ subject TRR tends to be higher than that obtained with RTP, which is a fork and extension of mAFQ (42, 43). Generally, TRR of FA profiles and TRR of mean FA across subjects tend to be higher than those of MD. This could be because the assessment of MD is more sensitive to partial volume effects. In contrast to FA, MD is also not bounded, which means that extreme values at the boundaries of tissue types can have a substantial effect on TRR.
Robustness of tractometry
As highlighted in the recent work by Botvinik-Nezer et al. (25) and in parallel by Schilling et al. (45), inferences from even a single dataset can vary significantly, depending on the decisions and analysis pipelines that are used. The analysis approaches used in tractometry embody many assumptions made at the different stages of analysis: the model of the signal in each individual voxel, the manner in which streamlines are generated in tractography, the definition of bundles, and the extraction of tract profiles. While TRR is important, it does not guard against systematic errors in the analysis approach. One way to test model assumptions and software failures is to create ground truth data against which different methods and implementations can be tested (13, 49, 50). However, this approach also relies on certain assumptions about the mechanisms that generate the data that is considered ground truth, making this approach more straightforward for some methods than others. Here, we instead assessed the robustness of tractometry results to perturbations of analytic components, focusing on the modeling of ODFs in individual voxels and the approach taken to bundle recognition.
Subject robustness remains high despite differences in the spatial extent of bundles
We replicated previous findings that the definition of major bundles can vary in terms of their spatial extent (quantified via wDSC) (13, 37, 40, 45), depending on the software implementation or the ODF model used. As we showed, low wDSC robustness often corresponds to low profile robustness and vice versa (Figs. 4A and B, 5A and B, 6B and C). That is, when two algorithms detect bundles with small spatial overlap, the shape of the resulting tract profiles is also different from each other. However, low wDSC and profile robustness does not always translate to low subject robustness. Algorithms can detect bundles with low spatial overlap and of different shapes yet still agree on the ordering of the mean of the profiles, that is, which subjects have high or low FA in a given bundle. A clear example of this is the SLF and ARC in Fig. 4 (wDSC and profile robustness are low, yet subject robustness is very high). This suggests that tractometry can overcome failures in precise delineation of the major bundles by averaging tissue properties within the core of the white matter. Conversely, important details that are sensitive to these choices may be missed when averaging along the length of the tracts. Moreover, this may also reflect biases in the measurement that cannot be overcome at either stage of the analysis: tractography or bundle recognition.Our high subject-level robustness results (Figs. 4C, 5C, 6C) dovetail with the results of a recently published study that used tractometry in a sample of 45 participants (51) and found high subject-level correlations between the mean tract values of FA and MD for two different pipelines: deterministic tractography using the diffusion tensor model (DTI) as the ODF model (essentially identical to a pipeline used in our supplementary analysis, described in “DTI Configuration”) and probabilistic tractography using CSD as the ODF model. Consistent with our results on the HCP-TR dataset, slightly higher subject robustness was found for MD than for FA.
Exceptions and limitations
High profile robustness did not always imply high subject robustness (e.g., the FP in Fig. 4 has high profile robustness but low subject robustness). This suggests that there are other sources of between-subject variance that do not correspond directly to profile robustness within an individual.There are still significant challenges to robustness that arise from the way in which the major bundles are defined. This problem was highlighted in recent work that demonstrated that different researchers use different criteria to define bundles of streamlines that represent the same tract (45). In our case, this challenge is represented by the relatively low robustness between the waypoint ROI algorithm for bundle definition and the RecoBundles algorithm. In this comparison, the wDSC exceeds 0.8 in only one bundle and is below 0.4 in two cases. While both algorithms identify a bundle of streamlines that represents the right ILF, this bundle differs substantially between the two algorithms. Even so, profile and subject robustness can still be rather high, even in cases in which a rather middling overlap is found between the anatomical extents of the bundles. This challenge not only highlights the need for more precise definitions of the models of brain tracts that are derived from dMRI but also highlights the need for clear, automated, and reproducible software to perform bundle recognition.In addition to decisions about analysis approach, which may be theoretically motivated, software implementations may contain systematic errors in executing the different steps and different software may be prone to different kinds of failure modes. Since other software implementations (9, 42) of the AFQ approach have been in widespread use in multiple different datasets and research settings, we also compared the results across different software implementations (Fig. 6). While there are some systematic differences between implementations, tractometry is overall quite robust to differences between software implementations.Another important limitation of this work is that we have only analyzed samples of healthy individuals. Where brains are severely deformed (e.g., in TBI, brain tumors, and so forth), particular care would be needed to check the results of bundle recognition, and separate considerations would be needed in order to reach conclusions about the reliability of the inferences made.
Computational reproducibility via open-source software
Reproducibility is a bedrock of science, but achieving full computational reproducibility is a high bar that requires access to the software, data, and computational environment that a researcher uses (22). One of the goals of pyAFQ is to provide a platform for reproducible tractometry. It is embedded in an ecosystem of tools for reproducible neuroimaging and is extensible. This is shown in Fig. S6 and Fig. S2 and is further discussed in “Supplementary Discussion of pyAFQ.” Results from the present article and supplements can be reproduced using a set of Jupyter notebooks provided here: https://github.com/36000/Tractometry_TRR_and_robustness. After installing the version of pyAFQ that we used (0.6), reproduction should be straightforward on standard operating systems and architectures or in cloud computing systems (see the set of Jupyter notebooks linked to above, and Supplementary Methods). In the UW-PREK dataset, we shared the tract profiles and we provide web-based visualizations using a tool that was previously developed for transparent data sharing of tractometry data (52): https://yeatmanlab.github.io/UW_PREK_pyAFQ_pre_browser and https://yeatmanlab.github.io/UW_PREK_pyAFQ_post_browser.The HCP-TR dataset is relatively straightforward for others to access in its preprocessed form through the HCP, and because the study IDs can be openly shared in our code, anyone with such access should be able to reproduce the figures in full. Using these resources, it should be possible to re-execute our workflows and replicate most of our results (53). For example, if other researchers would be interested in comparing our TRR results to another tractometry pipeline (e.g., TRACULA (11), another popular tractometry pipeline) or another bundle recognition algorithm (e.g., TractSeg (54), which uses a neural network to recognize bundles, or Classifyber (55), which uses a linear classifier), they could do so with the HCP-TR dataset, inspired by our scripts and the visualization tools in the pyAFQ software.
Future work
There are many aspects of reliability that could be further explored. We explored robustness with respect to ODF models and bundle recognition algorithms; robustness could also be explored with respect to data acquisition parameters within the same subject; preprocessing methods; profile extraction method (e.g., comparing our current approach with the BUndle ANalytics (BUAN) (56)); and the effects of profile realignment on tract profile reliability (57). Another possibility for teasing apart measurement and tractography effects would be to test profile TRR using the streamline of one scan on the results of the second scan (by registering the streamline themselves, to avoid data interpolation in volume registration). This could tease apart the effects of tractography from the voxel-level models of tissue properties, because it is not necessary that these would be sensitive to the same constraints (e.g., different sensitivity to noise). The methods we demonstrate and resources we provide in this paper should be useful for anyone wishing to further explore reliability in tractometry.
Authors: Kegang Hua; Jiangyang Zhang; Setsu Wakana; Hangyi Jiang; Xin Li; Daniel S Reich; Peter A Calabresi; James J Pekar; Peter C M van Zijl; Susumu Mori Journal: Neuroimage Date: 2007-08-15 Impact factor: 6.556
Authors: Setsu Wakana; Arvind Caprihan; Martina M Panzenboeck; James H Fallon; Michele Perry; Randy L Gollub; Kegang Hua; Jiangyang Zhang; Hangyi Jiang; Prachi Dubey; Ari Blitz; Peter van Zijl; Susumu Mori Journal: Neuroimage Date: 2007-03-20 Impact factor: 6.556
Authors: Vladimir Fonov; Alan C Evans; Kelly Botteron; C Robert Almli; Robert C McKinstry; D Louis Collins Journal: Neuroimage Date: 2010-07-23 Impact factor: 6.556
Authors: Satrajit S Ghosh; Jean-Baptiste Poline; David B Keator; Yaroslav O Halchenko; Adam G Thomas; Daniel A Kessler; David N Kennedy Journal: F1000Res Date: 2017-02-10
Authors: John B Colby; Lindsay Soderberg; Catherine Lebel; Ivo D Dinov; Paul M Thompson; Elizabeth R Sowell Journal: Neuroimage Date: 2011-11-09 Impact factor: 6.556
Authors: David C Van Essen; Stephen M Smith; Deanna M Barch; Timothy E J Behrens; Essa Yacoub; Kamil Ugurbil Journal: Neuroimage Date: 2013-05-16 Impact factor: 6.556
Authors: Loreen Tisdall; Kelly H MacNiven; Claudia B Padula; Josiah K Leong; Brian Knutson Journal: Proc Natl Acad Sci U S A Date: 2022-06-21 Impact factor: 12.779
Authors: Kevin R Sitek; Evan Calabrese; G Allan Johnson; Satrajit S Ghosh; Bharath Chandrasekaran Journal: Front Neurosci Date: 2022-03-22 Impact factor: 4.677
Authors: Adam Richie-Halford; Matthew Cieslak; Lei Ai; Sendy Caffarra; Sydney Covitz; Alexandre R Franco; Iliana I Karipidis; John Kruper; Michael Milham; Bárbara Avelar-Pereira; Ethan Roy; Valerie J Sydnor; Jason D Yeatman; Theodore D Satterthwaite; Ariel Rokem Journal: Sci Data Date: 2022-10-12 Impact factor: 8.501