Literature DB >> 34724000

Numerical uncertainty in analytical pipelines lead to impactful variability in brain networks.

Gregory Kiar¹, Yohan Chatelain², Pablo de Oliveira Castro³, Eric Petit⁴, Ariel Rokem⁵, Gaël Varoquaux⁶, Bratislav Misic¹, Alan C Evans¹, Tristan Glatard².

Abstract

The analysis of brain-imaging data requires complex processing pipelines to support findings on brain function or pathologies. Recent work has shown that variability in analytical decisions, small amounts of noise, or computational environments can lead to substantial differences in the results, endangering the trust in conclusions. We explored the instability of results by instrumenting a structural connectome estimation pipeline with Monte Carlo Arithmetic to introduce random noise throughout. We evaluated the reliability of the connectomes, the robustness of their features, and the eventual impact on analysis. The stability of results was found to range from perfectly stable (i.e. all digits of data significant) to highly unstable (i.e. 0 - 1 significant digits). This paper highlights the potential of leveraging induced variance in estimates of brain connectivity to reduce the bias in networks without compromising reliability, alongside increasing the robustness and potential upper-bound of their applications in the classification of individual differences. We demonstrate that stability evaluations are necessary for understanding error inherent to brain imaging experiments, and how numerical analysis can be applied to typical analytical workflows both in brain imaging and other domains of computational sciences, as the techniques used were data and context agnostic and globally relevant. Overall, while the extreme variability in results due to analytical instabilities could severely hamper our understanding of brain organization, it also affords us the opportunity to increase the robustness of findings.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34724000 PMCID： PMC8559953 DOI： 10.1371/journal.pone.0250755

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The modelling of brain networks, called connectomics, has shaped our understanding of the structure and function of the brain across a variety of organisms and scales over the last decade [1-6]. In humans, these wiring diagrams are obtained in vivo through Magnetic Resonance Imaging (MRI), and show promise towards identifying biomarkers of disease. This can not only improve understanding of so-called “connectopathies”, such as Alzheimer’s Disease and Schizophrenia, but potentially pave the way for therapeutics [7-11]. However, the analysis of brain imaging data relies on complex computational methods and software. Tools are trusted to perform everything from pre-processing tasks to downstream statistical evaluation. While these tools undoubtedly undergo rigorous evaluation on bespoke datasets, in the absence of ground-truth this is often evaluated through measures of reliability [12-16], proxy outcome statistics, or agreement with existing theory. Importantly, this means that tools and datasets are not necessarily of known or consistent quality, and it is not uncommon that equivalent experiments may lead to diverging conclusions [17-23]. While many scientific disciplines suffer from a lack of reproducibility [24], this was recently explored in brain imaging by a 70 team consortium which performed equivalent analyses and found widely inconsistent results [17], and it is likely that software instabilities played a role. This study does not broach dataset differences, but there are considerable works which demonstrate that data selection may compound these effects(e.g. [14, 16]). The present study approached evaluating reproducibility from a computational perspective in which a series of brain imaging studies were numerically perturbed in such a way that the plausibility of results was not affected, and the implications of the observed instabilities on downstream analyses were quantified. We accomplished this through the use of Monte Carlo Arithmetic (MCA) [25, 26], a technique which enables characterization of the sensitivity of a system to small numerical perturbations. This is importantly distinct from data perturbation experiments where the underlying datasets are manipulated or pathologies may be simulated, and allows for the evaluation of experimental uncertainty in real-world settings. We explored the impact of numerical perturbations through the direct comparision of structural connectomes, the consistency of their features, and their eventual application in a neuroscience study. We also characterized the consequences of instability in these pipelines on the reliability of derived datasets, and discuss how the induced variability may be harnessed to increase the discriminability of datasets, in an approach akin to ensemble learning. Finally, we make recommendations for the roles perturbation analyses may play in brain imaging research and beyond.

Results

Graphs vary widely with perturbations

Prior to exploring the analytic impact of instabilities, a direct understanding of the induced variability was required. A subset of the Nathan Kline Institute Rockland Sample (NKIRS) dataset [27] was randomly selected to contain 25 individuals with two sessions of imaging data, each of which was subsampled into two components, resulting in four samples per individual and 100 samples total (25 × 2 × 2 samples). Structural connectomes were generated with canonical deterministic and probabilistic pipelines [28, 29] which were instrumented with MCA, replicating computational noise either sparsely or densely throughout the pipelines [19, 26]. In the sparse case, a small subset of the libraries were instrumented with MCA, allowing for the evaluation of the cascading effects of numerical instabilities that may arise. In the dense case, operations are more uniformly perturbed and thus the law of large numbers suggests that perturbations will quickly offset one-another and only dramatic local instabilities will have propagating effects. Importantly, the perturbations resulting from the sparse setting represent a strict subset of the possible outcomes of the dense implementation. The random perturbations are statistically independent from one another across both settings and simulations. Instrumenting pipelines with MCA increases their computation time, in this case by multiplication factors of 1.2 × and 7 × for the sparse and dense settings, respectively [19]. The results obtained were compared to unperturbed (e.g. reference) connectomes in both cases. The connectomes were sampled 20 times per sample and once without perturbations, resulting in a total of 8,400 connectomes. Two versions of the unperturbed connectomes were generated and compared such that the absence of variability aside from that induced via MCA could be confirmed. The stability of structural connectomes was evaluated through the normalized percent deviation from reference [19] and the number of significant digits (Fig 1). The comparisons were grouped according to differences across simulations, subsampling of data, sessions of acquisition, or subjects, and accordingly sorted from most to least similar. While the similarity of connectomes decreases as the collections become more distinct, connectomes generated with sparse perturbations show considerable variability, often reaching deviations equal to or greater than those observed across individuals or sessions (Fig 1A; right). Interpretting these results with respect to the distinct MCA environments used suggests that the tested pipelines may not suffer from single dominant sources of instability, but that nevertheless there exist minor local instabilities which may the propagate throughout the pipeline. Furthermore, this finding suggests that instabilities inherent to these pipelines may mask session or individual differences, limiting the trustworthiness of derived connectomes. While both pipelines show similar performance, the probabilistic pipeline was more stable in the face of dense perturbations whereas the deterministic was more stable to sparse perturbations (p < 0.0001 for all; exploratory). As an alternative to the normalized percent deviation, the stability of correlations between networks can be found in S1 Section in S1 File.

Fig 1

Exploration of perturbation-induced deviations from reference structural connectomes.

Exploration of perturbation-induced deviations from reference structural connectomes.

(A) The absolute deviations between connectomes, in the form of normalized percent deviation from reference. The difference in MCA-perturbed connectomes is shown as the across MCA series, and is presented relative to the variability observed across subsamples, sessions, and subjects. (B) The number of significant decimal digits in each set of connectomes as obtained by evaluating the complete distribution of networks. In the case of 16, values can be fully relied upon, whereas in the case of 1 only the first digit of a value can be trusted. Dense and sparse perturbations are shown on the left and right, respectively. The number of significant digits per edge across connectomes (Fig 1B) similarly decreases alongside the decreasing similarity between comparison groups. While the cross-MCA comparison of connectomes generated with dense perturbations show nearly perfect precision for many edges (approaching the maximum of 15.7 digits for 64-bit data), this evaluation uniquely shows considerable drop off in performance when comparing networks across subsamplings (average of <4 digits). In addition, sparsely perturbed connectomes show no more than an average of 3 significant digits across all comparison groups, demonstrating a significant limitation in the reliability of independent edge weights. The number of significant digits across individuals did not exceed a single digit per edge in any case, indicating that only the order of magnitude of edges in naively computed groupwise average connectomes can be trusted. The combination of these results with those presented in Fig 1A suggests that while specific edge weights are largely affected by instabilities, macro-scale network structure is stable.

Sparse perturbations reduce off-target signal

We assessed the reproducibility of the dataset through mimicking and extending a typical test-retest experiment [14] in which the similarity of samples across sessions were compared to distinct samples in the dataset (Table 1, with additional experiments and explanation of the measure and its scaling in S2 Section in S1 File). The ability to discriminate connectomes across subjects (Hypothesis 1) is an essential prerequisite for the application of brain imaging towards identifying individual differences [6]. In testing hypothesis 1, we observe that the dataset is discriminable with a scaled score of 0.82 (p < 0.001; optimal score: 1.0; chance: 0.04) for both pipelines in the absence of MCA. We can see that inducing instabilities through MCA preserves the discriminability in the dense perturbtion setting, and and discriminability decreased slightly but remained above the unscaled reference value of 0.65 in the sparse case. This lack of significant decrease in discriminability across MCA perturbations suggests its utility for capturing variance within datasets without compromising the robustness and reliability of the dataset as a whole, and possibly suggests this technique as a cost effective and context-agnostic method for dataset augmentation.

Table 1

The impact of instabilities as evaluated through the discriminability of the dataset based on individual (or subject) differences, session, and subsample.

Comparison	Chance	Target	Unscaled Ref.		Scaled Ref.		Dense MCA		Sparse MCA
Comparison	Chance	Target	Det.	Prob.	Det.	Prob.	Det.	Prob.	Det.	Prob.
H₁: Across Subjects	0.04	1.0	0.64	0.65	0.82	0.82	0.82	0.82	0.77	0.75
H₂: Across Sessions	0.5	0.5	1.00	1.00	1.00	1.00	1.00	1.00	0.88	0.85
H₃: Across Subsamples	0.5	0.5					0.99	1.00	0.71	0.61

The performance is reported as mean discriminability. While a perfectly discriminable dataset would be represented by a score of 1.0, the chance performance, indicating minimal discriminability, is 1/the number of classes. H3 could not be tested using the reference executions due to too few possible comparisons. The alternative hypothesis, indicating significant discrimination, was accepted for all experiments, with p < 0.005 after correcting for multiple comparisons. While the discriminability of individuals is essential for the identification of individual brain networks, it is similarly reliant on network similarity—or lack of discriminability—across equivalent acquisitions (Hypothesis 2). In this case, connectomes were grouped based upon session, rather than subject, and the ability to distinguish one session from another based on subsamples was computed within-individual and aggregated. Both the unperturbed and dense perturbation settings perfectly preserved differences between sessions with a score of 1.0 (p < 0.005; optimal score: 0.5; chance: 0.5), indicating a dominant session-dependent signal for all individuals despite no intended biological differences. However, while still significant relative to chance (score: 0.85 and 0.88; p < 0.005 for both), sparse perturbations lead to significantly lower discriminability of the dataset (p < 0.005 for all). This reduction of the difference between sessions suggests that the added variance due to perturbations reduces the relative impact of non-biological acquisition-dependent bias inherent in the networks. Though the previous sets of experiments inextricably evaluate the interaction between data acquisition and tool, the use of subsampling allowed for characterizing the discriminability of networks sampled from within a single acquisition (Hypothesis 3). While this experiment could not be evaluated using reference executions, the networks generated with dense perturbations showed near perfect discrimination between subsamples, with scores of 0.99 and 1.0 (p < 0.005; optimal: 0.5; chance: 0.5). Given that there was no variability in data acquisition, due to undesired effects such as participant motion, or preprocessing, the ability to discriminate between equivalent subsamples in this experiment may only be due to instability or bias inherent to the pipelines. The high variability introduced through sparse perturbations considerably lowered the discriminability towards chance (score: 0.71 and 0.61; p < 0.005 for all), further supporting this as an effective method for obtaining lower-bias estimates of individual connectivity. Across all cases, the induced perturbations maintained the ability to discriminate networks on the basis of meaningful biological signal alongside a reduction in discriminability due to of off-target signal in the sparse perturbation setting. This result appears strikingly like a manifestation of the well-known bias-variance tradeoff [30] in machine learning, a concept which observes a decrease in bias as variance is favoured by a model. In particular, this highlights that numerical perturbations can be used to not only evaluate the stability of pipelines, but that the induced variance may be leveraged for the interpretation as a robust distribution of possible results.

Distributions of graph statistics are reliable, but individual statistics are not

Exploring the stability of topological features of structural connectomes is relevant for typical analyses, as low dimensional features are often more suitable than full connectomes for many analytical methods in practice [5]. A separate subset of the NKIRS dataset was randomly selected to contain a single non-subsampled session for 100 individuals (100 × 1 × 1) using the pipelines and instrumentation methods to generate connectomes as above. Connectomes were generated 20 times each, resulting in a dataset which also contained 8,400 connectomes with the MCA simulations serving as the only source of repeated measurements. The stability of several commonly-used multivariate graph features [31] were explored and are presented in Fig 2. The cumulative density of the features was computed within individuals and the mean cumulative density and associated standard error were computed for across individuals (Fig 2A and 2B). There was no significant difference between the distributions for each feature across the two perturbation settings, suggesting that the topological features summarized by these multivariate features are robust across both perturbation modes.

Fig 2

Distribution and stability assessment of multivariate graph statistics.

Distribution and stability assessment of multivariate graph statistics.

(A, B) The cumulative distribution functions of multivariate statistics across all subjects and perturbation settings. There was no significant difference between the distributions in A and B. (C, D) The number of significant digits in the first 5 five moments of each statistic across perturbations. The dashed red line refers to the maximum possible number of significant digits. In addition to the comparison of distributions, the stability of the first 5 moments of these features was evaluated (Fig 2C and 2D). In the face of dense perturbations, the feature-moments were stable with more than 10 significant digits with the exception of edge weight when using the deterministic pipeline, though the probabilistic pipeline was more stable for all comparisons (p < 0.0001; exploratory). In stark contrast, sparse perturbations led to highly unstable feature-moments (Fig 2D), such that none contained more than 5 significant digits of information and several contained less than a single significant digit, indicating a complete lack of reliability. This dramatic degradation in stability for individual measures strongly suggests that these features may be unreliable as individual biomarkers when derived from a single pipeline evaluation, though their reliability may be increased when studying their distributions across perturbations. A similar analysis was performed for univariate statistics which obtained similar findings and can be found in S3 Section in S1 File.

Uncertainty in brain-phenotype relationships

While the variability of connectomes and their features was summarized above, networks are commonly used as inputs to machine learning models tasked with learning brain-phenotype relationships [6]. To explore the stability of these analyses, we modelled the relationship between high- or low- Body Mass Index (BMI) groups and brain connectivity using standard dimensionality reduction and classification tools. In particular, we used Principal Component Analysis followed by a Logistic Regression classifier to predict BMI label, and demonstrated similar performance to previous work which adopted similar techniques for this task [32, 33]. We compared the performance achieved across numerically perturbed samples to both the reference and random performance (Fig 3).

Fig 3

Variability in BMI classification across the sampling of an MCA-perturbed dataset.

The dashed red lines indicate random-chance performance, and the orange dots show the performance using the reference executions.

Variability in BMI classification across the sampling of an MCA-perturbed dataset.

The dashed red lines indicate random-chance performance, and the orange dots show the performance using the reference executions. The analysis was perturbed through distinct samplings of the dataset across both pipelines and perturbation methods. The accuracy and F1 score for the perturbed models varied from 0.520–0.716 and 0.510–0.725, respectively, ranging from at or below random performance to outperforming performance on the reference dataset. This large variability illustrates a previously uncharacterized margin of uncertainty in the modelling of this relationship, and limits confidence in reported accuracy scores on singly processed datasets. The portion of explained variance in these samples ranged from 88.6% -– 97.8%, similar to the reference of 90.3%, suggesting that the range in performance was not due to a gain or loss of meaningful signal, but rather the reduction of bias towards specific outcome. Importantly, this finding does not suggest that modelling brain-phenotype relationships is not possible, but rather it sheds light on impactful uncertainty that must be accounted for in this process, and supports the use of ensemble modeling techniques. One distinction between the results presented here and the previous is that while networks derived from dense perturbations had been shown to exhibit less dramatic instabilities in general, the results here show similar variability in classification performance across the two methods. This consistency suggests that the desired method of instrumentation may vary across experiments. While sparse perturbations result in considerably more variability in networks directly, the two techniques capture similar variability when relating networks to this phenotypic variable. Given the dramatic reduction in computational overhead, a sparse instrumentation may be preferred when processing datasets for eventual application in modelling brain-phenotype relationships.

Discussion

The perturbation of structural connectome estimation pipelines with small amounts of noise, on the order of machine error, led to considerable variability in derived brain graphs. Across all analyses the stability of results ranged from nearly perfectly trustworthy (i.e. no variation) to completely unreliable (i.e. containing no trustworthy information). Given that the magnitude of introduced numerical noise is to be expected in computational workflows, this finding has potentially significant implications for inferences in brain imaging as it is currently performed. In particular, this bounds the success of studying individual differences, a central objective in brain imaging [6], given that the quality of relationships between phenotypic data and brain networks will be limited by the stability of the connectomes themselves. This issue is accentuated through the crucial finding that individually derived network features were unreliable despite there being no significant difference in their aggregated distributions. This finding is not damning for the study of brain networks as a whole, but rather is strong support for the aggregation of networks, either across perturbations for an individual or across groups, over the use of individual estimates.

Underestimated false positive rates

While the instability of brain networks was used here to demonstrate the limitations of modelling brain-phenotype relationships in the context of machine learning, this limitation extends to classical hypothesis testing, as well. Though performing individual comparisons in a hypothesis testing framework will be accompanied by reported false positive rates, the accuracy of these rates is critically dependent upon the reliability of the samples used. In reality, the true false positive rate for a test would be a combination of the reported confidence and the underlying variability in the results, a typically unknown quantity. When performing these experiments outside of a repeated-measure context, such as that afforded here through MCA, it is impossible to empirically estimate the reliability of samples. This means that the reliability of accepted hypotheses is also unknown, regardless of the reported false positive rate. In fact, it is a virtual certainty that the true false positive rate for a given hypothesis exceeds the reported value simply as a result of numerical instabilities. This uncertainty inherent to derived data is compounded with traditional arguments limiting the trustworthiness of claims [34], and hampers the ability of researchers to evaluate the quality of results. The accompaniment of brain imaging experiments with direct evaluations of their stability, as was done here, would allow researchers to simultaneously improve the numerical stability of their analyses and accurately gauge confidence in them. The induced variability in derived brain networks may be leveraged to estimate aggregate connectomes with lower bias than any single independent observation, leading to learned relationships that are more generalizable and ultimately more useful.

Cost-effective data augmentation

The evaluation of reliability in brain imaging has historically relied upon the expensive collection of repeated measurements choreographed by massive cross-institutional consortia [35, 36]. The finding that perturbing experiments using MCA both preserved the discriminability of the dataset due to biological signal and decreased the discriminability due to off-target differences across acquisitions and subsamples opens the door for a promising paradigm shift. Given that MCA is data-agnostic, this technique could be used effectively in conjunction with, or in lieu of, realistic noise models to augment existing datasets. While this of course would not replace the need for repeated measurements when exploring the effect of data collection paradigm or study longitudinal progressions of development or disease, it could be used in conjunction with these efforts to decrease the bias of each distinct sample within a dataset. In contexts where repeated measurements are typically collected to increase the fidelity of the dataset, MCA could potentially serve as an alternative solution to capture more biological variability, with the added benefit being the savings of millions of dollars on data collection.

Shortcomings and future questions

Given the complexity of recompiling complex software libraries, pre-processing was not perturbed in these experiments as the instrumentation of the canonical workflow used in diffusion image processing would have added considerable technical complexity and computational overhead to the large set of experiments performed here; similarly, this complexity along with the added layer of difficulty in comparing instrumentations meant that only algorithms within a single library were tested. Other work has shown that linear registration, a core piece of many elements of pre-processing such as motion correction and alignment, is sensitive to minor perturbations [21]. It is likely that the instabilities across the entire processing workflow would be compounded with one another, resulting in even greater variability. While the analyses performed in this paper evaluated a single dataset and set of pipelines, extending this work to other modalities and analyses, alongside the detection of local sources of instability within pipelines, is of interest for future projects. This paper does not explore methodological flexibility or compare this to numerical instability. Recently, the nearly boundless space of analysis pipelines and their impact on outcomes in brain imaging has been clearly demonstrated [17]. The approach taken in these studies complement one another and explore instability at the opposite ends of the spectrum, with human variability in the construction of an analysis workflow on one end and the unavoidable error implicit in the digital representation of data on the other. It is of extreme interest to combine these approaches and explore the interaction of these scientific degrees of freedom with effects from software implementations, libraries, and parametric choices. Finally, it is important to state explicitly that the work presented here does not invalidate analytical pipelines used in brain imaging, but merely sheds light on the fact that many studies are accompanied by an unknown degree of uncertainty due to machine-introduced errors. The presence of unknown error-bars associated with experimental findings limits the impact of results due to increased uncertainty. The desired outcome of this paper is to motivate a shift in scientific computing—both in neuroimaging and more broadly—towards a paradigm that favours the explicit evaluation of the trustworthiness of claims alongside the claims themselves.

Materials & methods

Dataset

The Nathan Kline Institute Rockland Sample (NKI-RS) [27] dataset contains high-fidelity imaging and phenotypic data from over 1,000 individuals spread across the lifespan. A subset of this dataset was chosen for each experiment to both match sample sizes presented in the original analyses and to minimize the computational burden of performing MCA. The selected subset comprises 100 individuals ranging in age from 6–79 with a mean of 36.8 (original: 6–81, mean 37.8), 60% female (original: 60%), with 52% having a BMI over 25 (original: 54%). Each selected individual had at least a single session of both structural T1-weighted (MPRAGE) and diffusion-weighted (DWI) MR imaging data. DWI data was acquired with 137 diffusion directions in a single shell; more information regarding the acquisition of this dataset can be found in the NKI-RS data release [27]. In addition to the 100 sessions mentioned above, 25 individuals had a second session to be used in a test-retest analysis. Two additional copies of the data for these individuals were generated, including only the odd or even diffusion directions (64 + 9 B0 volumes = 73 in either case) such that the acquired data was evenly represented across both portions, and each subsample represented a realistic complete acquisition. This allowed for an extra level of stability evaluation to be performed between the levels of MCA and session-level variation. In total, the dataset is composed of 100 subsampled sessions of data originating from 50 acquisitions and 25 individuals for in depth stability analysis, and an additional 100 sessions of full-resolution data from 100 individuals for subsequent analyses.

Processing

The dataset was preprocessed using a standard FSL [37] workflow consisting of eddy-current correction and alignment. The MNI152 atlas [38] was aligned to each session of data via the structural images, and the resulting transformation was applied to the DKT parcellation [39]. Subsampling the diffusion data took place after preprocessing was performed on full-resolution sessions, ensuring that an additional confound was not introduced in this process when comparing between downsampled sessions. The preprocessing described here was performed once without MCA, and thus is not being evaluated. Structural connectomes were generated from preprocessed data using two canonical pipelines from Dipy [28]: deterministic and probabilistic. In the deterministic pipeline, a constant solid angle model was used to estimate tensors at each voxel and streamlines were then generated using the EuDX algorithm [29]. In the probabilistic pipeline, a constrained spherical deconvolution model was fit at each voxel and streamlines were generated by iteratively sampling the resulting fiber orientation distributions. In both cases tracking occurred with 8 seeds per 3D voxel and edges were added to the graph based on the location of terminal nodes with weight determined by fiber count. The random state of both pipelines was fixed for all analyses. Fixing this random state led to entirely deterministic repeated-evaluations of the tools, and allowed for explicit attribution of observed variability to limitations in tool precision as provoked by Monte Carlo simulations, rather than the internal state of the algorithm.

Perturbations

All connectomes were generated with one reference execution where no perturbation was introduced in the processing. For all other executions, all floating point operations were instrumented with Monte Carlo Arithmetic (MCA) [25] through Verificarlo [26]. MCA simulates the distribution of errors implicit to all instrumented floating point operations (flop). This rounding is performed on a value x at precision t by: where e is the exponent value of x and ξ is a uniform random variable in the range (, ). MCA can be introduced in two places for each flop: before or after evaluation. Performing MCA on the inputs of an operation limits its precision, while performing MCA on the output of an operation highlights round-off errors that may be introduced. The former is referred to as Precision Bounding (PB) and the latter is called Random Rounding (RR). Using MCA, the execution of a pipeline may be performed many times to produce a distribution of results. Studying the distribution of these results can then lead to insights on the stability of the instrumented tools or functions. To this end, a complete software stack was instrumented with MCA and is made available on GitHub at https://github.com/verificarlo/fuzzy. The RR variant of MCA was used for all experiments. As was presented in [19], both the degree of instrumentation (i.e. number of affected libraries) and the perturbation mode have an effect on the distribution of observed results. For this work, the RR-MCA was applied across the bulk of the relevant operations (those occurring in BLAS, LAPACK, Python, Cython, and Numpy) and is referred to as dense perturbation. In this case the bulk of numerical operations were affected by MCA. Conversely, the case in which RR-MCA was applied across the operations in a small subset of operations (those ocurring in Python and Cython) is here referred to as sparse perturbation. In this case, the inputs to operations within the instrumented libraries were perturbed, resulting in less frequent, data-centric perturbations. Alongside the stated theoretical differences, sparse perturbation is considerably less computationally expensive than dense perturbation. All perturbations targeted the least-significant-bit for all data (t = 24 and t = 53 in float32 and float64, respectively [26]). Perturbing the least significant bit importantly serves as a perturbation of machine error, and thus is the appropriate precision to be applied globally in complex pipelines. Simulations were performed 20 times for each pipeline execution for the 100 sample dataset and 10 times for the repeated measures dataset. A detailed motivation for the number of simulations can be found in [40].

Evaluation

The magnitude and importance of instabilities in pipelines can be considered at a number of analytical levels, namely: the induced variability of derivatives directly, the resulting downstream impact on summary statistics or features, or the ultimate change in analyses or findings. We explore the nature and severity of instabilities through each of these lenses. Unless otherwise stated, all p-values were computed using Wilcoxon signed-rank tests and corrected for multiple comparisons. To avoid biasing these statistics in this unique repeated-measures context, tests were performed across sets of independent observations and then the results were aggregated in all cases.

Direct evaluation of the graphs

The differences between perturbation-generated graphs was measured directly through both a direct variance quantification and a comparison to other sources of variance such as individual- and session-level differences. Quantification of variability. Graphs, in the form of adjacency matrices, were compared to one another using three metrics: normalized percent deviation, Pearson correlation, and edgewise significant digits. The normalized percent deviation measure, defined in [19], scales the norm of the difference between a simulated graph and the reference execution (that without intentional perturbation) with respect to the norm of the reference graph, and is defined as [19]: where A and B each represent a graph, and ▫ are elements therein corresponding to row and column i and j, respectively. For these experiments, the A graph always refers to the reference, where B represents a perturbed value. The purpose of this comparison is to provide insight on the scale of differences in observed graphs relative to the original signal intensity. A Pearson correlation coefficient [41] was computed in complement to normalized percent deviation to identify the consistency of structure and not just intensity between observed graphs, though the result of this experiment is shown only in S1 Section in S1 File. Finally, the estimated number of significant digits, s′, for each edge in the graph is calculated as: where μ and σ are the mean and unbiased estimator of standard deviation across graphs, respectively. The upper bound on significant digits is 15.7 for 64-bit floating point data. The percent deviation, correlation, and number of significant digits were each calculated within a single session of data, thereby removing any subject- and session-effects and providing a direct measure of the tool-introduced variability across perturbations. A distribution was formed by aggregating these individual results. Class-based variability evaluation To gain a concrete understanding of the significance of observed variations we explore the separability of our results with respect to understood sources of variability, such as subject-, session-, and pipeline-level effects. This can be probed through Discriminability [14], a technique similar to ICC [12] which relies on the mean of a ranked distribution of distances between observations belonging to a defined set of classes. The discriminability statistic is formalized as follows: where g is a graph belonging to class i that was measured at observation j, where i ≠ i′ and j ≠ j′. Discriminability can then be read as the probability that an observation belonging to a given class will be more similar to other observations within that class than observations of a different class. It is a measure of reproducibility, and is discussed in detail in [14]. This definition allows for the exploration of deviations across arbitrarily defined classes that in practice can be any of those listed above. We combine this statistic with permutation testing to test hypotheses on whether differences between classes are statistically significant in each of these settings. This statistic is similar to ICC [12] in a two-measurement setting, however, given the dependence on a rank distribution from all measurements, discriminability scores do not become meaningless by the addition of more samples which are highly similar to the originals, whereas ICC scores would much more rapidly trend towards 1, making discriminability appropriate in this context. The scaling properties of discriminability are described more fully in S2 Section in S1 File. With this in mind, three hypotheses were defined. For each setting, we state the alternate hypotheses, the variable(s) which were used to determine class membership, and the remaining variables which may be sampled when obtaining multiple observations. Each hypothesis was tested independently for each pipeline and perturbation mode. H: Individuals are distinct from one another Class definition: Subject ID Comparisons: , Subsample (1 session), MCA (1 subsample, 1 session) H: Sessions within an individual are distinct Class definition: Session ID | Subject ID Comparisons: , MCA (1 subsample) H: Subsamples are distinct Class definition: Subsample | Subject ID, Session ID Comparisons: As a result, we tested 3 hypotheses across 6 MCA experiments and 3 reference experiments on 2 pipelines and 2 perturbation modes, resulting in a total of 30 distinct tests. While results from all tests can be found within S2 Section in S1 File, only the bolded comparisons in the list above have been presented in the main body of this article. Correction for repeated testing was performed.

Evaluating graph-theoretical metrics

While connectomes may be used directly for some analyses, it is common practice to summarize them with structural measures, that can then be used as lower-dimensional proxies of connectivity in so-called graph-theoretical studies [5]. We explored the stability of several commonly-used univariate (graphwise) and multivariate (nodewise or edgewise) features. The features computed and subsequent methods for comparison in this section were selected to closely match those computed in [31]. Univariate differences. For each univariate statistic (edge count, mean clustering coefficient, global efficiency, modularity of the largest connected component, assortativity, and mean path length) a distribution of values across all perturbations within subjects was observed. A Z-score was computed for each sample with respect to the distribution of feature values within an individual, and the proportion of “classically significant” Z-scores, i.e. corresponding to p < 0.05, was reported and aggregated across all subjects. There was no correction for multiple comparisons in these statistics, as they were not used to interpret a hypothesis but demonstrate the false-positive rate due to perturbations. The number of significant digits contained within an estimate derived from a single subject were calculated and aggregated. The results of this analysis can be found in S3 Section in S1 File. Multivariate differences. In the case of both nodewise (degree distribution, clustering coefficient, betweenness centrality) and edgewise (weight distribution, connection length) features, the cumulative density functions of their distributions were evaluated over a fixed range and subsequently aggregated across individuals. The number of significant digits for each moment of these distributions (sum, mean, variance, skew, and kurtosis) were calculated across observations within a sample and aggregated.

Evaluating a brain-phenotype analysis

Though each of the above approaches explores the instability of derived connectomes and their features, many modern studies employ modeling or machine-learning approaches, for instance to learn brain-phenotype relationships or identify differences across groups. We carried out one such study and explored the instability of its results with respect to the upstream variability of connectomes characterized in the previous sections. We performed the modeling task with a single sampled connectome per individual and repeated this sampling and modelling 20 times. We report the model performance for each sampling of the dataset and summarize its variance. BMI classification. Structural changes have been linked to obesity in adolescents and adults [42]. We classified normal-weight and overweight individuals from their structural networks (using for overweight a cutoff of BMI >25 [33]). We reduced the dimensionality of the connectomes through principal component analysis (PCA), and provided the first N-components to a logistic regression classifier for predicting BMI class membership, similar to methods shown in [32, 33]. The number of components was selected as the minimum set which explained >90% of the variance when averaged across the training set for each fold within the cross validation of the original graphs; this resulted in a feature of 20 components. We trained the model using k-fold cross validation, with k = 2,5,10, and N (equivalent to leave-one-out; LOO).

Data & code provenance

The unprocessed dataset is available through The Consortium of Reliability and Reproducibility (http://fcon_1000.projects.nitrc.org/indi/enhanced/), including both the imaging data as well as phenotypic data which may be obtained upon submission and compliance with a Data Usage Agreement. The connectomes generated through simulations have been bundled and stored permanently (https://doi.org/10.5281/zenodo.4041549), and are made available through The Canadian Open Neuroscience Platform (https://portal.conp.ca/search, search term “Kiar”). All software developed for processing or evaluation is publicly available on GitHub at https://github.com/gkpapers/2021ImpactOfInstability. Experiments were launched using Boutiques [43] and Clowdr [44] in Compute Canada’s HPC cluster environment. MCA instrumentation was achieved through Verificarlo [26] available on Github at https://github.com/verificarlo/verificarlo. A set of MCA instrumented software containers is available on Github at https://github.com/gkiar/fuzzy. (PDF) Click here for additional data file. 30 Jul 2021 PONE-D-21-12648 Numerical Uncertainty in Analytical Pipelines Lead to Impactful Variability in Brain Networks PLOS ONE Dear Dr. Kiar, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== The authors explored how Numerical Uncertainty in Analytical Pipelines can alter brain network topology. The study is interesting but further clarifications and modifications are needed. Below, I summarize my comments: 1) Your analysis cannot be generalized to all the different types of brain connectomes, unless you extended further the description of your analysis. There is no title in the introduction, no material & methods, and no results section. 2) You analyzed dMRI and explore how numerical errors can violate brain network topologies in terms of brain fingerprinting, repeatability across sessions, and brain-phenotype relationships. There is no description of the dataset. You can add it in supp. material 3) It's difficult to understand the whole idea behind your study and types of perturbations (sparse and dense) etc. You didn't compare for example two independent software that implement the same tractographic algorithm in order to explore their outcome. The whole study demands substantial revision and rewriting explaining further the rationale and the source of numerical uncertainties. Also, why did you generalize the outcome of your experiments across every type of modality-based brain networks? ============================== Please submit your revised manuscript by Sep 13 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Stavros I. Dimitriadis Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Acknowledgments Section of your manuscript: ”This research was financially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) (award 360 no. CGSD3-519497-2018). This work was also supported in part by funding provided by Brain Canada, in partnership with 361 Health Canada, for the Canadian Open Neuroscience Platform initiative. “ We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: ”GK was fully supported for this work by the Natural Sciences and Engineering Research Council of Canada Award number: CGSD3 - 519497 - 2018. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments (if provided): The authors explored how Numerical Uncertainty in Analytical Pipelines can alter brain network topology. The study is interesting but further clarifications and modifications are needed. Below, I summarize my comments: 1) Your analysis cannot be generalized to all the different types of brain connectomes, unless you extended further the description of your analysis. There is no title in the introduction, no material & methods, and no results section. 2) You analyzed dMRI and explore how numerical errors can violate brain network topologies in terms of brain fingerprinting, repeatability across sessions, and brain-phenotype relationships. There is no description of the dataset. You can add it in supp. material 3) It's difficult to understand the whole idea behind your study and types of perturbations (sparse and dense) etc. You didn't compare for example two independent software that implement the same tractographic algorithm in order to explore their outcome. The whole study demands substantial revision and rewriting explaining further the rationale and the source of numerical uncertainties. Also, why did you generalize the outcome of your experiments across every type of modality-based brain networks? [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The present study deals with the notion of reproducibility which makes research difficult to make progress nowadays. The manuscript is well-written but titles like introduction and so on are missing making it difficult to follow. A few comments before the acceptance of this work 1. Abstract References should be avoided from the abstract and the abstract should be reword to reveal better the novelty/findings of the present work 2. Introduction (even though there is not title for this) 2nd paragraph: The problem here is not only the variability among toolboxes but also among subjects with regard for example their dielectric properties. See for example Antonakakis et al., 2020, NeuroImage (https://pubmed.ncbi.nlm.nih.gov/32919058/). Inter-subject variability of skull conductivity is a special parameter that can really affect the reproduction of a neuroimaging result. 3. A strong concern of mine is if there is any kind of False detection rate applied on the statistical comparison throughout the entire study? Page 6, last paragraph the classification model is not described well, the author need to name the classifier they used. 'logistic regression classifier is one case and it might be weak for the purpose of this study. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 4 Aug 2021 The response to reviewers was provided in a supplied attachment. Submitted filename: response_to_reviewers_impact_of_instability.pdf Click here for additional data file. 26 Aug 2021 Numerical Uncertainty in Analytical Pipelines Lead to Impactful Variability in Brain Networks PONE-D-21-12648R1 Dear Dr. Kiar, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Stavros I. Dimitriadis Academic Editor PLOS ONE Additional Editor Comments (optional): I have no further comments. Reviewers' comments: 12 Oct 2021 PONE-D-21-12648R1 Numerical Uncertainty in Analytical Pipelines Lead to Impactful Variability in Brain Networks Dear Dr. Kiar: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Stavros I. Dimitriadis Academic Editor PLOS ONE

35 in total

1. Complex network measures of brain connectivity: uses and interpretations.

Authors: Mikail Rubinov; Olaf Sporns
Journal: Neuroimage Date: 2009-10-09 Impact factor: 6.556

Review 2. Brain networks in schizophrenia.

Authors: Martijn P van den Heuvel; Alex Fornito
Journal: Neuropsychol Rev Date: 2014-02-06 Impact factor: 7.444

3. 1,500 scientists lift the lid on reproducibility.

Authors: Monya Baker
Journal: Nature Date: 2016-05-26 Impact factor: 49.962

Review 4. Building a Science of Individual Differences from fMRI.

Authors: Julien Dubois; Ralph Adolphs
Journal: Trends Cogn Sci Date: 2016-04-30 Impact factor: 20.229

Review 5. Human connectomics.

Authors: Timothy E J Behrens; Olaf Sporns
Journal: Curr Opin Neurobiol Date: 2011-09-09 Impact factor: 6.627