Stephanie Noble1, Marisa N Spann2, Fuyuze Tokoglu3, Xilin Shen3, R Todd Constable1,3,4, Dustin Scheinost3. 1. Interdepartmental Neuroscience Program, Yale University, New Haven, CT 06520, USA. 2. Department of Psychiatry, College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA. 3. Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, CT 06520, USA. 4. Department of Neurosurgery, Yale School of Medicine, New Haven, CT 06520, USA.
Abstract
Best practices are currently being developed for the acquisition and processing of resting-state magnetic resonance imaging data used to estimate brain functional organization-or "functional connectivity." Standards have been proposed based on test-retest reliability, but open questions remain. These include how amount of data per subject influences whole-brain reliability, the influence of increasing runs versus sessions, the spatial distribution of reliability, the reliability of multivariate methods, and, crucially, how reliability maps onto prediction of behavior. We collected a dataset of 12 extensively sampled individuals (144 min data each across 2 identically configured scanners) to assess test-retest reliability of whole-brain connectivity within the generalizability theory framework. We used Human Connectome Project data to replicate these analyses and relate reliability to behavioral prediction. Overall, the historical 5-min scan produced poor reliability averaged across connections. Increasing the number of sessions was more beneficial than increasing runs. Reliability was lowest for subcortical connections and highest for within-network cortical connections. Multivariate reliability was greater than univariate. Finally, reliability could not be used to improve prediction; these findings are among the first to underscore this distinction for functional connectivity. A comprehensive understanding of test-retest reliability, including its limitations, supports the development of best practices in the field.
Best practices are currently being developed for the acquisition and processing of resting-state magnetic resonance imaging data used to estimate brain functional organization-or "functional connectivity." Standards have been proposed based on test-retest reliability, but open questions remain. These include how amount of data per subject influences whole-brain reliability, the influence of increasing runs versus sessions, the spatial distribution of reliability, the reliability of multivariate methods, and, crucially, how reliability maps onto prediction of behavior. We collected a dataset of 12 extensively sampled individuals (144 min data each across 2 identically configured scanners) to assess test-retest reliability of whole-brain connectivity within the generalizability theory framework. We used Human Connectome Project data to replicate these analyses and relate reliability to behavioral prediction. Overall, the historical 5-min scan produced poor reliability averaged across connections. Increasing the number of sessions was more beneficial than increasing runs. Reliability was lowest for subcortical connections and highest for within-network cortical connections. Multivariate reliability was greater than univariate. Finally, reliability could not be used to improve prediction; these findings are among the first to underscore this distinction for functional connectivity. A comprehensive understanding of test-retest reliability, including its limitations, supports the development of best practices in the field.
The cornerstones of scientific progress are reliability, reproducibility, and validity.
These concepts provide complementary facets for understanding the accuracy of a measure:
reliability and reproducibility refer to the consistency of a measure across repeated tests
and/or varying conditions, and validity refers to the association of a measure with a
predefined measure of “ground truth” (Cronbach
1988; Open Science Collaboration
2015). Unfortunately, many scientists argue that biomedical research is in the midst
of a sort of “crisis of reproducibility” (Baker
2016; cf. Begley and Ellis 2012; Pashler and Wagenmakers 2012; Open Science Collaboration 2015). Accordingly, the field of
neuroimaging has begun to compile best practices to increase reliability, reproducibility,
and validity (Nichols et al. 2017; Poldrack et al. 2017). A fundamental measure
increasingly used to inform these best practices is “test–retest reliability,” intended to
reflect intrinsic stability over repeated tests.Given the promise of resting-state functional connectivity as a research and clinical tool
(Smith 2012), characterizing its test–retest
reliability is an active line of research and has resulted in estimates of test–retest
reliability ranging from poor to good (Shehzad et al.
2009; Van Dijk et al. 2010; Anderson et al. 2011; Birn et al. 2013; Shou et al.
2013; Laumann et al. 2015; Noble et al. 2016; Shah et al. 2016; Tomasi et
al. 2016; O'Connor et al. 2017; Pannunzi et al. 2017). Some of these conflicting
results are due to differences in measures of test–retest reliability (Birn et al. 2013; Laumann et
al. 2015) and the selection of regions used for connectivity analyses (Anderson et al. 2011; Shah et al. 2016). This variability across studies has given rise
to widely varying recommendations regarding the amount of data needed for the reliable
measurement of functional connectivity. As such, understanding how much data is needed to
achieve different levels of test–retest reliability for different connections across the
whole brain remains an open question. A whole-brain scope is relevant to exploratory or
data-driven research where anatomical constraints are not established a priori. Furthermore,
mounting evidence suggests that there is rich anatomical variability in test–retest
reliability across the brain (Shehzad et al.
2009; Birn et al. 2013; Laumann et al. 2015; Noble et al. 2016; Shah et
al. 2016; Tomasi et al. 2016; O’Connor et al. 2017). Additionally, novel methods
are being developed that treat connectivity as a multivariate object (as opposed to
univariate analyses operating at the level of single connections, or “edges”; Shirer et al. 2012; Smith et al. 2015; La Rosa et
al. 2016; Shen et al. 2017) or
incorporate test–retest reliability into various analyses (Strother et al. 2004; Shou et
al. 2014; Mueller et al. 2015; Shirer et al. 2015). However, there is a lack of
work investigating test–retest reliability using multivariate frameworks or associating
test–retest reliability of any single connection with measures of its behavior utility
(i.e., its association with or ability to predict behavior).Here, we used the generalizability theory framework (Webb and Shavelson 2005) to investigate the test–retest reliability of whole-brain
resting-state functional connectivity. The matrix connectivity approach employed here
represents a generalization of classical seed connectivity (Finn et al. 2014), allowing us to examine the connectivity amongst
268 seed regions spanning the whole brain. To assess test–retest reliability, we collected a
state-of-the-art high spatial and temporal resolution dataset from 12 subjects scanned over
four 36-min sessions, each session about a week apart, on 2 separate but harmonized
scanners. Finally, we replicated a portion of the test–retest reliability results and
related each connection's test–retest reliability to its utility in predicting behavior
using the Human Connectome Project (HCP) 900 Subjects Data Release (Van Essen et al. 2013). In the following, we investigate the
influence on test–retest reliability of increasing runs and sessions, the spatial
distribution of reliable edges, multivariate test–retest reliability and discriminability,
and, crucially, how test–retest reliability relates to the prediction of behavior.
Methods
Test–Retest Cohort
In total, 12 healthy subjects between the ages of 27 and 56 (mean = 40, standard
deviation [SD] = 11) (6 males and 6 females) were recruited. Exclusion criteria included
history of psychiatric illness or any magnetic resonance imaging (MRI) contraindications.
All subjects gave informed consent and were compensated for their participation.Data were acquired on 2 identically configured Siemens 3 T Tim Trio scanners at Yale
University using a 32-channel head coil. After the localizer, high-resolution T1-weighted
3D anatomical scans were acquired using a magnetization prepared rapid gradient echo
(MPRAGE) sequence (208 contiguous slices acquired in the sagittal plane, TR = 2400 ms, TE
= 1.18 ms, flip angle=8°, thickness=1 mm, in-plane resolution=1 mm × 1 mm, matrix size =
256 × 256). Next, T1-weighted 3D anatomical scans were acquired using a fast low angle
shot (FLASH) sequence (75 contiguous slices acquired in the axial plane, TR = 440 ms, TE =
2.61 ms, flip angle=70°, thickness=2 mm, in-plane resolution=2 mm × 2 mm, matrix size =
256 × 256). Next, functional images were acquired in the same slice locations as the axial
T1-weighted data using a multiband echo-planar imaging (EPI) pulse sequence (75 contiguous
slices acquired in the axial plane, TR = 1000 ms, TE = 30 ms, flip angle=55°, thickness=2
mm, in-plane resolution = 2 mm × 2 mm, matrix size = 110 × 110).Each subject underwent scans over 4 sessions approximately 1 week apart (mean inter-scan
interval=9.4 days, SD = 5.3 days), with all 4 scans completed within a maximum period of
1.5 months. Each subject was scanned using the 2 scanners; 2 sessions used “Scanner A”,
and the other 2 sessions used “Scanner B.” The order of visits to scanners was
counterbalanced across subjects, and visits to a single scanner were not necessarily
consecutive (e.g., all visit orders were allowed). Six functional runs were collected at
each session; a single 6-min run of resting-state functional data consisted of 360
continuous EPI functional volumes. Altogether, 144 min of data was collected for each
subject (4 sessions/subject × 6 runs/session × 6 min/run), as shown in Figure 1. Subjects were instructed to remain still in the
scanner with their eyes open, keeping their gaze on a fixation cross.
Figure 1.
Study design. A total of 12 subjects were each scanned at 4 sessions (2 scanners × 2
days), with each session comprising six 6-min runs for a total of 36 min of data per
session. All scans were acquired with subjects at rest with eyes open. Connectivity
matrices were obtained independently for each run.
Study design. A total of 12 subjects were each scanned at 4 sessions (2 scanners × 2
days), with each session comprising six 6-min runs for a total of 36 min of data per
session. All scans were acquired with subjects at rest with eyes open. Connectivity
matrices were obtained independently for each run.
Human Connectome Project Cohort
The Human Connectome Project (HCP) 900 Subjects Data Release (Van Essen et al. 2013) was used to assess behavioral utility in a
much larger set of subjects. The full dataset contains 877 individuals; this dataset was
narrowed by restricting analysis only to individuals who had resting-state data from all 4
scans (n = 823/877), low motion (mean frame-to-frame displacement [mFFD]
<0.1 mm; n = 610/823), and behavioral records of fluid intelligence
(gF; n = 606/610). Thus, behavioral predictions were made from a final
cohort of 606 subjects.
Image Analysis
Test–Retest Preprocessing
All analyses were performed using BioImage Suite (Joshi et al. 2011) unless otherwise noted. Functional images were
motion-corrected using SPM5 (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/). The data were then
iteratively smoothed to an equivalent smoothness of a 2.5 mm Gaussian kernel in order to
ensure uniform smoothness across the dataset (Friedman et al. 2006; Scheinost et al.
2014). White matter and CSF were defined on a MNI-space template brain and
eroded in order to minimize inclusion of grey matter in the mask. The template was then
warped to subject space using a series of transformations described in the next section.
This ensured that mainly grey matter voxels were used in subsequent analyses. The
following noise covariates were regressed from the data: linear, quadratic, and cubic
drift, a 24-parameter model of motion (Satterthwaite et al. 2013), mean cerebrospinal fluid signal, mean white matter
signal, and mean global signal. Finally, data were temporally smoothed with a zero mean
unit variance Gaussian filter (cutoff frequency=0.19 Hz).
Test–Retest Common Space Registration
In order to warp single subject results into MNI space, a series of linear and
nonlinear transformations were estimated using BioImage Suite. Anatomical data were
first skull-stripped using FSL (Smith
2002). Functional data for each subject, scanner, and session were linearly
registered to the corresponding FLASH images. FLASH images were then linearly registered
to MPRAGE images. Next, an average MPRAGE image for each subject was created by linearly
registering and averaging all 4 anatomical images (from 2 scanner × 2 sessions) for each
subject. These average MPRAGE images were used for an iterative nonlinear registration
to MNI space. The use of the average anatomical images and a single nonlinear
registration for each subject ensures that any potential anatomical distortions due to
the different scanners does not introduce a systematic bias into the registration. The
average MPRAGE images were nonlinearly registered to an evolving group average template
in MNI space as described previously (Scheinost et
al. 2017). All transformation pairs were calculated independently and then
combined into a single transform that warps single participant results into common
space. From this, all subjects’ images can be transformed into common space using a
single transformation, which reduces interpolation error.
HCP Preprocessing
Starting with the minimally preprocessed HCP data (Glasser et al. 2013), further preprocessing steps were
performed using BioImage Suite. These included regressing 24 motion parameters,
regressing the mean time courses of the white matter and CSF as well as the global
signal, removing the linear trend, and low-pass filtering as described for the
test–retest cohort.
Matrix Connectivity
Regions were delineated according to a 268-node whole-brain grey matter atlas (Shen et al. 2013). This atlas, defined in an
independent dataset, provides a parcellation of the whole gray matter (including
subcortex) into 268 contiguous, functionally coherent regions. These 268 nodes have been
also grouped into 10 functionally coherent “networks” (cf. Finn et al. 2015) using the same parcellation procedure, and by
anatomy into 10 “anatomical regions.” Note that the subcortical-cerebellar network
(previously Network 4 in Finn et al. 2015)
is further divided here into 3 networks: cerebellar, subcortical, and limbic.For each scan, the average timecourse within each region was obtained, and the
Pearson's correlation between the mean time courses of each pair of regions was
calculated. These correlation values provided the edge strengths for a 268 × 268
symmetric correlation matrix for each combination of subject, session, and run. These
correlations were converted to z-scores using a Fisher transformation,
providing a “connectivity matrix” for each combination of subject, session, and run.
This connectivity matrix is analogous to a conventional seed connectivity analysis
wherein a volumetric region of interest is defined and correlations are calculated
between that region and all whole-brain grey matter voxels, except that other regions
are used instead of voxels. Therefore, the connectivity matrix represents a set of
regions (“nodes”) and connections between each pair of regions (“edges”).
Test–Retest Reliability Analyses in the Test–Retest Cohort
Univariate Test–Retest Reliability
The generalizability theory (G theory) framework was used to assess test–retest
reliability. G theory is a generalization of classical test theory that allows the
inclusion of more than one facet of measurement, or source of error (Webb and Shavelson 2005; Webb et al. 2006). G theory has been previously used to assess
the test–retest reliability of multisite functional connectivity (Forsyth et al. 2014; Gee
et al. 2015; Noble et al. 2016).
The first step is a generalizability study (G-study), which involves estimating variance
components for all factors: the “object” of measurement (here, people), “facets” of
measurement (here, sessions and runs), and their interactions. The residual contains
variance due to both the 3-way interaction and residual error.A 3-way ANOVA model was used to estimate variance due to all factors modeled as
random—to maximize generalizability—using the Matlab OLS-based function “anovan.”
Negative variance components were small in magnitude and therefore set to 0, as
previously described (Shavelson et al.
1993). The model of variance is as follows, with subscripts representing factors
p = person, s = session, r = run,
and e = residual:These variance components can then be used to estimate test–retest reliability. For
this study, we calculated “absolute reliability,” which is measured by the dependability
coefficient (D-coefficient, Φ) and reflects the absolute agreement of
measurements. The D-coefficient is a form of the intraclass correlation
coefficient (ICC; Shrout and Fleiss 1979),
which is based on the ratio of variance due to the object of measurement versus sources
of error (akin to an F-statistic). For the univariate case,
D-coefficients () were calculated for each edge x as
follows: where, represents a variance component associated with factor
i or an interaction between facets, represents the number of conditions of facet
i used for an average measurement (discussed in the next paragraph),
and the factors are, again, p (person), s (session),
and r (run). Note that the treatment of facets as possibly similar
across subjects yet randomly selected from a larger population makes this
D-coefficient of the form ICC(2,1) (cf. Webb et al. 2006). To summarize all edges, the mean and SD of
the D-coefficient across all edges is presented.
D-coefficients (like most ICC coefficients) range from 0 to 1, and can
be interpreted as follows: <0.4 poor; 0.4–0.59 fair; 0.60–0.74 good; > 0.74
excellent (Cicchetti and Sparrow 1981).This formulation enables the construction of a decision study (D-study), which provides
information about what combination of measurements from each facet of measurement yields
the desired level of test–retest reliability. To build a D-study matrix,
D-coefficients are re-calculated with allowed to vary as free parameters. Increasing
in the calculation of the D-coefficients
is akin to assessing the test–retest reliability of connectivity matrices averaged over
multiple sessions or runs, for example, how test–retest reliability increases with the
addition of more data from either sessions or runs. Therefore, test–retest reliability
obtained from and is the projected test–retest reliability of a
connectivity matrix obtained from a single 6-min run (6 min total), whereas test–retest
reliability obtained from and is the projected test–retest reliability of a
connectivity matrix obtained from the average of matrices over 2 sessions of four 6-min
runs of data (48 min total). After calculating test–retest reliability across all edges,
test–retest reliability was then summarized by taking the mean within (1) nodes, (2)
networks, and (3) anatomical regions. Using the univariate D-study results, the minimum
scan duration required to obtain different mean levels of test–retest reliability was
calculated.To facilitate comparisons with previous studies, test–retest reliability was also
assessed within the context of classical test theory. In brief, 2 versions of the ICC
were calculated: (1) a coefficient describing short-term (intrasession) test–retest
reliability, and (2) a coefficient describing long-term (intersession) test–retest
reliability. Details can be found in the Supplementary Methods.
Region and Motion Influences on Univariate Test–Retest Reliability
The extent to which test–retest reliability was associated with node size (voxel-wise
volume) and location was explored via Matlab's “fitlm.” Three models were constructed:
(1) test–retest reliability ~ node size, (2) test–retest reliability ~ location, and (3)
test–retest reliability ~ node size + location. Location was coded as a binary variable
(1 = cortex, 0 = subcortex). The fits of both single-predictor models (1 or 2) were
compared against the dual-predictor model (3) via F-test.The influence of motion on test–retest reliability was investigated by comparing
test–retest reliability estimated in lower and higher motion groups. To estimate motion,
mFFD was calculated for each subject as the average FFD across all runs and sessions. A
threshold of mFFD = 0.1 was chosen to separate subjects into higher and lower motion
groups, based on the mean FFD across medium motion adults in Power et al. (2014; see Fig. S1). Test–retest reliability was then calculated separately for
both groups.
Multivariate Analyses
Two multivariate analyses were performed: (1) whole-brain multivariate test–retest
reliability, using a method derived from the image intraclass correlation coefficient
(I2C2; Shou et al. 2013), and (2)
whole-brain discriminability, using 2 discrimination procedures related to the
“fingerprinting” approach (Finn et al.
2015). All 35 778 unique edges in the 268-node atlas were used for these
analyses.
Multivariate Test–Retest Reliability
For multivariate test–retest reliability, a single test–retest reliability coefficient
was calculated by summing variance components over all edges, as described by Shou et al. (2013): where, as above, represents the variance component associated with factor
i or an interaction between factors, represents the number of levels in facet
i, and factors are p (person), s
(session), and r (run). In accordance with Shou et al. (2013), the I2C2 approach represents a multivariate
image measurement error model because these combined variance components reflect the
true overall image variance, in contrast with the univariate ICC which reflects a
univariate (marginal) measurement error model. Scan durations required for each level of
test–retest reliability were then calculated as in the univariate analysis.
Multivariate Discriminability
The fingerprinting approach (Finn et al.
2015) is a complementary method of assessing the reproducibility of functional
connectivity. We used 2 related discrimination approaches to determine whether subjects
could be identified (using the fingerprinting procedure) or separated (using a new,
related procedure). Accordingly, 2 summary statistics were calculated: the
identification success rate (ISR) and the perfect separability rate (PSR). Summary
statistics were calculated 6 times to assess the effect of different amounts of data,
each time adding a 6-min run before re-calculating the connectivity matrix (e.g., for 6,
12, 18, 24, 30, and 36 min). Runs were added sequentially (i.e., in the order acquired)
to reflect real-world conditions.The ISR measures whether scans from the same subject can be accurately selected from a
set of all scans. Each of the 48 scans (12 subjects × 4 sessions = 48 scans) served as a
reference once. For each reference scan, the maximum correlation between the reference
scan matrix and all other 47 scan matrices was selected. If the selected matrix belonged
to the same subject as the reference matrix, the identification for that scan was coded
as a success. After repeating this procedure for all scans, the ISR was then calculated
as follows: where p = person and s =
session. Each successful identification can be compared with the probability of choosing
one of 3 correct scans out of 47 total scans at random.The PSR is more challenging than the ISR; it describes whether all within-subject
correlations exceeded all between-subject correlations for each scan. Again, each of the
48 scans were used as reference. For each reference scan, if all 3 correlations between
the reference and other within-subject scan matrices exceeded all 44 correlations
between the reference and between-subject scan matrices—that is, the reference subject
did not once look more like another subject than him/herself—then the separation for the
reference scan was recorded as a success. The PSR was calculated as follows:
where, p = person and s =
session. Each perfect separation for a scan can be compared with the probability of the
minimum of 3 randomly chosen integers exceeding the maximum of 47 randomly chosen
integers.Mean correlations within and between subjects for the above procedures are also
presented. Note that correlations will always exceed ICC (Müller and Büttner 1994); therefore, a perfect Pearson's
correlation can occur in the absence of identical measurements, but identical
measurements must always be accompanied by a high correlation.
Estimating Scanner and Day Effects in the Test–Retest Cohort
We investigated the effect of acquiring test–retest data using identical scanners at a
single site. This step is needed to determine whether it is appropriate to collapse
scanner and day factors into a single session factor, which was done in the above analyses
in order to estimate the presence of a session effect. A model including person, scanner,
day, run, and all interactions was created to assess the presence of scanner and day
effects, with subscripts representing factors p = person,
s = scanner, d = day, r = run, and
e = residual:These variance components were then compared with the variance components obtained with
the model including person, session, run, and all interactions.Next, we estimated whether person, scanner, and day factors were associated with
significant variance across different numbers of runs, as described previously (Noble et al. 2016). We first assessed for the
effect of each factor (via ANOVA), then assessed for the effect of each individual level
within each factor (via GLM). For each number of runs, a connectivity matrix was
re-calculated over that number of runs. Then the regression estimation procedure was
repeated using that number of runs. Run was not explicitly included in either model due to
the potential for inflation of estimates of significance resulting from the within-factor
repeated-measures nature of the data.Effects due to each factor were assessed as follows. The contribution of all factors to
the variability in connectivity was estimated using a 3-way ANOVA with all factors modeled
as random effects, as above, except without the interactions. This ensures more accurate
estimates for the denominator of the F-test. The model is as follows,
with subscripts representing p = person, s = session,
d = day, and e = residual:Next, the F-test statistic was used to assess whether each factor was
associated with significant variability in connectivity. Finally, correction via
estimation of the false discovery rate (FDR) was performed separately for each factor
using “mafdr” in Matlab (Storey 2002).Next, effects due to each individual person, scanner, and day were assessed using a
general linear model (GLM). Each of the 3 factors (person, scanner, and day) was modeled
separately, so that 3 GLMs—1 per factor—were constructed for each edge or voxel and fit
using the Matlab function “glmfit.” Since the aims of this analysis are exploratory, GLMs
were independently estimated to facilitate interpretation of the direct effects (Hayes 2013). The results of this analysis show
whether any of the measures are significantly different from the group mean as a function
of these factors. While the inclusion of the level of interest in the grand mean can
decrease the power of this test, this setup is useful for comparing the effects of each
level to one another because each is compared with a common reference. As above, FDR
correction was performed separately for each level of each factor.
Test–Retest Reliability Analysis in the HCP Cohort
For the HCP cohort, we performed univariate and multivariate test–retest reliability
analyses similar to those used in the test–retest cohort. The following model was used to
estimate variance components, with subscripts representing p = person,
s = session, and e = residual:The 2 runs acquired in the same day for the HCP cohort were acquired with different phase
encodings, which could artificially increase within-session variance and confound
test–retest reliability. As such, connectivity matrices for the right-left and left-right
phase encoding acquired on the same day (i.e., REST2_LR and REST2_RL) were averaged
together and no run factor was included in the test–retest reliability analysis. Thus,
test–retest reliability estimated with is the projected test–retest reliability from a 30-min
session.
Behavioral Analysis Using Connectome-Based Predictive Modeling in the HCP
Cohort
We then explored the association between an edge's test–retest reliability and its
behavioral utility in the HCP cohort, using Connectome-based Predictive Modeling (CPM;
Shen et al. 2017) to define useful edges.
CPM is a data-driven protocol for developing predictive models of brain-behavior
relationships from connectomes using leave-one-subject-out cross-validation. Full details
of the CPM protocols are provided elsewhere (Shen et
al. 2017). Briefly, CPM is composed of (1) selecting edges that are significantly
correlated with behavior (P < 0.05), (2) summing those edge strengths
into a single-subject summary score, (3) building a linear model to predict behavior based
on the single-subject summary scores, and (4) testing this predictive model in novel
subjects. Fluid intelligence scores (gF) obtained via the Raven Progressive Matrices
(“pmat” variable in HCP dataset) were used as the behavioral measure, and connectivity
matrices were averaged over both days (60 min of data in total). Our final predictive
model was only composed of edges that appeared in every round of cross-validation (cf.
Rosenberg et al. 2015).
Results
Test–Retest Reliability of Univariate Functional Connectivity
Overall Test–Retest Reliability of Individual Edges
For a single, 6-min session, test–retest reliability across all edges was found to be
poor (Φ = 0.18 ± 0.13). This is mainly due to the large
contribution of the residual (65.8%) relative to the other variance components,
including subject (18.3%; Supplementary Table 1). This large residual indicates that all scans in the
dataset are unique in a way that is not explained by the specified factors. All other
variance components were relatively small (<2%) except the person by session
interaction (12.0%), suggesting that subjects differ in how they progress through
sessions.To assess the influence of quantity of data on test–retest reliability, a Decision
Study map was created by estimating test–retest reliability for different combinations
of numbers of sessions and runs (Fig. 2). The
Decision Study map is asymmetric across the diagonal, indicating that increasing the
number of sessions boosts test–retest reliability more than increasing the number of
runs within a session. As such, fair test–retest reliability could not be obtained
within a single session, even using the maximum amount of data acquired, 36 min
(Φ = 0.39 ± 0.21). In contrast, using the minimal
amount of data (6 min), it is possible to achieve fair test–retest reliability in 4
sessions (Φ = 40 ± 21; 24 min in total). Good test–retest
reliability can be obtained with a minimum of 3 sessions, requiring 35 min of data per
session (Φ = 0.60 ± 0.23; 105 min in total). However,
excellent test–retest reliability cannot be obtained even using all data acquired (144
min total; Φ = 0.65 ± 0.23).
Figure 2.
Effect of number of sessions and scan duration on mean test–retest reliability over
all edges. A Decision Study was performed to estimate absolute reliability (Φ) as a
function of scan duration (min) and number of sessions. Brighter colors correspond
to higher levels of test–retest reliability, and results are categorized as follows:
poor < 0.4, fair = 0.4–0.59, good = 0.6–0.74, excellent≥0.74 (Cicchetti and Sparrow 1981). The asymmetry
across the diagonal indicates that test–retest reliability improves more quickly
with increasing number of sessions than with increasing scan durations. Accordingly,
for a single session, even with 36 min of data, only poor test–retest reliability is
obtained. However, fair test–retest reliability can be obtained with only 24 min of
data collected over 4 sessions of 6 min each. Good test–retest reliability requires
96–108 min of data divided over 3–4 sessions. Excellent test–retest reliability
cannot be achieved using the maximum amount of data collected (4 sessions × 36 min,
or 2.4 h in total).
Effect of number of sessions and scan duration on mean test–retest reliability over
all edges. A Decision Study was performed to estimate absolute reliability (Φ) as a
function of scan duration (min) and number of sessions. Brighter colors correspond
to higher levels of test–retest reliability, and results are categorized as follows:
poor < 0.4, fair = 0.4–0.59, good = 0.6–0.74, excellent≥0.74 (Cicchetti and Sparrow 1981). The asymmetry
across the diagonal indicates that test–retest reliability improves more quickly
with increasing number of sessions than with increasing scan durations. Accordingly,
for a single session, even with 36 min of data, only poor test–retest reliability is
obtained. However, fair test–retest reliability can be obtained with only 24 min of
data collected over 4 sessions of 6 min each. Good test–retest reliability requires
96–108 min of data divided over 3–4 sessions. Excellent test–retest reliability
cannot be achieved using the maximum amount of data collected (4 sessions × 36 min,
or 2.4 h in total).
Test–Retest Reliability of Edges Summarized by Nodes and Networks
Although test–retest reliability over all edges was poor at a single session even using
the full scan duration (36 min), edges associated with certain nodes showed fair or good
mean test–retest reliability (Φ = 0.4–0.6; Fig. 3a). Edges associated with cortical nodes
(Φ = 0.20 ± 0.05) were more reliable than those
associated with subcortical nodes (Φ = 0.12 ± 0.04).
More reliable cortical nodes attained fair test–retest reliability with less data per
subject (Fig. 3b). These
differences are attributed to a combination of node size (sizectx = 5211 ±
1854 voxels; sizesubctx = 3408 ± 1116 voxels) and location (cortex or
subcortex), since both were found to uniquely contribute to test–retest reliability when
both were simultaneously included as predictors
(β = 0.67, P <
0.0001; β = 0.15,
P < 0.001). This model explained significantly more variance than
either single-predictor model (Supplementary Fig. 1).
Figure 3.
Spatial distribution of test–retest reliability, organized by node.
(a) Mean test–retest reliability (Φ) of connectivity at each
node. For each node, the mean test–retest reliability of all edges associated with
that node was calculated for a single session with a 36-min scan duration. Brighter
colors correspond to higher levels of test–retest reliability, and results are
categorized as follows: poor < 0.4, fair = 0.4–0.59, good = 0.6–0.74,
excellent≥0.74 (Cicchetti and Sparrow
1981). Cortical nodes exhibited greater test–retest reliability than
noncortical nodes. (b) Minimum scan duration needed to achieve mean
fair test–retest reliability at each node for a single session. Brighter colors
correspond to shorter scan durations, and are scaled differently than for
(a). Cortical nodes became reliable at shorter scan durations
than noncortical nodes.
Spatial distribution of test–retest reliability, organized by node.
(a) Mean test–retest reliability (Φ) of connectivity at each
node. For each node, the mean test–retest reliability of all edges associated with
that node was calculated for a single session with a 36-min scan duration. Brighter
colors correspond to higher levels of test–retest reliability, and results are
categorized as follows: poor < 0.4, fair = 0.4–0.59, good = 0.6–0.74,
excellent≥0.74 (Cicchetti and Sparrow
1981). Cortical nodes exhibited greater test–retest reliability than
noncortical nodes. (b) Minimum scan duration needed to achieve mean
fair test–retest reliability at each node for a single session. Brighter colors
correspond to shorter scan durations, and are scaled differently than for
(a). Cortical nodes became reliable at shorter scan durations
than noncortical nodes.Edges associated with different large-scale functional networks exhibited different
levels of test–retest reliability (Fig. 4a, Supplementary Fig. 2). Networks with more reliable edges
required less data per subject to obtain a fair level of test–retest reliability (Fig.
4b, Supplementary Fig.
2). For cortical networks, within-network test–retest
reliability (values on the diagonal in Fig. 4a) was greater than between-network test–retest reliability
(values off the diagonal in Fig. 4a). For all networks, test–retest reliability increased with
more data per subject. Results organized by anatomical region are also included (Supplementary Fig. 3).
Figure 4.
(a) Spatial distribution of test–retest reliability, summarized by
network. For each pair of networks, the mean test–retest reliability (Φ) of all
edges between those networks was calculated for a single session with variable scan
duration (as noted in figures, scan duration=6, 12, 18, 24, 30, and 36 min). The
frontoparietal, medial frontal, and secondary visual networks showed the highest
test–retest reliability. Brighter colors correspond to higher levels of test–retest
reliability, and results are categorized as follows: poor < 0.4, fair = 0.4–0.59,
good = 0.6–0.74, excellent≥0.74 (Cicchetti and
Sparrow 1981). (b) Minimum scan duration needed to achieve
mean fair, good, and excellent test–retest reliability, summarized by network. For
each pair of networks, the mean test–retest reliability of all edges between those
networks was calculated (Φ) for a single session with variable scan duration (scan
duration=6, 12, 18, 24, 30, and 36 min), as above. The minimum scan duration
resulting in mean fair, good, and excellent test–retest reliability for that network
pair was then determined. Only within-network frontoparietal connectivity reached
good test–retest reliability in a single session. No network pair reached excellent
test–retest reliability. Subcortical regions never reached fair test–retest
reliability. Brighter colors correspond with shorter scan durations. MF, medial
frontal; FP, frontoparietal; DMN, default mode; Mot, Motor; VI, visual I; VII,
visual II; VAs, visual association; Lim, limbic; BG, basal ganglia (including
thalamus and striatum); CBL, cerebellum.
(a) Spatial distribution of test–retest reliability, summarized by
network. For each pair of networks, the mean test–retest reliability (Φ) of all
edges between those networks was calculated for a single session with variable scan
duration (as noted in figures, scan duration=6, 12, 18, 24, 30, and 36 min). The
frontoparietal, medial frontal, and secondary visual networks showed the highest
test–retest reliability. Brighter colors correspond to higher levels of test–retest
reliability, and results are categorized as follows: poor < 0.4, fair = 0.4–0.59,
good = 0.6–0.74, excellent≥0.74 (Cicchetti and
Sparrow 1981). (b) Minimum scan duration needed to achieve
mean fair, good, and excellent test–retest reliability, summarized by network. For
each pair of networks, the mean test–retest reliability of all edges between those
networks was calculated (Φ) for a single session with variable scan duration (scan
duration=6, 12, 18, 24, 30, and 36 min), as above. The minimum scan duration
resulting in mean fair, good, and excellent test–retest reliability for that network
pair was then determined. Only within-network frontoparietal connectivity reached
good test–retest reliability in a single session. No network pair reached excellent
test–retest reliability. Subcortical regions never reached fair test–retest
reliability. Brighter colors correspond with shorter scan durations. MF, medial
frontal; FP, frontoparietal; DMN, default mode; Mot, Motor; VI, visual I; VII,
visual II; VAs, visual association; Lim, limbic; BG, basal ganglia (including
thalamus and striatum); CBL, cerebellum.Test–retest reliability estimated via classical test theory was also found to be poor
with a similar spatial distribution; test–retest reliability and associated variance
components for classical test theory can be found in Supplementary Figure 4.
Influence of Motion on Univariate Test–Retest Reliability
Average motion for each subject ranged between 0.0374 and 0.1818 mm mFFD (mean = 0.0993
mm, SD = 0.0506 mm; Supplementary
Table 2). Overall, both higher and lower motion groups showed similar mean
test–retest reliability. Test–retest reliability was found to be
Φ = 0.15 ± 0.12 for the higher motion group (mFFD >
0.1 mm; n = 5) and Φ = 0.15 ± 0.13 for
the lower motion group (mFFD < 0.1 mm; n = 7). In both cases,
test–retest reliability was estimated to be lower than that obtained using the entire
cohort. The spatial distribution of test–retest reliability was also found to be similar
across higher and lower motion groups, with higher test–retest reliability of
within-network compared with between-network edges (Supplementary Fig. 5). However,
this spatial pattern was more pronounced in the lower motion group compared with the
higher motion group.
Multivariate Test–Retest Reliability and Discriminability of Functional
Connectivity
The D-study for multivariate test–retest reliability in shown in Figure 5. For a single session with 6 min of data,
multivariate test–retest reliability was greater than univariate test–retest reliability
(Φ = 0.23, Φ = 0.19 ±
0.13). As for univariate test–retest reliability, the D-study map is asymmetric across
the diagonal. Fair test–retest reliability (Φ≥0.4) was
obtained within a single session using 18 or more min of data. Using the minimum amount
of data per session (6 min), 3 sessions were required to achieve fair test–retest
reliability (18 min in total). Good test–retest reliability was obtained with a minimum
of 2 sessions, requiring 23 min of data per session (46 min in total). The minimum
amount of data per session required to attain good test–retest reliability is 9 min,
requiring 4 sessions (36 min in total). Excellent test–retest reliability was obtained
with 4 sessions of 22 min of data (88 min in total). Not only is less data required to
achieve higher test–retest reliability for multivariate, compared with univariate,
measures, but there is a greater rate of change in test–retest reliability with
increasing amounts of data.
Figure 5.
Effect of number of sessions and scan duration on multivariate test–retest
reliability of the connectivity matrix. A Decision Study was performed to estimate
multivariate absolute test–retest reliability (Φ) as a function of scan duration
(min) and number of sessions. Brighter colors correspond to higher levels of
test–retest reliability, and results are categorized as follows: poor < 0.4, fair
= 0.4–0.59, good = 0.6–0.74, excellent≥0.74 (Cicchetti and Sparrow 1981). The asymmetry across the diagonal indicates
that test–retest reliability improves more quickly with increasing number of
sessions than with increasing scan durations. Accordingly, fair test–retest
reliability can be obtained with less total data if data is acquired with multiple
sessions than if only a single session is used. Unlike in the univariate case, mean
fair test–retest reliability can be achieved in a single session and mean excellent
test–retest reliability can be achieved using the maximum amount of data collected
(4 sessions × 36 min).
Effect of number of sessions and scan duration on multivariate test–retest
reliability of the connectivity matrix. A Decision Study was performed to estimate
multivariate absolute test–retest reliability (Φ) as a function of scan duration
(min) and number of sessions. Brighter colors correspond to higher levels of
test–retest reliability, and results are categorized as follows: poor < 0.4, fair
= 0.4–0.59, good = 0.6–0.74, excellent≥0.74 (Cicchetti and Sparrow 1981). The asymmetry across the diagonal indicates
that test–retest reliability improves more quickly with increasing number of
sessions than with increasing scan durations. Accordingly, fair test–retest
reliability can be obtained with less total data if data is acquired with multiple
sessions than if only a single session is used. Unlike in the univariate case, mean
fair test–retest reliability can be achieved in a single session and mean excellent
test–retest reliability can be achieved using the maximum amount of data collected
(4 sessions × 36 min).The correlations between all sessions of all subjects were calculated (Supplementary Fig. 6) and used in
2 measures of between-subject discrimination. Overall, subjects were found to be unique,
but not perfectly so (Table 1, Supplementary Fig.
7). With 6 min of data, the ISR was 98% (only a single
session was misidentified) and reached the maximum of 100% with ≥12 min of data, which
corresponded with poor univariate and multivariate test–retest reliability. The PSR,
which requires all within-subject correlations to exceed all between-subject
correlations, was lower. PSR was 71% when using 6 min of data and 90% when using 36 min
of data. Correlations between the reference session and the session with the maximum
similarity increased from r = 0.59 using 6 min of data to
r = 0.82 using 36 min of data. As the amount of data per subject
increased, the worst within-subject correlation increased at a greater rate than the
best between-subject correlation (Supplementary Fig. 7), which underlies the improvement in
discrimination with increasing data.
Table 1
Multivariate discriminability increases with scan duration
6 min
12 min
18 min
24 min
30 min
36 min
Discriminability
ISR
98%
100%
100%
100%
100%
100%
PSR
71%
73%
79%
91%
90%
90%
Mean correlations
Match
0.5938
0.7101
0.7615
0.792
0.8105
0.8191
Within-sub
0.5512
0.6643
0.7152
0.7477
0.7683
0.7791
Between-sub
0.3828
0.4644
0.502
0.5265
0.5424
0.551
Worst within
0.5036
0.617
0.6679
0.7041
0.723
0.7348
Best between
0.4686
0.5501
0.594
0.6133
0.6272
0.6353
Two discriminability measures were used here: the identification success rate
(ISR) and the perfect separability rate (PSR). The ISR measures whether matrices
from the same subject can be accurately chosen from a set of all other 47 scan
matrices. The PSR measures whether, for each reference scan, all 3
within-subject correlations exceed all 44 between-subject correlations. The
following metrics are reported for each scan duration from 6 to 36 min: ISR,
PSR, mean match correlation, mean within-subject correlation, mean worst
(lowest) within-subject correlation, and mean best (highest) between-subject
correlation.
Multivariate discriminability increases with scan durationTwo discriminability measures were used here: the identification success rate
(ISR) and the perfect separability rate (PSR). The ISR measures whether matrices
from the same subject can be accurately chosen from a set of all other 47 scan
matrices. The PSR measures whether, for each reference scan, all 3
within-subject correlations exceed all 44 between-subject correlations. The
following metrics are reported for each scan duration from 6 to 36 min: ISR,
PSR, mean match correlation, mean within-subject correlation, mean worst
(lowest) within-subject correlation, and mean best (highest) between-subject
correlation.
Day and Scanner Effects
Variance was estimated for a model including scanner and day (Supplementary Fig.
8). No main effects of scanner or day were found for any amount
of data (except at 36 min, where a single edge showed a scanner effect). In other words,
subject connectivity does not appear to systematically increase or decrease from the first
scanner to the second or the first day to the second. In contrast, many edges were
influenced by person with 6 min of data (~80% edges by factor, ~5% edges by level), and
the number of affected edges increased with larger amounts of within-session data (at 36
min, 100% edges by factor, 25% edges by level; Supplementary Fig. 8). For correspondence with the
session effect, see Supplementary
Results. Finally, mean matrix correlations were high within-person across
scanners and days ( = 0.75 ± 0.07 = 0.80 ± 0.06; Supplementary Fig. 8). Altogether, these results
suggest minimal systematic session or scanner effects and support pooling data across
sessions or scanners, in the case of harmonized scanners.
Replication of Test–Retest Reliability in the HCP Cohort
With 30 min of data, univariate test–retest reliability was largely similar in the HCP
cohort compared with the TRT cohort (Φ = 0.28 ±
0.15; Φ = 0.37 ± 0.20). Test–retest reliability
(Φ) was correlated between the 2 cohorts at the single
edge level (r = 0.57, P
< 0.0001; Supplementary Fig.
9a). Between cohorts, mean reliabilities were almost perfectly correlated at the
network level (rnetwork = 0.94, P <
0.0001; Supplementary Fig. 9b)
and were highly correlated at the node- and anatomical region-levels
(rnode = 0.79, P < 0.0001;
ranat.region = 0.91, P < 0.0001).
Multivariate test–retest reliability was slightly lower for the HCP compared with the TRT
cohort (Φ = 0.33;
Φ = 0.48).
Association Between Test–Retest Reliability and Behavioral Utility
The model obtained from the CPM analysis significantly predicted gF (correlation between
predicted and observed value, r = 0.22, P < 0.0001;
Supplementary Fig.
10). Edges used to predict gF were associated with both
cortical and subcortical regions. The difference between test–retest reliability
(Φ) of predictive and nonpredictive edges exhibited a
small effect size (Cohen's d = 0.14; Fig. 6). Similarly, using all edges, the association between the
test–retest reliability of an edge and its behavioral relevance (i.e., magnitude of its
correlation with gF) was found to be small (r = 0.05), with test–retest
reliability accounting for less than 1% of the variance in behavioral relevance of edges
(Fig. 7). Finally, we repeated our CPM analysis
after removing the least reliable edges (or the most reliable edges) from the analysis in
an iterative fashion, increasing the number of edges removed at each iteration by 3200 (9%
of all edges; Fig. 8; see Supplementary Fig.
10 for additional results from positive and negative predictive
networks). Prediction performance remained mainly stable when removing edges in either
direction (reliable or unreliable). The influence of removing edges in either direction
was inconsistent, with performance never increasing by more than +0.025 from performance
with all edges included. Altogether, these results illustrate a dissociation between an
edge's test–retest reliability and its utility in predicting behavior.
Figure 6.
Difference in test–retest reliability between edges predictive and not predictive of
gF. Edges categorized as predictive of gF are those included in predictive networks
for every fold of the cross-validation; the distribution of predictive edges is shown
at left. Tukey boxplots show median (red line), data between first and third quartile
(edges of box), and suspected outliers (whiskers and red crosses; beyond 1.5
inter-quartile range [IQR]).
Figure 7.
Edge-wise relationship between test–retest reliability and behavioral relevance.
Behavioral relevance refers to the correlation between that edge's strength and fluid
intelligence (gF) across all subjects (abs(r)). For the plot of the
spatial distribution of behavioral relevance (top left), warmer colors are more
positively correlated with behavior, and cooler colors are more negatively correlated
with behavior. For the plot of test–retest reliability (bottom left), brighter colors
are more reliable. For the scatterplot (right), each point represents a single edge.
Here, behavioral relevance is the absolute value of behavioral relevance to facilitate
finding effects related to magnitude of relevance.
Figure 8.
Influence of reliable and unreliable edges on prediction of fluid intelligence. An
increasing number of unreliable edges are removed toward the right of the
x-axis, and an increasing number of reliable edges are removed
toward the left. Removal of unreliable edges starts with all edges showing Φ = 0, then
removing in intervals of 3200; removal of reliable edges also occurs in increments of
3200 (+1 for the first interval). The removal of 2/3 of all edges (23 852 edges
removed) are marked with vertical lines; changes in performance greater than 0.025
from performance with all edges are marked with horizontal lines.
Difference in test–retest reliability between edges predictive and not predictive of
gF. Edges categorized as predictive of gF are those included in predictive networks
for every fold of the cross-validation; the distribution of predictive edges is shown
at left. Tukey boxplots show median (red line), data between first and third quartile
(edges of box), and suspected outliers (whiskers and red crosses; beyond 1.5
inter-quartile range [IQR]).Edge-wise relationship between test–retest reliability and behavioral relevance.
Behavioral relevance refers to the correlation between that edge's strength and fluid
intelligence (gF) across all subjects (abs(r)). For the plot of the
spatial distribution of behavioral relevance (top left), warmer colors are more
positively correlated with behavior, and cooler colors are more negatively correlated
with behavior. For the plot of test–retest reliability (bottom left), brighter colors
are more reliable. For the scatterplot (right), each point represents a single edge.
Here, behavioral relevance is the absolute value of behavioral relevance to facilitate
finding effects related to magnitude of relevance.Influence of reliable and unreliable edges on prediction of fluid intelligence. An
increasing number of unreliable edges are removed toward the right of the
x-axis, and an increasing number of reliable edges are removed
toward the left. Removal of unreliable edges starts with all edges showing Φ = 0, then
removing in intervals of 3200; removal of reliable edges also occurs in increments of
3200 (+1 for the first interval). The removal of 2/3 of all edges (23 852 edges
removed) are marked with vertical lines; changes in performance greater than 0.025
from performance with all edges are marked with horizontal lines.
Discussion
As studies continue to suggest the promise of resting-state functional connectivity, full
characterization of its reliability is necessary. To add to previous studies of the
test–retest reliability of resting-state functional connectivity, we investigated the
influence of the amount of data per subject on test–retest reliability, characterized the
whole-brain spatial distribution of test–retest reliability, compared univariate and
multivariate measures of test–retest reliability, and explored the association between
test–retest reliability and behavioral prediction. Overall, our results suggest that the
historical 5-min resting-state scan is associated with poor average test–retest reliability
of whole-brain connectivity, and support ongoing efforts to collect more data per subject
(Laumann et al. 2015). While test–retest
reliability averaged over all edges was poor, this was mainly due to the lower test–retest
reliability of noncortical edges. Within-network connectivity was the most reliable,
consistent with previous studies (Shehzad et al.
2009; Birn et al. 2013; O'Connor et al. 2017). Multivariate test–retest
reliability was substantially greater than univariate test–retest reliability, indicating
that the connectivity matrix, or connectome, as a whole contains more stable information
than any particular edge. Finally, we found that an edge's test–retest reliability was not
meaningfully correlated with that edge's contribution to behavioral prediction and that
removing the least and most reliable edges did not substantially change prediction
performance. These findings are among the first to underscore the distinction between
reliability and utility, suggesting that the most reliable edges are not necessarily the
most informative edges, and vice versa.
Support for More Resting-State Functional Connectivity Data per Subject
Our results suggest that a relatively large amount of data per subject (>36 min) is
needed to reach good test–retest reliability of functional connectivity using both
univariate and multivariate test–retest reliability metrics. This concurs with prior
evidence that 5–10 min of data may not result in reliable functional connectivity (Shou et al. 2013) and that reproducibility
improves with more data (Anderson et al. 2011;
Birn et al. 2013), even up to 90 min (Laumann et al. 2015)—but may seem in conflict
with other recent studies suggesting that about 10 min of data is sufficient to generate a
stable measurement of connectivity (Shehzad et al.
2009; Van Dijk et al. 2010; Birn et al. 2013; Choe et al. 2015; Tomasi et
al. 2016). These contradictions arise primarily from differences in data
summarization strategies and measures of similarity. For example, while Shehzad et al. (2009) report moderate to high
test–retest reliability with less than 10 min of data, this result is specific to a subset
of within-network data, and they too detail poor test–retest reliability over all edges
(ICC < 0.4 from Table 1 in Shehzad et al. 2009). These results are
consistent with the overall poor test–retest reliability we report in Figure 1 and the fair to good within-network test–retest
reliability we report in Figure 4a. Crucially, several reports have determined the amount of
needed connectivity data (ranging from 6 to 12 min) by showing limited gains in
reproducibility with increasing the amount of data (Van Dijk et al. 2010; Birn et al.
2013; Tomasi et al. 2016). While a
plateau in test–retest reliability serves as a practical measure for maximizing the
tradeoff between test–retest reliability and the amount of data collected, it is not an
endorsement of good test–retest reliability. The latter studies still show poor
between-session test–retest reliability across all scan durations (ICC < 0.4 in Fig.
7 from Tomasi et al. 2016, and in Fig. 3a from Birn et al.
2013) and do not in themselves support high test–retest reliability of average
whole-brain connectivity using approximately 10 min of data.
Within-Network Cortical Edges are Most Reliable; Subcortical Edges are Least
Reliable
While subsets of nodes have been used to demonstrate the spatial distribution of
test–retest reliability (Shehzad et al. 2009;
Shah et al. 2016) or to summarize the
average influence of scan duration on connectivity (Birn et al. 2013; Shah et al.
2016), to our knowledge, the spatial distribution of edge-wise test–retest
reliability from a whole-brain atlas and its association with scan duration have not been
previously shown.Connectivity was most reliable within networks defined a priori, particularly the
frontoparietal and default mode networks. The frontoparietal network has been shown to
underlie individual differences (Finn et al.
2015; Gordon et al. 2016), the
default mode network has served as the bread and butter of much functional connectivity
research (Broyd et al. 2009; Whitfield-Gabrieli and Ford 2012; Raichle 2015), and altered connectivity in
these networks has been linked to a variety of neuropsychiatric disorders (Uddin and Menon 2009; Öngür et al. 2010; Whitfield-Gabrieli and Ford 2012; Baker et
al. 2014; Di Martino et al. 2014;
Northoff 2014; Abbott et al. 2015; Alonso-Solís et al. 2015; Kaiser et al.
2015). Therefore, that connectivity within frontoparietal, default mode, and the
other canonical networks was more reliable is encouraging, suggesting that these networks
may be reliable targets in studies of normal cognition and disorders.Still, a few caveats bear mentioning. Test–retest reliability results were variable at
the level of individual edges, meaning that unreliable edges will be found within reliable
networks and vice versa. Furthermore, anatomy may play a role in these findings, for
example, frontoparietal anatomy in particular has been found to vary across subjects
(Hill et al. 2010).In contrast, connectivity between noncortical regions was the least reliable, as
previously suggested (Shah et al. 2016). This
is likely due in part to the smaller size of subcortical nodes, since node size was found
to be correlated with test–retest reliability. However, subcortical location was found to
independently contribute to test–retest reliability in addition to node size; this may be
due to other factors such as unique activity, proximity to non-neuronal sources (e.g.,
susceptibility variations associated with breathing [Raj et al. 2001], cerebrospinal fluid), or lower central SNR with highly
parallel array coils (Wiggins et al.
2009).
The Connectome as a Whole is More Reliable than Individual Edges
Our multivariate results imply that the connectivity matrix in its entirety holds more
reliable information than simply the sum of its parts. Similarly, recent work has
demonstrated the relative unreliability of edges compared with whole-brain connectivity
fingerprinting (Pannunzi et al. 2017). We
interpret these findings as highlighting the potential of multivariate methods—such as the
CPM approach used here (Shen et al. 2017), or
related approaches (Shirer et al. 2012; Smith et al. 2015; La Rosa et al. 2016)—when analyzing connectivity data, instead of
univariate comparison of single edges. This is loosely analogous to the classic comparison
between task-based activation and multivoxel pattern analysis (MVPA) where each variable
contributes unequally and uniquely to the final measure (Norman et al. 2006).
Data can be Pooled Across Well-Harmonized Scanners and Sessions
We found no main effect of scanner or day when scanners are identically configured and
harmonized, suggesting that it is acceptable to pool data across different scanning
sessions and harmonized scanners. Pooling data across different scanning sessions may
enable the acquisition of large amounts of data per subject in populations that cannot
tolerate longer scanning sessions. For example, children and elderly subjects could be
scanned in multiple shorter sessions, and then data can be pooled across sessions.
People are Highly Variable Across Runs and Sessions
In contrast, the non-negligible person-by-session interaction suggests that each subject
varied substantially across sessions. Other work has shown similar indications of greater
between- than within-session variance, often in the form of smaller intersession compared
with intrasession test–retest reliability (Shehzad
et al. 2009; Birn et al. 2013; Pannunzi et al. 2017). Similarly, data from
different tasks completed within the same session were found to be more similar than
different sessions (O'Connor et al. 2017).
The present work, coupled with the work of others, suggests a possibility that repeated
measurements across sessions rather than runs may result in a more stable trait
measurement of an individual. Although this high intersession variability may reflect
meaningful individual variation, it may also increase the likelihood of finding an effect
by chance, especially for studies with repeated measures (e.g., pre-/post-treatment
studies). Additionally, the large residual (50–60%)—which has also been found in previous
task-based and resting-state fMRI (Forsyth et al.
2014; Gee et al. 2015; Noble et al. 2016)—suggests the presence of
substantial individual variability across all runs and sessions. This may be from either
random or structured sources unaccounted for here. Parsing the contribution of real,
brain-derived dynamics underlying variability at shorter timescales remains a challenge
(Hutchison et al. 2013; Tagliazucchi and Laufs 2014; Hindriks et al. 2016; Laumann et al. 2016). It is possible that the mental states of
individuals change from measurement to measurement—that, according to that ancient adage
attributed to Heraclitus, “You can’t stand in the same river twice.”
The Most Reliable Data is not Necessarily the Most Useful Data
Potentially our most important result is that higher test–retest reliability is not
meaningfully associated with higher data utility. We observed high discriminability in the
absence of good test–retest reliability, suggesting that meaningful information unique to
each individual can be captured by data with relatively low test–retest reliability.
Similarly, we showed that an edge's test–retest reliability does not account for its
usefulness in predicting behavior. Since the test–retest reliability of a measure
establishes the upper limit of its predictive validity (Carmines and Zeller 1979), it is tempting to incorporate
test–retest reliability information into analytical procedures to improve utility (Strother et al. 2004; Shou et al. 2014; Mueller
et al. 2015; Shirer et al. 2015).
However, this may not ultimately facilitate the development of predictive biomarkers of
behavior. While it is possible that such procedures can support utility for specific
behavioral traits or in under specific conditions, benefits must be determined on a
case-by-case basis. Others have also shown a disconnect between the processing procedures
that optimize for test–retest reliability versus those that optimize for prediction of
behavioral traits (Strother et al. 2004;
Shirer et al. 2015). Altogether, these
results suggest that optimizing data acquisition and analysis only with respect to
test–retest reliability, while ignoring other metrics closer to utility, may not provide
the most meaningful results.
Limitations
The statistical models used here are designed to maximize generalizability to a healthy,
post-adolescent population. Generalizability to different populations is less certain, as
others have demonstrated (Somandepalli et al.
2015). Additionally, differences in acquisition and processing procedures also
impact test–retest reliability (Zuo et al.
2013; Aurich et al. 2015; Shirer et al. 2015; Varikuti et al. 2017). For example, increasing the temporal
resolution of the data has also been shown to increase test–retest reliability (Birn et al. 2013; Zuo and Xing 2014; Shah et
al. 2016). Notably, global signal regression has heterogeneous effects on
test–retest reliability of functional connectivity (Shirer et al. 2015; Varikuti et al.
2017), complicated by the high test–retest reliability of motion itself (Zuo and Xing 2014). This study is therefore most
relevant to others using this common denoising technique (cf. Power et al. 2015 for a more complete discussion of motion
denoising methods). Finally, test–retest reliability may be related in different ways to
other models and measures of behavior. While these considerations are not expected to
impact the generalizability of the central conclusions presented here—namely, that
test–retest reliability improves with more data and can be distinct from the utility of
the data—it is important to try to parse the effects of population, acquisition,
processing, and analysis on characterizing test–retest reliability (cf. Poldrack et al. 2017).
Conclusion
In conclusion, this work helps to address some problems in characterizing the test–retest
reliability of resting-state functional connectivity. Our results support the need for more
data per subject to improve test–retest reliability and shows the spatial distribution of
test–retest reliability of connectivity spanning the whole brain. Significantly, our results
are among the first to highlight the increase in test–retest reliability when treating the
connectivity matrix as a multivariate object and the dissociation between test–retest
reliability and behavioral utility. As standards for the acquisition and analysis of
functional connectivity data continue to be developed (Nichols et al. 2017), these results provide important considerations for
establishing best practices in the field.Click here for additional data file.
Authors: Deepthi P Varikuti; Felix Hoffstaedter; Sarah Genon; Holger Schwender; Andrew T Reid; Simon B Eickhoff Journal: Brain Struct Funct Date: 2016-08-22 Impact factor: 3.270
Authors: Timothy O Laumann; Evan M Gordon; Babatunde Adeyemo; Abraham Z Snyder; Sung Jun Joo; Mei-Yen Chen; Adrian W Gilmore; Kathleen B McDermott; Steven M Nelson; Nico U F Dosenbach; Bradley L Schlaggar; Jeanette A Mumford; Russell A Poldrack; Steven E Petersen Journal: Neuron Date: 2015-07-23 Impact factor: 17.173
Authors: David C Van Essen; Stephen M Smith; Deanna M Barch; Timothy E J Behrens; Essa Yacoub; Kamil Ugurbil Journal: Neuroimage Date: 2013-05-16 Impact factor: 6.556
Authors: Justin T Baker; Avram J Holmes; Grace A Masters; B T Thomas Yeo; Fenna Krienen; Randy L Buckner; Dost Öngür Journal: JAMA Psychiatry Date: 2014-02 Impact factor: 21.596
Authors: Matthew F Glasser; Stamatios N Sotiropoulos; J Anthony Wilson; Timothy S Coalson; Bruce Fischl; Jesper L Andersson; Junqian Xu; Saad Jbabdi; Matthew Webster; Jonathan R Polimeni; David C Van Essen; Mark Jenkinson Journal: Neuroimage Date: 2013-05-11 Impact factor: 6.556
Authors: Andréa L Hobkirk; Ryan P Bell; Amanda V Utevsky; Scott Huettel; Christina S Meade Journal: Drug Alcohol Depend Date: 2018-10-24 Impact factor: 4.492
Authors: Ursula A Tooley; Allyson P Mackey; Rastko Ciric; Kosha Ruparel; Tyler M Moore; Ruben C Gur; Raquel E Gur; Theodore D Satterthwaite; Danielle S Bassett Journal: Cereb Cortex Date: 2020-01-10 Impact factor: 5.357
Authors: Evan M Gordon; Randall S Scheibel; Laura Zambrano-Vazquez; Meilin Jia-Richards; Geoffrey J May; Eric C Meyer; Steven M Nelson Journal: J Neurotrauma Date: 2018-01-29 Impact factor: 5.269
Authors: Colin Hawco; Joseph D Viviano; Sofia Chavez; Erin W Dickie; Navona Calarco; Peter Kochunov; Miklos Argyelan; Jessica A Turner; Anil K Malhotra; Robert W Buchanan; Aristotle N Voineskos Journal: Psychiatry Res Neuroimaging Date: 2018-06-09 Impact factor: 2.376
Authors: Amanda F Mejia; Mary Beth Nebel; Anita D Barber; Ann S Choe; James J Pekar; Brian S Caffo; Martin A Lindquist Journal: Neuroimage Date: 2018-02-14 Impact factor: 6.556