Literature DB >> 34402132

Comparison of traveling-subject and ComBat harmonization methods for assessing structural brain characteristics.

Norihide Maikusa^1,2, Yinghan Zhu¹, Akiko Uematsu¹, Ayumu Yamashita³, Kousaku Saotome¹, Naohiro Okada^4,5, Kiyoto Kasai^4,5,6,7, Kazuo Okanoya^1,4,6,7, Okito Yamashita^3,8, Saori C Tanaka³, Shinsuke Koike^1,4,6,7.

Abstract

Multisite magnetic resonance imaging (MRI) is increasingly used in clinical research and development. Measurement biases-caused by site differences in scanner/image-acquisition protocols-negatively influence the reliability and reproducibility of image-analysis methods. Harmonization can reduce bias and improve the reproducibility of multisite datasets. Herein, a traveling-subject (TS) dataset including 56 T1-weighted MRI scans of 20 healthy participants in three different MRI procedures-20, 19, and 17 subjects in Procedures 1, 2, and 3, respectively-was considered to compare the reproducibility of TS-GLM, ComBat, and TS-ComBat harmonization methods. The minimum participant count required for harmonization was determined, and the Cohen's d between different MRI procedures was evaluated as a measurement-bias indicator. The measurement-bias reduction realized with different methods was evaluated by comparing test-retest scans for 20 healthy participants. Moreover, the minimum subject count for harmonization was determined by comparing test-retest datasets. The results revealed that TS-GLM and TS-ComBat reduced measurement bias by up to 85 and 81.3%, respectively. Meanwhile, ComBat showed a reduction of only 59.0%. At least 6 TSs were required to harmonize data obtained from different MRI scanners, complying with the imaging protocol predetermined for multisite investigations and operated with similar scan parameters. The results indicate that TS-based harmonization outperforms ComBat for measurement-bias reduction and is optimal for MRI data in well-prepared multisite investigations. One drawback is the small sample size used, potentially limiting the applicability of ComBat. Investigation on the number of subjects needed for a large-scale study is an interesting future problem.

Entities: Chemical

Keywords: ComBat; FreeSurfer; MRI; harmonization; multisite; traveling subject

Mesh：

Year: 2021 PMID： 34402132 PMCID： PMC8519865 DOI： 10.1002/hbm.25615

Source DB: PubMed Journal: Hum Brain Mapp ISSN： 1065-9471 Impact factor: 5.038

INTRODUCTION

The archiving and sharing of large‐scale clinical data from multisite brain magnetic resonance imaging (MRI) studies have gained considerable interest due to their potential to elucidate disease characteristics in the field of neuropsychiatric disorders. For example, the Global Alzheimer's Association Interactive Network project (Toga, Neu, Bhatt, Crawford, & Ashish, 2016), which aggregated data archives from 53 neuroimaging studies, such as the Alzheimer's Disease Neuroimaging Initiative (Jack et al., 2008), Open Access Series of Imaging Study (Marcus et al., 2007; Marcus, Fotenos, Csernansky, Morris, & Buckner, 2010), and Australian Imaging, Biomarker, & Lifestyle Flagship Study of Aging (Ellis et al., 2009), can be used to explore the data of more than 500,000 subjects. Other large‐scale multisite imaging cohorts include the Human Connectome Project (HCP; Glasser et al., 2013), Human Brain Project (Rose, 2014), and UK Biobank (Sudlow et al., 2015). Multisite clinical research on MRI data has been expanding worldwide in recent years. Multisite MRI data can cause nonbiological measurement biases resulting from the differences in the properties of MRI scanners and procedures. These differences include scanner manufacturer, image‐acquisition protocol, scanner coil, and field strength differences. Each of these can cause unwanted measurement biases that can negatively influence the reproducibility of the results and the ability to detect disease‐related changes (He, Byge, & Kennedy, 2020; Koike et al., 2021; Ma et al., 2019). In their initial multisite MRI research, Jack et al. (2008) standardized imaging protocols across sites in the ADNI project to reduce the effects of different imaging protocols on the MRI quality. Moreover, previous researchers have attempted to correct measurement biases using image pre‐processing methods to treat raw MRI data. For instance, Sled, Zijdenbos, and Evans (1998) proposed a method to rectify the MRI‐intensity inhomogeneity (also referred to as the “bias field”) using the nonparametric nonuniformity intensity normalization (N3) approach. Further, Tustison et al. (2010) proposed an improved version of N3 referred to as Nick's N3 (N4). The N3 approach involves built‐in prepossessing for the FreeSurfer pipeline. Janke, Zhao, Cowin, Galloway, and Doddrell (2004) developed a gradient nonlinearity correction software package called Gradunwarp to reduce the spatial‐distortion bias based on the spherical‐harmonic expansion, and finally, Maikusa et al. (2013) proposed a phantom‐based distortion–correction method. Gradunwarp has been adopted by the ADNI project and HCP. The aforementioned attempts were made to standardize imaging protocols and image pre‐processing, but the measurement bias could not be eliminated (Beer et al., 2020). Therefore, sites were utilized as covariates in a general linear model (GLM) using dummy variables in previous studies (Fortin et al., 2017, 2018). However, this method is limited in that the effects of individual subjects and sites cannot be separated. Thus, the existing GLM methods eliminate biologically meaningless and meaningful biases. Recently, several large‐scale investigations have considered the application of the meta‐analytic approach to multisite datasets (Koshiyama et al., 2020; Okada et al., 2016; Van Erp et al., 2018). However, this approach is limited in that (a) the publication bias, wherein negative results are less likely to be published, reveals positive results in the original data to be combined; (b) the quality of brain images and clinical assessments varies significantly; (c) individual‐based statistics cannot be obtained; and (d) survey literature and/or records are missing. ComBat is an empirical Bayesian‐based harmonization method that was originally designed for genomic microarray data (Johnson, Li, & Rabinovic, 2007). Fortin et al. effectively harmonized fractional anisotropy and mean diffusivity data from diffusion tensor imaging using ComBat (Fortin et al., 2017) and estimated the cortical thickness (Fortin et al., 2018) using ComBat to improve the statistical and machine‐learning classification power (Radua et al., 2020). Nevertheless, site differences include measurement and sampling biases. A sampling bias is a difference in biological information (e.g., age, sex, and pathology) between sites and can affect MRI signals. Numerous subjects or sites are required to separate measurement bias from sampling bias; however, most multisite MRI studies have failed to perform this distinction because both bias types have been characterized based on their respective sites (Yamashita et al., 2019). To eliminate the effects of measurement bias, Yamashita et al. (2019) extended GLM harmonization using a TS dataset. The machine and protocol availabilities for each site can be inferred from TS data, and therefore, TS measurements facilitate advanced preparation for multisite projects. Thus, because TS data are free from sampling bias, they can be considered to differentiate between measurement and sampling biases. In addition, the TS‐GLM harmonization method can correct measurement bias and improve the signal‐to‐noise ratio of resting‐state functional connectivity data (Yamashita et al., 2019). Limited research has been conducted on the reproducibility of data obtained at different sites or by different scanners using the TS method. Furthermore, no researchers have compared the reproducibility of harmonized variables using the test–retest reproducibility, where only the sampling bias or number of subjects required for harmonization could be considered. A test–retest dataset provides variables without measurement or sampling bias. Given that a TS dataset requires many subjects and incurs high scanning and travel costs, it would be useful to define the minimum number of subjects required for harmonization. In this study, we evaluated the performance of three harmonization methods: TS‐GLM, ComBat, and TS‐ComBat (combination of TS and ComBat) harmonization. ComBat is a convenient method because it does not require additional scans and is adaptable to retrospective data. TS‐GLM, on the other hand, has limitations in that it can only be applied to prospective data and has a high scanning cost (requires traveling), but it can be superior to separate biological bias and measurement bias (Yamashita et al., 2019). TS‐ComBat is similar in principle. Therefore, a quantitative comparison of these methods in terms of their ability to improve reproducibility will provide useful insights to determine whether TS, which has a high imaging cost, should be implemented. We assessed the abilities of the three methods to reduce the measurement bias and improve the reproducibility of T1‐weighted MRI scans for the same subject. Furthermore, by evaluating the differences between the test and retest results, the minimum number of subjects required to achieve reproducibility was determined.

MATERIALS AND METHODS

TS and test–retest datasets

We evaluated the effects of scan‐procedure differences on the structural characteristics of T1‐weight brain images taken from 20 healthy control participants. We used two 3‐T scanners and three procedures to acquire data (Table 1). In Procedure 1, a Philips Achieva with an 8‐channel head coil was used; in Procedure 2, a Siemens Prisma with a 64‐channel head coil was used; and in Procedure 3, the same Siemens Prisma with a 32‐channel head coil was used. The protocols of Procedures 1 and 2 were determined according to previous Japanese multisite projects and had similar scan parameters (Koizumi et al., 2016; Taschereau‐Dumouchel et al., 2018; Yamada et al., 2017; Yamashita et al., 2019; Yamashita, Hayasaka, Kawato, & Imamizu, 2017). The protocol of Procedure 3 was the same as that provided by the HCP (Glasser et al., 2013).

TABLE 1

Scanner information and demographics of participants

	Procedure 1	Procedure 2	Procedure 3
Manufacturer	PHILIPS	SIEMENS	SIEMENS
Scanner model	Achieva	Prisma	Prisma
Head coil (ch)	8	64	32
Repetition time (ms)	7	1900	2,400
Echo time (ms)	3.17	2.53	2.22
In‐plate resolution (mm²)	1.0 × 1.0	1.0 × 1.0	0.8 × 0.8
Matrix size	256 × 256	256 × 256	256 × 240
Slice thickness (mm)	1.2	1.2	0.8
Slice direction	AP	AP	AP
Slice orientation	Sagittal	Sagittal	Sagittal
Pulse sequence	MPRAGE	MPRAGE	MPRAGE
Flip angle (°)	9	9	8
Number of participants	20	19	17

Scanner information and demographics of participants Because one subject was missing in Procedure 2 and three other subjects were missing in Procedure 3, 56 measurements were performed in total: 20, 19, and 17 measurements in Procedures 1, 2, and 3, respectively. The control participants included 7 women and 13 men, with a mean [SD] age of 24.3 [6.56] years, mean [SD] height of 168.1 [6.58] cm, and mean [SD] weight of 61.1 [10.4] kg at the first measurement. The median duration of the three scans was 22 days (range = 0–448 days). More detailed information, that is, age, sex, height, weight, and scan duration for each subject, can be found in Table S1. To compare the harmonization methods in terms of their test–retest result reproducibility, we performed TS‐independent scans of 40 images of 20 healthy participants (11 women and 9 men; mean [SD] age of 15.4 [0.42] years, height of 163.3 [7.03] cm, and weight of 52.1 [7.29] kg) with Procedure 3, considering a median interval of 2.5 days (range = 1–54 days) between successive scans. The preliminary analysis results reveal the Cohen's d between the test and retest datasets to be quite high for long intervals. This study was approved by the Ethics Committee at the University of Tokyo (Approval No. 19‐298), and all the participants provided informed consent to participate in this study prior to performing the initial measurement.

Image pre‐processing

To extract the cortical and subcortical volumes and cortical thickness, we used FreeSurfer software (version 6.0) (Dale, Fischl, & Sereno, 1999; Fischl, 2012; Fischl et al., 2002, 2004; Fischl & Dale, 2000; Fischl, Sereno, & Dale, 1999) with a CentOS PC and the “recon‐all” pipeline with the default parameters. The FreeSurfer pipeline performs N3 (Sled et al., 1998) as part of the pre‐processing to minimize the effect of intensity inhomogeneity. We obtained the cortical volume and thickness from the rh.aparc.a2009s.stats and lh.aparc.a2009s.stats files, derived from “Destrieux Atlas,” and included 74 anatomical cortical regions in each hemisphere. For the subcortical volume, we used the aseg.stats file, which included 41 subcortical anatomical regions (the left‐WM‐hypointensities, right‐WM‐hypointensities, left‐non‐WM‐hypointensities, and right‐non‐WM‐hypointensities were excluded because their values were zero). The cortical volume, cortical thickness, and subcortical volume were used to assess the reduction in measurement bias using the three kinds of harmonization.

Harmonization methods

In this study, y(i, j, v) was the vth FreeSufer variable, that is, cortical thickness, volume, and subcortical volume within the arbitrary anatomical label for imaging procedure i for the jth subject; k was the number of procedures; and n was the total number of traveling subjects. The harmonization methods considered in this study are described as follows.

ComBat harmonization

ComBat is a tool that was initially developed to correct the batch effect in genomics (Johnson et al., 2007) and has more recently been applied to MRI datasets (Fortin et al., 2017, 2018). ComBat corrects a type of multivariate dataset using an empirical Bayesian estimation approach and can be used to analyze datasets obtained through different scanning procedures. The ComBat methodology can be described as where, α(v) is the average anatomical volume at the reference site within the vth anatomical variable, (v) is the p × 1 vector of coefficients associated with the design matrix of biological covariates of interest (age, sex, weight, and height in this study), (i, j) is the design matrix of the vth anatomical variable, and p is the number of biological covariates. ε(i, j, v) is the error term, following a normal distribution with a mean of zero and a variance of σ 2(v). The terms γ(i, v) and δ(i, v) represent the additive and multiplicative site effects of procedure i on the vth anatomical volume or thickness, respectively. In ComBat harmonization, an empirical Bayesian framework is used to estimate γ*(i, v) and δ*(i, v). The final ComBat harmonized values can be expressed as where, and represent estimated coefficients associated with the biological covariates of interest and estimated population mean of the vth anatomical variable.

TS‐GLM harmonization

The use of GLM is the most basic approach to remove the site effects. We followed the TS‐GLM harmonization method reported by Yamashita et al. (2019), which extends the GLM harmonization model using a TS dataset. The TS‐GLM harmonization model can be described as follows: where, (v) represents the participant factor and (i, j) is the n × 1 vector of the participant indicator. (v) represents the coefficient of the site factor, namely, the measurement bias, and (i, j) is the k × 1 vector of the site indicator. To estimate the respective parameters, we calculated the inverse matrix for (i, j) and (i, j). In this study, all the subjects were healthy and identical at each site; thus, the sampling bias was not considered. However, the design matrix of the GLM was rank‐deficient; thus, we used the Moore–Penrose pseudo inverse matrix as the “pinv” function in MATLAB (R2016b) to estimate and . After estimating , the TS‐GLM harmonized anatomical volumes and thicknesses were set as follows:

TS‐ComBat harmonization

We also extended the ComBat harmonization model to a TS dataset. Conventional ComBat estimates covariates , such as age and gender, to exclude individual effects from the measurement values, but in this model, individual effects are estimated by the traveling subjects, as in TS‐GLM. This approach enables TS‐ComBat to have full‐control sampling bias like TS‐GLM. TS‐ComBat is defined as follows: Thus, the TS‐ComBat harmonized volumes can be set as follows: ComBat, TS‐GLM, and TS‐ComBat all assume a normal distribution. We used the Kolmogorov–Smirnov test (KS‐test) for all input data to check the guarantee of normality.

Evaluation metrics

To investigate and compare the reproducibility of the different procedures before and after the implementation of these harmonization methods, we computed the Cohen's d effect size of different MRI procedures as the metric of the measurement bias or reproducibility of the anatomical variables—cortical volume/thickness and subcortical volume—between the different procedures. In the above expression, n 1 and n 2 denote the numbers of subjects in groups 1 and 2, respectively, and and and s 1 and s 2 denote the average and SD of each variable in groups 1 and 2, respectively. In this study, Cohen's d was calculated between Procedures 1 and 2, 1 and 3, and 2 and 3 for each FreeSurfer variable, that is, cortical thickness, cortical volume, and subcortical volume within the brain region. If there is no difference between the procedures, Cohen's d must equal zero.

Statistical analyses

Comparison of three harmonization methods

To explore the effects of the harmonization methods on reproducibility, we employed a general linear mixed model (GLMM) to estimate Cohen's d as a dependent variable, with the procedure and harmonization method as independent variables and the anatomical structures as within‐subject variables. In this manner, we investigate whether each harmonization method improves the Cohen's d between Procedures 1 and 2, 1 and 3, and 2 and 3 with respect to the corresponding raw values. Furthermore, by comparison with the Cohen's d between test and retest, we investigate whether each harmonization method achieves the same reproducibility as the test–retest dataset. To assess the potential differences in the associations between the dependent and independent variables among the anatomical structures, we set a random effect for the intercept of anatomical structures. The GLMMs were estimated using the “lmer” function in the “lmerTest” package for R, version 3.1.2. A p value <.05 was considered significant. The Bonferroni correction for multiple comparisons to control the familywise error (FWE) was used for post‐hoc analyses (FWE‐corrected p = .05/3 = .0166). Next, we tested the differences in Cohen's d between the TS dataset (no harmonization, TS‐GLM, ComBat, and TS‐ComBat) and the test–retest dataset using a two‐sample t‐test. The Bonferroni correction was also applied (FWE‐corrected p = .05/4 = .0125). To test the effect of scan duration on the harmonization, we obtained the effect size, that is, Cohen's d, for cortical thickness and cortical/subcortical volume across MRI procedures, and then tested Cohen's d (dependent variable) across MRI procedure difference, as scan duration as an independent variable and subject as a within‐subject factor, using a repeated‐measures analysis of variance (ANOVA).

Minimum number of participants for TS harmonization

We re‐sampled s subjects from the all S TSs corresponding to all combinations ( C ) and calculated Cohen's d as a function of s, that is, d(s) after ComBat, TS‐GLM, and TS‐ComBat harmonization. Subsequently, we performed a two‐sample t‐test to compare the values of d(s) obtained for the test and retest scans. We defined the minimum number of subjects required for TS as the minimum s for which the null hypothesis of a difference from test–retest was rejected. We performed a preliminary assessment to ensure that the all sampled data followed a normal distribution in the KS‐test. We applied the Benjamini–Hochberg procedure to control the false discovery rate (FDR) of a family of hypotheses (q < 0.05) because Bonferroni correction is conservative and, therefore, could overestimate the required number of TSs.

RESULTS

Evaluation of three harmonization methods

The GLMM showed that all harmonization methods significantly reduced Cohen's d across all the procedures (p < .001, Figure 1). ComBat harmonization reduced the averaged Cohen's d for each FreeSurfer variable, namely, the cortical thickness, cortical volume, and subcortical volume, in the corresponding brain region by 59.0, 29.1, and 40.1% when comparing Procedures 1 and 2, 2 and 3, and 1 and 3, respectively. Similarly, TS‐GLM and TS‐ComBat reduced the averaged Cohen's d by 85.0%, 50.0%, and 68.5% and 81.3%, 48.1%, and 65.6% in the three above‐mentioned comparisons, respectively.

FIGURE 1

Bee‐swarm plots for Cohen's d values before and after harmonization. Cohen's d values were derived from comparison of (a) Procedures 1 and 2, (b) Procedures 2 and 3, and (c) Procedures 1 and 3. The test–retest results have been plotted in all the subplots for comparison. The colored line indicates Cohen's d of an arbitrary FreeSurfer's anatomical label between procedures Cohen's d before harmonization was significantly greater than the test–retest difference. Meanwhile, Cohen's d after ComBat harmonization was significantly greater than the test–retest difference when comparing Procedures 2 and 3 (0.187 [0.0191] vs. the test–retest effect size, FWE‐corrected p < .001) and Procedures 1 and 3, but no significant difference was found when comparing Procedures 1 and 2. The TS‐GLM and TS‐ComBat harmonization methods did not differ significantly from the test–retest reproducibility when comparing Procedures 2 and 3; the values of Cohen's d after TS‐GLM and TS‐ComBat were significantly smaller than the test–retest value when comparing Procedures 1 and 2 (FWE‐corrected ps < .0001) and Procedures 1 and 3 (FWE‐corrected ps < .005). Repeated measure ANOVA did not show significant main effect of scan duration in Cohen's d for cortical thickness (F[1, 45] = 0.508, p = .480) and subcortical/ cortical volume (F[1, 45] = 1.130, p = .293).The mean Cohen's d for the maximum scan duration (448 days) was 0.01.

Spatial distribution of the different procedures

There was a trend toward a higher averaged Cohen's d in the medial prefrontal cortex and inferior occipital cortex for both volume and thickness; specifically, the right medial orbital sulcus has the largest Cohen's d before harmonization (Figure 2). After ComBat harmonization, the Cohen's d values corresponding to the volume and thickness of the medial prefrontal cortex are reduced. However, they nonetheless exceed those observed in other regions. After applying the TS‐GLM and TS‐ComBat harmonization methods, Cohen's d values were lower in all the cortical and subcortical regions.

FIGURE 2

Averaged Cohen's d maps overlaid on aparc.a2009s + aseg.mgz file. The upper and lower rows show sagittal and coronal images, respectively. The columns indicate raw, ComBat, TS‐GLM, TS‐ComBat, and test–retest results obtained using each harmonization method. Cohen's d values were calculated from (a) the cortical and subcortical volumes and (b) the cortical thickness The spatial distributions of Cohen's d between Procedures 1 and 2, 1 and 3, and 2 and 3 are shown in Figures [Link], [Link], and S3, respectively. Irrespective of the procedures, a high Cohen's d was consistently observed around the medial prefrontal cortex before harmonization, which was well corrected by TS‐GLM and TS‐Combat. ComBat had a moderate effect on harmonization; in other words, ComBat could not remove the strong site effect around the medial frontal cortex when comparing Procedures 1 and 3 (Figure S2).

Minimum number of participants required for harmonization

After TS‐GLM harmonization, Cohen's d between Procedures 1 and 2 was not significantly different from the test–retest difference when the number of TSs was at least 6 (FDR‐corrected p > .05). Thus, the minimum number of subjects required was 6 in Procedure 1, which involved different MRI scanners with a similar MRI protocol (SRPB) (Figure 3a). Similarly, with TS‐ComBat harmonization, the minimum number of TSs was 13 in Procedure 1. Furthermore, the minimum number of TSs was 12 for TS‐GLM and 14 for TS‐ComBat in Procedure 2, which involved the same MRI scanner but different MRI protocols, that is, SRPB and CRHD (Figure 3b). In addition, the minimum number of TSs for TS‐GLM was 19; however, the Cohen's d value after TS‐ComBat harmonization remained significantly higher than the test–retest difference in Procedure 3, which involved different MRI scanners and protocols. (Figure 3(c)). In contrast, ComBat harmonization consistently showed significantly higher Cohen's d values than the test–retest differences (FDR‐corrected p < .05), regardless of the scanning procedure.

FIGURE 3

Average Cohen's d according to number of re‐sampling subjects. Cohen's d as a function of s, the number of subjects resampled, from the comparison of (a) Procedure 1 and 2, (b) Procedures 2 and 3, and (c) Procedures 1 and 3

DISCUSSION

We compared the three harmonization methods with three measurement procedures using the TS dataset, as well as a test–retest dataset. Although considerable measurement bias was confirmed prior to harmonization, the TS‐based harmonization results obtained by applying the TS‐GLM and TS‐ComBat approaches to the test and retest results were observed to be comparable and, hence, reproducible. Because the test–retest dataset did not have measurement bias, Cohen's d was expected to be zero, but the actual results were different. We expected additional factors to be present that could have affected the reproducibility such as image analysis error, individual errors, and measurement bias. The advantage of the TS‐GLM and TS‐ComBat methods are that they can harmonize these factors without modeling; therefore, these TS‐based harmonization methods showed better reproducibility in the test–retest case (Figure 1). In contrast, ComBat harmonization yielded a Cohen's d higher than the test–retest difference. These facts indicate that the biological covariates used in this study were not sufficient to estimate individual effects and that the measurement bias and individual effects could not be separated. When we focused on the spatial distribution of Cohen's d, a greater measurement bias was found in the medial prefrontal cortex. The results indicate that the measurement bias between the procedures has a moderate effect size, irrespective of procedure differences and structural characteristics. Larger effect sizes were observed in the ventral and medial parts of the frontal cortex (i.e., medial prefrontal cortex). The results agree with those of previous studies; the specific bias in this region coincides with the location of high geometric distortion (Li, Williams, Frisk, Arnold, & Smith, 1995; Maikusa et al., 2013). According to Li et al. (1995), the boundary of the nasal cavity below the medial prefrontal cortex induces strong geomatical distortion (i.e., measurement bias) in these areas. Although harmonization methods cannot provide information on these characteristics, they can harmonize and statistically correct differences between scanners without the need to know the details of these characteristics. This is the advantage of TS‐based harmonization methods. We defined the minimum TS sample size for nonsignificance between the harmonization and test–retest reproducibility. For TS‐GLM, it was 6 with different MRI scanners but similar protocols (Procedures 1 and 2). TS‐GLM required a minimum TS sample size of 12 and 19 when comparing Procedures 1 and 3 and Procedures 2 and 3, respectively. To the best of our knowledge, this study was the first to compare the effectiveness of harmonization methods and to suggest a sample‐size requirement for TS‐based harmonization. Furthermore, the procedure comparisons, in the decreasing order of minimum TS sample sizes, were Procedures 1 versus 2, Procedures 2 versus 3, and Procedures 1 versus 3, which coincides with the result presented in Section 3.1, that is, the procedure comparisons, in the increasing order of reduction rates of Cohen's d after harmonization, were Procedures 1 versus 2, Procedures 1 versus 3, and Procedures 2 versus 3. The measurement bias could not be fully corrected when ComBat harmonization was used, perhaps because ComBat harmonization does not exhibit the ability to harmonize the test–retest dataset with 20 subjects or less. In contrast, both TS‐GLM and TS‐ComBat successfully corrected the measurement bias, irrespective of procedural and brain region differences. A previous multisite fMRI study revealed the well‐harmonized factors from a functional connectivity matrix using TS‐GLM harmonization (Yamashita et al., 2019), suggesting the applicability of this method to the structural characteristics of the brain. Although TS‐based harmonization methods have caused measurement bias to decrease, considerable effort should be devoted toward TS recruitment and scan‐schedule preparation within a short duration to obtain the TS dataset. Thus, determining the required number of TSs will help minimize the use of resources. As observed, although the highest number of subjects was required to compare Procedures 1 and 3, that is, different MRI scanners and protocols, only six TSs were required to compare Procedures 1 and 2, which involved different MRI scanners but similar scan parameters predetermined for a multisite investigation (Koizumi et al., 2016; Taschereau‐Dumouchel et al., 2018; Yamada et al., 2017; Yamashita et al., 2017, 2019). The findings suggest that the required number of subjects varies depending on the procedure, and the attempt to unify the parameters highlights the importance of unifying imaging protocols when using MRI data obtained from different vendors and MRI scanners in multisite studies. It is considered ideal to scan up to 20 TSs; however, the operational costs increase with increasing site count, and it is difficult for participants to travel to multiple sites. It is practical to change the TS count depending on the differences between scanner configurations. We believe that TS should be implementation for all scanner vendors and MRI protocols to investigate the relation between the minimal sample size and measurement bias. However, this is not realistic, because scanning traveling subjects incurs high cost. Therefore, our study involved a minimal sample size and limited scenarios, that is, two different scanners with similar imaging protocols (the SRPB protocol), different protocols on the same scanner (SRPB and CRHD protocols), and different scanners and protocols. Our study provides guidance on the minimum number of TSs for limited scenarios; in particular, it provides guidance for the scenario of different scanners with similar protocols, which is similar to the scenario in a recent multi‐site imaging study. In addition, we do not have sufficient longitudinal data to discuss whether the proposed harmonization methods are applicable to longitudinal data; we would like to examine this possibility with a new dataset in future work. Van Erp et al. (2018) reported that, when compared with healthy subjects, schizophrenic subjects had lower thickness in the left and right cortices, with Cohen's d = −0.530 and −0.516, respectively; bilateral fusiform, temporal (inferior, middle, and superior), and left superior frontal gyri; right pars opercularis; and bilateral insula. Therefore, it is necessary to reduce the measurement bias so that it does not affect the size of the target disease. Our results showed that the averaged Cohen's d (measurement bias) for whole brain cortical thickness in all scenarios was 0.259 before harmonization, which is approximately half the effect size for the above diseases, and TS‐GLM can reduce this value to 0.0710. We plan to consider the number of TSs required for the assumed effect sizes between different groups, such as disease and healthy control groups. Our study has some limitations. First, the datasets were used for harmonization and validation. Ideally, independent TS validation datasets should be utilized. Moreover, TS‐based harmonization is a method of estimating the measurement bias at the time of scanning, and the TS‐scanning intervals may affect the harmonization accuracy; this tendency was not fully investigated in this study. In addition, we verified ComBat harmonization for only 20 subjects, which may not represent larger imaging studies, and the small sample size may have prevented better ComBat estimations. A large‐scale TS project is currently underway, and we hope to have more detailed validation and analyses possible in future works. Second, we only investigated the cortical thickness and volume obtained from the FreeSurfer analysis. Although the initial TS‐GLM harmonization was confirmed using functional connectivity during a resting state (Yamashita et al., 2019), it is possible that other modalities—other MRI and positron‐emission tomography imaging sequences—exhibit different trends. Third, the dropout of the TSs meant that there was an imbalance in number between the scanning procedures. For example, the result that the optimal number of TSs was 19 in the Section 3.3 was precisely the result of using 19 subjects in Procedure 1 and 17 subjects in Procedure 3. This imbalance in the TS count between procedures due to dropout has been identified and considered in similar extant studies. Lastly, we did not investigate how much of the TS‐scanning interval could be feasibly harmonized. Next, our TS data have a wide range of scan durations, which might have led to sampling bias caused by brain changes with normal aging. In this study, the minimum and maximum age of TSs were 20 and 40 years, respectively; brain changes due to aging are minimal in this age group. Therefore, we believe that brain changes are negligible, even if these scan durations were quite wide (maximum of 448 days). Repeated measures ANOVA for cortical thickness and subcortical/cortical volume did not show significant difference between scan durations to intra‐subject Cohen's d. In fact, intra‐subject Cohen's d was 0.01 at the maximum scan duration of 448 days. Finally, there is risk that a harmonization method eliminates not only measurement bias but also biological information; in other words, it could cause a sampling bias. A harmonization method requires the separation of sampling bias from measurement bias as well as verity. However, in this study, sampling bias did not occur because the TSs showed the same sampling bias across the sites. Therefore, we would like to verify this risk using another dataset in future work. In conclusion, our study showed that TS‐based harmonization methods, namely, TS GLM and TS‐ComBat, outperform ComBat harmonization. Furthermore, we demonstrated that at least six subjects are required when the dataset is scanned using different scanners with a similar scanning procedure. As a future endeavor, we intend to undertake a large‐scale TS project to explain and resolve such problems associated with TS harmonization as the TS‐scanning interval and validation of an independent test set.

CONFLICT OF INTEREST

The authors declare no conflicts of interest. Figure S1 Cohen's d maps in Procedures 1 vs. 2 overlaid on the aparc.a2009s + aseg.mgz file. The upper and lower rows show sagittal and coronal images, respectively. The columns indicate raw, ComBat, TS‐GLM, and TS‐ComBat results. Cohen's d values were calculated from (a) the cortical and subcortical volumes and (b) the cortical thickness Click here for additional data file. Figure S2 Cohen's d maps in Procedures 1 vs. 3 overlaid on the aparc.a2009s + aseg.mgz file Click here for additional data file. Figure S3 Cohen's d maps in Procedures 2 vs. 3 overlaid on the aparc.a2009s + aseg.mgz file Click here for additional data file. Table S1 Demographics of all traveling subjects Click here for additional data file.

36 in total

1. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain.

Authors: Bruce Fischl; David H Salat; Evelina Busa; Marilyn Albert; Megan Dieterich; Christian Haselgrove; Andre van der Kouwe; Ron Killiany; David Kennedy; Shuna Klaveness; Albert Montillo; Nikos Makris; Bruce Rosen; Anders M Dale
Journal: Neuron Date: 2002-01-31 Impact factor: 17.173

2. Adjusting batch effects in microarray expression data using empirical Bayes methods.

Authors: W Evan Johnson; Cheng Li; Ariel Rabinovic
Journal: Biostatistics Date: 2006-04-21 Impact factor: 5.899

Review 3. The Human Brain Project: social and ethical challenges.

Authors: Nikolas Rose
Journal: Neuron Date: 2014-06-18 Impact factor: 17.173

4. Harmonization of multi-site diffusion tensor imaging data.

Authors: Jean-Philippe Fortin; Drew Parker; Birkan Tunç; Takanori Watanabe; Mark A Elliott; Kosha Ruparel; David R Roalf; Theodore D Satterthwaite; Ruben C Gur; Raquel E Gur; Robert T Schultz; Ragini Verma; Russell T Shinohara
Journal: Neuroimage Date: 2017-08-18 Impact factor: 6.556

5. A computer simulation of the static magnetic field distribution in the human head.

Authors: S Li; G D Williams; T A Frisk; B W Arnold; M B Smith
Journal: Magn Reson Med Date: 1995-08 Impact factor: 4.668

6. Measuring the thickness of the human cerebral cortex from magnetic resonance images.

Authors: B Fischl; A M Dale
Journal: Proc Natl Acad Sci U S A Date: 2000-09-26 Impact factor: 11.205

7. Fear reduction without fear through reinforcement of neural activity that bypasses conscious exposure.

Authors: Ai Koizumi; Kaoru Amano; Aurelio Cortese; Kazuhisa Shibata; Wako Yoshida; Ben Seymour; Mitsuo Kawato; Hakwan Lau
Journal: Nat Hum Behav Date: 2016-11-21

8. Harmonization of resting-state functional MRI data across multiple imaging sites via the separation of site differences into sampling bias and measurement bias.

Authors: Ayumu Yamashita; Noriaki Yahata; Takashi Itahashi; Giuseppe Lisi; Takashi Yamada; Naho Ichikawa; Masahiro Takamura; Yujiro Yoshihara; Akira Kunimatsu; Naohiro Okada; Hirotaka Yamagata; Koji Matsuo; Ryuichiro Hashimoto; Go Okada; Yuki Sakai; Jun Morimoto; Jin Narumoto; Yasuhiro Shimada; Kiyoto Kasai; Nobumasa Kato; Hidehiko Takahashi; Yasumasa Okamoto; Saori C Tanaka; Mitsuo Kawato; Okito Yamashita; Hiroshi Imamizu
Journal: PLoS Biol Date: 2019-04-18 Impact factor: 8.029

9. Increased power by harmonizing structural MRI site differences with the ComBat batch adjustment method in ENIGMA.

Authors: Joaquim Radua; Eduard Vieta; Russell Shinohara; Peter Kochunov; Yann Quidé; Melissa J Green; Cynthia S Weickert; Thomas Weickert; Jason Bruggemann; Tilo Kircher; Igor Nenadić; Murray J Cairns; Marc Seal; Ulrich Schall; Frans Henskens; Janice M Fullerton; Bryan Mowry; Christos Pantelis; Rhoshel Lenroot; Vanessa Cropley; Carmel Loughland; Rodney Scott; Daniel Wolf; Theodore D Satterthwaite; Yunlong Tan; Kang Sim; Fabrizio Piras; Gianfranco Spalletta; Nerisa Banaj; Edith Pomarol-Clotet; Aleix Solanes; Anton Albajes-Eizagirre; Erick J Canales-Rodríguez; Salvador Sarro; Annabella Di Giorgio; Alessandro Bertolino; Michael Stäblein; Viola Oertel; Christian Knöchel; Stefan Borgwardt; Stefan du Plessis; Je-Yeon Yun; Jun Soo Kwon; Udo Dannlowski; Tim Hahn; Dominik Grotegerd; Clara Alloza; Celso Arango; Joost Janssen; Covadonga Díaz-Caneja; Wenhao Jiang; Vince Calhoun; Stefan Ehrlich; Kun Yang; Nicola G Cascella; Yoichiro Takayanagi; Akira Sawa; Alexander Tomyshev; Irina Lebedeva; Vasily Kaleda; Matthias Kirschner; Cyril Hoschl; David Tomecek; Antonin Skoch; Therese van Amelsvoort; Geor Bakker; Anthony James; Adrian Preda; Andrea Weideman; Dan J Stein; Fleur Howells; Anne Uhlmann; Henk Temmingh; Carlos López-Jaramillo; Ana Díaz-Zuluaga; Lydia Fortea; Eloy Martinez-Heras; Elisabeth Solana; Sara Llufriu; Neda Jahanshad; Paul Thompson; Jessica Turner; Theo van Erp
Journal: Neuroimage Date: 2020-05-26 Impact factor: 6.556

10. The minimal preprocessing pipelines for the Human Connectome Project.

Authors: Matthew F Glasser; Stamatios N Sotiropoulos; J Anthony Wilson; Timothy S Coalson; Bruce Fischl; Jesper L Andersson; Junqian Xu; Saad Jbabdi; Matthew Webster; Jonathan R Polimeni; David C Van Essen; Mark Jenkinson
Journal: Neuroimage Date: 2013-05-11 Impact factor: 6.556

8 in total

1. Gray Matter Network Associated With Attention in Children With Attention Deficit Hyperactivity Disorder.

Authors: Xing-Ke Wang; Xiu-Qin Wang; Xue Yang; Li-Xia Yuan
Journal: Front Psychiatry Date: 2022-07-04 Impact factor: 5.435

2. Application of a Machine Learning Algorithm for Structural Brain Images in Chronic Schizophrenia to Earlier Clinical Stages of Psychosis and Autism Spectrum Disorder: A Multiprotocol Imaging Dataset Study.

Authors: Yinghan Zhu; Hironori Nakatani; Walid Yassin; Norihide Maikusa; Naohiro Okada; Akira Kunimatsu; Osamu Abe; Hitoshi Kuwabara; Hidenori Yamasue; Kiyoto Kasai; Kazuo Okanoya; Shinsuke Koike
Journal: Schizophr Bull Date: 2022-05-07 Impact factor: 7.348

3. Lifespan Volume Trajectories From Non-harmonized T1-Weighted MRI Do Not Differ After Site Correction Based on Traveling Human Phantoms.

Authors: Sarah Treit; Emily Stolz; Julia N Rickard; Cheryl R McCreary; Mercedes Bagshawe; Richard Frayne; Catherine Lebel; Derek Emery; Christian Beaulieu
Journal: Front Neurol Date: 2022-05-09 Impact factor: 4.086

4. Quantitative MRI Harmonization to Maximize Clinical Impact: The RIN-Neuroimaging Network.

Authors: Anna Nigri; Stefania Ferraro; Claudia A M Gandini Wheeler-Kingshott; Michela Tosetti; Alberto Redolfi; Gianluigi Forloni; Egidio D'Angelo; Domenico Aquino; Laura Biagi; Paolo Bosco; Irene Carne; Silvia De Francesco; Greta Demichelis; Ruben Gianeri; Maria Marcella Lagana; Edoardo Micotti; Antonio Napolitano; Fulvia Palesi; Alice Pirastru; Giovanni Savini; Elisa Alberici; Carmelo Amato; Filippo Arrigoni; Francesca Baglio; Marco Bozzali; Antonella Castellano; Carlo Cavaliere; Valeria Elisa Contarino; Giulio Ferrazzi; Simona Gaudino; Silvia Marino; Vittorio Manzo; Luigi Pavone; Letterio S Politi; Luca Roccatagliata; Elisa Rognone; Andrea Rossi; Caterina Tonon; Raffaele Lodi; Fabrizio Tagliavini; Maria Grazia Bruzzone
Journal: Front Neurol Date: 2022-04-14 Impact factor: 4.086

5. Neuroimaging brain growth charts: A road to mental health.

Authors: Li-Zhen Chen; Avram J Holmes; Xi-Nian Zuo; Qi Dong
Journal: Psychoradiology Date: 2021-12-30

6. Comparison of traveling-subject and ComBat harmonization methods for assessing structural brain characteristics.

Authors: Norihide Maikusa; Yinghan Zhu; Akiko Uematsu; Ayumu Yamashita; Kousaku Saotome; Naohiro Okada; Kiyoto Kasai; Kazuo Okanoya; Okito Yamashita; Saori C Tanaka; Shinsuke Koike
Journal: Hum Brain Mapp Date: 2021-08-17 Impact factor: 5.038

Review 7. Quantitative Structural Brain Magnetic Resonance Imaging Analyses: Methodological Overview and Application to Rett Syndrome.

Authors: Tadashi Shiohama; Keita Tsujimura
Journal: Front Neurosci Date: 2022-04-05 Impact factor: 5.152

8. How failure to falsify in high-volume science contributes to the replication crisis.

Authors: Sarah M Rajtmajer; Timothy M Errington; Frank G Hillary
Journal: Elife Date: 2022-08-08 Impact factor: 8.713

8 in total