Literature DB >> 35908308

Predicting escitalopram treatment response from pre-treatment and early response resting state fMRI in a multi-site sample: A CAN-BIND-1 report.

Jacqueline K Harris¹, Stefanie Hassel², Andrew D Davis³, Mojdeh Zamyadi⁴, Stephen R Arnott³, Roumen Milev⁵, Raymond W Lam⁶, Benicio N Frey⁷, Geoffrey B Hall⁸, Daniel J Müller⁹, Susan Rotzinger¹⁰, Sidney H Kennedy¹⁰, Stephen C Strother⁴, Glenda M MacQueen², Russell Greiner¹¹.

Abstract

Many previous intervention studies have used functional magnetic resonance imaging (fMRI) data to predict the antidepressant response of patients with major depressive disorder (MDD); however, practical constraints have limited many of those attempts to small, single centre studies which may not adequately reflect how these models will generalize when used in clinical practice. Not only does the act of collecting data at multiple sites generally increase sample sizes (a critical point in machine learning development) it also generates a more heterogeneous dataset due to systematic differences in scanners at different sites, and geographical differences in patient populations. As part of the Canadian Biomarker Integration Network in Depression (CAN-BIND-1) study, 144 MDD patients from six sites underwent resting state fMRI prior to starting escitalopram treatment, and again two weeks after the start. Here, we consider ways to use machine learning techniques to produce models that can predict response (measured at eight weeks after initiation), based on various parcellations, functional connectivity (FC) metrics, dimensionality reduction algorithms, and base learners, and also whether to use scans from one or both time points. Models that use only baseline (pre-treatment) or only week 2 (early-response) whole-brain FC features consistently failed to perform significantly better than default models. Utilizing the change in FC between these two time points, however, yielded significant results, with the best performing analytical pipeline achieving 69.6% (SD 10.8) accuracy. These results appear contrary to findings from many smaller single-site studies, which report substantially higher predictive accuracies from models trained on only baseline resting state FC features, suggesting these models may not generalize well beyond data used for development. Further, these results indicate the potential value of collecting data both before and shortly after treatment initiation.

Entities: Chemical

Keywords: Depression; Functional connectivity; Machine learning; Resting State; Treatment response; fMRI

Mesh：

Substances：
Biomarkers
Escitalopram

Year: 2022 PMID： 35908308 PMCID： PMC9421454 DOI： 10.1016/j.nicl.2022.103120

Source DB: PubMed Journal: Neuroimage Clin ISSN： 2213-1582 Impact factor: 4.891

Introduction

Major depressive disorder (MDD) is a prevalent, and potentially chronic disorder with a highly variable symptom profile (Fried and Nesse, 2015). While there are many effective treatments, including multiple classes of pharmacological interventions, response to any given treatment at the individual level is quite variable. Approximately one third of patients remit from depressive symptoms following their first antidepressant treatment, and many will require multiple trials of different medications or other treatments to reach remission (Rush et al., 2006). While remission after multiple treatments may be possible, many patients drop out before an adequate response is found (Thornicroft et al., 2017), or experience prolonged distress and frustration while undergoing various (often unsuccessful) treatments. This has motivated the development of tools that can guide clinicians to quickly and accurately inform the treatment choice that is best for each individual MDD patient. The heterogeneity of MDD suggests that response to treatment may be complex (Fried and Nesse, 2015). Certain clinical factors appear to correlate with response, such as duration, and severity of the disorder (Kraus et al., 2019), suggesting the existence of several depressive subtypes. In general, however, attempts to find stable clinical subgroups with associations that are robust enough to guide clinical decision making have not been successful (Arnow et al., 2015). Machine learning provides a promising alternative to these approaches, as it can produce models that can map complex interactions among input features to outcome measures, enabling accurate, individual predictions of response. Resting state (RS) functional connectivity (FC) has frequently been used to study neurophysiological alterations in MDD. This type of analysis has been used to reproduce positron emission tomography (PET) findings implicating corticolimbic dysregulation in MDD etiology (Anand et al., 2005, Mayberg, 1997); with subsequent studies consistently reporting abnormal connectivity between the prefrontal cortex (PFC), anterior cingulate cortex (ACC), and limbic regions (Cullen et al., 2009, Veer et al., 2010). Findings have also pointed to alterations in whole RS networks such as hyperconnectivity in the default mode network (DMN) (Greicius et al., 2007, Sheline et al., 2009). Changes in FC following antidepressant treatment have also been reported including normalization of DMN hyperconnectivity (Posner et al., 2013, Wang et al., 2015), modulation of striatal connectivity (Wang et al., 2019), and increased corticolimbic connectivity (Anand et al., 2005). The sensitivity of RS FC measures to detect differences between MDD and healthy control groups as well as changes with antidepressant treatment, make them an attractive measurement for models predicting individual response to treatment, with several studies reporting accuracies exceeding 80% (Cohen et al., 2021, Gao et al., 2018). Most of these studies, however, use relatively small, single-site datasets, which tend to be more homogeneous, making it difficult to know if these results would translate to clinical practice (Schnack and Kahn, 2016). The magnitude of such site-induced biases in FC measures has also been shown to exceed the effect size of many psychiatric disorders (including MDD), making multi-site data collection essential for modelling the overall population (Yamashita et al., 2019). Empirical results from predictive studies further support these claims. Sundermann et al. (2017), for example, found that models derived from a larger, more heterogeneous sample -- presumably more representative of the true target patient population -- predicted MDD diagnosis with lower accuracy (45.0 to 56.1% accuracy) than models that had typically been reported in the literature (based on smaller sample sizes). Ramasubbu et al. (2016) similarly showed that MDD diagnostic models only outperformed default accuracy after limiting their patient population to those with severe depressive symptoms. All of these observations have been recently reinforced by the meta-analysis of Sajjadian et al. (2021). They report a mean accuracy of 63% [95% confidence interval (CI) 56–71] from 8 of 54 eligible literature reports before November 2020 considered to be of high-enough quality to enter into the meta-analysis. The remaining 46 reported studies provided a significantly higher mean accuracy of 75% (95% CI 72–78). Variability in data-processing pipelines can also make it challenging to compare results from different studies. In addition to the pre-processing required in all fMRI analysis, studies utilizing FC must also choose definitions for regions of interest and the metric used to calculate FC, both of which have been shown to have substantial impact on overall model performance (Abraham et al., 2017, Dadi et al., 2019, Kalmady et al., 2019). We must also consider other modelling choices, such as methods to reduce the number of features (if at all), what type of base learner will be used, and with what parameters. Published results may also be artificially inflated if various pipeline choices are tested and chosen only after reviewing performance on the test set (Hosseini et al., 2020). Here we investigate the utility of machine learning models that use whole-brain RS-fMRI features to predict escitalopram treatment response in a relatively large, multi-site MDD sample. We assess both the utility of pre-treatment FC measures and consider the utility of also using data collected after two weeks of treatment, which may serve as an early indicator of response. Further, we consider the impact of altering various steps in the analysis pipeline to assess what effect they may have on prediction accuracy.

Methods

Participants

We used data from CAN-BIND-1, a multi-centre study developed to examine potential biomarkers of response to antidepressant therapies in MDD; see Lam et al. (2016) for a detailed overview of the full study protocol and Kennedy et al. (2019) for a full report on clinical outcomes. Of relevance to this analysis, patients were recruited at six sites across Canada: University of British Columbia (UBC), University of Calgary (UCA), Queen’s University at Kingston (QNS), McMaster University (MCU), Toronto General Hospital (TGH), and The Centre for Addiction and Mental Health, Toronto (CAMH). Informed consent was obtained from all participants, and the protocol was approved by the Research Ethics Boards at each institution, with additional approval from the University of Alberta Research Ethics Board for secondary data analysis. Eligible participants were outpatients between the ages of 18 and 60, fluent English speakers, meeting DSM-IV-TR criteria for MDD in a current Major Depressive Episode, as assessed by the Mini International Neuropsychiatric Interview (MINI; Sheehan et al., 1998), with episode duration of at least three months, and a score of at least 24 on the Montgomery-Asberg Depression Rating Scale (MADRS; Montgomery and Åsberg, 1979). Participants were also required to be free of psychotropic medications for at least five half-lives prior to initial visit and to have not started psychological treatment within the previous three months. Individuals were ineligible if they had a previous unsuccessful, or not tolerated, trial of either of the study medications (escitalopram or aripiprazole), had failed four or more previous pharmacological treatments, or were at high risk for hypomanic switch (history of antidepressant-induced hypomania). Participants were also screened for other psychiatric disorders and excluded if they had a diagnosis of Bipolar I or II, or any primary psychiatric diagnosis other than MDD. Individuals with high suicidal risk, substance abuse/dependence in the past six months, psychosis in the current depressive episode, significant neurological disorders, head trauma, unstable medical conditions, pregnant or breastfeeding, or contraindications to MRI were also excluded. Participants were included in current analyses if they were part of the treatment group and had complete pre-treatment (baseline) and week 2 RS-fMRI data, as well as treatment response data at week 8. Of the 157 participants with complete data, 13 were excluded during manual quality control for severe imaging artefacts or incidental MRI findings. Scans were also screened for excessive motion based on the ‘lenient’ criteria outlined in Parkes et al. (2018), and none exceeded the threshold of 0.55 mm for the temporal mean of the frame displacement time series. Hence, data from 144 participants (age 34.9 ± 12.43, 90 females; see Supplementary Material Table S1 for detailed patient clinical and demographic information) were included in model building and assessment.

Treatment protocol and outcomes

Once enrolled, patients underwent extensive clinical data collection, after which, they began treatment with escitalopram for 8 weeks. MRI data, including RS-fMRI, were collected at baseline, two weeks, and eight weeks after treatment initiation. After eight weeks, patients were assessed for response to escitalopram, as defined as ≥ 50% reduction in total MADRS score from baseline. Those who achieved ≥ 50% decrease in MADRS were classified as ‘responders’, and those who did not, as ‘non-responders’.

MR image acquisition

Imaging data were acquired using 3 Tesla MRI scanners at six centers across Canada, varying by model and manufacturer (GE Healthcare Signa HDxT (TGH), GE Healthcare Discovery MR750 (CAM, MCU, UCA), Phillips Intera (UBC), Siemens Trio Tim (QNS)). Imaging protocols varied slightly among sites, to accommodate for scanner and manufacturer differences (see MacQueen et al., 2019 for detailed protocols). RS-fMRI was collected over a 10-minute scan (2000 ms repetition time) using a whole-brain T2*-sensitive blood-oxygen-level-dependent echo planar imaging sequence (30 ms echo time, 4 mm × 4 mm × 4 mm resolution). Accompanying whole-brain structural 3D T1-weighted images were acquired with a 1 mm isotropic resolution. In addition to protocol harmonization, substantial quality control and assurance efforts were implemented to ensure consistent high-quality data was collected from all sites. These methods included, but were not limited to, automated file name and imaging protocol adherence checks, manual image quality control rating, and longitudinal monitoring of scanner stability using monthly phantom scans from all sites (see MacQueen et al. 2019). Details of image pre-processing are included in the Supplementary Materials.

Data analysis

FC was extracted as a connectome, measuring the pair-wise synchronicity between the time-series of each pair of spatial ROIs; where the time-series for a single ROI is the average temporally changing signal intensity of all voxels included in the ROI. Many studies have noted that in both the calculation and the use of these connectomes in machine learning models, various steps in the pipeline can have a substantial impact on overall model performance (Abraham et al., 2017). To ensure that our reported results are not specific to a single pipeline, we tested multiple pipelines, altering decisions made at various steps, as depicted in Fig. 1. In relation to the computation of the connectome, we considered different combinations of parcellations and connectivity estimation metrics. We also considered various methods for feature dimensionality reduction, and various types of base learners.

Fig. 1

(Top) Diagram of data flow through analytic pipelines. Pre-processed resting state fMRI data from either baseline or week 2 is fed through a series of operations that first generate functional connectivity (FC) features and then generate a predictive model based on these features. Alternate approaches are used for parcellation, connectivity estimation, dimensionality reduction, and base learner, every combination of which is tested for each set of input data, resulting in a total of 240 models for each of the baseline and week 2 datasets (5 parcellations × 3 connectivity metrics × 4 dimensionality reduction techniques × 4 classifiers) – leading to a total of 480 models. Along the highlighted pathway, for example, pre-processed data collected at week 2 is first parcellated using the Power coordinates, resulting in a 259x295 matrix for each participant’s data, where 259 is the number of regions of interest (ROI) included in the power coordinates after SNR masking, and 295 is the length of the temporal dimension of the original fMRI dataset. Between every pair of ROIs, the correlation between time-courses is then calculated, resulting in a single value for every ROI pair. The full set of correlations corresponds to the lower triangle of the full correlation matrix, which is then vectorized with a length of 33,411 for each participant. This process of feature generation is repeated for each participant, resulting in a 144x33,411 matrix of FC features. Since feature generation is independent of response label, this procedure is completed prior to model generation. Features then fed into model generation are scaled, and passed to ANOVA feature selection, which chooses the k (value obtained based on internal cross-validation) most relevant features to be used in the linear SVM classifier. (Bottom) Delta models are processed through the same analytic pathway, with the addition of a subtraction step at the end of feature generation. Here, both baseline and week 2 FC features are generated, and the difference of the two matrices (the delta FC feature matrix) is used in subsequent predictive modelling. An additional 240 models are generated using these delta features, considering all possible combinations of processing steps.

Parcellation

To reduce dimensionality, and obtain neurobiologically relevant features, pre-processed RS images were parcellated based on either an a priori atlas or set of pre-defined ROI coordinates. A time-series for each region from either atlas or coordinate set was generated by taking an element-wise average of all voxels included in the region. Methods for defining these regions, and also the number of regions in each parcellation, vary greatly depending on the parcellation scheme. We considered the five schemes listed below. Parcellations were all implemented using Nilearn (Abraham et al., 2014); with default parameters unless otherwise stated. Automated Anatomical Labelling (AAL) (Tzourio-Mazoyer et al., 2002) − 116 region anatomical atlas based on manual segmentation. Dosenbach (Dosenbach et al., 2010) − 160 ROIs based on meta-analyses of task-based fMRI (radius = 4.5). Harvard-Oxford (Desikan et al., 2006) − 48 region cortical atlas based semi-automated segmentation (atlas_name = ‘cort-maxprob-thr25-2 mm’). Power (Power et al., 2011) − 264 ROIs generated based on functional homogeneity (radius = 5.0). Yeo (Thomas Yeo et al., 2011) − 17 cortical networks based on clustering of functional connectivity in 1000 participants (data = ‘thick_17′). To ensure sufficient signal quality, we generated a binary mask that set a voxel position to zero, for exclusion, if the signal-to-noise ratio (SNR, time-series mean divided by standard deviation) for that position was less than 100 in greater than 5% of participants (Drysdale et al., 2017). Voxels with insufficient SNR were excluded during mean time series calculation for parcellations. Based on this low SNR criteria, we excluded (all voxels in) five regions from the Power parcellation and one from the Dosenbach parcellation, generally along the inferior frontal and temporal lobes (Supplementary Fig. S1).

Connectivity estimation

For each parcellation, we then considered the following three measures to estimate pairwise FC between each pair of brain regions; again, calculated using Nilearn (Abraham et al., 2014). Correlation Partial Correlation – A variant of correlation that infers only direct connectivity between regions, as opposed to including indirect connectivity through other regions. Tangent – optimized covariance metric for statistical learning (Abraham et al., 2017, Dadi et al., 2019). As part of the training procedure, FC features were scaled to zero mean, and unit variance with respect to the entire training set. This, and subsequent steps, including dimensionality reduction techniques, base learners, cross-validation, and hyperparameter optimization, were implemented using Scikit-learn (Pedregosa et al., 2011).

Dimensionality reduction

Various techniques were applied to further reduce the number of input features. The number of features selected, k, was optimized during internal cross-validation to be between 5 and 50; the upper limit selected in consideration of the sample size. In addition to the strategies listed below, models were also tested with this step omitted, to allow the base learners to use the full set of FC features. Principal components analysis (PCA) – Project data into a lower dimensional space; here keeping the first k components. ANOVA – Univariate Feature selection based on ANOVA f-value between the input features and labels, keeping features with k highest f-values. Agglomeration – Recursively merge features, stopping when k clusters are reached. None – Omit dimensionality reduction step and allow base learner to use all features.

Base learner

We then ran base learners on the data described above, to produce models that could discriminate between treatment responders and non-responders. We considered four different base learners, both linear and nonlinear, to assess their discriminative capacity. Here, we considered support vector machine (SVM) learners as they have been shown to be highly effective for classification and are one of the most commonly used. We also considered logistic regression for its simplicity and fast training time, and random forest as a more complex, and often effective, model. Logistic Regression – Basic linear classification model. Support vector Machine (SVM) with linear kernel. SVM with Radial Basis Function (RBF) Kernel – Non-linear SVM. Random Forest – Ensemble bagging algorithm based on decision trees.

Timepoints

We trained each combination of parcellation, connectivity estimation, dimensionality reduction, and base learner described above with each of the three sets of input data based on the time(s) when the data was collected: (1) just baseline data, which was collected prior to treatment, (2) data collected 2 weeks into treatment, and (3) a combination of the two time points. For (3), we computed the features for a given parcellation and connectivity metric for both baseline and week 2, then subtracted the week 2 features from the baseline features; going forward, we refer to such models as “delta” models. We compared the performance of models trained using data from these three different timepoints using Wilcoxon matched-pairs signed rank test, matching results from the same analytical pipelines trained using different data, and corrected for multiple comparisons using Bonferroni correction. Altogether, for each of the three timepoints, we consider 240 = 5x3x4x4 different models, as we consider five different parcellations (AAL, Dosenbach, Harvard-Oxford, Power, and Yeo), three different connectivity measures (correlation, partial correlation, and tangent embedding) four different dimensionality reduction strategies (PCA, ANOVA, agglomeration, and none), and four base learners (logistic regression, linear SVM, SVM with RBF kernel, and random forest). For each timepoint, and each setting (particular parcellation, connectivity measure, dimensionality reduction strategy, and base learner), we used internal 5-fold cross-validation to identify the best hyperparameter values, using a random search of 100 iterations over the prescribed parameter space for both dimensionality reduction and base learner (details of hyperparameter optimization included in Supplementary Materials).

Cross-validation and assessment

The quality of the learned models was assessed using an external 10-fold cross-validation strategy. Here, we partitioned the data into 10 disjoint equal-sized folds, each with roughly the same number of instances from each scan site and each classification label. Since all the parcellations used in this analysis were derived a priori, the parcellation and connectome calculation were performed outside of cross-validation. This approach produces a model for each fold (9/10 of the data), whose accuracy is assessed on the held-out remaining 1/10 of the data. The overall performance reported for each model is the mean accuracy across all folds. A default ‘dummy’ classifier was also generated to assign each patient to the majority class label from the training set. Each model for a given timepoint was compared to the default model to determine if it could significantly outperform chance level predictions. Models from each timepoint were first compared using an ANOVA analysis, with subsequent paired t-tests if significance was reached. During external cross-validation, the resubstitution accuracy was also assessed by making predictions on the training data to be used as an indicator of over-fitting. The impact of changes to the analytical pipeline was statistically tested using the Wilcoxon matched-pairs rank test corrected with Bonferroni correction, matching pipelines differing in only a single step.

Results

The default model, predicting the majority class, had a prediction accuracy of 53.5% (SD 3.7). Fig. 2 shows the cross-validation accuracy results for the top 10 models from each timepoint, along with details of corresponding analytic pipelines.

Fig. 2

Ranked models with highest mean cross-validation accuracy for data from baseline (left), week 2 (center), and delta (right) timepoints. Bar charts depict the mean test accuracy for each pipeline with error bars representing standard deviation across folds. Pipelines that performed significantly better than default accuracy (p-value ≤ 0.05) are indicated with an asterisk, which occurs only in the delta models and a single week 2 model. Tables below further detail the pipeline settings of top performing models along with mean cross-validation accuracy, and standard deviation.

Baseline models

Models trained on baseline data had accuracies ranging from 39.0 (SD 11.7) to 61.2% (SD 10.5), none of which performed significantly differently from the default classifier (p-value > 0.05). Only three models exceeded 60% mean accuracy.

Week 2 models

Models trained using week 2 data showed performance similar to baseline models, with mean accuracies between 37.5 (SD 8.8) to 66.5% (SD 12.0), only the highest of which performed significantly better than the default classifier (p = 0.014). The best performing model was also the only model to exceed 60% mean accuracy, having over 6% higher accuracy than the next best performing model. Corrected Wilcoxon rank test showed no significant difference between models from baseline and week 2 (p = 1.0).

Delta models

40 models of the 240 modelling pipelines tested exceeded a mean accuracy of 60%. Thirty of these 40 models came from pipelines utilizing correlation as the metric to calculate connectome matrices. Mean test accuracies (over all 240 models) ranged from 41.6 (10.1) to 69.6% (SD 10.8), with 17 models performing significantly better than the default model at a p-value of < 0.05. Statistical comparison with models trained with baseline, or week 2 data, show highly significant differences (p = 1.6e-11, p = 2.4e-13, respectively) although only an overall modest mean performance improvement of 3.1 and 3.4% respectively.

Pipeline choices

We assessed the impact of different choices in the analytic pipeline only in the delta models (Fig. 3), as neither baseline nor week 2 models indicated predictive capacity beyond chance level. Connectivity metric showed the greatest impact on predictive accuracy, with partial correlation models performing significantly worse than both correlation and tangent (p = 3.6e-12, p = 3.0e-5, respectively). Models trained with correlation data also significantly outperformed tangent models (p = 3.0e-7).

Fig. 3

Impact of analytic pipeline choices on prediction accuracy. For each evaluated step of the analytic pipeline, the figure shows the mean difference between pipelines with indicated alternatives and mean overall predictive accuracy. Error bars are scaled to one-sixth standard deviation. For each step in the pipeline, alternative operations were compared using a Wilcoxon matched-pairs rank test with Bonferroni correction. Steps involved in feature generation had the greatest impact on model performance, specifically the choice of connectivity metric, where correlation showed the highest mean performance. Neither step in model generation (dimensionality reduction or base learner) had a substantial impact on performance. Parcellation choice had a small impact on accuracy but was significant between AAL and Yeo (p = 0.008), Dosenbach and Yeo (p = 0.003), and Power and Yeo (p = 0.016) parcellations. In general, parcellations with fewer regions tended to outperform those with more. Neither base learner nor dimensionality reduction technique had a significant impact on prediction accuracy, although the random forest base learner and ANOVA-based dimensionality reduction showed moderate improvement over other techniques.

Discussion

Early symptom response to antidepressant medication has been well established to be one of the strongest predictors of treatment response to date (Jakubovski and Bloch, 2014, Szegedi et al., 2009). However, it is not known if neurobiological measures carry this same predictive power. Our analysis, utilizing a relatively large heterogeneous sample, shows that FC features collected from a single time-point, either prior to treatment initiation, or 2 weeks into treatment, were not capable of predicting response significantly above chance level, except for a single anomalous case amongst the week 2 models. In the same sample, however, the change in FC matrices between baseline and week 2 predicting response after eight weeks of treatment, was significantly above chance, with the highest performing model achieving 69.6% (SD 10.8) accuracy. Previous authors utilizing RS-fMRI data to predict individual response to pharmacological treatment have reported predictive accuracies in excess of 80% (Cohen et al., 2021, Gao et al., 2018); although these rarely included imaging data from more than one site, or samples with more than 50 participants. One notable result from the iSPOT trials, utilizing 80 participants, reported prediction accuracy for remission exceeding 80%, based on models trained from intrinsic FC measures between the posterior cingulate cortex (PCC) and anterior cingulate cortex (ACC)/medial prefrontal cortex (mPFC) (Goldstein-Piekarski et al., 2018); regions that have been implicated in depressive symptomology (Cullen et al., 2009, Veer et al., 2010). Problematically, the mPFC, ACC, and other limbic regions lie in the inferior frontal and temporal lobes, which are known to experience substantial magnetic field inhomogeneities due to numerous air/tissue interfaces in the region (Truong et al., 2002), resulting in susceptibility artefacts and signal loss. SNR masks produced as part of this analysis (Supplementary Fig. S1) to ensure adequate signal coverage in regions included in model development excluded significant portions of regions of interest including subgenual ACC and raphe nuclei, which are important components of emotional regulation and serotonergic pathways. Poor signal coverage in these areas may have failed to capture neuronal interactions pivotal to predicting treatment response. Lower accuracies observed in this study are also consistent with a previously observed, paradoxical relationship between sample size and accuracy in psychiatric predictive models (Kalmady et al., 2019, Varoquaux, 2018, Wolfers et al., 2015); wherein larger samples tend to report lower overall accuracy. An overall increase in dataset heterogeneity is likely a large contributing factor to this decline, as it becomes more challenging to control heterogeneity as sample sizes increase. Although this heterogeneity may appear to decrease overall accuracy, results of less tightly controlled studies are likely more generalizable to clinical practice, as they tend to encompass more realistic patient populations and multiple collection sites. Instability in model performance across cross-validation folds, resulting in large error bars, is also a concern in small samples, and leads to uncertainty around the capacity of these models to generalize to new samples. All models tested exhibited a large degree of variance in performance across cross-validation folds; with standard deviations reaching up to 18.7%, which is consistent with empirical results from Varoquaux (2018) for samples of similar size. This result, however, draws into question whether the proposed models are indeed able to classify treatment response above a chance level. Moreover, the number of models run may raise concerns of ‘overhyping’ (Hosseini et al., 2020), where analysis choices are made after observing test set accuracies. This effect becomes clearly evident in the week 2 model results, where the only significant model has an accuracy over six percent higher than the next best performing model; an effect that is more likely to be spurious, rather than a reflection of superior modelling choices. Even the more consistent performance of the delta models may be questionable in light of the number of models run, although they were found to statistically outperform baseline and week 2 models with a high degree of significance. Model stability and performance are also negatively impacted by small sample sizes relative to the number of features; a cited concern of neuroimaging features in machine learning (Du et al., 2018, Varoquaux, 2018). Under these conditions, the training samples inadequately cover the feature space leading to overfitting. Fig. 4 demonstrates that across time point and across model type, resubstitution accuracy consistently exceeds test accuracy, an indication of overfitting. Given the inherently high-dimensional nature of FC data, considerably larger sample sizes may be required to elucidate subtle patterns necessary for prediction, although resubstitution accuracy would still be expected to exceed test accuracy in larger real-world data sets but by smaller amounts. Similar concerns have been suggested in utilizing genetic features (Wray et al., 2013), which was cited as a major concern in a recent report by Shumake et al. (2021). Here, authors failed to show improvement in accuracy when including genetic data in models that predict SSRI treatment response. These results were somewhat disappointing given the considerable support in the literature for a heritable component to SSRI response and the interest in the predictive capacity of genetic data in treatment response (Amare et al., 2017).

Fig. 4

Mean resubstitution (blue) and testing (orange) accuracy across timepoints and base learners. Mean accuracy results for all models separated by time point and base learner are plotted, where each box extends from the first quartile to the third quartile with median line; whiskers extend to show full data range with mean default model accuracy indicated by the red horizontal line through plots. Amassing data sets of adequate size is extremely challenging, due, in part, to the high costs and scarce resources for image acquisition. Prognostic studies, such as ours, where additional follow-up and patient monitoring is required, bring additional challenges. Improved predictive accuracy observed in the delta models, which include imaging data for each patient from two separate timepoints, requires yet additional resources. Results from Klöbl et al. (2020), however, suggest medication-induced FC alterations may be achieved in a much shorter timeframe by an acute intravenous SSRI challenge. This may allow the week 2 timepoint used in delta models to be acquired at the baseline visit following an intravenous challenge, eliminating the need for a second scanning session and 2-week medication trial before response prediction can be made. Whether we consider one- or two-timepoints, our current approach requires a large number of participants to find meaningful patterns, leading to good generalization. An alternative approach to increasing sample size may be to reduce the number of features used to describe each patient; either through parcellation choice or a priori manual feature selection. When comparing performance of models utilizing different parcellations in Fig. 3 (top), a trend emerges favouring those with fewer regions (and therefore fewer features), which may suggest that larger parcellations produce too many features to be adequately modelled within the given sample. Ultimately, however, smaller parcellations require more voxels to be averaged together in a single region, implicitly imposing regional homogeneity, which may lead to information loss (Dornas and Braun, 2018) important in the predictive task. Manual feature selection similarly has the potential to overlook connections that may meaningfully contribute to predictions. Results from delta models in our analysis show a modest improvement over default accuracy, similar to reports from larger, multisite studies such as (1) Chekroud et al. (2016), utilizing clinical measures from the STAR*D study to predict treatment response after 2 weeks reporting a mean accuracy of 67.9% (SD 3.8), which improved over baseline prediction accuracy of 64.65% (SD 3.2), (2) Shumake et al. (2021) using pre-treatment clinical and small nucleotide polymorphism (SNP) features to achieve 62.8% accuracy, and (3) Bartlett et al. (2018) who achieved 63.9% accuracy predicting SSRI remission using both pre-treatment, and also early-treatment cortical thickness measurements. A plateau in accuracy approaching 70% across these larger studies may be an indication of a theoretical upper limit on the capacity of models to predict response to pharmacological treatment in MDD, likely impacted to some degree by uncertainty associated with the response labels (Cortes et al., 1995). This limit, however, may be different for alternative prediction targets of treatment response such as remission, and treatment resistance (Sajjadian et al., 2021). MADRS scores, the basis of the binary labels of response in this analysis, are inherently noisy due to variability in rater assessment (Davidson et al., 1986), and may also be biased due to factors independent of biological response that impact mood such as changes in external stressors, spontaneous recovery, and placebo effects.

Limitations

This study is not without limitations. As an open label trial, there was no placebo group to control for known placebo effects in treatment response. While this dataset is from one of the larger fMRI studies to examine treatment response in MDD, it may still be substantially smaller than what might be necessary for modelling inherently high-dimensional FC data. As discussed earlier, the inability of single-timepoint models (i.e., baseline, or week 2) to exceed chance level performance may be mitigated by including more patients in the training data. In this analysis we chose to use cross-validation splits with equal proportions of data from each acquisition site. This approach, however, does not assess how the model would perform on data with uncontrolled site-specific variability. Alternatively, defining folds by acquisition site, where each fold consists of data from a single site, might better assess a model’s capacity to generalize to new sites. We did not take this approach in our analysis, given the uneven distribution of patients across imaging sites (see Supplementary Materials Table S1) and limited sample size. Ideally, future analyses may evaluate model performance on a completely independent and external testing data set. Further, while we attempted to explore the impact of different modelling choices on overall performance in predicting treatment response in MDD, this was not an exhaustive search of the space. For example, we limited parcellations to pre-defined atlases and co-ordinate sets but did not consider the potential of data-driven strategies. Novel approaches to each step of the analytic pipeline continue to be developed and may improve classification accuracy. In addition, we note there are state-of-the-art approaches for learning from resting-state fMRI (such as Chen et al., 2022, Santana et al., 2019, Zhao et al., 2022). However, as the main point of this paper is showing that including week 2 data can significantly improve the accuracy over just using the baseline data, we decided it was sufficient to show this on these standard machine learning approaches.

Conclusion

This study explored ways to learn a model that can predict the effectiveness of escitalopram for treating MDD, based on RS-fMRI data, taken at baseline and/or at 2 weeks. Our extensive empirical analysis (over 720 possible models) suggests that models that use only the data at baseline, or that use only the data at 2 weeks, are not sufficient for this task. However, we were able to produce several effective learned models that combine the data from both timepoints. While we acknowledge that baseline data might be sufficient, given a larger sample, additional features, and perhaps other learning techniques, we anticipate, even then, including early biological changes following treatment initiation may lead to yet more accurate predictions.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: RM has received consulting and speaking honoraria from AbbVie, Allergan, Eisai, Janssen, KYE, Lallemand, Lundbeck, Neonmind, Otsuka, and Sunovion, and research grants from CAN-BIND, Canadian Institutes for Health Research (CIHR), Janssen, Lallemand, Lundbeck, Nubiyota, Ontario Brain Institute (OBI) and Ontario Mental Health Foundation (OMHF). RWL has received honoraria for ad hoc speaking or advising/consulting, or received research funds, from: Asia-Pacific Economic Cooperation, BC Leading Edge Foundation, CIHR, Canadian Network for Mood and Anxiety Treatments, Healthy Minds Canada, Janssen, Lundbeck, Lundbeck Institute, Medscape, Michael Smith Foundation for Health Research, MITACS, Myriad Neuroscience, Ontario Brain Institute, Otsuka, Pfizer, Sanofi, Unity Health, Vancouver Coastal Health Research Institute, and VGH/UBCH Foundation. SHK has received research funding or honoraria from the following sources: Abbott, AbbVie, Alkermes, Allergan, Boehringer-Ingelheim, Brain Canada, CIHR, Field Trip (stock), Janssen, Lundbeck, Lundbeck Institute, Merck, OBI, Ontario Research Fund (ORF), Otsuka, Pfizer, Servier, SPOR, Sun, and Sunovion. SCS is a senior scientific advisor and shareholder of ADMdx, Inc., which receives NIH funding, and he currently has research grants from the Ontario Brain Institute in Canada. The remaining authors have no conflicts to disclose.

56 in total

1. Symptomatic and Functional Outcomes and Early Prediction of Response to Escitalopram Monotherapy and Sequential Adjunctive Aripiprazole Therapy in Patients With Major Depressive Disorder: A CAN-BIND-1 Report.

Authors: Sidney H Kennedy; Raymond W Lam; Susan Rotzinger; Roumen V Milev; Pierre Blier; Jonathan Downar; Kenneth R Evans; Faranak Farzan; Jane A Foster; Benicio N Frey; Peter Giacobbe; Geoffrey B Hall; Kate L Harkness; Stefanie Hassel; Zahinoor Ismail; Francesco Leri; Shane McInerney; Glenda M MacQueen; Luciano Minuzzi; Daniel J Müller; Sagar V Parikh; Franca M Placenza; Lena C Quilty; Arun V Ravindran; Roberto B Sassi; Claudio N Soares; Stephen C Strother; Gustavo Turecki; Anthony L Vaccarino; Fidel Vila-Rodriguez; Joanna Yu; Rudolf Uher
Journal: J Clin Psychiatry Date: 2019-02-05 Impact factor: 4.384

2. The Canadian Biomarker Integration Network in Depression (CAN-BIND): magnetic resonance imaging protocols

Authors: Glenda M. MacQueen; Stefanie Hassel; Stephen R. Arnott; Addington Jean; Christopher R. Bowie; Signe L. Bray; Andrew D. Davis; Jonathan Downar; Jane A. Foster; Benicio N. Frey; Benjamin I. Goldstein; Geoffrey B. Hall; Kate L. Harkness; Jacqueline Harris; Raymond W. Lam; Catherine Lebel; Roumen Milev; Daniel J. Müller; Sagar V. Parikh; Sakina Rizvi; Susan Rotzinger; Gulshan B. Sharma; Claudio N. Soares; Gustavo Turecki; Fidel Vila-Rodriguez; Joanna Yu; Mojdeh Zamyadi; Stephen C. Strother; Sidney H. Kennedy
Journal: J Psychiatry Neurosci Date: 2019-07-01 Impact factor: 6.186

3. Antidepressant effect on connectivity of the mood-regulating circuit: an FMRI study.

Authors: Amit Anand; Yu Li; Yang Wang; Jingwei Wu; Sujuan Gao; Lubna Bukhari; Vincent P Mathews; Andrew Kalnin; Mark J Lowe
Journal: Neuropsychopharmacology Date: 2005-07 Impact factor: 7.853

4. Antidepressants normalize the default mode network in patients with dysthymia.

Authors: Jonathan Posner; David J Hellerstein; Inbal Gat; Anna Mechling; Kristin Klahr; Zhishun Wang; Patrick J McGrath; Jonathan W Stewart; Bradley S Peterson
Journal: JAMA Psychiatry Date: 2013-04 Impact factor: 21.596

5. Prognostic subgroups for citalopram response in the STAR*D trial.

Authors: Ewgeni Jakubovski; Michael H Bloch
Journal: J Clin Psychiatry Date: 2014-07 Impact factor: 4.384

6. Deriving reproducible biomarkers from multi-site resting-state data: An Autism-based example.

Authors: Alexandre Abraham; Michael P Milham; Adriana Di Martino; R Cameron Craddock; Dimitris Samaras; Bertrand Thirion; Gael Varoquaux
Journal: Neuroimage Date: 2016-11-16 Impact factor: 7.400

7. Discovering biomarkers for antidepressant response: protocol from the Canadian biomarker integration network in depression (CAN-BIND) and clinical characteristics of the first patient cohort.

Authors: Raymond W Lam; Roumen Milev; Susan Rotzinger; Ana C Andreazza; Pierre Blier; Colleen Brenner; Zafiris J Daskalakis; Moyez Dharsee; Jonathan Downar; Kenneth R Evans; Faranak Farzan; Jane A Foster; Benicio N Frey; Joseph Geraci; Peter Giacobbe; Harriet E Feilotter; Geoffrey B Hall; Kate L Harkness; Stefanie Hassel; Zahinoor Ismail; Francesco Leri; Mario Liotti; Glenda M MacQueen; Mary Pat McAndrews; Luciano Minuzzi; Daniel J Müller; Sagar V Parikh; Franca M Placenza; Lena C Quilty; Arun V Ravindran; Tim V Salomons; Claudio N Soares; Stephen C Strother; Gustavo Turecki; Anthony L Vaccarino; Fidel Vila-Rodriguez; Sidney H Kennedy
Journal: BMC Psychiatry Date: 2016-04-16 Impact factor: 3.630

8. Accuracy of automated classification of major depressive disorder as a function of symptom severity.

Authors: Rajamannar Ramasubbu; Matthew R G Brown; Filmeno Cortese; Ismael Gaxiola; Bradley Goodyear; Andrew J Greenshaw; Serdar M Dursun; Russell Greiner
Journal: Neuroimage Clin Date: 2016-07-27 Impact factor: 4.881

9. Using Deep Learning and Resting-State fMRI to Classify Chronic Pain Conditions.

Authors: Alex Novaes Santana; Ignacio Cifre; Charles Novaes de Santana; Pedro Montoya
Journal: Front Neurosci Date: 2019-12-17 Impact factor: 4.677

10. Inclusion of genetic variants in an ensemble of gradient boosting decision trees does not improve the prediction of citalopram treatment response.

Authors: Jason Shumake; Travis T Mallard; John E McGeary; Christopher G Beevers
Journal: Sci Rep Date: 2021-02-12 Impact factor: 4.379