Literature DB >> 29879133

Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study.

Zhi Nie^1,2, Srinivasan Vairavan^3,4, Vaibhav A Narayan^3,4, Jieping Ye^1,2, Qingqin S Li^3,4.

Abstract

Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70-0.78 and 0.72-0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Antidepressive Agents

Year: 2018 PMID： 29879133 PMCID： PMC5991746 DOI： 10.1371/journal.pone.0197268

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Treatment resistant depression (TRD) and relapse/recurrence of major depressive episodes (MDE) are important sources of morbidity and mortality. Early Identification of risk factors of resistance using baseline characteristics and initial response may be useful to guide treatment selection, avoid inefficient trial-and-error, alter disease course, and improve major depressive disorder (MDD) care. The term initial and early response are used interchangeably in this article and its refers to response at 2 weeks. In the literature, there are multiple TRD definitions[1] and staging models such as the Antidepressant Treatment History Form (ATHF), the Thase and Rush Model, the European Staging Model, the Massachusetts General Hospital Staging Model, the Maudsley Staging Model (MSM), [2] and the Dutch Measure for quantification of TRD (DM-TRD), an extension of MSM.[3] From the health authority’s perspective, the September 17, 2009 Committee for Medicinal Product for Human Use (CHMP) Concept Paper on “The Need For Revision of Note for Guidance on Clinical Investigation of Medicinal Products In The Treatment of Depression With Regard To Treatment Resistant Depression” stated that “In a clinical pragmatic view a patient has been considered suffering from TRD when consecutive treatment with two products of different pharmacological classes, used for a sufficient length of time at an adequate dose, fail to induce a clinically meaningful effect (inadequate response).” However, since there is no evidence supporting switching of treatment within one class is less effective than switch to a different pharmacologic class, the May 23, 2013 CHMP Guideline on “clinical investigation of medicinal products in the treatment of depression” stated that “For the purpose of this guideline TRD is considered, when treatment with at least two different antidepressant agents (of the same or a different class) prescribed in adequate dosages for adequate duration and adequate affirmation of treatment adherence showed lack of clinically meaningful improvement in the regulatory setting.” We adopted the spirit of 2013 CHMP definition in this study. A failed antidepressant trial could be defined by either failure to respond (less than 50% reduction in depression severity score) or failure to reach remission (such as having the 16-item Quick Inventory of Depressive Symptomatology, clinician-rated (QIDS-C16) score > 5 [4]). The risk factors for TRD as reviewed by Bennabi et al., 2014 [5] include clinical risk factors such as comorbid anxiety disorder, current suicidal risk, non-response to the first antidepressant received in the patient's lifetime and presence of melancholic features [6]; bipolarity, early onset of first depressive episode, high rate of depressive recurrences, and lack of full remission after a previous episode [7]; low reward dependence and low cooperativeness [8]; high neuroticism, low extraversion, low openness, and low conscientiousness [9]. In addition, a subset of patients with diagnosis of "unipolar" treatment resistant depression may have a bipolar diathesis[10]. Although there are many studies using baseline clinical risk factors/symptom clusters and/or early treatment response to predict longer term antidepressant treatment outcome within the same treatment regime in clinical ascertained samples [11-14], the studies using baseline clinical characteristics and initial response to earlier treatment regime to predict response to next treatment option are limited. Perlis et al. [15] developed a clinical decision tool to predict the risk of TRD based on clinical or socio-demographic variables available by or readily amenable to self-report using a logistic regression model with area under the receiver operating characteristic curve (AUC) AUC exceeding 0.71 in training, testing, and validation datasets, all derived from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D). In the work by Perlis et al., non-TRD and TRD subjects were defined as individuals reaching remission with a first or second pharmacological treatment trial and those not reaching remission despite two trials, respectively. In our proposed work, we extended the work by Perlis et al., via partitioning the data from STAR*D cohort into the training and the testing dataset, and included a completely independent cohort RIS-INT-93 as an external test dataset. In addition, we explored the definition of non-TRD and TRD as individuals responding with a first or second treatment trial and those not responding despite two trials, respectively. In the work by Perlis et al., performance of three machine learning models, namely, naïve Bayes classifier, support vector machine (SVM), and random forest, was less consistent. In our work, we further explored other machine learning models such as XGBoost, l2 penalized logistic regression, gradient boosted decision tree (GBDT), the elastic net and compared their performance in the prediction of TRD.

Methods

Cohorts

STAR*D cohort

The multicenter antidepressant effectiveness STAR*D study (ClinicalTrials.gov number NCT00021528) has been described elsewhere.[16, 17] Briefly, patients meeting DSM-IV criteria for MDD went through four levels of treatment options, for up to 12 weeks in length at each level. A patient exited the study and had the option to enter a naturalistic follow up study if achieving remission at the end of each level of treatment. Otherwise the patient had the option to enter the next level of treatment. All subjects signed written informed consent before participation, with the protocol approved by institutional review boards at participating institutions.

RIS-INT-93 cohort

Data were drawn from a Janssen clinical study RIS-INT-93 (ClinicalTrials.gov number NCT00044681)[18]. Patients met DSM-IV criteria for MDD and had history of resistance to therapy with antidepressant medication and were treated (open-label) prospectively with citalopram for up to 6 weeks. Since the first level of antidepressant failure was retrospective, baseline (week 0) and early response at week 2 from the level 2 prospective treatment was used in TRD modeling. All subjects signed written informed consent before participation, with the protocol approved by institutional review boards at participating institutions.

Design

STAR*D training and testing, and an external test dataset

Data from the STAR*D cohort was divided into the training and the testing datasets. The training dataset was drawn from ~80% of the STAR*D data from regional centers 1–8, 10–12), while the testing dataset was drawn from the rest (~20%) of the STAR*D data from regional centers 13–15 also termed as the hold-out testing dataset. Internal cross-validation was performed using the STAR*D training dataset to tune model parameters. Data from RIS-INT-93 cohort served as an independent external test dataset.

Target variable for predictive modeling

The predictive modeling objective is to differentiate TRD from non-TRD as early as possible. Non-TRD and TRD subjects were defined by remission as individuals reaching remission after Level one or Level two of treatment trials and those not reaching remission despite two trials, respectively; and by response as individuals responding after Level one or Level two of treatment trials and those not responding despite two trials, respectively. For the STAR*D cohort, remission was defined as a score of < = 5 on QIDS-C16 (primary analysis) or Quick Inventory of Depressive Symptomatology, Self-Report (QIDS-SR, secondary analysis) at last observation carried forward (LOCF). Response was defined as a reduction of > = 50% in baseline QIDS-C16 or QIDS-SR score. For both remission and response criteria, patients must have a QIDS score greater than 5 at week 0 (baseline) of treatment level one, and remained in the study for at least four weeks. Only data from the first two levels of treatment were used in the STAR*D study. For RIS-INT-93, remission was defined as a score of < = 7 on HAM-D17, while response was defined as reduction of > = 50% in baseline HAM-D17 score after 6 weeks of citalopram treatment. Since all patients were retrospectively reported to have failed one round of antidepressant treatment, all non-remitters/non-responders from the prospective treatment were classified as TRD, while others were classified as non-TRD. In addition, subjects must remain in the study for at least six weeks to be included in the analysis.

Input features for predictive modeling

Three sets of features were included in the modeling, the full set of features (n~700) (referred to later as full set of features), the top n features from feature selection technique (referred to later as the top n features), and the set of 22 overlapping features between the STAR*D and RIS-INT-93 datasets (referred to later as overlapping features). : All the features from enrollment including information from Cumulative Illness Rating Scale (CRS), demographics (DM), psychiatric history (PHX), medication history (MHX), the Patient Rated Inventory of Side Effects (PRISE), Psychiatric Diagnostic Screening Questionnaire (PDSQ), as well as baseline and week 2 of level 1 treatment which include records from Clinic Visit Form, QIDS-C16, QIDS-SR). Each attribute related to medications was expanded into as many features as number of medications that appeared in the data with each feature representing information collected on a specific mediation. Additional features were derived based on prior knowledge and included in predictive modeling. For comorbid condition, any anxiety disorder (including posttraumatic stress disorder, panic disorder with or without agoraphobia, specific phobia, social phobia, and general anxiety disorder) was derived. In addition, studies suggest that the unidimensional subscales of the multidimensional HAM-D17,[19, 20] which capture the core depressive symptoms, outperform the full HAM-D17. We included five unidimensional subscales as discussed by Boessen et al.,[21] including the Bech melancholia scale,[22-24], the Maier-Phillipp severity subscale,[25-27] the Santen Subscale,[28] the Gibbons’ global depression severity scale,[29] and the 7-item abbreviated version (HAM-D7).[30] The five factors derived from HAM-D17 (retardation (the sum of the scores for items 1, 7, 8 and 14, with a score ranging from 0 to 14), anxiety/somatization (the sum of the scores for items 10, 11, 12, 13, 15, and 17, with a score ranging from 0 to 18), sleep disturbance (the sum of scores for items 4, 5, and 6, with a total score ranging from 0 to 6), depression, guilt association) was also included as candidate features and tested if any of them was a better predictor than HAM-D17 item-level data or full HAMD score. Both IDS-C30 and HAM-D17 were only measured at level entry and exit, while QIDS-C16 and QIDS-SR were measured every two weeks, we derived an exploratory variable IDS-C5 that was definable from QIDS-C16 and overlapped with five out of the six items defining IDS-C6/HAM-D6[24, 31], the Bech melancholia scale. Out of the six items (mood sad, involvement, fatigue, anxious mood, outlook (self), and psychomotor slowing) defining IDS-C6, all except anxious mood was definable from QIDS-C16 or QIDS-SR. We included percentage change of IDS-C5 and QIDS-C16 at week 2, QIDS-C16 at week 2 in the predictive model as well. For each data set, we eliminated all the features that have a variance less than 10−8 across the selected set of samples. After this preprocessing, we were left with around 700 starting features for each dataset. All machine learning approaches did not apply feature selection prior to model fitting. Many of the features were correlated with each other (such as QIDS-C16 was correlated with QIDS-SR and HAM-D17) and most of the machine learning approaches considered in this work do not handle correlated features well. To mitigate this and to facilitate model interpretation and application, we applied two strategies to select limited sets of features for machine learning models. The primary feature selection strategy was using a similar approach as proposed in Bühlmann et al.[32]. It is a two-step process and it as follows: Step 1: K-means clustering was first applied to all features to group features into k groups (k = 50, 75, and 100 in our experiment). The feature closest to the centroid of each cluster was chosen as the representative feature for the cluster. Step 2: To this end, a χ2 score as proposed by Liu et al [33] was calculated as a measure of feature importance to select top n features for model creation. Its implementation is available at http://featureselection.asu.edu/software.php. The method discretizes each numeric attribute based on χ2 statistics and continuously merges the intervals of each attribute while keeping the inconsistency rate under certain threshold. We adopted this method due to the ordinal nature of majority of the features such as depression symptom severity score as measured by QIDS-C16 and HAM-D17 and those measuring how depression affects patients’ daily life as exemplified by items from Work and Social Adjustment Scale (WSAS) which gives a quantitative evaluation of to what extent depression impairs activities related to social and work life. The χ2 statistics was calculated using the STAR*D training dataset or the 90% of the training dataset (during 10-fold cross-validation procedure within the training data set for optimization of model parameters). This procedure was applied for each of the target outcome and for each feature clustering configuration (k = 50, 75, 100). This feature selection strategy is called clustering- χ2 throughout the article. The second feature selection strategy was using the elastic net technique [34, 35] and it corresponds to n features with non-zero beta coefficient while optimizing the number of non-zero beta coefficient to ~30. This feature selection strategy is termed as ELNET features throughout the article. As optimizing elastic net to approximately 30 features could result in sub-optimal performance, we selected top n features optimized according to the model’s performance (area under the receiver operating characteristics curve (AUC)) via 10-fold internal cross-validation in the training dataset and the results were comparable. If not explicitly stated, the ELNET features refers to the ~30 features selected by elastic net. : There were 22 overlapping (but not exhaustive list of) features between STAR*D and RIS-INT-93 (week 0 17 HAMD item score, week 0 HAM-D17 total score, HAM-D retardation subtotal score, percentage change of HAM-D17 total score at week 2 from baseline, any anxiety disorders, and recurrent depression) and represented the third set of features for the prediction of TRD.

Predictive modeling

Machine learning methods

Five different machine learning approaches including l2 penalized logistic regression [36, 37] random forest [38] GBDT, XGBoost, and the elastic net were applied in this study. The implementations of l2 penalized logistic regression, random forest, GBDT, the elastic net available from scikit-learn [39] and XGBoost from https://xgboost.readthedocs.org/en/latest/ were used in this study.

Handling class imbalance

In our prediction tasks, subjects from different categories (remission vs. non-remission; response vs. non-response as determined by either QIDS-C16 or QIDS-SR16) were imbalanced to different degrees. A direct application of the machine learning model introduced above would lead to the obtained model biased towards the majority class. The class imbalance handling technique that has been used in Nie et al., 2015 [40] was also applied to the training data in STAR*D cohort. Basically, during the training stage, we used all the samples from minority class in the training dataset and subsample with replacement an equal number of samples from the majority class to build one model. We repeated the process t times and thus obtained t models (t = 30). In obtaining predictions for the test dataset, an ensemble model prediction corresponding to t models are obtained by averaging t model predictions (probabilities) and the performance are assessed against the target. (Fig 1).

Fig 1

Schematic diagram showing the predictive modeling process.

The STAR*D training dataset was used to create 30 subsamples with equal ratios of cases and controls and 30 models were constructed using the entire training dataset. The 30 models were used to predict the outcome for the independent STAR*D test dataset and RIS-INT-93 dataset and the predicted outcome were average across 30 models.

Schematic diagram showing the predictive modeling process.

Model characterization

Internal cross-validation (10-fold) performance metrics was reported for the training dataset using full set of features and top n features (n = 2, 3, …, k, where k = 50, 75, and 100 clusters). The model derived from the STAR*D training dataset was externally tested in a hold-out STAR*D testing dataset. For the overlapping features, the models trained using the STAR*D training dataset was both internally cross-validated in the training dataset and externally validated in the hold-out STAR*D testing data and an independent RIS-INT-93 dataset using machine learning approaches. Relevant descriptions of model performance including AUC, sensitivity, specificity, accuracy, balanced accuracy, positive predictive value (PPV), and negative predictive value (NPV) were determined.

Permutation test

To make sure that the observed model performance is not obtained by chance, the target label was randomly permutated 1,000 times in the START*D training dataset. The model’s training on the STAR*D training dataset with permuted target labels and the prediction process in STAR*D testing dataset was repeated for 1,000 times giving rise to a null distribution of AUC values. The observed AUC value was plotted in reference to the null AUC distribution, and a permutation p-value was calculated. [41]

Interpretation of the relationship between features and outcome via multiple logistic regression procedure

Classical logistic regression with forward selection was used to facilitate model interpretation, although only linear relationship could be captured in the model. With this approach, only the top n features from the STAR*D training data (n = 30 from clustering-χ2) were included in the model as candidates for forward selection. The features selected were included in a multiple logistic regression model and features with p-value less than 0.1 were examined to understand the relationship between features and outcome. Classical logistic regression was fitted using PROC LOGISTIC via SAS 9.2 (SAS Institute Inc., Cary, NC).

Results

Using the remission criteria, there were 501 TRD cases and 1463 non-TRD controls from the STAR*D training dataset and 141 TRD cases and 349 non-TRD controls from the STAR*D testing dataset using QIDS-C16 assessment (primary analysis, S1 Table). For RIS-INT-93, there were 200 TRD cases and 25 non-TRD controls definable using HAM-D17. Using the responder criteria, there were 411 TRD cases and 1797 non-TRD controls from the STAR*D training dataset and 104 TRD cases and 470 non-TRD controls from the STAR*D testing dataset using QIDS-C16 assessment. For RIS-INT-93, there were 190 TRD cases and 35 non-TRD controls definable using HAM-D17. To reduce the correlation between features, we used k-means clustering to group features into clusters and select a candidate feature representing each cluster. To explore the approximate number of top features that should be included in the predictive model without compromising on the model performance, the top n features (n = 2, 3, …, 75) from k-means clustering (k = 75 for k-means clustering, as results from k = 50 and 100 were comparable) were used to train, test the model (as shown in Fig 2) and the model’s performance plateaued at ~10–30 predictors. The variance of AUC obtained from internal cross-validation in the STAR*D training dataset for each fold was averaged across 30 models corresponding to 30 subsamples. The variance for the top n (n = 30) features was 3.9%, 3.3%, and 3.3% for random forest, l2 penalized logistic regression, and GBDT, respectively. The results for full set of features and the top n features (n = 30) were reported for subsequent analysis. We also used the ELNET features generated by elastic net (optimizing for number of features = ~30) and compared the results to the earlier top n features. In addition, the results for the overlapping features was also reported.

Fig 2

AUC for models containing top 2 to 75 representative predictors from k-means cluster (k = 75) was plotted against the number of predictors for each of the machine learning methods in the STAR*D training and test datasets, respectively.

Remission status was used to define TRD using QIDS-C16 data.

AUC for models containing top 2 to 75 representative predictors from k-means cluster (k = 75) was plotted against the number of predictors for each of the machine learning methods in the STAR*D training and test datasets, respectively.

Remission status was used to define TRD using QIDS-C16 data. For TRD phenotype defined by remission criteria using QIDS-C16 assessment, the AUC for the 10-fold internal cross-validation in the training dataset ranged from 0.73 to 0.79 with the full set of features with GBDT performing the best (S1A Fig). The out of bag (OOB) score for random forest was 0.71. The model derived from the training data was externally validated in the STAR*D test dataset with the AUC ranging from 0.70 to 0.78 for all different machine learning approaches and is shown in Fig 3 and S1D Fig. The permutation testing for all the machine learning approaches is shown in Fig 4. Model AUC, accuracy, specificity, sensitivity, PPV, and NPV for selected models were also reported in Table 1, S2 Table, and S3 Table using predicted target (ranging from -1 to 1) with 0 as a threshold for classifying predicted target. The full set of features represented the upper bound of model performance using all clinical variables. To remove the correlated structure among the features, the AUCs for the top n features (n = 30) (from the k-means clustering procedure were 0.76–0.77 in the STAR*D training data (S1B Fig) and 0.75–0.77 in the STAR*D testing data (S1E Fig). To test if the model could be generalized to a completely independent dataset, namely, RIS-INT-93, the overlapping features were both internally cross-validated in the STAR*D training data (10-fold cross-validation AUC 0.70–0.71, S1C Fig) and externally validated in the STAR*D hold out testing data (AUC 0.70–0.73, S1F Fig) and the independent RIS-INT-93 dataset (AUC 0.72–0.77, S1G Fig). Confusion matrix for GBDT model in STAR*D test dataset and independent dataset RIS-INT-93 was also reported in S4 Table. Similar results were observed for target variable using responder criteria defined using QIDS-C16 assessment (S2 Fig) and for target variable using remission and responder criteria defined QIDS-SR16 assessment (S3 Fig and S4 Fig). Overall, the models achieved AUC > = 0.7 in at least one machine learning approaches across the STAR*D testing dataset and RIS-INT-93 dataset and was independent of the phenotype definition by remission or response criteria or using QIDC-C16 or QIDS-SR16 assessment, although the performance of TRD using remission criteria generally was better than that using response criteria and the performance of the full set of features and the top n features outperformed that of the overlapping features as expected. For simplicity of reporting, we reported models with TRD defined using remission criteria and QIDS-C16 assessment moving forward if not explicitly stated. Results from alternative models could be found in the supplemental content. The results for the ELNET features (elastic net optimized for number of features = ~30 for ease of comparison with the top n features selected by clustering-χ2) were comparable and described in S5 Table.

Fig 3

Receiver operating characteristic curves in training and test dataset (STAR*D) using the full set of features, top n features (n ~ 30), and the overlapping features where remission status was used to define TRD (STAR*D remission status was defined using QIDS-C16 data, and RIS-INT-93 remission status was defined using HAM-D17).

Fig 4

Permutation process to access the model robustness.

Table 1

Model performance (outcome defined by remission using QIDS-C16) in the STAR*D testing dataset and RIS-INT-93.

Model	Accuracy	Balanced Accuracy	Sensitivity	Specificity	PPV	NPV	AUC
STAR*D test data set
Random Forest
Full set of features	0.70	0.70	0.69	0.71	0.49	0.85	0.78
Top n features (n = 30)^a	0.67	0.68	0.72	0.65	0.45	0.85	0.77
Top n features (n = 31)^b	0.70	0.69	0.68	0.71	0.48	0.85	0.76
Overlapping features	0.65	0.66	0.67	0.64	0.43	0.83	0.71
GBDT
Full set of features	0.70	0.70	0.69	0.71	0.49	0.85	0.78
Top n features (n = 30)^a	0.70	0.70	0.72	0.68	0.48	0.86	0.77
Top n features (n = 31)^b	0.70	0.70	0.68	0.71	0.49	0.85	0.76
Overlapping features	0.66	0.67	0.70	0.64	0.44	0.84	0.71
XGBoost
Full set of features	0.67	0.68	0.72	0.64	0.45	0.85	0.76
Top n features (n = 30)^a	0.66	0.67	0.67	0.66	0.44	0.83	0.73
Top n features (n = 31)^b	0.67	0.68	0.72	0.65	0.45	0.85	0.76
Overlapping features	0.68	0.68	0.67	0.69	0.47	0.84	0.72
l₂ penalized logistic regression
Full set of features	0.63	0.64	0.65	0.62	0.41	0.81	0.69
Top n features (n = 30)^a	0.71	0.71	0.68	0.73	0.50	0.85	0.73
Top n features (n = 31)^b	0.72	0.71	0.68	0.73	0.51	0.85	0.77
Overlapping features	0.67	0.67	0.68	0.66	0.45	0.84	0.74
Elastic net
Top n features (n = 31)^b	0.70	0.68	0.64	0.73	0.49	0.83	0.76
RIS-INT-93 dataset (Overlapping features)
Random Forest	0.86	0.65	0.92	0.36	0.92	0.39	0.73
GBDT	0.82	0.64	0.88	0.40	0.92	0.29	0.71
XGBoost	0.84	0.67	0.89	0.44	0.93	0.33	0.73
l₂ penalized logistic regression	0.83	0.64	0.88	0.40	0.92	0.29	0.60

GBDT: gradient boosting decision tree

a clustering- χ2

bELNET features

Permutation process to access the model robustness.

The outcome label of the STAR*D training dataset was randomly shuffled 1,000 times and the AUC distribution of the 1,000 null models were plotted for each machine learning machine method (A) XGBoost, (B) Random Forest, (C) l2 penalized logistic regression, and (D) GBDT. In all cases, the observed AUC out-perform the random noise from the 1,000 null models. GBDT: gradient boosting decision tree a clustering- χ2 bELNET features The top groups of features identified to be important based on the clustering- χ2 score and most consistent between outcome definitions (remission and response) and symptom severity measurements (QIDS-C16 and QIDS-SR16) were represented by: QIDS total score at week 2 (self-reported and clinician-rated QIDS total score were strongly correlated and grouped together with percentage change in symptom severity score at week 2 from week 0 into the same cluster by k-means clustering procedure, see S6 Table for top 20 groups of features clustered using k-means clustering), Quality of Life Enjoyment and Satisfaction Questionnaire (QLESQ) total score, Work and Social Adjustment Scale (WSAS) total score, and Short Form Health Survey (SFHS, SF-12) Physical component. The top n (n = 10) representative features from clustering- χ2 were also shown in Fig 5 and S5 Fig. They represented the feature importance in a univariate sense. The ELNET features (S7 Table) and were largely consistent with the univariate features of importance except that multiple correlated features were selected as expected.

Fig 5

variable of importance in statistical learning approaches for outcomes defined by (A) remission status (B) responder status. In both cases, the outcomes were defined using QIDS-C16. SFHS: Short Form Health Survey (SF-12); WSAS: The Work and Social Adjustment Scale; *from PRISE: The Patient Rated Inventory of Side Effects, which collected symptoms one had experienced in the past week. Those symptoms may or may not have been caused by the treatment. To facilitate the interpretation of the relationship between features and target and to consider multiple features simultaneously (in contrast to χ2 statistics), both classical logistic regression model with forward selection and the elastic net and were evaluated. For the classical logistic regression, starting from top 30 features from the STAR*D training data, 8 variables were selected in forward selection procedure and 6 made it to the final logistic regression model (Table 2, S7 Table). In the STAR*D training dataset, more severe QIDS-C16 symptom severity total score at week 2 [or higher percentage change in QIDS-C16 at week 2 from baseline (i.e. less improvement) since both variables were clustered together in k-means clustering procedure (S6 Table) although QIDS-C16 symptom severity total score at week 0 will add additional explanatory value if including percentage change at week 2 from week 0 in the model], anxiety/chronicity (PDSQ questions), lower level of Physical Health Composite Scores (PCS), nervous system symptom (PRISE question) and WSAS total score were associated with increased odds of being TRD (Table 2). For PCS (ranging from 0 to 100), the odds ratio is less than 1 since a zero score indicates the lowest level of health measured by the scales and 100 indicates the highest level of health. Similar results for TRD defined using responder criteria were reported in S8 Table. When testing these six variables in the STAR*D testing dataset, all variables predicted outcome in the same direction as the STAR*D training data and both QIDS-C16 symptom severity total score at week 2 and Physical Health Composite Scores (PCS) reached statistical significance (p-value < 0.05). For the elastic net, the sign of β coefficient (S8 Table) was consistent with the directionality of odds ratio and most variables with non-zero β coefficients were consistent except correlated variables were also grouped in the model.

Table 2

Predictors from PROC LOGISTIC for TRD phenotype defined using remitter criteria in the STAR*D training and testing datasets.

Effect	Dataset	Pr > ChiSq	Odds Ratio Point Estimate	Odds Ratio 95% Wald Confidence Limits
QIDS-C₁₆ total score at week 2 (Level 1)	Training*	< .0001	1.198	(1.161, 1.236)
QIDS-C₁₆ total score at week 2 (Level 1)	Testing*	< .0001	1.211	(1.133, 1.294)
SFHS Physical Health Composite Scores	Training	< .0001	0.978	(0.969, 0.988)
SFHS Physical Health Composite Scores	Testing	0.0072	0.972	(0.952, 0.992)
WSAS total score	Training	< .0001	1.031	(1.016, 1.046)
WSAS total score	Testing	0.1072	1.023	(0.995, 1.051)
PDSQ: being in crowded places make one feel fearful, anxious, or nervous because you were afraid you'd have an anxiety attack in the situation during the past six months	Training	0.0064	1.383	(1.095, 1.747)
	Testing	0.0615	1.567	(0.979, 2.508)
PRISE: how bad was the worst nervous system symptoms (headache, tremors, poor coordination, dizziness) that one has experienced during the past week regardless of cause (0—not present, 1- tolerable, 2- distressing)	Training	0.0336	1.193	(1.014, 1.403)
	Testing	0.1813	1.270	(0.895, 1.802)
PDSQ: did you feel sad or down on most days of the past 2 years?	Training	0.0609	1.261	(0.989, 1.608)
	Testing	0.1653	1.426	(0.864, 2.354)

*Sample size for final multiple logistic regression model is 501 TRD cases vs. 1463 non-TRD controls in the STAR*D training data, and 141 TRD and 349 non-TRD in the STAR*D testing data. SFHS: Short Form Health Survey (SF-12). WSAS: The Work and Social Adjustment Scale. PRISE: The Patient Rated Inventory of Side Effects. PDSQ: Psychiatric Diagnostic Screening Questionnaire™ Lastly, we considered an overly simplified model using early response (defined by having less than 20% improvement at week 2) to predict TRD outcome. In the STAR*D training dataset, the model accuracy for predicting TRD is 62 in the STAR*D training data (sensitivity = 0.52, specificity = 0.77, PPV = 0.75, NPV = 0.54). Model performances in the STAR*D testing dataset and RIS-INT-93 dataset were comparable and is shown in S9 Table.

Discussion

We developed a series of machine learning models to predict TRD after two trials of treatment regimens using clinical and socio-demographic data. The models were trained using three sets of features, namely, all set of features (using all available STAR*D features from enrollment and week 0 (all features) and week 2 (only symptom severity score and percentage change from baseline were used) of level 1 treatment), top n features (a subset of top 30 representative features selected by clustering- χ2 and elastic net), and overlapping features (a subset of 22 overlapping features between STAR*D and RIS-INT-93) were externally validated and performed reasonably well with AUC in the range from 0.70 to-0.78 in the STAR*D hold-out testing dataset and 0.74–0.78 in RIS-INT-93 independent dataset. Although the full set of features provided an upper bound using clinical and socio-demographic features, the large number of features makes it impractical to deploy the model in a real-world setting. A reduced set of top n (n = 30) features from clustering- χ2 procedure is more practical if we were to design a new study and collect features to predict TRD. In order to use data from existing studies as independent testing dataset, we could only work with the limitation of overlapping features. The overall accuracy of the GBDT model using top 30 features (k = 75) was 0.70 for the STAR*D testing dataset (predicted outcome ranging from -1 to 1, using mid-point 0 as the threshold to infer TRD vs. non-TRD status). The accuracy in predicting TRD and non-TRD were comparable (0.72 and 0.68, respectively), Assuming there may be a population of ambiguous subjects who may be difficult to predict based on clinical and socio-demographic features alone, we focused only on extreme 10 and 20 percentiles of the predicted outcome distribution (of being TRD or being non-TRD). The accuracy for extreme 10 and 20 percentiles for inferring non-TRD was 0.98 (1 error out of 49 subjects) and 0.95 (5 errors out of 98 subjects), while the accuracy for extreme 10 and 20 percentiles for inferring TRD was 0.69% (15 errors out of 49 subjects) and 0.60 (39 errors out of 98 subjects), respectively, suggesting it is easier to predict non-TRD than TRD in the STAR*D testing dataset for those especially among those with extreme predicted outcome. The default choice of using mid-point in predicted probability to classify subjects is arbitrary and not necessarily the optimal threshold. We prefer to report AUC as it takes into different threshold cut points into consideration. For a predictive model to be useful in predicting individual outcome, a pre-specified threshold shall be used. We therefore also explored a specific cut point such as having a predicted score less than -0.6 predicts TRD, while having a predicted score great than 0.6 predicts non-TRD. Using these thresholds, the accuracy for non-TRD and TRD prediction was 0.90 and 0.53, respectively, in the STAR*D testing dataset. For the model using overlapping features, the overall accuracy of the GBDT model was 0.82 for RIS-INT-93 dataset. The accuracy in predicting TRD and non-TRD were 0.88 and 0.40, respectively. Using the same -0.6 and 0.6 for predicting TRD and non-TRD, the accuracy for TRD and non-TRD prediction was 0.93 (11 errors out of 154 predictions) and 0.40 (15 errors out of 25 predictions), respectively, in the RIS-INT-93 dataset. It is unlikely that the model performance and these observations merely reflect the group make up as STAR*D samples had approximately 2.5 times more non-TRD subjects than TRD subjects, while RIS-INT-93 study had more TRD than non-TRD subjects, as the actual model out-performed the null models in all cases of the permutation testing. It is somewhat surprising that the performance of models in the independent RIS-INT-93 dataset was generally somewhat superior to the STAR*D testing dataset. In the STAR*D training dataset, symptom severity score at week 2 was selected over symptom severity score at week 0 and percentage change in symptom score, suggesting the closer the symptom measurement is to the future outcome the more predictive the feature is in predicting the future outcome. In RIS-INT-93, level 1 failure was retrospectively reported and therefore only level 2 measurements were available. Therefore, it is possible that RIS-INT-93 prediction problem might be an easier test case compared to STAR*D study design. Furthermore, the outcome measurement was at week 12 (if there is no missing data or last observation carried forward (LOCF) but minimally week 4) in STAR*D, while in RIS-INT-93 the outcome measurement was at week 6 (or earlier but at minimal week 4). In addition, the fewer number of non-TRD subjects in RIS-INT-93 (25 to 35) as against STAR*D (349 to 470) could also be a potential limitation of the current work. It is therefore important to further validate the models developed from the STAR*D training to independent studies of different design but preferably two stages of prospective interventional studies such as the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC)[42-45] and the Canadian Biomarker Integration Network in Depression (CAN-BIND)[46].

Area under the receiver operating characteristic curves (ROC AUC) for TRD as defined by remission status and QIDS-C16.

ROC AUC in the training and test dataset (STAR*D) using the full set of features, top n features and the overlapping features in all three datasets. (A-C STAR*D training data; D-F STAR*D test data; and G RIS-INT-93 test data) where remission status was used to define TRD (STAR*D response status was defined using QIDS-C data, while RIS-INT-93 response status was defined using HAM-D17). (DOCX) Click here for additional data file.

ROC AUC for TRD as defined by responder status and QIDS-C16.

ROC curves in the training and test dataset (STAR*D) using full set of features, top n features and the overlapping features in all three datasets. (A-C STAR*D training data; D-F STAR*D test data; G RIS-INT-93 test data) where response status was used to define TRD (STAR*D response status was defined using QIDS-C data, and RIS-INT-93 response status was defined using HAM-D17). (DOCX) Click here for additional data file.

ROC AUC for TRD as defined by remission status and QIDS-SR16.

ROC curves in the training and test dataset (STAR*D) using full set of features, top n features and the overlapping features in all three datasets. (A-C STAR*D training data; D-F STAR*D test data; G RIS-INT-93 test data) where remission status was used to define TRD (STAR*D remission status was defined using QIDS-SR16 data, and RIS-INT-93 remission status was defined using HAM-D17). (DOCX) Click here for additional data file.

ROC AUC for TRD as defined by responder status and QIDS-SR16.

ROC curves in the training and test dataset (STAR*D) using full set of features, top n features and the overlapping features in all three datasets. (A-C STAR*D training data; D-F STAR*D test data; G RIS-INT-93 test data) where response status was used to define TRD (STAR*D response status was defined using QIDS-SR data, and RIS-INT-93 response status was defined using HAM-D17). (DOCX) Click here for additional data file.

Variables of importance for outcomes defined using QIDS-SR16.

Feature importance for outcomes defined by (A) remission status (B) responder status. In both cases, the outcomes were defined using QIDS-SR. (DOCX) Click here for additional data file.

Sample size for the training and the testing datasets from STAR*D and an independent RIS-INT-93 dataset.