| Literature DB >> 33083570 |
Yijun Zhao1, Tong Wang1, Riley Bove2,3,4, Bruce Cree2,3,4, Roland Henry2,3,4, Hrishikesh Lokhande5, Mariann Polgar-Turcsanyi3,4,5, Mark Anderson3,4,5, Rohit Bakshi3,4,5, Howard L Weiner3,4,5, Tanuja Chitnis3,4,5.
Abstract
The rate of disability accumulation varies across multiple sclerosis (MS) patients. Machine learning techniques may offer more powerful means to predict disease course in MS patients. In our study, 724 patients from the Comprehensive Longitudinal Investigation in MS at Brigham and Women's Hospital (CLIMB study) and 400 patients from the EPIC dataset, University of California, San Francisco, were included in the analysis. The primary outcome was an increase in Expanded Disability Status Scale (EDSS) ≥ 1.5 (worsening) or not (non-worsening) at up to 5 years after the baseline visit. Classification models were built using the CLIMB dataset with patients' clinical and MRI longitudinal observations in first 2 years, and further validated using the EPIC dataset. We compared the performance of three popular machine learning algorithms (SVM, Logistic Regression, and Random Forest) and three ensemble learning approaches (XGBoost, LightGBM, and a Meta-learner L). A "threshold" was established to trade-off the performance between the two classes. Predictive features were identified and compared among different models. Machine learning models achieved 0.79 and 0.83 AUC scores for the CLIMB and EPIC datasets, respectively, shortly after disease onset. Ensemble learning methods were more effective and robust compared to standalone algorithms. Two ensemble models, XGBoost and LightGBM were superior to the other four models evaluated in our study. Of variables evaluated, EDSS, Pyramidal Function, and Ambulatory Index were the top common predictors in forecasting the MS disease course. Machine learning techniques, in particular ensemble methods offer increased accuracy for the prediction of MS disease course.Entities:
Keywords: Multiple sclerosis
Year: 2020 PMID: 33083570 PMCID: PMC7567781 DOI: 10.1038/s41746-020-00338-8
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
ML models applied to the CLIMB dataset with varying thresholds.
| Threshold | Model | Sensitivity | Specificity | Overall |
|---|---|---|---|---|
| 0.5 | SVM | 0.60 | 0.70 | 0.68 |
| Logistic Regression | 0.70 | 0.71 | 0.71 | |
| Random Forest | 0.72 | 0.73 | 0.73 | |
| XGBoost | 0.50 | 0.87 | 0.79 | |
| LightGBM | 0.51 | 0.86 | 0.78 | |
| Meta-La | 0.61 | 0.84 | 0.79 | |
| 0.45 | SVM | 0.75 | 0.64 | 0.67 |
| Logistic Regression | 0.76 | 0.62 | 0.65 | |
| Random Forest | 0.83 | 0.51 | 0.58 | |
| XGBoost | 0.58 | 0.82 | 0.77 | |
| LightGBM | 0.52 | 0.85 | 0.77 | |
| Meta-La | 0.71 | 0.74 | 0.73 | |
| 0.4 | SVM | 0.81 | 0.51 | 0.58 |
| Logistic Regression | 0.81 | 0.57 | 0.62 | |
| Random Forest | 0.91 | 0.34 | 0.47 | |
| XGBoost | 0.68 | 0.76 | 0.74 | |
| LightGBM | 0.58 | 0.82 | 0.77 | |
| Meta-La | ||||
| 0.35 | SVM | 0.92 | 0.34 | 0.47 |
| Logistic Regression | 0.86 | 0.49 | 0.57 | |
| Random Forest | 0.98 | 0.11 | 0.31 | |
| XGBoost | ||||
| LightGBM | 0.70 | 0.76 | 0.75 | |
| Meta-La | 0.86 | 0.50 | 0.58 | |
| 0.3 | SVM | 0.96 | 0.21 | 0.38 |
| Logistic Regression | 0.91 | 0.41 | 0.52 | |
| Random Forest | 0.99 | 0.06 | 0.27 | |
| XGBoost | ||||
| LightGBM | ||||
| Meta-La | 0.93 | 0.35 | 0.48 |
Bold numbers indicate models of high practical value.
aEnsemble of SVM, Logistic Regression, and Random Forest.
Model validation using overlapping attributes and annual observations.
| Threshold | Model | Sensitivity | Specificity | Overall | |||
|---|---|---|---|---|---|---|---|
| CLIMB | EPIC | CLIMB | EPIC | CLIMB | EPIC | ||
| 0.5 | SVM | 0.63 | 0.81 | 0.75 | 0.70 | 0.72 | 0.74 |
| Logistic Regression | 0.64 | 0.76 | 0.78 | 0.72 | 0.75 | 0.73 | |
| Random Forest | 0.62 | 0.83 | 0.77 | 0.65 | 0.74 | 0.71 | |
| XGBoost | 0.58 | 0.75 | 0.75 | 0.71 | 0.71 | 0.72 | |
| LightGBM | 0.56 | 0.62 | 0.75 | 0.83 | 0.71 | 0.76 | |
| Meta-La | 0.61 | 0.78 | 0.79 | 0.76 | 0.75 | 0.77 | |
| 0.45 | SVM | 0.90 | 0.45 | 0.64 | 0.60 | ||
| Logistic Regression | 0.69 | 0.83 | 0.69 | 0.65 | 0.69 | 0.71 | |
| Random Forest | 0.73 | 0.90 | 0.63 | 0.53 | 0.65 | 0.65 | |
| XGBoost | 0.68 | 0.79 | 0.70 | 0.66 | 0.70 | 0.70 | |
| LightGBM | 0.69 | 0.69 | 0.68 | 0.77 | 0.68 | 0.74 | |
| Meta-La | 0.70 | 0.85 | 0.68 | 0.70 | 0.68 | 0.75 | |
| 0.4 | SVM | 0.84 | 0.93 | 0.47 | 0.42 | 0.55 | 0.59 |
| Logistic Regression | |||||||
| Random Forest | 0.85 | 0.92 | 0.54 | 0.39 | 0.61 | 0.56 | |
| XGBoost | |||||||
| LightGBM | |||||||
| Meta-La | |||||||
| 0.35 | SVM | 0.92 | 0.96 | 0.37 | 0.32 | 0.50 | 0.53 |
| Logistic Regression | 0.86 | 0.92 | 0.51 | 0.51 | 0.59 | 0.64 | |
| Random Forest | 0.89 | 0.96 | 0.45 | 0.31 | 0.55 | 0.52 | |
| XGBoost | 0.85 | 0.54 | 0.61 | 0.69 | |||
| LightGBM | 0.85 | 0.52 | 0.60 | 0.73 | |||
| Meta-La | 0.88 | 0.93 | 0.49 | 0.52 | 0.58 | 0.65 | |
| 0.3 | SVM | 0.93 | 0.98 | 0.25 | 0.23 | 0.40 | 0.47 |
| Logistic Regression | 0.90 | 0.93 | 0.41 | 0.48 | 0.52 | 0.63 | |
| Random Forest | 0.95 | 0.95 | 0.30 | 0.24 | 0.45 | 0.47 | |
| XGBoost | 0.90 | 0.90 | 0.45 | 0.56 | 0.55 | 0.67 | |
| LightGBM | 0.92 | 0.86 | 0.42 | 0.62 | 0.53 | 0.70 | |
| Meta-La | 0.93 | 0.96 | 0.38 | 0.37 | 0.51 | 0.56 | |
| Regression coef. ( | 1.08 (6.9E−08) 0.65 (0.81) | 0.77 (8.6E−09) 0.70 (0.84) | 0.88 (1.8E−08) 0.68 (0.83) | ||||
Bold numbers indicate models of high practical value.
aEnsemble of SVM, Logistic Regression, and Random Forest.
AUC scores of six models across the two dataset.
| Model | CLIMB_alla | CLIMB_partb | EPIC |
|---|---|---|---|
| SVM | 0.75 | 0.76 | 0.81 |
| Logistic Regression | 0.78 | 0.77 | 0.81 |
| Random Forest | 0.77 | 0.77 | 0.82 |
| XGBoost | 0.78 | 0.76 | 0.82 |
| LightGBM | 0.78 | 0.76 | 0.82 |
| Meta- | 0.79 | 0.78 | 0.83 |
aModels trained using complete CLIMB data.
bModels trained using overlapping features of CLIMB and EPIC datasets, and annual observations.
Top ten predictive features identified by five models using the CLIMB dataset.
| Rank | SVM | Logistic Regression | Random Forest |
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | ∆LESION_VOLUME | ∆AMBULATORY_INDEX | |
| 4 | ∆DISEASE_CATEGORY | MRI_STATUS | AMBULATORY_INDEX |
| 5 | ∆AMBULATORY_INDEX | ∆DISEASE_CATEGORY | DISEASE_ACTIVITY |
| 6 | AMBULATORY_INDEX | BOWEL_BLADDER_FUNCTION | DISEASE_STEP |
| 7 | BOWEL_BLADDER_FUNCTION | DISEASE_ACTIVITY | ∆AMBULATORY_INDEX |
| 8 | ∆TOTAL_GD | ∆TOTAL_GD | ∆SENSORY_FUNCTION |
| 9 | DISEASE_ACTIVITY | AMBULATORY_INDEX | DISEASE_CATEGORY |
| 10 | ∆WALKING_ABILITY | DISEASE_COURSE_SUBTYPE | ∆BPF |
∆: change in the indicated variable.
AMBULATORY_INDEX: ordinal scale of gait capacity.
ATTACKPREV2Y: number of clinical relapses (attacks) in the previous 2 years.
BOWEL_BLADDER_FUNCTION: measure of bowel and bladder function from 0 (normal) to 6 (loss of bowel and bladder function).
DISEASE_ACTIVITY: physician reported metric of current inflammatory or progressive disease status.
DISEASE_CATEGORY: code indicating disease categories, such as primary progressive, secondary progressive, etc.
DISEASE_STEP: scale of disability.
EDSS: overall neurologic disability score.
FAMILY_MS: code indicating family history of MS, including mother, father, sibling, cousin, etc.
LESION_VOLUME: brain T2 lesion volume measured.
MRI_STATUS: presence of new MRI lesions.
PYRAMIDAL_FUNCTION: measure of pyramidal function from 0 (normal) to 6.
SENSORY_FUNCTION: measure of sensory disability ranging from 0 (normal) to 6 (sensation lost below the head).
BPF: brain parenchymal fraction.
TOTAL_GD: total number of Gad+ lesions.
VISIT_AGE: age of the subject.
Top ten predictive features identified by five models using the EPIC dataset.
| Rank | SVM | Logistic Regression | Random Forest |
|---|---|---|---|
| 1 | ∆EDSS | ∆EDSS | ∆EDSS |
| 2 | BRAIN_GREY_VOLUME | ∆PYRAMIDAL_FUNCTION | EDSS |
| 3 | CEREBELLAR_FUNCTION | VISIT_AGE | ∆PYRAMIDAL_FUNCTION |
| 4 | ∆PYRAMIDAL_FUNCTION | VENTRICULAR_CSF_VOLUME | PYRAMIDAL_FUNCTION |
| 5 | ATTACKPREV2Y | CEREBELLAR_FUNCTION | BRAIN_WHITE_VOLUME |
| 6 | PYRAMIDAL_FUNCTION | ATTACKPREV2Y | SENSORY_FUNCTION |
| 7 | VENTRICULAR_CSF_VOLUME | ∆MENTAL_FUNCTION | VENTRICULAR_CSF_VOLUME |
| 8 | VISIT_AGE | MENTAL_FUNCTION | ∆BOWEL_BLADDER_FUNCTION |
| 9 | ∆BRAIN_GREY_VOLUME | PYRAMIDAL_FUNCTION | TIMED_WALK_TRIAL |
| 10 | MENTAL_FUNCTION | BRAIN_GREY_VOLUME | BRAIN_GREY_VOLUME |
∆: change in the indicated variable.
ATTACKPREV2Y: number of clinical relapses (attacks) in the previous 2 years.
BOWEL_BLADDER_FUNCTION: measure of bowel and bladder function from 0 (normal) to 6 (loss of bowel and bladder function).
BRAIN_GREY_VOLUME: total brain gray matter volume.
BRAIN_WHITE_VOLUME: total brain white matter volume.
CEREBELLAR_FUNCTION: measure of cerebella function from 0 (normal) to 5 (severe ataxia)
EDSS: overall neurologic disability score.
MENTAL_FUNCTION: measure of mental function from 0 (normal) to 5 (dementia).
PYRAMIDAL_FUNCTION: measure of pyramidal function from 0 (normal) to 6 (tetraplegia).
SENSORY_FUNCTION: measure of sensory function from 0 (normal) to 6 (loss of sensation below head).
TIMED_WALK_TRIAL: average time (in seconds) for two trials of the 25-foot walk.
VENTRICULAR_CSF_VOLUME: volume of the cerebrospinal fluid in the ventricles. In the EPIC study, this is usually reported in cm3.
VISIT_AGE: age of the subject.
Comparison of the CLIMB and EPIC datasets.
| Category | CLIMB | EPIC | Common |
|---|---|---|---|
| # of subjects | 724 | 400 | n/a |
| # of “worsening” subjects | 165 | 130 | n/a |
| # of demographic features | 24 | 10 | 5 |
| # longitudinal features | 44 | 35 | 14 |
| Clinical visit frequency | 6 months | 12 months | 12 months |
| Common features | |||
| Demographic features | Age | Gender | Smoking history |
| Ethnicity | Race | ||
| Longitudinal features | Attack previous 6 m | Disease category | Sensory function |
| Attack previous 2Y | EDSS | Total GD | |
| Bowel–bladder function | Lesion volume | Visual function | |
| Brainstem function | Mental function | Walk 25 ft time | |
| Cerebellar function | Pyramidal function | ||
Fig. 1Illustration of three baseline machine learning models.
a Support Vector Machine: red squares and blue circles represent data from different classes. The optimal decision plane achieves the largest separation, or margin, between the two classes. b A Random Forest with n decision trees. Each tree is trained with a randomly sampled subset of training data. Predictions from all trees are combined using majority voting to produce a final decision. c Logistic Regression with one dependent variable. The blue line is the linear regression model of the observed data. The sigmoid function transforms the linear model’s predictions into values between 0 and 1, which indicate the observations’ likelihood of belonging to the positive class.
Fig. 2Illustration of ensemble learning and adaptive boosting.
a Ensemble learning: L, L, …, L are independent learners trained on the entire training data D. The stacked generalizer is a logistic regression model trained to produce a final prediction P based on the decisions from individual classifiers. Model performance is measured using the final predictions. b Adaptive boosting: checkmarks and crosses indicate correctly and incorrectly classified instances, respectively. The heights of the rectangles are proportional to the weights of the training instances. A sequence of learners, L, L, …, L, is generated with each new model trained on a re-weighted dataset, which boosts the weights of the misclassified training instances in the previous model.