| Literature DB >> 35629177 |
Yung-Chuan Huang1,2, Yu-Chen Cheng2, Mao-Jhen Jhou1, Mingchih Chen1,3, Chi-Jie Lu1,3,4.
Abstract
Our study aims to develop an effective integrated machine learning (ML) scheme to predict vascular events and bleeding in patients with nonvalvular atrial fibrillation taking dabigatran and identify important risk factors. This study is a post-hoc analysis from the Randomized Evaluation of Long-Term Anticoagulant Therapy trial database. One traditional prediction method, logistic regression (LGR), and four ML techniques-naive Bayes, random forest (RF), classification and regression tree, and extreme gradient boosting (XGBoost)-were combined to construct our scheme. Area under the receiver operating characteristic curve (AUC) of RF (0.780) and XGBoost (0.717) was higher than that of LGR (0.674) in predicting vascular events. In predicting bleeding, AUC of RF (0.684) and XGBoost (0.618) showed higher values than those generated by LGR (0.605). Our integrated ML feature selection scheme based on the two convincing prediction techniques identified age, history of congestive heart failure and myocardial infarction, smoking, kidney function, and body mass index as major variables of vascular events; age, kidney function, smoking, bleeding history, concomitant use of specific drugs, and dabigatran dosage as major variables of bleeding. ML is an effective data analysis algorithm for solving complex medical data. Our results may provide preliminary direction for precision medicine.Entities:
Keywords: arrhythmia; cardioembolic stroke; dabigatran; machine learning; non-vitamin K antagonist oral anticoagulants
Year: 2022 PMID: 35629177 PMCID: PMC9146635 DOI: 10.3390/jpm12050756
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1Flow chart of the proposed integrated ML feature selection scheme.
Description of predictor and target variables in this study.
| Variables | Description | Units | |
|---|---|---|---|
| V1 | Sex | - | |
| V2 | Age | years | |
| V3 | BMI | kg/m2 | |
| V4 | Body weight | kg | |
| V5 | Ethnicity | - | |
| V6 | Hypertension history | - | |
| V7 | Kidney function (GFR) | mL/min/1.73 m2 | |
| V8 | Previous stroke history | - | |
| V9 | Previous bleeding history | - | |
| V10 | Concomitant use of drugs | - | |
| V11 | History of MI | - | |
| V12 | History of DM | - | |
| V13 | History of CHF | - | |
| V14 | Smoking | - | |
| V15 | History of systemic embolism | - | |
| V16 | Liver function abnormality # | - | |
| V17 | Anemia | g/dL | |
| V18 | Medicine dosage (dabigatran) | - | |
| P1 | Vascular events † | - | |
| P2 | Bleeding events * | - |
Abbr.: BMI, body mass index; GFR, glomerular filtration rate; TIA, transient ischemic attack; NSAID, nonsteroidal anti-inflammatory drug; COX, cyclooxygenase; MI, myocardial infarction; DM, diabetes mellitus; CHF, congestive heart failure. # Liver function abnormality defined as a medical history of cirrhosis or abnormal biochemical data when the patients were enrolled (bilirubin level more than two times the upper limit of normal, plus one or more of aspartate transaminase, alanine transaminase, or alkaline phosphatase level more than three times the upper limit of normal). † Vascular events defined as stroke, myocardial infarction, systemic embolism, and vascular death. * Major bleeding was defined as blood loss with a decrease in hemoglobin level of ≥2 g/dL (1.2 mmol/L), transfusion of ≥2 packed red blood cells, or symptomatic bleeding in a critical area or organ. Critical areas were intraocular, intracranial (including hemorrhagic stroke), intraspinal, intramuscular with compartment syndrome, retroperitoneal, intraarticular, or pericardial.
Subjects’ demographics.
| Characteristics | Metrics |
|---|---|
| V1 Sex |
|
| 0: Male | 7519 (63.70) |
| 1: Female | 4284 (36.30) |
| V2 Age (years) |
|
| 1: <65 | 1982 (16.79) |
| 2: ≥65 and <75 | 5123 (43.41) |
| 3: ≥75 | 4697 (39.80) |
| V3 BMI (kg/m2) |
|
| 1: <18.5 | 123 (1.04) |
| 2: ≥18.5 and <30 | 7589 (64.30) |
| 3: ≥30 | 4091 (34.66) |
| V4 Body weight |
|
| 0: <60 | 1098 (9.30) |
| 1: ≥60 | 10,705 (90.70) |
| V5 Ethnicity |
|
| 0: Arab/others | 3594 (30.45) |
| 1: European | 8209 (69.55) |
| V6 Hypertension history |
|
| 0: Record of hypertension that required medical treatment | 9301 (78.80) |
| 1: No | 2502 (21.20) |
| V7 Kidney function (GFR) |
|
| 1: <30 | 45 (0.38) |
| 2: ≥30 and <50 | 2245 (19.02) |
| 3: ≥50 | 9513 (80.60) |
| V8 Previous stroke history |
|
| 0: Yes | 2366 (20.05) |
| 1: No | 9437 (79.95) |
| V9 Previous bleeding history |
|
| 0: Yes | 774 (6.56) |
| 1: No | 11,029 (93.44) |
| V10 Concomitant use of drugs |
|
| 0: Yes | 2845 (24.10) |
| 1: No | 8958 (75.90) |
| V11 History of myocardial infarction |
|
| 0: Yes | 1982 (16.79) |
| 1: No | 9821 (83.21) |
| V12 History of diabetes mellitus |
|
| 0: Yes | 2739 (23.21) |
| 1: No | 9064 (76.79) |
| V13 History of congestive heart failure |
|
| 0: Yes | 4125 (34.95) |
| 1: No | 7678 (65.05) |
| V14 Smoking |
|
| 1: Never | 5781 (48.98) |
| 2: Current | 867 (7.35) |
| 3: Former | 5155 (43.68) |
| V15 History of systemic embolism |
|
| 0: Yes | 306 (2.59) |
| 1: No | 11,497 (97.41) |
| V16 Liver function abnormality |
|
| 0: Presence of liver function abnormality | 84 (0.71) |
| 1: No | 11,719 (99.29) |
| V17 Anemia |
|
| 0: Hemoglobin ≥10 g/dL | 11,773 (99.75) |
| 1: Hemoglobin <10 g/dL | 30 (0.25) |
| V18 Medicine dosage (dabigatran) |
|
| 1: 110 mg | 5870 (49.73) |
| 2: 150 mg | 5933 (50.27) |
| P1 Vascular events |
|
| 0: No | 11,485 (97.31) |
| 1: Yes | 318 (2.69) |
| P2 Bleeding events |
|
| 0: No | 9565 (81.04) |
| 1: Yes | 2238 (18.96) |
Abbr.: BMI, body mass index; GFR, glomerular filtration rate.
Summary of the values of the hyperparameters which train the best NB, RF, CART, and XGBoost models.
| Methods | Hyperparameters | Best Value | Meanings |
|---|---|---|---|
| CART | minispilt | 20 | The minimum number of observations that must exist in a node for a split to be attempted. |
| minibucket | 20 | The minimum number of observations in any terminal node. | |
| maxdepth | 10 | The maximum depth of any node of the final tree. | |
| xval | 10 | Number of cross-validations. | |
| cp | 0.0013 | Complexity parameter: The minimum improvement in the model needed at each node. | |
| RF | ntree | 500 | The number of trees in forest. |
| mtry | 2 | The number of predictors sampled for splitting at each node. | |
| NB | fL | 1 | Adjustment of Laplace smoother. |
| usekernel | FALSE | Using kernel density estimate for continuous variable versus a Gaussian density estimate. | |
| adjust | 1 | Adjust the bandwidth of the kernel density. | |
| XGBoost | nrounds | 100 | The number of boosted trees. |
| maximum depth | 2 | The maximum depth of a tree. | |
| learning rate | 0.4 | Shrinkage coefficient of tree. | |
| gamma | 0 | The minimum loss reduction. | |
| subsample | 1 | Subsample ratio of columns when building each tree. | |
| colsample_bytree | 0.8 | Subsample ratio of columns when constructing each tree. | |
| rate_drop | 0.01 | Rate of trees dropped. | |
| skip_drop | 0.95 | Probability of skipping the dropout procedure during a boosting iteration. | |
| min_child_weight | 1 | The minimum sum of instance weight. |
Abbr: CART, classification and regression tree; RF, random forest; NB, naive Bayes; XGBoost, eXtreme gradient boosting.
Performance of the four machine learning methods in predicting (a) vascular events and (b) bleeding.
| Methods | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
|
| ||||
| LGR | 0.574 (0.03) | 0.571 (0.04) | 0.707 (0.03) | 0.674 (0.00) |
| NB | 0.569 (0.03) | 0.565 (0.03) | 0.711 (0.04) | 0.674 (0.00) |
|
|
|
|
|
|
| CART | 0.599 (0.09) | 0.598 (0.10) | 0.621 (0.09) | 0.637 (0.00) |
|
|
|
|
|
|
|
| ||||
| LGR | 0.604 (0.03) | 0.622 (0.05) | 0.527 (0.05) | 0.605 (0.00) |
| NB | 0.599 (0.01) | 0.613 (0.02) | 0.537 (0.02) | 0.603 (0.00) |
|
|
|
|
|
|
| CART | 0.787 (0.07) | 0.959 (0.12) | 0.052 (0.16) | 0.467 (0.03) |
|
|
|
|
|
|
Abbr.: SD, standard deviation; AUC, area under the receiver operating characteristic curve; LGR, logistic regression; NB, naive Bayes; RF, random forest; CART, classification and regression tree; XGBoost, eXtreme gradient boosting. In predicting both vascular events and bleeding, RF and XGBoost demonstrated higher AUC values (indicated in bold) than LGR.
Figure 2Receiver operating characteristic (ROC) curves of the five methods in predicting (a) vascular events and (b) bleeding.
Importance ranking of risk factors in predicting vascular events based on RF and XGBoost.
| Risk Factors | Average Ranking of 10 Times RF | Average Ranking of 10 Times XGBoost | Average Ranking of the 2 Models | Final Ranking in Predicting Vascular Events |
|---|---|---|---|---|
| Age | 1 | 5.2 | 3.1 | 1 |
| History of congestive heart failure | 4.6 | 2.1 | 3.35 | 2 |
| History of myocardial infarction | 4 | 2.8 | 3.4 | 3 |
| Smoking | 2.2 | 5.6 | 3.9 | 4 |
| Kidney function | 5.9 | 6.1 | 6 | 5 |
| BMI | 3.5 | 10.5 | 7 | 6 |
| Ethnicity | 7.8 | 7.3 | 7.55 | 7 |
| History of diabetes mellitus | 8.6 | 7 | 7.8 | 8 |
| Medicine dosage (dabigatran) | 8.5 | 7.5 | 8 | 9 |
| Previous stroke history | 9.4 | 9.6 | 9.5 | 10 |
| Body weight | 12.2 | 8.9 | 11.05 | 11 |
| Concomitant use of drugs | 14.3 | 9.8 | 12.05 | 12 |
| Hypertension history | 11.8 | 13.2 | 12.5 | 13 |
| Sex | 11.7 | 14.3 | 13 | 14 |
| Previous bleeding history | 14.5 | 14 | 14.25 | 15 |
| History of systemic embolism | 16.3 | 14.8 | 15.55 | 16 |
| Liver function abnormality | 16.7 | 15 | 15.85 | 17 |
| Anemia | 18 | 18 | 18 | 18 |
Abbr.: RF, random forest; XGBoost, eXtreme gradient boosting; BMI, body mass index.
Overall importance ranking of each risk factor in predicting bleeding based on RF and XGBoost.
| Risk Factors | Average Ranking of 10 Times RF | Average Ranking of 10 Times XGBoost | Average Ranking of the 2 Models | Final Ranking in Predicting Bleeding |
|---|---|---|---|---|
| Age | 1 | 1.3 | 1.15 | 1 |
| Kidney function | 3.2 | 3.5 | 3.35 | 2 |
| Smoking | 2.1 | 4.7 | 3.4 | 3 |
| Previous bleeding history | 4.7 | 2.4 | 3.55 | 4 |
| Concomitant use of drugs | 4.8 | 5 | 4.9 | 5 |
| Medicine dosage (dabigatran) | 7 | 6.7 | 6.85 | 6 |
| BMI | 5.2 | 9.6 | 7.4 | 7 |
| History of myocardial infarction | 9.2 | 6.1 | 7.65 | 8 |
| History of congestive heart failure | 10.1 | 10.3 | 10.2 | 9 |
| Ethnicity | 9.1 | 12.2 | 10.65 | 10 |
| Sex | 10.8 | 11.3 | 10.55 | 11 |
| History of diabetes mellitus | 12.8 | 11.5 | 12.15 | 12 |
| Previous stroke history | 11 | 14.2 | 12.6 | 13 |
| Hypertension history | 14 | 12.6 | 13.3 | 14 |
| Body weight | 15 | 12.3 | 13.65 | 15 |
| History of systemic embolism | 16 | 13 | 14.5 | 16 |
| Liver function abnormality | 17 | 17 | 17 | 17 |
| Anemia | 18 | 17.4 | 17.7 | 18 |
Abbr.: RF, random forest; XGBoost, eXtreme gradient boosting; BMI, body mass index.
Major nine important variables in predicting vascular events and bleeding.
| Average Ranking of Variables | Variable of Prediction of Vascular Events | Variable of Prediction of Bleeding |
|---|---|---|
| 1 | Age | Age |
| 2 | History of CHF | Kidney function |
| 3 | History of MI | Smoking |
| 4 | Smoking | Previous bleeding history |
| 5 | Kidney function | Concomitant use of drugs |
| 6 | BMI | Medicine dosage (dabigatran) |
| 7 | Ethnicity | BMI |
| 8 | History of diabetes mellitus | History of MI |
| 9 | Medicine dosage (dabigatran) | History of CHF |
Abbr.: CHF, congestive heart failure; MI, myocardial infarction; BMI, body mass index.