| Literature DB >> 34558325 |
Lengchen Hou1,2, Longjun Hu1,2, Wenxue Gao1, Wenbo Sheng3, Zedong Hao3, Yiwei Chen3, Jiyu Li1.
Abstract
The purpose of this study is to establish a novel pulmonary embolism (PE) risk prediction model based on machine learning (ML) methods and to evaluate the predictive performance of the model and the contribution of variables to the predictive performance. We conducted a retrospective study at the Shanghai Tenth People's Hospital and collected the clinical data of in-patients that received pulmonary computed tomography imaging between January 1, 2014 and December 31, 2018. We trained several ML models, including logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), compared the models with representative baseline algorithms, and investigated their predictability and feature interpretation. A total of 3619 patients were included in the study. We discovered that the GBDT model demonstrated the best prediction with an area under the curve value of 0.799, whereas those of the RF, LR, and SVM models were 0.791, 0.716, and 0.743, respectively. The sensibilities of the GBDT, LR, RF, and SVM models were 63.9%, 68.1%, 71.5%, and 75%, respectively; the specificities were 81.1%, 66.1, 72.7%, and 65.1%, respectively; and the accuracies were 77.8%, 66.5%, 72.5%, and 67%, respectively. We discovered that the maximum D-dimer level contributed the most to the outcome prediction, followed by the extreme growth rate of the plasma fibrinogen level, in-hospital duration, and extreme growth rate of the D-dimer level. The study demonstrates the superiority of the GBDT model in predicting the risk of PE in hospitalized patients. However, in order to be applied in clinical practice and provide support for clinical decision-making, the predictive performance of the model needs to be prospectively verified.Entities:
Keywords: GBDT; hospital-acquired pulmonary embolism; machine learning; pulmonary embolism; risk prediction model
Mesh:
Year: 2021 PMID: 34558325 PMCID: PMC8495515 DOI: 10.1177/10760296211040868
Source DB: PubMed Journal: Clin Appl Thromb Hemost ISSN: 1076-0296 Impact factor: 2.389
Demographic Characteristics of In-Hospital Patients.
| Training
( | Testing ( | |||
|---|---|---|---|---|
| PE positive ( | PE negative ( | PE positive ( | PE negative ( | |
| Age | ||||
| | 485 | 2382 | 144 | 608 |
| Mean ± SD | 70.5 ± 12.8 | 65.6 ± 14.1 | 69.1 ± 15 | 65.2 ± 14.7 |
| Median (Q1, Q3) | 73 (63, 80) | 66 (58, 77) | 70 (61.5, 81) | 66 (58, 76.5) |
| Gender | ||||
| Male, | 236 (48.66%) | 1118 (46.94%) | 60 (41.67%) | 283 (46.55%) |
| Female, | 249 (51.34%) | 1264 (53.06%) | 84 (58.33%) | 325 (53.45%) |
Abbreviation: PE, pulmonary embolism.
Characteristics Included in the GBDT Model.
| Training
( | Testing ( | |||
|---|---|---|---|---|
| PE positive ( | PE negative ( | PE positive ( | PE negative ( | |
| In-hospital duration | ||||
| | 485 | 2382 | 144 | 608 |
| Mean ± SD | 14.7 ± 10.6 | 11.3 ± 13 | 16.3 ± 13.2 | 10.9 ± 8.3 |
| Median (Q1, Q3) | 12 (9, 17) | 9 (6, 13) | 14 (9, 20) | 8 (6, 13) |
| Maximum neutrophil count within 2 weeks | ||||
| | 473 | 2311 | 142 | 587 |
| Mean ± SD | 7.53 ± 3.91 | 6.46 ± 4.27 | 7.63 ± 4.17 | 6.3 ± 4.09 |
| Median (Q1, Q3) | 6.76 (4.74, 9.36) | 5.08 (3.53, 8.21) | 6.99 (4.6, 9.69) | 5.01 (3.4, 7.82) |
| Maximum serum albumin level within 3 days | ||||
| | 224 | 1188 | 74 | 281 |
| Mean ± SD | 37.48 ± 6.66 | 38.88 ± 6.89 | 36.2 ± 4.98 | 39.12 ± 7.7 |
| Median (Q1, Q3) | 37 (33, 40) | 39 (35, 43) | 36.5 (34, 39) | 39 (35, 43) |
| Minimum plasma fibrinogen level within 1 day | ||||
| | 232 | 759 | 69 | 196 |
| Mean ± SD | 3.49 ± 1.24 | 3.63 ± 1.49 | 3.49 ± 1.12 | 3.67 ± 1.47 |
| Median (Q1, Q3) | 3.27 (2.69, 4.23) | 3.24 (2.5, 4.54) | 3.3 (2.77, 4.08) | 3.31 (2.58, 4.79) |
| Extreme growth rate of plasma fibrinogen level within 2 weeks | ||||
| | 216 | 558 | 62 | 145 |
| Mean ± SD | 0.007 ± 0.03 | 0 ± 0.027 | 0.004 ± 0.028 | −0.716 ± 8.65 |
| Median (Q1, Q3) | 0.004 (−0.007, 0.017) | −0.002 (−0.014, 0.01) | 0.002 (−0.01, 0.01) | −0.003 (−0.011, 0.009) |
| Plasma prothrombin time average growth rate within 2 days | ||||
| | 312 | 1277 | 97 | 320 |
| Mean ± SD | −0.018 ± 0.062 | −0.017 ± 0.075 | −0.026 ± 0.057 | −0.02 ± 0.078 |
| Median (Q1, Q3) | −0.008 (−0.037, 0.013) | −0.004 (−0.038, 0.022) | −0.012 (−0.041, 0.004) | −0.004 (−0.042, 0.024) |
| Plasma prothrombin time average growth rate within 2 weeks | ||||
| | 465 | 2239 | 140 | 570 |
| Mean ± SD | −0.01 ± 0.04 | −0.01 ± 0.06 | −0.02 ± 0.05 | −0.01 ± 0.07 |
| Median (Q1, Q3) | 0 (−0.02, 0.02) | 0 (−0.03, 0.03) | −0.01 (−0.03, 0.01) | 0 (−0.03, 0.03) |
| Minimum mean red blood cell volume within 2 days | ||||
| | 322 | 1377 | 103 | 341 |
| Mean ± SD | 90.39 ± 5.75 | 90.35 ± 5.84 | 91.34 ± 7.32 | 90.13 ± 6.3 |
| Median (Q1, Q3) | 90.75 (87.9, 93.6) | 90.6 (87.3, 93.6) | 91.6 (88.4, 94.6) | 90.7 (87, 93.6) |
| Last thrombin time within 1 week | ||||
| | 441 | 2110 | 134 | 537 |
| Mean ± SD | 19.44 ± 3.23 | 20.45 ± 6.56 | 19.53 ± 3.47 | 20.75 ± 8.6 |
| Median (Q1, Q3) | 18.8 (17.5, 20.7) | 19.7 (18.3, 21.3) | 18.9 (17.7, 20.4) | 19.7 (18.3, 21.5) |
| Extreme growth rate of urea nitrogen level within 2 weeks | ||||
| | 242 | 748 | 75 | 188 |
| Mean ± SD | −0.007 ± 0.074 | −0.001 ± 0.069 | −0.014 ± 0.056 | 0.003 ± 0.077 |
| Median (Q1, Q3) | −0.011 (−0.04, 0.019) | −0.008 (−0.029, 0.013) | −0.014 (−0.048, 0.015) | −0.007 (−0.031, 0.024) |
| Maximum red blood cell count within 3 days | ||||
| | 382 | 1705 | 120 | 426 |
| Mean ± SD | 4.04 ± 0.68 | 4.13 ± 0.67 | 4.03 ± 0.63 | 4.16 ± 0.69 |
| Median (Q1, Q3) | 4.09 (3.63, 4.49) | 4.17 (3.74, 4.58) | 4.06 (3.65, 4.42) | 4.19 (3.72, 4.64) |
| Maximum D-dimer level within 2 weeks | ||||
| | 472 | 2271 | 142 | 572 |
| Mean ± SD | 31.42 ± 460.01 | 3.92 ± 7.42 | 9.15 ± 9.82 | 4.42 ± 9.71 |
| Median (Q1, Q3) | 5.82 (2.38, 12.51) | 1.31 (0.32, 4.26) | 6.77 (2.86, 10.62) | 1.33 (0.31, 4.14) |
| Extreme growth rate of D-dimer level within 2 weeks | ||||
| | 294 | 916 | 86 | 236 |
| Mean ± SD | 0.729 ± 12.14 | −0.009 ± 0.194 | 0.036 ± 0.205 | −0.05 ± 0.362 |
| Median (Q1, Q3) | 0.018 (−0.018, 0.077) | 0 (−0.019, 0.02) | 0.015 (−0.029, 0.076) | −0.001 (−0.03, 0.018) |
| Maximum C-reactive protein level within 2 weeks | ||||
| | 433 | 1995 | 134 | 523 |
| Mean ± SD | 54.52 ± 57.31 | 47.36 ± 60.67 | 58.09 ± 55.12 | 46.86 ± 58.36 |
| Median (Q1, Q3) | 31.52 (7.87, 87.14) | 12.7 (3.4, 81) | 41.99 (10.5, 99.9) | 13.5 (3.4, 79.51) |
| Extreme growth rate of C-reactive protein level within 2 weeks | ||||
| | 245 | 798 | 78 | 203 |
| Mean ± SD | 0.411 ± 1.433 | 0.215 ± 1.48 | 0.27 ± 1.006 | 0.06 ± 1.17 |
| Median (Q1, Q3) | 0.022 (−0.255, 0.512) | −0.019 (−0.392, 0.405) | 0.138 (−0.302, 0.684) | −0.021 (−0.398, 0.381) |
| Any primary care within 1 month | ||||
| Yes | 243 (50.1%) | 761 (31.95%) | 80 (55.56%) | 212 (34.87%) |
| No | 242 (49.9%) | 1621 (68.05%) | 64 (44.44%) | 396 (65.13%) |
| Base excess level | ||||
| Low | 24 (8.28%) | 127 (10.73%) | 8 (8.42%) | 23 (7.8%) |
| Normal | 171 (58.97%) | 743 (62.75%) | 51 (53.68%) | 195 (66.1%) |
| High | 95 (32.76%) | 314 (26.52%) | 36 (37.89%) | 77 (26.1%) |
Abbreviations: PE, pulmonary embolism; GBDT, gradient boosting decision tree.
Figure 1.Receiver operating curves for the prediction of pulmonary embolism (PE) risk of different machine learning models (validation set).
Predictive Efficacy Analysis (Verification Set) of Different Machine Learning Models for PE.
| Model | AUC (95% CI) | Sensibility | Specificity | Accuracy | F1 |
|---|---|---|---|---|---|
| GBDT | 0.799 (0.762, 0.837) | 63.9% | 81.1% | 77.8% | 0.524 |
| Logistic regression | 0.716 (0.672, 0.761) | 68.1% | 66.1% | 66.5% | 0.438 |
| Random forest | 0.791 (0.753, 0.828) | 71.5% | 72.7% | 72.5% | 0.499 |
| SVM | 0.743 (0.701, 0.785) | 75% | 65.1% | 67% | 0.466 |
Abbreviations: PE, pulmonary embolism; GBDT, gradient boosting decision tree; SVM, support vector machine.
Figure 2.Importance of the top 10 risk factors in the prediction model of machine learning: average scores of each feature among the overall gbtree models after cross-validation and downsampling.