| Literature DB >> 35321110 |
Haike Lei1, Mengyang Zhang2, Zeyi Wu2, Chun Liu2, Xiaosheng Li1, Wei Zhou1, Bo Long1, Jiayang Ma2, Huiyi Zhang2, Ying Wang1, Guixue Wang3, Mengchun Gong2, Na Hong2, Haixia Liu1, Yongzhong Wu1.
Abstract
Background: There is currently a lack of model for predicting the occurrence of venous thromboembolism (VTE) in patients with lung cancer. Machine learning (ML) techniques are being increasingly adapted for use in the medical field because of their capabilities of intelligent analysis and scalability. This study aimed to develop and validate ML models to predict the incidence of VTE among lung cancer patients.Entities:
Keywords: lung cancer; machine learning; random forest; risk prediction model; venous thromboembolism
Year: 2022 PMID: 35321110 PMCID: PMC8934875 DOI: 10.3389/fcvm.2022.845210
Source DB: PubMed Journal: Front Cardiovasc Med ISSN: 2297-055X
Patient demographics and clinical characteristics.
| Characteristic | Modifier | All | No VTE | VTE n(%) | |
| Age ( | years | 64.03 ± 10.31 | 64.08 ± 10.30 | 62.64 ± 10.48 | 0.134 |
| KPS ( | 76.80 ± 10.97 | 76.76 ± 11.05 | 77.88 ± 8.76 | 0.166 | |
| Weight ( | kg | 59.61 ± 10.64 | 59.61 ± 10.65 | 59.23 ± 1056 | 0.830 |
| Height ( | cm | 161.13 ± 8.07 | 161.16 ± 8.06 | 160.32 ± 8.44 | 0.335 |
| PLT count ( | *109/L | 222.19 ± 99.73 | 222.52 ± 100.03 | 213.83 ± 91.88 | 0.303 |
| Albumin ( | g/L | 38.84 ± 6.24 | 38.88 ± 6.20 | 37.77 ± 7.09 | 0.087 |
| D-dimer ( | mg/L | 2.09 ± 3.65 | 2.05 ± 3.62 | 2.96 ± 4.18 | 0.019 |
| Hemoglobin ( | g/L | 121.75 ± 20.53 | 121.86 ± 20.48 | 118.94 ± 21.67 | 0.141 |
| Leukocyte count ( | *1012/L | 5.55 ± 5.35 | 5.51 ± 5.32 | 6.54 ± 6.04 | 0.063 |
| Creatinine ( | umol/L | 66.75 ± 37.90 | 66.86 ± 38.34 | 64.04 ± 23.88 | 0.211 |
|
| |||||
| Female | 1015 | 958 | 57 | <0.001 | |
| Male | 2342 | 2274 | 68 | ||
|
| |||||
| NSCLC | 1815 | 1724 | 91 | 0.007 | |
| SCLC | 228 | 225 | 3 | ||
|
| |||||
| I | 91 | 90 | 1 | <0.001 | |
| II | 80 | 79 | 1 | ||
| III | 329 | 318 | 11 | ||
| IV | 989 | 887 | 102 | ||
|
| |||||
| VTE history | 172 | 74(2.26) | 98(78.4) | <0.001 | |
| Varicosity | 21 | 16(0.49) | 5(4) | <0.001 | |
| COPD | 666 | 633(19.34) | 33(26.4) | 0.065 | |
| History of malignant tumor | 49 | 47(1.44) | 2(1.6) | 0.701 | |
| CVC | 110 | 96(2.93) | 14(11.2) | <0.001 | |
|
| |||||
| Mitomycin | 8 | 7(0.21) | 1(0.8) | 0.259 | |
| Recombinant human endostatin | 108 | 102(3.11) | 6(4.8) | 0.291 | |
| EGFR-TKI | 403 | 348(10.63) | 55(44) | <0.001 | |
| Platinum-based chemotherapy | 647 | 602(18.39) | 45(36) | <0.001 | |
| Bevacizumab | 78 | 60(1.83) | 18(14.4) | <0.001 |
KPS, Karnofsky performance status; PLT, platet; NSCLC, non-small cell lung cancer; SCLC, small cell lung cancer; COPD, chronic obstructive pulmonary disease; CVC, central venous catheter cannulation; EGFR-TKI, Epithelial growth factor receptor tyrosine kinase inhibitors. The P-values for all the numerical variables were calculated using Welch’s t-test. The Mann-Whitney U test was used to determine the P-value for “Stage of cancer,” and Fisher’s exact test was used to obtain the P-values for all other categorical features.
Evaluation measurements and 95% CI performed on the testing data.
| Model | Accuracy | Sensitivity | Specificity | AUROC | AUPRC |
| Random forest | 0.957 (0.934–0.975) | 0.714 (0.619–0.762) | 0.965 (0.941–0.985) | 0.91 (0.893–0.926) | 0.43 (0.363–0.500) |
| Adaboost | 0.972 (0.954–0.974) | 0.619 (0.381–0.714) | 0.983 (0.968–0.986) | 0.83 (0.701–0.926) | 0.39 (0.230–0.476) |
| Xgboost | 0.971 (0.965–0.976) | 0.571 (0.429–0.667) | 0.983 (0.977–0.989) | 0.89 (0.861–0.937) | 0.41 (0.323–0.470) |
| Logistic regression | 0.954 (0.922–0.963) | 0.761 (0.619–0.762) | 0.961 (0.927–0.971) | 0.90 (0.830–0.932) | 0.37 (0.294–0.465) |
| KNN | 0.901 (0.882–0.931) | 0.190 (0.190–0.524) | 0.924 (0.901–0.948) | 0.73 (0.615–0.785) | 0.16 (0.090–0.270) |
KNN, K-NearestNeighbor.
FIGURE 1The receiver operating characteristic curve and the precision–recall curve on training and test dataset. (A) AUROC. (B) AUPRC.
FIGURE 2Permutation feature importance ranking for both training and test datasets. (A) Train set. (B) Test set.
FIGURE 3Impurity-based feature importance ranking.
FIGURE 4The receiver operating characteristic curve and the precision-recall curve on train and test dataset after feature selection. (A) AUROC. (B) AUPRC.