| Literature DB >> 35055429 |
Chao Lu1, Jiayin Song1, Hui Li1,2, Wenxing Yu1, Yangquan Hao1, Ke Xu1, Peng Xu1.
Abstract
Osteoarthritis (OA) is the most common joint disease associated with pain and disability. OA patients are at a high risk for venous thrombosis (VTE). Here, we developed an interpretable machine learning (ML)-based model to predict VTE risk in patients with OA. To establish a prediction model, we used six ML algorithms, of which 35 variables were employed. Recursive feature elimination (RFE) was used to screen the most related clinical variables associated with VTE. SHapley additive exPlanations (SHAP) were applied to interpret the ML mode and determine the importance of the selected features. Overall, 3169 patients with OA (average age: 66.52 ± 7.28 years) were recruited from Xi'an Honghui Hospital. Of these, 352 and 2817 patients were diagnosed with and without VTE, respectively. The XGBoost algorithm showed the best performance. According to the RFE algorithms, 15 variables were retained for further modeling with the XGBoost algorithm. The top three predictors were Kellgren-Lawrence grade, age, and hypertension. Our study showed that the XGBoost model with 15 variables has a high potential to predict VTE risk in patients with OA.Entities:
Keywords: VTE risk prediction; machine learning algorithm; osteoarthritis; population-based cohort study; venous thrombosis
Year: 2022 PMID: 35055429 PMCID: PMC8781369 DOI: 10.3390/jpm12010114
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1Flow chart of patients for enrollment.
Characteristics of the patients stratified by VTE or not.
| Class a | Total | None-Venous Thrombosis | Venous Thrombosis | ||
|---|---|---|---|---|---|
| N | 3169 | 2817 | 352 | ||
| Age (year) b | 66.52 ± 7.28 | 66.33 ± 7.31 | 68.05 ± 6.84 | <0.001 | |
| Gender | |||||
| Male | 2400 (75.73%) | 2119 (75.22%) | 281 (79.83%) | 0.066 | |
| Female | 769 (24.27%) | 698 (24.78%) | 71 (20.17%) | ||
| Hypertension | |||||
| No | 1730 (54.59%) | 1543 (54.77%) | 187 (53.12%) | 0.597 | |
| Yes | 1439 (45.41%) | 1274 (45.23%) | 165 (46.88%) | ||
| Diabetes | |||||
| No | 2751 (86.81%) | 2437 (86.51%) | 314 (89.20%) | 0.185 | |
| Yes | 418 (13.19%) | 380 (13.49%) | 38 (10.80%) | ||
| Coronary heart disease | |||||
| No | 2207 (69.64%) | 1974 (70.07%) | 233 (66.19%) | 0.152 | |
| Yes | 962 (30.36%) | 843 (29.93%) | 119 (33.81%) | ||
| Kellgren–Lawrence grade | |||||
| 0 | 2269 (71.60%) | 1943 (68.97%) | 326 (92.61%) | <0.001 | |
| III | 181 (5.71%) | 178 (6.32%) | 3 (0.85%) | ||
| IV | 719 (22.69%) | 696 (24.71%) | 23 (6.54%) | ||
| Eosinophil ratio | |||||
| Normal Range | 2746 (86.65%) | 2431 (86.30%) | 315 (89.49%) | 0.115 | |
| Abnormal | 423 (13.35%) | 386 (13.70%) | 37 (10.51%) | ||
| Hematocrit | |||||
| Normal Range | 2535 (79.99%) | 2254 (80.01%) | 281 (79.83%) | 0.991 | |
| Abnormal | 634 (20.01%) | 563 (19.99%) | 71 (20.17%) | ||
| Mean platelet volume | |||||
| Normal Range | 2782 (87.79%) | 2462 (87.40%) | 320 (90.91%) | 0.070 | |
| Abnormal | 387 (12.21%) | 355 (12.60%) | 32 (9.09%) | ||
| Thrombocytocrit | |||||
| Normal Range | 2858 (90.19%) | 2527 (89.71%) | 331 (94.03%) | 0.013 | |
| Abnormal | 311 (9.81%) | 290 (10.29%) | 21 (5.97%) | ||
| platelet-larger cell ratio | |||||
| Normal Range | 2390 (75.42%) | 2112 (74.97%) | 278 (78.98%) | 0.114 | |
| Abnormal | 779 (24.58%) | 705 (25.03%) | 74 (21.02%) | ||
| Uric acid | |||||
| Normal Range | 2554 (80.59%) | 2261 (80.26%) | 293 (83.24%) | 0.208 | |
| Abnormal | 615 (19.41%) | 556 (19.74%) | 59 (16.76%) | ||
| Glucose | |||||
| Normal Range | 2665 (84.10%) | 2369 (84.10%) | 296 (84.09%) | 0.941 | |
| Abnormal | 504 (15.90%) | 448 (15.90%) | 56 (15.91%) | ||
| Antistreptococcal hemolysin “O” | |||||
| Normal Range | 3074 (97.00%) | 2726 (96.77%) | 348 (98.86%) | 0.045 | |
| Abnormal | 95 (3.00%) | 91 (3.23%) | 4 (1.14%) | ||
| Anti-CCP antibody | |||||
| Normal Range | 2549 (80.44%) | 2255 (80.05%) | 294 (83.52%) | 0.140 | |
| Abnormal | 620 (19.56%) | 562 (19.95%) | 58 (16.48%) | ||
| Rheumatoid factors | |||||
| Normal Range | 2902 (91.57%) | 2577 (91.48%) | 325 (92.33%) | 0.661 | |
| Abnormal | 267 (8.43%) | 240 (8.52%) | 27 (7.67%) |
a Continuous variable are transformed to dichotomous variables according to their normal range. b Values are presented as mean ± SD.
Figure 2The receiver operating characteristic (ROC) curves of the machine learning models on the training set (A) and testing set (B).
The area under the curve (AUC) of training set and testing set.
| Training Set (AUC, 95% CI) | Testing Set (AUC, 95% CI) | |
|---|---|---|
| LR | 0.843 (0.832, 0.855) | 0.690 (0.620, 0.760) |
| RF | 0.872 (0.862, 0.882) | 0.685 (0.618, 0.753) |
| XGBoost | 0.980 (0.977, 0.983) | 0.741 (0.676, 0.806) |
| AdaBoost | 0.858 (0.847, 0.868) | 0.687 (0.619, 0.755) |
| GBDT | 0.965 (0.960, 0.970) | 0.720 (0.656, 0.784) |
| CatBoost | 0.973 (0.969, 0.977) | 0.724 (0.657, 0.790) |
Figure 3Using the RFE method to screen the optimal variables. (A) The most import variables, screened by the RFE method; (B) The receiver operating characteristic (ROC) curves of XGBoost model on the training set and testing set.
Figure 4Interpretation and Evaluation of Machine Learning Model. (A) SHAP analysis on the dataset, which shows the 15 most important features and their impact on the model output. Each dot represents one patient, with blue color meaning the lowest range and red color meaning the highest range of the feature; (B) Ranking of the features’ importance indicated by SHAP analysis.