| Literature DB >> 33820531 |
Jiaxin Fan1, Mengying Chen1, Jian Luo2, Shusen Yang3, Jinming Shi1, Qingling Yao1, Xiaodong Zhang1, Shuang Du1, Huiyang Qu1, Yuxuan Cheng1, Shuyin Ma1, Meijuan Zhang1, Xi Xu1, Qian Wang4, Shuqin Zhan5.
Abstract
BACKGROUND: Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS.Entities:
Keywords: Asymptomatic carotid atherosclerosis; Electronic health records; Machine learning; Prediction
Year: 2021 PMID: 33820531 PMCID: PMC8020544 DOI: 10.1186/s12911-021-01480-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Flowchart illustrating sample selection. (CAS, carotid atherosclerosis)
The characteristics of 18,441 participants
| Feature | Training set | Testing set | Total population |
|---|---|---|---|
| Age | 50.88 (19, 96) | 50.81 (18, 93) | 50.86 (18, 96) |
| Age subgroup, y | |||
| 18–64 | 11,170 (86.6) | 4815 (87.0) | 15,991 (86.7) |
| > 64 | 1733 (13.4) | 717 (13.0) | 2450 (13.3) |
| Gender (male) | 7738 (59.9) | 3297 (59.6) | 11,035 (59.8) |
| SBP, mmHg | 128.69 ± 17.76 | 128.04 ± 17.43 | 128.49 ± 17.66 |
| Heart rate, beats/min | 75.71 ± 7.80 | 75.48 ± 7.93 | 75.64 ± 7.84 |
| Pulse, beats/min | 77.40 ± 10.88 | 76.92 ± 10.94 | 77.25 ± 10.90 |
| Waistline, cm | 83.66 ± 10.10 | 83.70 ± 9.89 | 83.67 ± 10.04 |
| Hypertension | 1502 (11.6) | 868 (15.7) | 2370 (12.9) |
| Diabetes mellitus | 524 (4.1) | 326 (5.9) | 850 (4.6) |
| Hyperlipidemia | 177 (1.4) | 210 (3.8) | 387 (2.1) |
| Family history | 287 (2.2) | 235 (4.2) | 522 (2.8) |
| Ever-smoker | 563 (4.4) | 284 (5.1) | 847 (4.6) |
| Glucose, mmol/L | 5.42 ± 1.36 | 5.31 ± 1.36 | 5.39 ± 1.36 |
| HDL, mmol/L | 1.23 ± 0.21 | 1.21 ± 0.25 | 1.22 ± 0.22 |
| TC, mmol/L | 4.40 ± 0.62 | 4.40 ± 0.80 | 4.40 ± 0.68 |
| Total protein, g/L | 68.89 ± 4.21 | 69.15 ± 3.68 | 68.96 ± 4.06 |
| Albumin, g/L | 44.30 ± 2.78 | 44.21 ± 2.40 | 44.27 ± 2.67 |
| Albumin/Globulin | 1.84 ± 0.28 | 1.82 ± 0.29 | 1.84 ± 0.28 |
| γ-GLT, U/L | 27.33 ± 26.86 | 27.12 ± 25.18 | 27.27 ± 26.37 |
| Platelets, 10^9/L | 217.36 ± 58.42 | 218.57 ± 57.25 | 217.72 ± 58.07 |
| Yes | 4587 (35.5) | 1966 (35.5) | 6553 (35.5) |
| Age 18–64 y | 3246 (25.2) | 1418 (25.6) | 4670 (25.3) |
| Age > 64 y | 1335 (10.3) | 548 (9.9) | 1883 (10.2) |
| No | 8322 (64.5) | 3566 (64.5) | 11,888 (64.5) |
| Age 18–64 y | 7924 (61.4) | 3397 (61.4) | 11,321 (61.4) |
| Age > 64 y | 398 (3.1) | 169 (3.1) | 567 (3.1) |
Categorical features represented as frequency (%). Continuous features represented as median ± SD, except age, which was median (minimum, maximum). (SBP, systolic blood pressure; HDL, high density lipoprotein; TC, total cholesterol; γ-GLT, γ-glutamyl transpeptidase)
Comparison of the predictive performance for six models (testing set)
| Model | Acc (%) | Sp (%) | Pp (%) | Re (%) | F1 (%) | AUCROC |
|---|---|---|---|---|---|---|
| LR | 74.7 | 86.6 | 68.6 | 53.2 | 59.9 | 0.809 |
| RF | 74.5 | 89.5 | 71.3 | 47.2 | 56.8 | 0.794 |
| DT | 65.4 | 71.8 | 51.2 | 53.8 | 52.5 | 0.628 |
| XGB | 73.4 | 87.8 | 68.0 | 47.2 | 55.7 | 0.788 |
| GNB | 67.0 | 88.0 | 63.1 | 37.2 | 46.8 | 0.753 |
| KNN | 68.8 | 81.5 | 57.7 | 45.6 | 50.9 | 0.704 |
Acc, accuracy; Sp, specificity; Pp, precision; Re, recall; F1, F1 score; AUCROC, the area under the receiver operating characteristic curve; LR, logistic regression; RF, random forest; DT, decision tree; XGB, eXtreme Gradient Boosting; GNB, Gaussian Naïve Bayes; KNN, K-Nearest Neighbour
Fig. 2Performance characteristic curves for six models (Testing set). (LR, logistic regression; RF, random forest; DT, decision tree; XGB, eXtreme Gradient Boosting; GNB, Gaussian Naïve Bayes; KNN, K-Nearest Neighbour)
Comparison of the performance for six models (tenfold cross-validation)
| Model | AUCROC | Model | AUCROC |
|---|---|---|---|
| LR | 0.812 | XGB | 0.797 |
| RF | 0.799 | GNB | 0.755 |
| DT | 0.630 | KNN | 0.701 |
Fig. 3Pearson correlation analysis regarding 19 features for Naïve Bayes. (PLT, platelets)
Fig. 419 features used for decision tree model generation and their information gain values. Features were ranked according to their information gain values which reflect the entropy gain with respect to the predictive outcome. The longer the blue transverse column (the higher the value), the greater importance on the outcome. (SBP, systolic blood pressure; HDL, high density lipoprotein; TC, total cholesterol; γ-GLT, γ-glutamyl transpeptidase)