| Literature DB >> 34836899 |
Wanyue Li1,2, Yanan Song3, Kang Chen4, Jun Ying5, Zhong Zheng6, Shen Qiao3, Ming Yang3, Maonian Zhang1,2, Ying Zhang7.
Abstract
OBJECTIVE: Aiming to investigate diabetic retinopathy (DR) risk factors and predictive models by machine learning using a large sample dataset.Entities:
Keywords: diabetes & endocrinology; diabetic retinopathy; statistics & research methods
Mesh:
Year: 2021 PMID: 34836899 PMCID: PMC8628336 DOI: 10.1136/bmjopen-2021-050989
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Figure 1General schema for prediction model building and evaluation. The positive samples were defined as patients with diabetic retinopathy (DR), and negative samples were patients without DR.
Baseline analysis results of 60 variables of 32 452 patients with T2DM
| Variables | Total (n=32 452) | Non-DR (n=30 414) | DR (n=2038) | P value | |
| Age | 59.71±12.64 | 59.86±12.69 | 57.43±11.67 | <0.001 | |
| Sex (Female) | Female | 10 962 (33.78) | 10 217 (33.59) | 745 (36.56) | 0.007 |
| Nationality | Han | 30 461 (93.86) | 28 550 (93.87) | 1911 (93.77) | 0.834 |
| Others | 1806 (5.57) | 1689 (5.55) | 117 (5.74) | ||
| Unknown | 185 (0.57) | 175 (0.58) | 10 (0.49) | ||
| Marital status | Married | 31 526 (97.15) | 29 544 (97.14) | 1982 (97.25) | 0.820 |
| Others | 926 (2.85) | 870 (2.86) | 56 (2.75) | ||
| Permanent residence | Urban | 27 484 (84.69) | 25 830 (84.93) | 1654 (81.16) | <0.001 |
| Rural | 4968 (15.31) | 4584 (15.07) | 384 (18.84) | ||
| Occupation | Stable | 14 404 (44.39) | 13 570 (44.62) | 834 (40.92) | 0.001 |
| Unstable | 18 048 (55.61) | 16 844 (55.38) | 1204 (59.08) | ||
| Hypertension | Yes | 20 834 (64.20) | 19 328 (63.55) | 1506 (73.90) | <0.001 |
| Hyperlipidaemia | Yes | 9567 (29.48) | 9164 (30.13) | 403 (19.77) | <0.001 |
| Atherosclerosis | Yes | 17 083 (52.64) | 16 022 (52.68) | 1061 (52.06) | 0.604 |
| Stroke | Yes | 2264 (6.98) | 2050 (6.74) | 214 (10.50) | <0.001 |
| Fatty liver | Yes | 9849 (30.35) | 9165 (30.13) | 684 (33.56) | 0.001 |
| Liver cirrhosis | Yes | 550 (1.69) | 525 (1.73) | 25 (1.23) | 0.109 |
| Other chronic liver disease | Yes | 4605 (14.19) | 4311 (14.17) | 294 (14.43) | 0.778 |
| Pancreatic disease | Yes | 726 (2.24) | 691 (2.27) | 35 (1.72) | 0.118 |
| Biliary tract diseases | Yes | 4613 (14.21) | 4291 (14.11) | 322 (15.80) | 0.037* |
| Nephropathy | Yes | 8611 (26.53) | 7383 (24.28) | 1228 (60.26) | <0.001** |
| Kidney failure | Yes | 817 (2.52) | 608 (2.00) | 209 (10.26) | <0.001** |
| Nervous system disease | Yes | 2362 (7.28) | 2238 (7.36) | 124 (6.08) | 0.036* |
| Coronary heart disease | Yes | 13 114 (40.41) | 12 553 (41.27) | 561 (27.53) | <0.001** |
| Myocardial infarction | Yes | 3026 (9.32) | 2919 (9.60) | 107 (5.25) | <0.001** |
| Arrhythmias | Yes | 2790 (8.60) | 2648 (8.71) | 142 (6.97) | 0.008** |
| Respiratory system diseases | Yes | 5545 (17.09) | 5202 (17.10) | 343 (16.83) | 0.774 |
| Diabetic lower extremity arterial disease | Yes | 2963 (9.13) | 2456 (8.08) | 507 (24.88) | <0.001** |
| Hemopathy | Yes | 2556 (7.88) | 2122 (6.98) | 434 (21.30) | <0.001** |
| Rheumatic immune disease | Yes | 1252 (3.86) | 1194 (3.93) | 58 (2.85) | 0.017* |
| Endocrine disease | Yes | 8855 (27.29) | 7992 (26.28) | 863 (42.35) | <0.001** |
| Digestive system neoplasms | Yes | 2593 (7.99) | 2532 (8.33) | 61 (2.99) | <0.001** |
| Urinary neoplasms | Yes | 458 (1.41) | 438 (1.44) | 20 (0.98) | 0.109 |
| Gynaecological neoplasms | Yes | 1149 (3.54) | 1103 (3.63) | 46 (2.26) | 0.001* |
| Lung neoplasms | Yes | 855 (2.63) | 838 (2.76) | 17 (0.83) | <0.001** |
| Other neoplasms | Yes | 3327 (10.25) | 3202 (10.53) | 125 (6.13) | <0.001** |
| Insulin treatment | Yes | 20 037 (61.74) | 18 249 (60.00) | 1788 (87.73) | <0.001** |
| SBP, mm Hg | 135±19 | 135±19 | 142±21 | <0.001** | |
| DBP, mm Hg | 79±11 | 79±11 | 82±12 | <0.001** | |
| FBG, mmol/L | 7.25 (5.93, 9.51) | 7.23 (5.94, 9.44) | 7.83 (5.78, 10.73) | <0.001** | |
| HbA1c, % | 7.1 (6.4, 8.3) | 7.1 (6.4, 8.2) | 7.9 (6.7, 9.4) | <0.001** | |
| TG, mg/day | 1.55 (1.10, 2.28) | 1.55 (1.10, 2.27) | 1.53 (1.11, 2.34) | 0.621 | |
| TC, mg/dL | 4.34 (3.62, 5.10) | 4.32 (3.61, 5.09) | 4.52 (3.81, 5.37) | <0.001** | |
| HDL, mg/dL | 1.02 (0.86, 1.23) | 1.02 (0.85, 1.23) | 1.03 (0.87, 1.24) | 0.044* | |
| LDL, mg/dL | 2.71±0.99 | 2.70±0.97 | 2.93±1.19 | <0.001** | |
| Fbg, g/L | 3.27 (2.80, 3.98) | 3.26 (2.80, 3.94) | 3.59 (2.96, 4.62) | <0.001** | |
| BUN, mmol/L | 5.41 (4.43, 6.69) | 5.38 (4.40, 6.60) | 6.30 (4.96, 8.70) | <0.001** | |
| SCr, μmol/L | 70.1 (59.0, 83.5) | 69.9 (59.0, 82.6) | 77.5 (59.8, 114.6) | <0.001** | |
| SUA, umol/L | 324.3±99.2 | 323.5±99.1 | 335.9±100.6 | <0.001 | |
| Hb, g/L | 137±21 | 137±20 | 128±24 | <0.001 | |
| Hct, % | 41 (37, 44) | 41 (38, 44) | 38 (34, 42) | <0.001 | |
| PLT, 109/L | 205 (170, 247) | 205 (170, 247) | 208 (172, 252) | 0.023 | |
| TBil, umol/L | 10.4 (7.7, 14.0) | 10.5 (7.8, 14.1) | 8.9 (6.2, 12.6) | <0.001 | |
| DBil, umol/L | 3.2 (2.3, 4.5) | 3.3 (2.4, 4.5) | 2.5 (1.6, 3.6) | <0.001 | |
| TP, g/L | 67.34±6.68 | 67.55±6.55 | 64.15±7.77 | <0.001 | |
| ALB, g/L | 41.5 (38.7, 44.1) | 41.7 (38.9, 44.2) | 39.7 (35.4, 42.3) | <0.001 | |
| LDH, U/L | 153.9 (134.9, 180.0) | 153.3 (134.5, 179.3) | 161.4 (140.9, 191.7) | <0.001 | |
| ALT, U/L | 19.6 (13.8, 29.9) | 19.8 (13.9, 30.4) | 16.3 (11.9, 23.4) | <0.001 | |
| AST, U/L | 17.2 (13.8, 22.8) | 17.4 (13.9, 23.0) | 15.6 (12.6, 20.1) | <0.001 | |
| GGT, U/L | 28.1 (18.8, 47.8) | 28.6 (19.1, 48.7) | 22.4 (15.7, 34.7) | <0.001 | |
| ALP, U/L | 68.2 (56.4, 83.2) | 68.2 (56.4, 83.2) | 67.9 (55.7, 82.9) | 0.147 | |
| PT, s | 13.1 (12.6, 13.7) | 13.1 (12.6, 13.7) | 12.9 (12.4, 13.5) | <0.001 | |
| PTA, % | 99 (90, 108) | 99 (90, 108) | 100 (91, 110) | <0.001 | |
| APTT, s | 35.8 (33.3, 38.7) | 35.8 (33.3, 38.7) | 35.7 (33.3, 38.58) | 0.145 | |
| GLO, g/L | 25.9 (22.9, 29.3) | 25.9 (22.9, 29.3) | 25.5 (22.5, 28.7) | <0.001 |
The continuous variables were expressed as mean±SD or the median (IQR) after the normality distribution test. The categorical variables were expressed as number (percentage).
*P value <0.05; **p value <0.01.
ALB, albumin; ALP, alkaline phosphatase transferase; ALT, alanine aminotransferase; APTT, activated partial thromboplastin time; APTT, activated partial thromboplastin time; AST, aspartate aminotransferases; BUN, blood urea nitrogen; DBil, direct bilirubin; DBP, diastolic blood pressure; DR, diabetic retinopathy; FBG, fasting blood glucose; Fbg, fibrinogen; GGT, glutamine; GLO, globulin; Hb, haemoglobin; Hct, haematocrit; HDL-C, high density lipoprotein; LDH, lactate dehydrogenase; LDL-C, low density lipoprotein; Marital status, others (single, divorced, widow); non-DR, diabetics without diabetic retinopathy; PLT, platelet count; PT, prothrombin time; PTA, prothrombin activity; SBP, systolic blood pressure; SCr, serum creatinine; SUA, serum uric acid; TBil, total bilirubin; TC, total cholesterol; TG, triglyceride; TP, total protein.
Figure 2Feature selection accuracy curve. The accuracy got the highest value when the number of variables was 17 (represented as a solid point).
Performance of prediction models in the validation set
| Method | Accuracy | Sensitivity | Specificity | ROC-AUC |
|
|
|
|
|
|
| SVM | 0.89 | 0.45 | 0.90 | 0.79 |
| LR | 0.86 | 0.59 | 0.86 | 0.83 |
| RF | 0.92 | 0.63 | 0.92 | 0.87 |
LR, logistic regression; RF, random forest; ROC-AUC, areas under receiver operator characteristic curves; SVM, support vector machine; XGBoost, Extreme Gradient Boosting.
Figure 3ROC curve of validation set. LR, logistic regression; RF, random forest; ROC-AUC, areas under receiver operator characteristic curves; SVM, support vector machine; XGBoost, Extreme Gradient Boosting.
Figure 4SHAP summary plot of the XGBoost model. The higher the SHAP value of a feature, the higher the risk of DR. The contribution of each feature of each patient to the model corresponds to a dot. The dots are coloured according to the values of features. Red represents a higher feature value, and blue represents a lower feature value. The higher the SHAP value of a feature, the more likely DR occurrence. DR, diabetic retinopathy; SHAP, Shapley Additive exPlanation.
Figure 5SHAP dependence plot of the XGBoost model. The SHAP value of each feature exceeded zero, indicating an increased risk of DR. HbA1c, nephropathy, serum creatinine, insulin treatment and diabetic lower extremity arterial disease were risk factors of DR. Age was a protective factor of DR. DR, diabetic retinopathy; SHAP, Shapley Additive exPlanation.
Figure 6Shapley Additive exPlanation force plot for diabetic retinopathy (DR) patient and non-DR patient.
Comparison with other previous DR prediction or diagnosis model
| Author | Gulshan | Liao | Mendoza-Herrera | Tsao | The present prediction model |
| Published time | 2016 | 2018 | 2017 | 2018 | / |
| Number of samples | 9963 (EyePACS-1 data set) | 1055 | 1000 | 536 | 32 452 |
| Algorithm (best result) | Deep convolutional neural network | Logistic regression | Probit model | Support vector machines | XGBoost |
| Sensitivity (validation) | 0.975(EyePACS-1 data set) | NA | NA | 0.933 | 0.70 |
| Specificity (validation) | 0.934 (EyePACS-1 data set) | NA | NA | 0.724 | 0.90 |
| Accuracy (validation) | NA | NA | NA | 0.795 | 0.90 |
| ROC-AUC (internal validation) | 0.991 (EyePACS-1 data set) | 0.744 | 0.778 | 0.839 | 0.90 |