| Literature DB >> 35831336 |
Shih-Ni Chang1,2, Ya-Luan Hsiao3, Che-Chen Lin1, Chuan-Hu Sun1, Pei-Shan Chen1, Min-Yen Wu1, Sheng-Hsuan Chen1, Hsiu-Yin Chiang1, Chiung-Tzu Hsiao4, Emily K King5, Chun-Min Chang6, Chin-Chi Kuo7,8,9.
Abstract
The fasting blood glucose (FBG) values extracted from electronic medical records (EMR) are assumed valid in existing research, which may cause diagnostic bias due to misclassification of fasting status. We proposed a machine learning (ML) algorithm to predict the fasting status of blood samples. This cross-sectional study was conducted using the EMR of a medical center from 2003 to 2018 and a total of 2,196,833 ontological FBGs from the outpatient service were enrolled. The theoretical true fasting status are identified by comparing the values of ontological FBG with average glucose levels derived from concomitant tested HbA1c based on multi-criteria. In addition to multiple logistic regression, we extracted 67 features to predict the fasting status by eXtreme Gradient Boosting (XGBoost). The discrimination and calibration of the prediction models were also assessed. Real-world performance was gauged by the prevalence of ineffective glucose measurement (IGM). Of the 784,340 ontologically labeled fasting samples, 77.1% were considered theoretical FBGs. The median (IQR) glucose and HbA1c level of ontological and theoretical fasting samples in patients without diabetes mellitus (DM) were 94.0 (87.0, 102.0) mg/dL and 5.6 (5.4, 5.9)%, and 92.0 (86.0, 99.0) mg/dL and 5.6 (5.4, 5.9)%, respectively. The XGBoost showed comparable calibration and AUROC of 0.887 than that of 0.868 in multiple logistic regression in the parsimonious approach and identified important predictors of glucose level, home-to-hospital distance, age, and concomitantly serum creatinine and lipid testing. The prevalence of IGM dropped from 27.8% based on ontological FBGs to 0.48% by using algorithm-verified FBGs. The proposed ML algorithm or multiple logistic regression model aids in verification of the fasting status.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35831336 PMCID: PMC9279373 DOI: 10.1038/s41598-022-15161-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Sample selection process from ontological glucose ante cibum (AC) to theoretical fasting classification, followed by splitting of the dataset into training and testing datasets.
Comparison of demographic and biochemical profiles of ontologically fasting samples with concomitant HbA1c measurement according to DM and theoretical fasting status.
| Variables | Non-DM | DM | ||||||
|---|---|---|---|---|---|---|---|---|
| Overall | Fasting | Nonfasting | Overall | Fasting | Nonfasting | |||
| n = 118,383 | n = 53,080 | n = 65,303 | n = 241,019 | n = 126,621 | n = 114,398 | |||
| Age, years | 55.3 (15.0) | 50.7 (15.6) | 59.1 (13.4) | < 0.01 | 61.9 (13.9) | 62.6 (13.3) | 61.1 (14.6) | < 0.01 |
| Male | 66,333 (56.0) | 29,147 (54.9) | 37,186 (56.9) | < 0.01 | 127,301 (52.8) | 66,791 (52.8) | 60,510 (52.9) | 0.47 |
| BMI, kg/m2 | 25.4 (4.85) | 24.8 (4.83) | 26.1 (4.78) | < 0.01 | 26.1 (4.63) | 26.2 (4.65) | 25.9 (4.60) | < 0.01 |
| < 0.01 | < 0.01 | |||||||
| 7:00–12:59 | 109,784 (92.7) | 48,168 (90.8) | 61,616 (94.4) | 225,770 (96.7) | 119,210 (94.2) | 106,560 (93.2) | ||
| 13:00–17:59 | 7465 (6.31) | 4322 (8.14) | 3143 (4.81) | 11,761 (4.88) | 5805 (4.58) | 5956 (5.21) | ||
| 18:00–22:59 | 1134 (0.96) | 590 (1.11) | 544 (0.83) | 3488 (1.45) | 1606 (1.27) | 1882 (1.65) | ||
| Interval between request and sampling day, day | 32.0 (40.8) | 24.9 (37.9) | 37.8 (42.2) | < 0.01 | 57.6 (34.2) | 58.71 (33.3) | 56.39 (35.2) | < 0.01 |
| No. of outpatient visits | 9.42 (17.6) | 7.44 (15.8) | 11.0 (18.7) | < 0.01 | 19.56 (24.7) | 19.07 (23.7) | 20.11 (25.8) | < 0.01 |
| Distance to hospital, km | 15.4 (76.3) | 17.7 (89.7) | 13.6 (63.9) | < 0.01 | 10.6 (55.4) | 10.6 (51.7) | 10.59 (59.2) | 0.87 |
| Concomitant lipid testing | 95,750 (80.9) | 47,735 (89.9) | 48,015 (73.5) | < 0.01 | 147,021 (61.0) | 81,320 (64.2) | 65,701 (57.4) | < 0.01 |
| < 0.01 | < 0.01 | |||||||
| General medicine | 19,268 (16.3) | 6913 (13.02) | 12,355 (18.92) | 19,988 (8.29) | 10,277 (8.12) | 9711 (8.49) | ||
| Metabolism/endocrinology | 14,190 (12.0) | 3553 (6.69) | 10,637 (16.29) | 132,971 (55.2) | 70,496 (55.67) | 62,475 (54.61) | ||
| Nephrology | 7461 (6.30) | 2070 (3.9) | 5391 (8.26) | 21,015 (8.72) | 10,298 (8.13) | 10,717 (9.37) | ||
| Cardiology | 20,837 (17.6) | 6260 (11.79) | 14,577 (22.32) | 36,060 (15.0) | 20,265 (16) | 15,795 (13.81) | ||
| Family medicine | 10,810 (9.13) | 4708 (8.87) | 6102 (9.34) | 16,022 (6.65) | 8243 (6.51) | 7779 (6.8) | ||
| Health management center | 34,572 (29.2) | 25,217 (47.51) | 9355 (14.33) | 766 (0.32) | 458 (0.36) | 308 (0.27) | ||
| Surgery | 5469 (4.62) | 2782 (5.24) | 2687 (4.11) | 2915 (1.21) | 1552 (1.23) | 1363 (1.19) | ||
| Pediatrics | 1109 (0.94) | 710 (1.34) | 399 (0.61) | 3607 (1.50) | 1065 (0.84) | 2542 (2.22) | ||
| Chinese medicine | 4232 (3.57) | 735 (1.38) | 3497 (5.36) | 7226 (3.00) | 3753 (2.96) | 3473 (3.04) | ||
| Other | 435 (0.37) | 132 (0.25) | 303 (0.46) | 449 (0.19) | 214 (0.17) | 235 (0.21) | ||
| Hypertension | 33,868 (28.6) | 10,916 (20.6) | 22,952 (35.2) | < 0.01 | 143,831 (59.7) | 76,406 (60.34) | 67,425 (58.94) | < 0.01 |
| Coronary artery disease | 12,906 (10.9) | 3981 (7.50) | 8925 (13.7) | < 0.01 | 46,568 (19.3) | 24,982 (19.73) | 21,586 (18.87) | < 0.01 |
| Stroke | 8713 (7.36) | 3236 (6.10) | 5477 (8.39) | < 0.01 | 31,375 (13.0) | 16,602 (13.11) | 14,773 (12.91) | 0.15 |
| Glucose, mg/dL | 124 (47.6) | 96.9 (19.5) | 145 (52.3) | < 0.01 | 163 (62.5) | 132 (36.7) | 197 (67.1) | < 0.01 |
| HbA1c, % | 6.48 (1.44) | 5.86 (0.98) | 6.97 (1.57) | < 0.01 | 7.55 (1.50) | 7.59 (1.42) | 7.50 (1.58) | < 0.01 |
| Hemoglobin, g/dL | 14.0 (1.97) | 14.1 (1.77) | 13.7 (2.23) | < 0.01 | 11.8 (2.29) | 12.0 (2.21) | 11.6 (2.35) | < 0.01 |
| Total cholesterol, mg/dL | 191 (41.5) | 193 (39.4) | 190 (43.6) | < 0.01 | 177 (43.5) | 174 (41.6) | 180 (45.5) | < 0.01 |
| LDL, mg/dL | 114 (34.6) | 116 (33.7) | 111 (35.4) | < 0.01 | 97.2 (33.0) | 96.6 (32.2) | 97.9 (33.9) | < 0.01 |
| HDL, mg/dL | 47.8 (13.7) | 49.9 (14.1) | 45.2 (12.6) | < 0.01 | 43.6 (12.5) | 43.6 (12.2) | 43.6 (12.8) | 0.95 |
| Triglyceride, mg/dL | 148 (168) | 125 (114) | 172 (208) | < 0.01 | 179 (231) | 160 (191) | 200 (269) | < 0.01 |
| BUN, mg/dL | 14.6 (12.7) | 12.3 (8.83) | 18.2 (16.3) | < 0.01 | 33.0 (24.3) | 31.0 (23.2) | 35.0 (25.3) | < 0.01 |
| Serum creatinine, mg/dL | 1.14 (1.51) | 1.00 (1.09) | 1.28 (1.82) | < 0.01 | 1.63 (2.20) | 1.53 (1.99) | 1.76 (2.41) | < 0.01 |
| Serum sodium, mmol/L | 139 (3.61) | 140 (3.22) | 138 (3.86) | < 0.01 | 137 (3.83) | 138 (3.41) | 136 (4.06) | < 0.01 |
| Serum potassium, mmol/L | 4.09 (0.52) | 4.02 (0.49) | 4.15 (0.54) | < 0.01 | 4.33 (0.61) | 4.34 (0.58) | 4.32 (0.64) | < 0.01 |
| AST, IU/L | 30.5 (26.0) | 27.4 (19.2) | 34.9 (32.8) | < 0.01 | 33.1 (29.1) | 32.1 (25.6) | 34.1 (32.1) | < 0.01 |
| ALT, IU/L | 33.1 (33.1) | 30.0 (29.6) | 36.4 (36.0) | < 0.01 | 31.1 (28.7) | 30.4 (26.6) | 31.9 (30.8) | < 0.01 |
| Uric acid, mg/dL | 6.00 (1.59) | 5.89 (1.55) | 6.13 (1.63) | < 0.01 | 6.29 (1.80) | 6.26 (1.78) | 6.32 (1.83) | < 0.01 |
| Albumin, g/dL | 4.50 (0.41) | 4.55 (0.35) | 4.41 (0.48) | < 0.01 | 3.97 (0.54) | 4.03 (0.50) | 3.92 (0.56) | < 0.01 |
| Estimated blood osmolality | 291 (7.90) | 290 (7.02) | 293 (8.68) | < 0.01 | 296 (10.1) | 294 (9.48) | 297 (10.5) | < 0.01 |
| Urine specific gravity | 1.02 (0.01) | 1.02 (0.01) | 1.02 (0.01) | 0.13 | 1.02 (0.01) | 1.02 (0.01) | 1.02 (0.01) | < 0.01 |
| Urine pH | 6.01 (0.68) | 6.03 (0.70) | 5.98 (0.66) | < 0.01 | 6.01 (0.63) | 6.03 (0.65) | 5.99 (0.62) | < 0.01 |
Values for continuous and categorical variables are expressed as mean (standard deviation) and frequency (%), respectively. Levels of glucose, HbA1c, and other biochemical variables were measured on the same day. P value indicates the significant difference of variables between theoretically fasting and nonfasting samples. BMI body mass index, LDL low-density lipoprotein, HDL high-density lipoprotein, BUN blood urea nitrogen, AST aspartate aminotransferase, ALT alanine transaminase.
Figure 2Density plots of ontological glucose AC in selected samples as follows: (A) entire samples stratified by the availability of HbA1c measured on the same day; (B) samples with HbA1c measured on the same day, stratified by theoretical fasting and nonfasting status; (C) the entire samples with A1c measured on the same day, stratified by fasting and nonfasting status in patients without DM; (D) the entire samples with HbA1c measured on the same day, stratified by fasting and nonfasting status in patients with DM. The dark blue dashed line shows the glucose value at 100 mg/dL, and the red dashed line shows the glucose value at 126 mg/dL.
Figure 3Scatter plot of HbA1c and fasting glucose levels. The figure is divided into four quadrants (a, b, c, and d) according to the diagnostic criteria of the American Diabetes Association (ADA) by diabetic status.
Odds ratios (95% confidence intervals) of being in the theoretically nonfasting status using the AContological sample in the training dataset (n = 277,822*).
| Variables | Univariate analysis | Model 1 | Model 2 | |||
|---|---|---|---|---|---|---|
| OR (95% CI) | OR (95% CI) | OR (95% CI) | ||||
| Glucose, per 5 mg/dL | 1.20 (1.20–1.21) | < 0.001 | 1.16 (1.16–1.16) | < 0.001 | 1.23 (1.23–1.23) | < 0.001 |
| Age, per 5 years | 1.16 (1.16–1.17) | < 0.001 | 1.01 (1.00–1.01) | < 0.001 | 1.05 (1.04–1.05) | < 0.001 |
| Male | 1.05 (1.03–1.06) | < 0.001 | 1.09 (1.07–1.11) | < 0.001 | 1.16 (1.14–1.18) | < 0.001 |
| 7:00–12:59 | Ref | Ref | Ref | |||
| 13:00–17:59 | 0.96 (0.93–1.00) | 0.03 | 0.90 (0.86–0.94) | < 0.001 | 0.87 (0.83–0.91) | < 0.001 |
| 18:00–22:59 | 1.14 (1.07–1.22) | < 0.001 | 0.76 (0.70–0.84) | < 0.001 | 0.78 (0.70–0.86) | < 0.001 |
| Interval between request and sampling, per 28 day | 1.07 (1.05–1.08) | < 0.001 | 0.95 (0.94–0.96) | < 0.001 | 1.08 (1.07–1.08) | < 0.001 |
| No. of outpatient visits, per 4 visits | 1.03 (1.02–1.04) | < 0.001 | 0.99 (0.99–1.00) | < 0.001 | 1.02 (1.02–1.02) | < 0.001 |
| Distance from home to hospital, per 10 km | 0.998 (0.996–1.000) | 0.04 | 0.998 (0.997–1.000) | 0.03 | 0.998 (0.996–1.00) | 0.01 |
| Health management center | Ref | Ref | Ref | – | ||
| General medicine | 3.39 (3.27–3.52) | < 0.001 | 1.49 (1.43–1.56) | < 0.001 | 2.24 (2.14–2.35) | < 0.001 |
| Metabolism/endocrinology | 2.57 (2.49–2.65) | < 0.001 | 0.67 (0.65–0.70) | < 0.001 | 2.11 (2.02–2.21) | < 0.001 |
| Nephrology | 3.45 (3.32–3.58) | < 0.001 | 1.19 (1.13–1.25) | < 0.001 | 2.38 (2.25–2.51) | < 0.001 |
| Cardiology | 3.03 (2.93–3.13) | < 0.001 | 1.09 (1.04–1.14) | < 0.001 | 1.99 (1.89–2.08) | < 0.001 |
| Family medicine | 2.80 (2.69–2.91) | < 0.001 | 1.03 (0.98–1.08) | 0.22 | 1.90 (1.81–2.00) | < 0.001 |
| Surgery | 2.57 (2.43–2.72) | < 0.001 | 1.30 (1.22–1.39) | < 0.001 | 1.74 (1.62–1.87) | < 0.001 |
| Pediatrics | 4.43 (4.13–4.77) | < 0.001 | 0.67 (0.60–0.75) | < 0.001 | 1.78 (1.59–1.99) | < 0.001 |
| Chinese medicine | 4.07 (3.87–4.28) | < 0.001 | 1.00 (0.94–1.06) | 0.92 | 1.86 (1.74–1.99) | < 0.001 |
| Other | 3.98 (3.41–4.64) | < 0.001 | 1.30 (1.09–1.56) | 0.004 | 1.95 (1.60–2.37) | < 0.001 |
| Hypertension | 1.04 (1.03–1.06) | < 0.001 | 1.02 (1.00–1.05) | 0.03 | ||
| Diabetes mellitus | 0.70 (0.68–0.71) | < 0.001 | 0.09 (0.09–0.09) | < 0.001 | ||
| Coronary artery disease | 1.07 (1.05–1.09) | < 0.001 | 0.94 (0.91–0.96) | < 0.001 | ||
| Stroke | 1.02 (0.99–1.04) | 0.12 | 0.99 (0.96–1.02) | 0.64 | ||
| Statin use | 0.78 (0.77–0.80) | < 0.001 | 0.82 (0.80–0.83) | < 0.001 | ||
| Concomitant lipid profile test | 0.69 (0.68–0.70) | < 0.001 | 0.78 (0.76–0.80) | < 0.001 | ||
| AIC | 297,987 | 262,835 | ||||
| AUC | 0.820 (0.819–0.822) | 0.867 (0.866–0.869) | ||||
| Ref | < 0.001 | |||||
*Sample were reduced because of missingness for the variable of "Distance from home to hospital".
AIC Akaike information criterion, AUC area under the curve.
Comparison of performance of determining fasting status by XGBoost, CatBoost, H2O Ensemble and logistic regression models in the testing dataset (n = 70,644).
| Algorithm/modeling strategy | Feature | Sensitivity | Specificity | Precision | F1-score | Accuracy | AUC |
|---|---|---|---|---|---|---|---|
| Logistic regression | Model 2* | 0.7608 | 0.8084 | 0.8081 | 0.7804 | 0.7845 | 0.868 (0.865–0.870) |
| XGBoost | Model 2* | 0.8261 | 0.7700 | 0.7844 | 0.8047 | 0.7982 | 0.887 (0.885–0.890) |
| CatBoost | Model 2* | 0.8415 | 0.7614 | 0.7813 | 0.8103 | 0.8017 | 0.889 (0.887–0.892) |
| H2O Ensemble | Model 2* | 0.8823 | 0.7093 | 0.7546 | 0.8135 | 0.7964 | 0.886 (0.884–0.889) |
| XGBoost | 67 | 0.8394 | 0.7785 | 0.7934 | 0.8158 | 0.8092 | 0.896 (0.894–0.898) |
| CatBoost | 67 | 0.8511 | 0.7574 | 0.7805 | 0.8142 | 0.8046 | 0.892 (0.890–0.894) |
| H2O Ensemble | 67 | 0.8770 | 0.7399 | 0.7735 | 0.8220 | 0.8089 | 0.897 (0.894–0.899) |
| XGBoost | Top 45 | 0.8369 | 0.7789 | 0.7932 | 0.8145 | 0.8081 | 0.895 (0.892–0.897) |
| XGBoost | Top 35 | 0.8413 | 0.7735 | 0.7901 | 0.8149 | 0.8076 | 0.894 (0.892–0.897) |
| XGBoost | Top 25 | 0.8414 | 0.7706 | 0.7880 | 0.8138 | 0.8062 | 0.893 (0.891–0.896) |
| XGBoost | Top 10 | 0.8502 | 0.7496 | 0.7748 | 0.8108 | 0.8002 | 0.887 (0.885–0.890) |
*Model 2 involves the features including glucose, age, male, timing of the day, interval between request and sampling, No. of outpatient visits, distance from home to hospital, division, hypertension, diabetes, coronary artery disease, stroke, statin use, and concomitant lipid testing as in Table 2.
Figure 4Top-ranked 45 features identified using the proposed XGBoost algorithm. SCr serum creatinine, ALT alanine transaminase, AST aspartate aminotransferase, BUN blood urea nitrogen, RBC red blood cell counts.
Figure 5Discrimination statistics (A) and calibration plots (B) for multivariable logistic regression model and the parsimonious machine learning models in the testing dataset.
Prevalence of fasting glucose ≥ 126 mg/dL and the proportion of ineffective glucose measurement (IGM) from 2003 to 2018 at China Medical University Hospital. The alphabet A to E represents the number of each condition and the proportion of each condition from B to E is derived from the ratio indicated in the brackets.
| 2003–2004 | 2005–2006 | 2007–2008 | 2009–2010 | 2011–2012 | 2013–2014 | 2015–2016 | 2017–2018 | |
|---|---|---|---|---|---|---|---|---|
| (A) Number of patients with AContological | 61,066 | 61,730 | 75,105 | 88,008 | 100,581 | 111,988 | 141,614 | 151,371 |
| (B) AContological ≥ 126 mg/dL, n (%) [B/A] | 17,353 (28.4) | 16,749 (27.1) | 17,842 (23.8) | 19,862 (22.6) | 22,832 (22.7) | 24,428 (21.8) | 27,131 (19.2) | 28,655 (18.9) |
| (C) IGM, n (%) [C/B] | 4943 (28.5) | 4771 (28.5) | 5046 (28.3) | 5422 (27.3) | 6595 (28.9) | 6676 (27.3) | 7444 (27.4) | 7584 (26.5) |
| (D) Algorithm-verified AC ≥ 126 mg/dL, n (%) [D/A] | 8607 (14.1) | 8458 (13.7) | 9073 (12.1) | 10,328 (11.7) | 11,456 (11.4) | 12,599 (11.3) | 13,938 (9.84) | 15,267 (10.1) |
| (E) IGM, n (%) [E/D] | 4 (0.05) | 56 (0.66) | 38 (0.42) | 58 (0.56) | 56 (0.49) | 71 (0.56) | 74 (0.53) | 89 (0.58) |