| Literature DB >> 35986034 |
Kazuharu Kawano1, Yoichiro Otaki2, Natsuko Suzuki1, Shouichi Fujimoto3, Kunitoshi Iseki3, Toshiki Moriyama3, Kunihiro Yamagata3, Kazuhiko Tsuruya3, Ichiei Narita3, Masahide Kondo3, Yugo Shibagaki3, Masato Kasahara3, Koichi Asahi3, Tsuyoshi Watanabe3, Tsuneo Konta4,5.
Abstract
Early detection and treatment of diseases through health checkups are effective in improving life expectancy. In this study, we compared the predictive ability for 5-year mortality between two machine learning-based models (gradient boosting decision tree [XGBoost] and neural network) and a conventional logistic regression model in 116,749 health checkup participants. We built prediction models using a training dataset consisting of 85,361 participants in 2008 and evaluated the models using a test dataset consisting of 31,388 participants from 2009 to 2014. The predictive ability was evaluated by the values of the area under the receiver operating characteristic curve (AUC) in the test dataset. The AUC values were 0.811 for XGBoost, 0.774 for neural network, and 0.772 for logistic regression models, indicating that the predictive ability of XGBoost was the highest. The importance rating of each explanatory variable was evaluated using the SHapley Additive exPlanations (SHAP) values, which were similar among these models. This study showed that the machine learning-based model has a higher predictive ability than the conventional logistic regression model and may be useful for risk assessment and health guidance for health checkup participants.Entities:
Mesh:
Year: 2022 PMID: 35986034 PMCID: PMC9391467 DOI: 10.1038/s41598-022-18276-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Baseline characteristics of training and test data set.
| Training data 2008 | Test data 2009–2014 | |
|---|---|---|
| Total participants, number (%) | 85,361 (73.1) | 31,388 (26.9) |
| Male, number (%) | 35,503 (41.5) | 14,022 (44.7) |
| Female, number (%) | 49,858 (58.4) | 17,366 (55.3) |
| Age, year | 61.7 (7.1) | 61.5 (7.3) |
| Height, cm | 157.6 (8.4) | 158.6 (8.5) |
| Body weight, kg | 57.4 (10.4) | 58.1 (10.7) |
| Systolic blood pressure, mmHg | 127.8 (17.0) | 129.1 (17.4) |
| Diastolic blood pressure, mmHg | 76.0 (10.6) | 76.6 (10.9) |
| Uric acid, mg/dL | 5.1 (1.4) | 5.1 (1.4) |
| Triglycerides, mg/dL | 121.4 (83.6) | 124.7 (89.7) |
| HDL-C, mg/dL | 62.1 (16.2) | 62.5 (16.8) |
| LDL-C, mg/dL | 125.3 (30.3) | 126.6 (32.0) |
| AST, U/L | 24.0 (13.0) | 25.3 (16.3) |
| γGTP, IU/L | 37.7 (49.5) | 42.6 (61.2) |
| eGFR, mL/min/1.73m2 | 76.5 (17.4) | 78.8 (19.9) |
| HbA1c, % | 5.7 (0.6) | 5.8 (0.8) |
| Urine protein, number (%) | (−) 74,741 (87.5)/(±) 6283 (7.3)/(+) 2884 (3.3)/(2+) 1009 (1.1)/(3+) 294 (0.3) | (−) 27,288(86.9)/(±) 2338 (7.4)/(+) 1165 (3.7)/(2+) 399 (1.3)/(3+) 141 (0.4) |
| Urine glucose, number (%) | (−) 82,839 (97.2)/(±) 612 (0.7)/(+) 726 (0.9)/(2+) 453 (0.5)/(3+) 577 (0.7) | (−) 30,144 (96.2)/(±) 316 (1.0)/(+) 327 (1.0)/(2+) 194 (0.6)/(3+) 339 (1.1) |
| Urine occult blood, number (%) | (−) 36,997 (67.6)/(±) 8982 (16.4)/(+) 5077 (9.2)/(2+) 2694 (4.9)/(3+) 1018 (1.9) | (−) 16,852 (69.8)/(±) 3516 (14.6)/(+) 2176 (9.0)/(2+) 1182 (4.9)/(3+) 401 (1.7) |
| Smoking, number (%) | 12,017 (14.0) | 5,308 (16.9) |
| Alcohol intake, number (%) | 39,032 (45.7) | 16,446 (52.4) |
| Antihypertensive medication, number (%) | 23,016 (27.0) | 8850 (28.2) |
| Antidiabetic medication, number (%) | 3730 (4.4) | 1507 (4.8) |
| Lipid-lowering medication, number (%) | 12,387 (14.5) | 4381 (14.0) |
| History of stroke, number (%) | 2534 (3.0) | 1200 (3.8) |
| History of heart disease, number (%) | 4029 (4.7) | 1542 (4.9) |
| History of renal failure, number (%) | 409 (0.5) | 116 (0.4) |
| Weight gain over 10 kg, number (%) | 24,154 (28.3) | 10,157 (32.4) |
| Mild exercise, number (%) | 31,984 (37.5) | 11,260 (35.9) |
| Walking, number (%) | 38,285 (44.9) | 14,482 (46.1) |
| Faster walking, number (%) | 37,508 (43.9) | 15,053 (48.0) |
| Eating speed, number (%) | Quicker 20,349 (23.8) | Quicker 8110 (25.8) |
| Normal 46,311 (54.3) | Normal 19,618 (62.5) | |
| Late 8208 (9.6) | Late 3114 (9.9) | |
| Eating supper 2 h before bedtime, number (%) | 11,936 (14.0) | 5554 (17.7) |
| Sleeping well, number (%) | 57,308 (67.1) | 23,760 (75.7) |
| Skipping breakfast, number (%) | 6335 (7.4) | 2979 (9.5) |
| Late night snack, number (%) | 9774 (13.0) | 4567 (14.8) |
Mean (standard deviation) or number (%).
HDL-C high-density lipoprotein cholesterol, LDL-C low-density lipoprotein cholesterol, AST aspartate aminotransferase, γGTP γ-glutamyl transpeptidase, eGFR estimated glomerular filtration rate, HbA1c hemoglobin A1c.
Parameters of predictive model.
| Predictive model | Parameters | |
|---|---|---|
| XGBoost | n_estimators | 100 |
| learning_rate | 0.1 | |
| max_depth | 5 | |
| min_child_weight | 5 | |
| Gamma | 0.2 | |
| colsample_bytree | 0.4 | |
| Neural network | Unit | 16 |
| Depth | 6 | |
| Activation | ReLU | |
| Batch_size | 512 | |
| Epochs | 60 | |
| Logistic regression model | C | 0.1 |
Figure 1Predictive ability of the model using test data. xgb XGBoost, nn neural network.
Figure 2Predictive ability of the model using innate validation data. xgb XGBoost, nn neural network.
Predictive ability of the model using test data.
| XGBoost | Neural network | Logistic regression | |
|---|---|---|---|
| AUC | 0.811 | 0.774 | 0.772 |
| Accuracy | 0.908 | 0.890 | 0.891 |
| Precision | 0.403 | 0.318 | 0.319 |
| Recall | 0.445 | 0.395 | 0.390 |
| F1 score | 0.423 | 0.352 | 0.351 |
AUC the area under the receiver operating characteristic curve.
Figure 3Confusion matrix of the predictive models using test data.
Importance ranking of explanatory variables in each model by SHAP values.
| Order | XGBoost | Neural network | Logistic regression |
|---|---|---|---|
| 1 | Age | Age | Age |
| 2 | Sex | Sex | Sex |
| 3 | Smoking | Smoking | Alcohol consumption |
| 4 | AST | Skipping breakfast | LDL-C |
| 5 | Alcohol consumption | Alcohol consumption | Skipping breakfast |
| 6 | Urine occult blood | LDL-C | Smoking |
| 7 | LDL-C | Walking speed | HDL-C |
| 8 | Walking speed | HDL-C | Walking speed |
| 9 | HbA1c | γGTP | Urine protein |
| 10 | Uric acid | AST | Uric acid |
Mean (standard deviation) or number (%).
AST aspartate aminotransferase, LDL-C low-density lipoprotein cholesterol, HDL-C high-density lipoprotein cholesterol, HbA1c hemoglobin A1c, γGTP γ-glutamyl transpeptidase.
Figure 4The distribution of SHAP values (impact on mortality) of explanatory variables for predictive models. The effects of the variables on the outcome were plotted for each individual in the test dataset. Cases with high values are shown in red, and those with low values are shown in blue. The variables are ranked in descending order. The horizontal location indicates whether the effect of that value is associated with a higher or lower prediction. AST aspartate aminotransferase, eGFR estimated glomerular filtration rate, LDL-C low-density lipoprotein cholesterol, HDL-C high-density lipoprotein cholesterol, γGTP γ-glutamyl transpeptidase, SBP systolic blood pressure, DBP diastolic blood pressure, CVD cardiovascular disease.
Figure 5Workflow diagram of development and performance evaluation of predictive models.