| Literature DB >> 35887527 |
Yang-Yuan Chen1,2, Chun-Yu Lin3,4, Hsu-Heng Yen1,5,6,7,8, Pei-Yuan Su1, Ya-Huei Zeng1, Siou-Ping Huang1, I-Ling Liu1.
Abstract
The rising incidence of fatty liver disease (FLD) poses a health challenge, and is expected to be the leading global cause of liver-related morbidity and mortality in the near future. Early case identification is crucial for disease intervention. A retrospective cross-sectional study was performed on 31,930 Taiwanese subjects (25,544 training and 6386 testing sets) who had received health check-ups and abdominal ultrasounds in Changhua Christian Hospital from January 2009 to January 2019. Clinical and laboratory factors were included for analysis by different machine-learning algorithms. In addition, the performance of the machine-learning algorithms was compared with that of the fatty liver index (FLI). Totally, 6658/25,544 (26.1%) and 1647/6386 (25.8%) subjects had moderate-to-severe liver disease in the training and testing sets, respectively. Five machine-learning models were examined and demonstrated exemplary performance in predicting FLD. Among these models, the xgBoost model revealed the highest area under the receiver operating characteristic (AUROC) (0.882), accuracy (0.833), F1 score (0.829), sensitivity (0.833), and specificity (0.683) compared with those of neural network, logistic regression, random forest, and support vector machine-learning models. The xgBoost, neural network, and logistic regression models had a significantly higher AUROC than that of FLI. Body mass index was the most important feature to predict FLD according to the feature ranking scores. The xgBoost model had the best overall prediction ability for diagnosing FLD in our study. Machine-learning algorithms provide considerable benefits for screening candidates with FLD.Entities:
Keywords: fatty liver disease; machine learning; predicting
Year: 2022 PMID: 35887527 PMCID: PMC9317783 DOI: 10.3390/jpm12071026
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Comparison of the fatty and nonfatty populations.
| No Fatty Liver | Fatty Liver Disease | ||
|---|---|---|---|
| Categorial variable | N (%) | N (%) | |
| Male sex | 13,484 (57.1%) | 6293 (75.8%) | <0.0001 |
| Continuous variables | Mean ± SD | Mean ± SD | |
| Age (years) | 48.63 ± 10.92 | 50.48 ± 9.93 | <0.0001 |
| Weight (kg) | 63.28 ± 10.6 | 75.011 ± 12.13 | <0.0001 |
| Height (cm) | 164.7 ± 8.02 | 166.34 ± 7.95 | <0.0001 |
| BMI (kg/m2) | 23.244 ± 2.91 | 27.044 ± 3.47 | <0.0001 |
| Waist (cm) | 63.27 ± 10.6 | 75.01 ± 12.13 | <0.0001 |
| SBP (mmHg) | 121.11 ± 16.14 | 130.17 ± 15.77 | <0.0001 |
| DBP (mmHg) | 77.25 ± 10.42 | 83.45 ± 10.62 | <0.0001 |
| ALT (IU/L) | 23.31 ± 20.42 | 39.64 ± 25.71 | <0.0001 |
| AST (IU/L) | 23.86 ± 19.344 | 30.39 ± 16.33 | <0.0001 |
| Cr (mg/dL) | 0.811 ± 0.23 | 0.86 ± 0.23 | <0.0001 |
| Sugar (mg/dL) | 93.88 ± 16.01 | 104.44 ± 25.56 | <0.0001 |
| T-Cho (mg/dL) | 191.666 ± 34.5 | 197.31 ± 36.39 | <0.0001 |
| HDL (mg/dL) | 52.588 ± 13.53 | 43.07 ± 9.39 | <0.0001 |
| LDL (mg/dL) | 118.4 ± 30.38 | 124.08 ± 32.47 | <0.0001 |
| TG (mg/dL) | 98.52 ± 69.62 | 162.64 ± 110.33 | <0.0001 |
| r-GT (U/L) | 21.91 ± 24.66 | 35.08 ± 36.76 | <0.0001 |
| WBC (×109/L) | 5.4 ± 1.45 | 6.18 ± 1.56 | <0.0001 |
| Hb (g/dL) | 13.99 ± 1.53 | 14.7 ± 1.31 | <0.0001 |
| MCH (pg) | 30.11 ± 2.98 | 30.17 ± 2.71 | 0.1025 |
| MCHC (g/dL) | 33.455 ± 0.95 | 33.61 ± 0.94 | <0.0001 |
| MCV (fL) | 41.8 ± 4.21 | 43.71 ± 3.63 | <0.0001 |
| RBC-RDW (%) | 13.53 ± 1.27 | 13.39 ± 1.04 | <0.0001 |
| RBC Count (106/μL) | 4.67 ± 0.52 | 4.9 ± 0.51 | <0.0001 |
| RBC Volume (fL) | 89.89 ± 7.53 | 89.67 ± 6.81 | 0.0166 |
| Platelet (103/μL) | 222.64 ± 53.55 | 229.82 ± 52.68 | <0.0001 |
| FIB-4 | 1.21 ± 0.64 | 1.17 ± 0.56 | <0.0001 |
Abbreviations: SBP: systolic blood pressure; DBP: diastolic blood pressure; ALT: alanine aminotransferase; AST: aspartate aminotransferase; Cr: creatinine; T-Cho: total cholesterol; HDL: high-density lipoprotein; LDL: low-density lipoprotein; TG: triglyceride; r-GT: r-glutamyl transpeptidase; WBC: white blood cell count; HB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; RBC: red blood cell; RDW: red cell distribution width; FIB-4: fibrosis index based on the four factors.
Baseline data of the testing and training population.
| Testing Population | Training Population | ||||
|---|---|---|---|---|---|
| Categorial Variable | N | % | N | % | |
| Male sex | 3920 | 61.4% | 15857 | 62.1% | 0.3077 |
| Fatty liver disease | 1647 | 25.8% | 6658 | 26.1% | 0.6552 |
| Continuous variables | Mean | SD | Mean | SD | |
| Age (years) | 49.0338 | 10.6897 | 49.1273 | 10.7045 | 0.5325 |
| Weight (kg) | 66.2086 | 11.8944 | 66.3478 | 12.2305 | 0.4133 |
| Height (cm) | 165.1039 | 8.0304 | 165.1287 | 8.0323 | 0.8254 |
| BMI (kg/m²) | 24.1912 | 3.3860 | 24.2331 | 3.5129 | 0.3896 |
| Waist (cm) | 81.3276 | 9.4895 | 81.4931 | 9.6282 | 0.2178 |
| SBP (mmHg) | 123.3472 | 16.2762 | 123.4969 | 16.5927 | 0.5173 |
| DBP (mmHg) | 78.7839 | 10.8325 | 78.8869 | 10.8106 | 0.4962 |
| ALT (IU/L) | 27.6682 | 20.3123 | 27.5262 | 23.6926 | 0.6599 |
| AST (IU/L) | 25.4887 | 11.4994 | 25.5720 | 20.2435 | 0.7520 |
| Cr (mg/dL) | 0.8184 | 0.2498 | 0.8216 | 0.2228 | 0.3200 |
| Sugar (mg/dL) | 96.8274 | 21.5711 | 96.5744 | 18.9770 | 0.3543 |
| T-Cho (mg/dL) | 192.5857 | 34.7903 | 193.2651 | 35.1615 | 0.1663 |
| HDL(mg/dL) | 50.2668 | 13.2300 | 50.0619 | 13.2686 | 0.2693 |
| LDL (mg/dL) | 119.4998 | 30.8285 | 119.9765 | 31.0895 | 0.2723 |
| TG (mg/dL) | 113.8447 | 101.3611 | 115.5403 | 82.8250 | 0.1629 |
| r-GT (U/L) | 25.2388 | 28.2466 | 25.3584 | 29.0509 | 0.7674 |
| WBC (×109/L) | 5.6059 | 1.5110 | 5.6049 | 1.5196 | 0.9632 |
| Hb (g/dL) | 14.1648 | 1.5050 | 14.1758 | 1.5102 | 0.6019 |
| MCH (pg) | 30.1294 | 2.8629 | 30.1250 | 2.9260 | 0.9135 |
| MCHC (g/dL) | 33.4929 | 0.9346 | 33.4895 | 0.9532 | 0.7981 |
| MCV (fL) | 42.2640 | 4.1367 | 42.3027 | 4.1600 | 0.5058 |
| RBC-RDW (%) | 13.5007 | 1.2364 | 13.4978 | 1.2132 | 0.8666 |
| RBC Count (106/μL) | 4.7246 | 0.5119 | 4.7314 | 0.5298 | 0.3560 |
| RBC volume (fL) | 89.8414 | 7.2438 | 89.8342 | 7.3755 | 0.9443 |
| Platelet (103/μL) | 225.1690 | 52.8910 | 224.3424 | 53.5484 | 0.2688 |
| FIB-4 | 1.1890 | 0.6050 | 1.1966 | 0.6242 | 0.3836 |
Abbreviations: SBP: systolic blood pressure; DBP: diastolic blood pressure; ALT: alanine aminotransferase; AST: aspartate aminotransferase; Cr: creatinine; T-Cho: total cholesterol; HDL: high-density lipoprotein; LDL: low-density lipoprotein; TG: triglyceride; r-GT: r-glutamyl transpeptidase; WBC: white blood cell count; HB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; RBC: red blood cell; RDW: red cell distribution width; FIB-4: fibrosis index based on the four factors.
Performance of different machine models on the testing dataset.
| Model | AUROC | Accuracy | Recall | F1 | Specificity | Precision |
|---|---|---|---|---|---|---|
| xgBoost | 0.882 | 0.833 | 0.833 | 0.829 | 0.683 | 0.827 |
| Neural network | 0.874 | 0.824 | 0.824 | 0.820 | 0.683 | 0.818 |
| Logistic regression | 0.870 | 0.825 | 0.825 | 0.815 | 0.629 | 0.816 |
| Random forest | 0.849 | 0.818 | 0.818 | 0.809 | 0.629 | 0.808 |
| SVM | 0.551 | 0.569 | 0.569 | 0.595 | 0.536 | 0.656 |
Abbreviations: AUROC: area under receiver operating characteristic curve; SVM: support vector machine.
Figure 1Top ten features of data contributing to the F1 score of the developed xgBoost model.
Pairwise comparison of the AUROC of different machine-learning models and the fatty liver index on the testing dataset.
| Difference between Areas | Neural Network | Logistic Regression | Random Forest | SVM | Fatty Liver Index |
|---|---|---|---|---|---|
| xgBoost | 0.0076 ( | 0.0114 ( | 0.0327 ( | 0.330 | 0.0347 ( |
| Neural network | 0.00382 ( | 0.0251 | 0.323 | 0.00204 | |
| Logistic regression | 0.0213 | 0.0319 | 0.0233 | ||
| Random forest | 0.298 | 0.00204 | |||
| SVM | 0.295 |
Abbreviations: AUROC: area under receiver operating characteristic curve; SVM: support vector machine.
Figure 2AOC curve of five different machine-learning models and the fatty liver index.
Figure 3Comparison of the precision–recall curve of the xgBoost model and fatty liver index.
Literature review of previous studies of machine learning for fatty liver disease.
| Author/Year | Setting/Country | Fatty/Total Population, (%) | Validation Method | ML Model | Accuracy (%) | Area under Curve (%) |
|---|---|---|---|---|---|---|
| Ma [ | Hospital/China | 2522/10,508 (24%) | 10-fold cross validation | LR | 82.92% | N/A |
| Wu [ | Hospital/Taiwan | 377/577 (65.3%) | 10-fold cross validation | Random forest | 87.48% | 92.25% |
| Liu [ | Hospital/China | 5878/15,315 (38.4%) | 32% of dataset as testing data | xgBoost | 79.5% | 87.3% |
| Atsawarungruangkit [ | Population/USA | 817/3235 (25.3%) | 30% of dataset as testing data | Ensemble of subspace | 77.7% | 78% |
| Pei [ | Hospital/China | 845/3419 (24.7%) | 30% of dataset as testing data | xgBoost | 94.15% | 93.06% |
| Zhao [ | Hospital/China | 9173/39,884 (23%) | 30% of dataset as testing data | xgBoost | 89% | N/A |
| Our Study 2022 | Hospital/Taiwan | 8375/31,930 (26.2%) | 20% of dataset as testing data | xgBoost | 83.3% | 88.2% |