| Literature DB >> 35444985 |
Weidong Ji1, Mingyue Xue2, Yushan Zhang3, Hua Yao4, Yushan Wang4.
Abstract
Non-alcoholic fatty liver disease (NAFLD) is a common serious health problem worldwide, which lacks efficient medical treatment. We aimed to develop and validate the machine learning (ML) models which could be used to the accurate screening of large number of people. This paper included 304,145 adults who have joined in the national physical examination and used their questionnaire and physical measurement parameters as model's candidate covariates. Absolute shrinkage and selection operator (LASSO) was used to feature selection from candidate covariates, then four ML algorithms were used to build the screening model for NAFLD, used a classifier with the best performance to output the importance score of the covariate in NAFLD. Among the four ML algorithms, XGBoost owned the best performance (accuracy = 0.880, precision = 0.801, recall = 0.894, F-1 = 0.882, and AUC = 0.951), and the importance ranking of covariates is accordingly BMI, age, waist circumference, gender, type 2 diabetes, gallbladder disease, smoking, hypertension, dietary status, physical activity, oil-loving and salt-loving. ML classifiers could help medical agencies achieve the early identification and classification of NAFLD, which is particularly useful for areas with poor economy, and the covariates' importance degree will be helpful to the prevention and treatment of NAFLD.Entities:
Keywords: LASSO; machine learning; non-alcoholic fatty liver disease (NAFLD); predictive models; screening model
Mesh:
Year: 2022 PMID: 35444985 PMCID: PMC9013842 DOI: 10.3389/fpubh.2022.846118
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Characteristics of variables.
|
|
|
|
|
|---|---|---|---|
|
| 62 (50–71) | 50 (40–65) | <0.001 |
|
| 27.27(25.15–29.64) | 23.71(21.91–25.80) | <0.001 |
|
| 92(85.55–99) | 84(78–90) | <0.001 |
| <0.001 | |||
| Han | 38,132(65.01) | 160,708(65.46) | |
| Uygur | 8,973(15.30) | 42,775(17.42) | |
| Kazak | 1,317(2.25) | 7,898(3.22) | |
| Hui | 9,151(15.60) | 27,843(11.34) | |
| Mongolian | 98(0.17) | 481(0.20) | |
| other nationalities | 983(1.68) | 5,785(2.36) | |
| <0.001 | |||
| Female | 23,069(39.33) | 104,083(42.40) | |
| Male | 35,585(60.67) | 141,407(57.60) | |
|
| <0.001 | ||
| Inactive | 43,876(74.80) | 149,349(60.84) | |
| Active | 14,778(25.20) | 96,141(39.16) | |
|
| <0.001 | ||
| Trader or service people | 35,124(59.88) | 180,260(73.43) | |
| Agriculture workers | 19,268(32.85) | 48,766(19.86) | |
| Factory workers | 1,839(3.14) | 6,230(2.54) | |
| Soldier | 597(1.02) | 1,058(0.43) | |
| Others | 1,826(3.11) | 9,176(3.74) | |
|
| <0.001 | ||
| No smoking | 50,571(86.22) | 225,638(91.91) | |
| 0–20 cigarettes per day | 6,119(10.43) | 16,981(6.92) | |
| >20 cigarettes per day | 1,964(3.35) | 2,871(1.17) | |
| <0.001 | |||
| Meat based | 55,034(93.83) | 233,163(94.98) | |
| Meat balanced | 1,980(3.38) | 7,255(2.96) | |
| Vegetarian based | 1,640(2.80) | 5,072(2.07) | |
| <0.001 | |||
| No | 53,524(91.25) | 233,709(95.20) | |
| Yes | 5,130(8.75) | 11,781(4.80) | |
| <0.001 | |||
| No | 50,144(85.49) | 232,123(94.55) | |
| Yes | 8,510(14.51) | 13,367(5.45) | |
| <0.001 | |||
| No | 53,363(90.98) | 235,452(95.91) | |
| Yes | 5,291(9.02) | 10,038(4.09) | |
| <0.001 | |||
| No | 57,187(97.50) | 240,826(98.10) | |
| Yes | 1,467(2.50) | 4,664(1.90) | |
| <0.001 | |||
| No | 55,545(94.70) | 234,197(95.40) | |
| Yes | 3,109(5.30) | 11,293(4.60) | |
| <0.001 | |||
| No | 47,176(80.43) | 227,057(92.49) | |
| Yes | 11,478(19.57) | 18,433(7.51) | |
| <0.001 | |||
| No | 38,136(65.02) | 223,951(91.23) | |
| Yes | 20,518(34.98) | 21,539(8.77) | |
| <0.001 | |||
| No | 37,127(63.30) | 189,935(89.81) | |
| Yes | 21,527(36.70) | 55,555(10.19) |
BMI, Body Mass Index; T2DM, type 2 diabetes mellitus.
Figure 1Machine learning flowchart of this study. LR, logistic regression; RF, random forest; NB, Naive Bayesian; ML, machine learning; LASSO, least absolute shrinkage and selection operator.
Figure 2Lasso algorithm for feature selection. (A) mean-squared error (10-fold cross-validation criterion) of LASSO penalized logistic regression algorithm. (B) Vertical line was drawn at the value selected using 10 times cross-validation, where optimal lambda resulted in 12 features with nonzero coefficients.
Dataset description.
|
|
|
|
|
|---|---|---|---|
| Original data | 245,490/ | 4:1 | Original data with full instances |
| SMOTE data | 245,490/ | 1:1 | Dataset is balanced utilizing SMOTE oversampling |
The results of classification algorithms.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
|
| 0.778 | 0.783 | 0.768 | 0.775 | 0.857 |
|
|
| 0.862 | 0.851 | 0.878 | 0.864 | 0.937 |
|
|
| 0.880 | 0.801 | 0.894 | 0.882 | 0.951 |
|
|
| 0.716 | 0.762 | 0.626 | 0.687 | 0.814 |
AUC the area under the receiver operating characteristic (ROC) curve.
LR, logistics regression; RF, random forest; NB, naïve bayesian.
Figure 3ROC curve of all algorithms. LR, logistic regression; RF, random forest; NB, Naive Bayesian; XGB, XGBoost.
Figure 4Feature importance contributed to the XGBoost model measured by F-score.