| Literature DB >> 31810204 |
Ying Wang1, Zhicheng Du1, Wayne R Lawrence2, Yun Huang1, Yu Deng1, Yuantao Hao1.
Abstract
Despite a decline in the prevalence of hepatitis B in China, the disease burden remains high. Large populations unaware of infection risk often fail to meet the ideal treatment window, resulting in poor prognosis. The purpose of this study was to develop and evaluate models identifying high-risk populations who should be tested for hepatitis B surface antigen. Data came from a large community-based health screening, including 97,173 individuals, with an average age of 54.94. A total of 33 indicators were collected as model predictors, including demographic characteristics, routine blood indicators, and liver function. Borderline-Synthetic minority oversampling technique (SMOTE) was conducted to preprocess the data and then four predictive models, namely, the extreme gradient boosting (XGBoost), random forest (RF), decision tree (DT), and logistic regression (LR) algorithms, were developed. The positive rate of hepatitis B surface antigen (HBsAg) was 8.27%. The area under the receiver operating characteristic curves for XGBoost, RF, DT, and LR models were 0.779, 0.752, 0.619, and 0.742, respectively. The Borderline-SMOTE XGBoost combined model outperformed the other models, which correctly predicted 13,637/19,435 cases (sensitivity 70.8%, specificity 70.1%), and the variable importance plot of XGBoost model indicated that age was of high importance. The prediction model can be used to accurately identify populations at high risk of hepatitis B infection that should adopt timely appropriate medical treatment measures.Entities:
Keywords: hepatitis B virus; machine learning; prediction
Mesh:
Substances:
Year: 2019 PMID: 31810204 PMCID: PMC6926879 DOI: 10.3390/ijerph16234842
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Summary of parameter values in each model for predicting hepatitis B virus (HBV) infection. Decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost).
| Algorithms | Parameter | Value | Meaning |
|---|---|---|---|
| XGBoost | nrounds | 120 | The number of rounds for boosting. |
| max_depth | 8 | Maximum depth of a tree. | |
| eta | 0.09 | Step size shrinkage used in update to prevent overfitting. | |
| gamma | 0.04 | Minimum loss reduction required to make a further partition on a leaf node of the tree. | |
| colsample_bytree | 0.8 | The subsample ratio of columns when constructing each tree. | |
| min_child_weight | 18 | Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than the value, then the building process will give up further partitioning. | |
| subsample | 0.89 | Subsample ratio of the training instances. | |
| n_estimators | 600 | Number of base learners in the integrated model. | |
| max_delta_step | 9 | Maximum delta step we allow each leaf output to be. If it is set to a positive value, it can help making the update step more conservative. | |
| DT | minispilt | 20 | The minimum number of observations that must exist in a node for a split to be attempted. |
| minibucket | 20 | The minimum number of observations in any terminal node. | |
| maxdepth | 10 | The maximum depth of any node of the final tree. | |
| xval | 5 | Number of cross-validations. | |
| cp (complexity parameter) | 0.001 | The minimum improvement in the model needed at each node. | |
| RF | mtry | 6 | Number of variables available for splitting at each tree node. |
| ntree | 700 | Number of trees to grow. |
Summary of participants’ characteristics.
| Characteristics | Proportion (%)/SD | |
|---|---|---|
| HBsAg | ||
| Positive | 8034 | 8.27 |
| Negative | 89,139 | 91.73 |
| Gender | ||
| Male | 32,208 | 33.15 |
| Female | 64,965 | 66.85 |
| Age | 54.94 | 21.72 |
| Education level | ||
| Illiteracy, and semi-illiteracy | 8971 | 9.23 |
| Primary school | 26,024 | 26.78 |
| Middle school | 19,667 | 20.24 |
| High and vocational school | 19,417 | 19.98 |
| College and above | 4632 | 4.77 |
| Unknown | 18,462 | 19.00 |
| Career | ||
| Leaders of enterprise unit | 827 | 0.85 |
| Technical personnel | 2681 | 2.76 |
| Handle affairs personnel | 1844 | 1.90 |
| Commercial personnel | 4768 | 4.91 |
| Farming, forestry, and fishery producers | 7843 | 8.07 |
| Transportation equipment operators | 4430 | 4.56 |
| Soldier | 185 | 0.19 |
| Unknown | 74,595 | 76.77 |
| Marital status | ||
| Single | 16,851 | 17.34 |
| Married | 67,821 | 69.79 |
| Widowed | 4127 | 4.25 |
| Divorced | 821 | 0.84 |
| Unknown | 7553 | 7.77 |
| Hepatitis B vaccination | ||
| No | 6017 | 6.19 |
| Yes | 4976 | 5.12 |
| Unknown | 86,180 | 88.69 |
| White blood cell count (WBC, 109/L) | 6.45 | 1.75 |
| Percent of monocytes (MON%, %) | 4.44 | 1.87 |
| Monocyte count (MON, 109/L) | 0.28 | 0.14 |
| Red cell volume distribution width-variable coefficient (RDW.CV, %) | 14.57 | 1.38 |
| Red cell volume distribution width-standard deviation (RDW.SD, fL) | 55.40 | 6.91 |
| Red blood cell count (RBC, 1012/L) | 4.58 | 0.52 |
| hematocrit (HCT, %) | 45.92 | 4.98 |
| Lymphocyte percentage (LYM%, %) | 37.74 | 9.05 |
| Lymphocyte count (LYM, 109/L) | 2.39 | 0.77 |
| Mean corpuscular volume (MCV, fL) | 100.97 | 10.66 |
| Mean red blood cell hemoglobin content (MCH, pg) | 29.55 | 3.56 |
| Mean corpuscular hemoglobin concentration (MCHC, g/L) | 293.22 | 25.12 |
| Mean platelet volume (MPV, fL) | 9.03 | 0.95 |
| Percent of basophilic granulocyte (BAS%, %) | 0.58 | 0.31 |
| Basophilic granulocyte count (BASO, 109/L) | 0.04 | 0.02 |
| Percentage of eosinophilic granulocyte (EOS%, %) | 3.16 | 2.39 |
| Eosinophil count (EOS, 109/L) | 0.20 | 0.17 |
| Hemoglobin (HGB, g/L) | 134.28 | 14.01 |
| Albumin (ALB, g/L) | 45.65 | 3.27 |
| Alanine aminotransferase (ALT, U/L) | 20.68 | 18.35 |
| Aspartate aminotransferase (AST, U/L) | 23.56 | 13.04 |
| Direct bilirubin (DBil, umol/L) | 3.15 | 1.46 |
| Total bilirubin (TBil, umol/L) | 10.39 | 4.37 |
| Platelet count (PLT, 109/L) | 258.25 | 68.58 |
| Plateletcrit (PCT, %) | 0.23 | 0.06 |
| Percent of neutrophile granulocyte (NEU%, %) | 54.08 | 9.29 |
| Neutrophil count (NEU, 109/L) | 3.53 | 1.32 |
| Total | 97,173 |
SD, standard deviation. HBsAg, hepatitis B surface antigen.
Difference analysis between the training set and the testing set.
| Characteristics | Training Set | Testing Set | |
|---|---|---|---|
| HBsAg | |||
| Positive | 6419 (8.26) | 1615 (8.31) | 0.812 |
| Negative | 71,319 (91.74) | 17,820 (91.69) | |
| Gender | |||
| Male | 25,769 (33.14) | 6439 (33.13) | 0.963 |
| Female | 51,969 (66.86) | 12,996 (66.87) | |
| Age(year) | 54.90 ± 21.75 | 55.09 ± 21.64 | 0.282 |
| Education level | |||
| Illiteracy, and semi-illiteracy | 7199 (9.26) | 1772 (9.12) | |
| Primary school | 20,855 (26.82) | 5169 (26.6) | |
| Middle school | 15,663 (20.15) | 4004 (20.6) | 0.437 |
| High and vocational school | 15,553 (20.01) | 3864 (19.88) | |
| College and above | 3666 (4.72) | 966 (4.97) | |
| Unknown | 14,802 (19.04) | 3660 (18.83) | |
| Career | |||
| Leaders of enterprise unit | 650 (0.84) | 177 (0.91) | |
| Technical personnel | 2125 (2.73) | 556 (2.86) | |
| Handle affairs personnel | 1463 (1.88) | 381 (1.96) | |
| Commercial personnel | 3788 (4.87) | 980 (5.04) | |
| Farming, forestry, and fishery producers | 6272 (8.07) | 1571 (8.08) | 0.633 |
| Transportation equipment operators | 3517 (4.53) | 913 (4.7) | |
| Soldier | 149 (0.19) | 36 (0.19) | |
| Unknown | 59,774 (76.89) | 14,821 (76.26) | |
| Marital status | |||
| Single | 13,542 (17.42) | 3309 (17.02) | |
| Married | 54,196 (69.72) | 13,625 (70.11) | |
| Widowed | 3277 (4.22) | 850 (4.37) | 0.294 |
| Divorced | 674 (0.86) | 147 (0.76) | |
| Unknown | 6049 (7.78) | 1504 (7.74) | |
| History of hepatitis B vaccination | |||
| No | 4777 (6.14) | 1240 (6.38) | |
| Yes | 4016 (5.17) | 960 (4.94) | 0.229 |
| Unknown | 68,945 (88.69) | 17,235 (88.68) | |
| WBC (109/L) | 6.45 ± 1.75 | 6.45 ± 1.73 | 0.718 |
| MON% (%) | 4.44 ± 1.87 | 4.43 ± 1.88 | 0.768 |
| MON (109/L) | 0.28 ± 0.14 | 0.28 ± 0.14 | 0.969 |
| RDW.CV (%) | 14.57 ± 1.38 | 14.56 ± 1.35 | 0.664 |
| RDW.SD (fL) | 55.39 ± 6.91 | 55.45 ± 6.92 | 0.239 |
| RBC (1012/L) | 4.58 ± 0.52 | 4.58 ± 0.52 | 1.000 |
| HCT (%) | 45.91 ± 4.97 | 45.97 ± 4.97 | 0.142 |
| LYM% (%) | 37.74 ± 9.05 | 37.71 ± 9.08 | 0.616 |
| LYM (109/L) | 2.39 ± 0.77 | 2.39 ± 0.77 | 0.869 |
| MCV (fL) | 100.95 ± 10.67 | 101.07 ± 10.64 | 0.157 |
| MCH (pg) | 29.54 ± 3.66 | 29.56 ± 3.13 | 0.548 |
| MCHC (g/L) | 293.26 ± 26.46 | 293.06 ± 18.85 | 0.304 |
| MPV (fL) | 9.03 ± 0.95 | 9.03 ± 0.95 | 0.765 |
| BAS% (%) | 0.58 ± 0.31 | 0.58 ± 0.31 | 0.146 |
| BASO (109/L) | 0.04 ± 0.02 | 0.04 ± 0.02 | 0.213 |
| EOS% (%) | 3.16 ± 2.39 | 3.17 ± 2.42 | 0.736 |
| EOS (109/L) | 0.20 ± 0.17 | 0.20 ± 0.18 | 0.560 |
| HGB (g/L) | 134.26 ± 14.02 | 134.37 ± 13.98 | 0.332 |
| ALB (g/L) | 45.65 ± 3.28 | 45.66 ± 3.28 | 0.731 |
| ALT (U/L) | 20.69 ± 19.02 | 20.62 ± 15.38 | 0.640 |
| AST (U/L) | 23.57 ± 13.50 | 23.53 ± 11.00 | 0.696 |
| DBil (umol/L) | 3.15 ± 1.47 | 3.15 ± 1.39 | 0.632 |
| TBil (umol/L) | 10.40 ± 4.39 | 10.37 ± 4.29 | 0.448 |
| PLT (109/L) | 258.25 ± 68.67 | 258.27 ± 68.21 | 0.969 |
| PCT (%) | 0.23 ± 0.06 | 0.23 ± 0.06 | 0.779 |
| NEU% (%) | 54.08 ± 9.28 | 54.12 ± 9.31 | 0.610 |
| NEU (109/L) | 3.53 ± 1.31 | 3.54 ± 1.32 | 0.600 |
| Total | 77,738 | 19,435 |
Predictive performance of each model for predicting HBV infection risk.
| Algorithms | AUC | Standard Error | 95% CI | AUC Compared with LR |
|---|---|---|---|---|
| LR | 0.742 | 0.006 | (0.729, 0.754) | - |
| DT | 0.619 | 0.008 | (0.603, 0.634) | −0.123 |
| RF | 0.752 | 0.006 | (0.740, 0.764) | +0.010 |
| XGBoost | 0.779 | 0.006 | (0.768, 0.791) | +0.037 |
| Borderline-SMOTE DT | 0.715 | 0.007 | (0.702, 0.729) | −0.027 |
| Borderline-SMOTE RF | 0.759 | 0.006 | (0.747, 0.771) | +0.017 |
| Borderline-SMOTE XGBoost | 0.782 | 0.006 | (0.771, 0.793) | +0.040 |
LR: logistic regression; SMOTE: synthetic minority oversampling technique; AUC: the area under the receiver operating characteristic curve; CI: confidence interval.
Figure 1Receiver operating characteristic (ROC) curves of the four models for predicting HBV infection. (XGBoost: extreme gradient boosting; RF: random forest; DT: decision tree; LR: logistic regression; SMOTE: synthetic minority oversampling technique).
Summary of evaluation metrics values of each model for predicting HBV infection risk.
| Algorithms | TP | FN | TN | FP | Accuracy | Sensitivity | Specificity | Cutoff Point |
|---|---|---|---|---|---|---|---|---|
| LR | 1109 | 506 | 11866 | 5934 | 0.668 | 0.687 | 0.667 | 0.010 |
| DT | 752 | 863 | 13214 | 4606 | 0.719 | 0.466 | 0.742 | 0.086 |
| RF | 1203 | 412 | 11131 | 6689 | 0.634 | 0.745 | 0.625 | 0.091 |
| XGBoost | 1134 | 481 | 12695 | 5125 | 0.711 | 0.702 | 0.712 | 0.082 |
| Borderline-SMOTE DT | 1094 | 521 | 11731 | 6089 | 0.660 | 0.658 | 0.677 | 0.135 |
| Borderline-SMOTE RF | 1124 | 491 | 12121 | 5699 | 0.681 | 0.696 | 0.680 | 0.116 |
| Borderline-SMOTE XGBoost | 1144 | 471 | 12493 | 5327 | 0.702 | 0.708 | 0.701 | 0.088 |
TP: true positives; FN: false negatives; TN: true negatives; FP: false positives.
Figure 2Variable importance plot of the XGBoost model for predicting HBV infection risk. (XGBoost: extreme gradient boosting.)