| Literature DB >> 32906777 |
Khishigsuren Davagdorj1, Van Huy Pham2, Nipon Theera-Umpon3,4, Keun Ho Ryu2,4.
Abstract
Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.Entities:
Keywords: extreme gradient boosting; feature selection; noncommunicable disease; smoking
Mesh:
Year: 2020 PMID: 32906777 PMCID: PMC7558165 DOI: 10.3390/ijerph17186513
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Extreme gradient boosting (XGBoost) based framework for smoking-induced noncommunicable diseases prediction. LASSO: least absolute shrinkage and selection operator.
Searching space of XGBoost model.
| Parameters | Symbol | Search Space |
|---|---|---|
| Maximum tree depth |
| 2, 4, 6, 8 |
| Minimum child weight |
| 2, 3, 4, 5 |
| Early stop round | e | 100 |
| Learning rate |
| 0.1 |
| Number of boost | N | 60 |
| Maximum delta step |
| 0.4, 0.6, 0.8, 1 |
| Subsample ratio |
| 0.9, 0.95, 1 |
| Column subsample ratio |
| 0.9, 0.95, 1 |
| Gamma |
| 0, 0.001 |
Figure 2Design of comparative experiments for smoking-induced noncommunicable diseases prediction. K(NHANES): Korean (national health and nutrition examination survey); SVM-RFE: support vector machine recursive feature elimination; RFFS: random forest feature selection; HFS: hybrid feature selection; LR: logistic regression; KNN: k-nearest neighbors; NN: neural network; RF: random forest; MLP: multilayer perceptron; XGBoost: extreme gradient boosting.
Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) Checklist: Prediction Model Development and Validation.
| Section/Topic | Item | Checklist Item | Page | |
|---|---|---|---|---|
|
| ||||
| Title | 1 | D; V | Identify the study as developing and/or validating a multivariable prediction model, the target population, and the outcome to be predicted. | 1 |
| Abstract | 2 | D; V | Provide a summary of objectives, study design, setting, participants, sample size, predictors, outcome, statistical analysis, results, and conclusions. | 2 |
|
| ||||
| Background and objectives | 3a | D; V | Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models. | 2–3 |
| 3b | D; V | Specify the objectives, including whether the study describes the development or validation of the model or both. | 3–5 | |
|
| ||||
| Source of data | 4a | D; V | Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation datasets, if applicable. | 6–7 |
| 4b | D; V | Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up. | 6–7 | |
| Participants | 5a | D; V | Specify key elements of the study setting (e.g., primary care, secondary care, general population), including number and location of centers. | 6–7 |
| 5b | D; V | Describe eligibility criteria for participants. | 6–7 | |
| 5c | D; V | Give details of treatments received, if relevant. | n/a | |
| Outcome | 6a | D; V | Clearly define the outcome that is predicted by the prediction model, including how and when assessed. | 8–12 |
| 6b | D; V | Report any actions to blind assessment of the outcome to be predicted. | 9–12 | |
| Predictors | 7a | D; V | Clearly define all predictors used in developing the multivariable prediction model, including how and when they were measured. | 7–8 |
| 7b | D; V | Report any actions to blind assessment of predictors for the outcome and other predictors. | 7–8 | |
| Sample size | 8 | D; V | Explain how the study size was arrived at. | 7–8 |
| Missing data | 9 | D; V | Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method. | 7 |
| Statistical analysis methods | 10a | D | Describe how predictors were handled in the analyses. | 7–8 |
| 10b | D | Specify type of model, all model-building procedures (including any predictor selection), and method for internal validation. | 4–5, 8 | |
| 10c | V | For validation, describe how the predictions were calculated. | 8 | |
| 10d | D; V | Specify all measures used to assess model performance and, if relevant, to compare multiple models. | 8 | |
| 10e | V | Describe any model updating (e.g., recalibration) arising from the validation, if done. | 8 | |
| Risk groups | 11 | D; V | Provide details on how risk groups were created, if done. | 7 |
| Development vs. validation | 12 | V | For validation, identify any differences from the development data in setting, eligibility criteria, outcome, and predictors. | 8 |
|
| ||||
| Participants | 13a | D; V | Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful. | |
| 13b | D; V | Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome. | ||
| 13c | V | For validation, show a comparison with the development data of the distribution of important variables (demographics, predictors, and outcome). | ||
| Model development | 14a | D | Specify the number of participants and outcome events in each analysis. | 7–8 |
| 14b | D | If done, report the unadjusted association between each candidate predictor and outcome. | 8–9 | |
| Model specification | 15a | D | Present the full prediction model to allow predictions for individuals (i.e., all regression coefficients, and model intercept or baseline survival at a given time point). | 10 |
| 15b | D | Explain how to use the prediction model. | 8–12 | |
| Model performance | 16 | D; V | Report performance measures (with CIs) for the prediction model. | 11 |
| Model-updating | 17 | V | If done, report the results from any model updating (i.e., model specification, model performance). | 12–13 |
|
| ||||
| Limitations | 18 | D; V | Discuss any limitations of the study (such as non-representative sample, few events per predictor, missing data). | 15 |
| Interpretation | 19a | V | For validation, discuss the results with reference to performance in the development data, and any other validation data. | 13–14 |
| 19b | D; V | Give an overall interpretation of the results, considering objectives, limitations, results from similar studies, and other relevant evidence. | 14–15 | |
| Implications | 20 | D; V | Discuss the potential clinical use of the model and implications for future research. | 14–15 |
|
| ||||
| Supplementary information | 21 | D; V | Provide information about the availability of supplementary resources, such as study protocol, web calculator, and datasets. | 16–18 |
| Funding | 22 | D; V | Give the source of funding and the role of the funders for the present study. | 15 |
Items relevant only to the development of a prediction model are denoted by D, items relating solely to a validation of a prediction model are denoted by V, and items relating to both are denoted D; V.
Figure 3Sample selection procedure of the Korea National Health and Nutrition Examination Survey (KNHANES) dataset.
Figure 4Sample selection procedure of National Health and Nutrition Examination Survey (NHANES) dataset.
Bivariate and Multicollinearity analysis for KNHANES dataset.
| Features | Multicollinearity Coefficient | ||
|---|---|---|---|
| 1 | Gender | <0.01 | 1.081 |
| 2 | Age | <0.01 | 1.523 |
| 3 | Household income | <0.01 | 2.901 |
| 4 | Education | <0.01 | 1.125 |
| 5 | Occupation | <0.01 | 2.016 |
| 6 | Marital status | <0.01 | 3.553 |
| 7 | Subjective health status | <0.01 | 1.778 |
| 8 | Depression diagnosis | <0.01 | 1.286 |
| 9 | Health checkup status | <0.01 | 1.047 |
| 10 | Athletic ability | <0.01 | 1.124 |
| 11 | Self-management | 0.16 | ~ |
| 12 | Daily activities | 0.58 | ~ |
| 13 | Pain/discomfort | <0.01 | 2.229 |
| 14 | Anxious/Depressed | <0.01 | 4.345 |
| 15 | EQ-5D index | <0.01 | 2.473 |
| 16 | Economic activity status | 0.50 | ~ |
| 17 | Weight control: exercise | <0.01 | 3.329 |
| 18 | Lifetime drinking experience | <0.01 | 1.171 |
| 19 | Start drinking age | <0.01 | 1.003 |
| 20 | Frequency of drinking for 1 year | <0.01 | 1.532 |
| 21 | Monthly drinking rate | <0.01 | 3.152 |
| 22 | Stress level | <0.01 | 3.033 |
| 23 | Indoor indirect smoking exposure | <0.01 | 1.221 |
| 24 | The usual time spent sitting (day) | <0.01 | 1.096 |
| 25 | Walk duration (hours) | <0.01 | 1.114 |
| 26 | Family history of chronic disease | <0.01 | 1.087 |
| 27 | Body mass index (kg/m2) | <0.01 | 1.048 |
| 28 | Obesity prevalence | <0.01 | 2.675 |
| 29 | Fasting blood sugar | <0.01 | 2.536 |
| 30 | Total cholesterol | <0.01 | 1.151 |
| 31 | Flexible exercise days per week | <0.01 | 1.038 |
| 32 | Residence area | <0.01 | 1.547 |
Bivariate and Multicollinearity analysis for NHANES dataset.
| Features | Multicollinearity Coefficient | ||
|---|---|---|---|
| 1 | Gender | <0.01 | 2.005 |
| 2 | Age | <0.01 | 1.008 |
| 3 | Body mass index (kg/m2) | <0.01 | 3.005 |
| 4 | Pulse regular or irregular? | 0.19 | ~ |
| 5 | Systolic: blood pressure | <0.01 | 2.015 |
| 6 | Diastolic: blood pressure | <0.01 | 1.875 |
| 7 | Education level | <0.01 | 1.076 |
| 8 | Marital status | <0.01 | 3.092 |
| 9 | Total number of people in the household | 0.32 | ~ |
| 10 | Annual household income | <0.01 | 5.312 |
| 11 | Health risk for diabetes (among family history) | <0.01 | 3.533 |
| 12 | Taking insulin or not | <0.01 | 1.453 |
| 13 | Number of healthcare counseling over past year | <0.01 | 2.027 |
| 14 | Salt usage level | 0.12 | ~ |
| 15 | Total sugars (gm) | <0.01 | 1.298 |
| 16 | Alcohol (gm) | <0.01 | 2.479 |
| 17 | Frequency of alcohol usage | <0.01 | 2.204 |
| 18 | High cholesterol level | <0.01 | 1.340 |
| 19 | General health condition | <0.01 | 1.199 |
| 20 | #times receive healthcare over past year | <0.01 | 1.249 |
| 21 | Received hepatitis A vaccine | <0.01 | 2.012 |
| 22 | Family monthly poverty level category | <0.01 | 1.004 |
| 23 | Doctor ever said you were overweight | <0.01 | 1.012 |
| 24 | Doctor told you to exercise | <0.01 | 1.004 |
| 25 | Feeling down, depressed, or hopeless | <0.01 | 1.012 |
| 26 | Feeling tired or having little energy | <0.01 | 1.004 |
| 27 | Poor appetite or overeating | <0.01 | 5.005 |
| 28 | Trouble concentrating on things | <0.01 | 2.292 |
| 29 | Description of job/work situation | 0.13 | ~ |
| 30 | Ever told doctor had trouble sleeping? | <0.01 | 1.012 |
| 31 | Number of people who live here smoke tobacco? | <0.01 | 1.035 |
| 32 | Number of people who smoke inside this home? | <0.01 | 1.424 |
| 33 | Last 7-d worked at job not at home? | <0.01 | 2.404 |
| 34 | Last 7-d at job someone smoked indoors? | <0.01 | 1.205 |
| 35 | Last 7-d in other indoor area? | <0.01 | 2.108 |
Evaluation results of the prediction models in the Korea National Health and Nutrition Examination Survey dataset.
| Feature Selection | Classifier | Accuracy | Sensitivity | Specificity | Precision | F-Score |
|---|---|---|---|---|---|---|
| SVM-RFE | LR | 0.7948 | 0.7818 | 0.7532 | 0.7676 | 0.7746 |
| RF | 0.7890 | 0.7989 | 0.7984 | 0.8115 | 0.8052 | |
| KNN | 0.7342 | 0.6958 | 0.7381 | 0.7961 | 0.7426 | |
| MLP | 0.8070 | 0.7936 | 0.7791 | 0.8016 | 0.7976 | |
| NN | 0.8197 | 0.8274 | 0.8203 | 0.8387 | 0.8330 | |
| XGBoost | 0.8098 | 0.8108 | 0.8310 | 0.8533 | 0.8315 | |
| RFFS | LR | 0.7804 | 0.7371 | 0.7422 | 0.8024 | 0.7684 |
| RF | 0.8264 | 0.7699 | 0.7338 | 0.8236 | 0.7958 | |
| KNN | 0.8048 | 0.7128 | 0.7661 | 0.7753 | 0.7427 | |
| MLP | 0.7994 | 0.7808 | 0.7396 | 0.8115 | 0.7959 | |
| NN | 0.8507 |
|
| 0.8522 | 0.8693 | |
| XGBoost | 0.8311 | 0.8782 | 0.7984 | 0.8626 | 0.8703 | |
| HFS | LR | 0.7834 | 0.7989 | 0.7813 | 0.7959 | 0.7974 |
| RF | 0.8362 | 0.7805 | 0.8496 | 0.8115 | 0.7957 | |
| KNN | 0.8032 | 0.8018 | 0.7123 | 0.7872 | 0.7944 | |
| MLP | 0.8421 | 0.8305 | 0.7513 | 0.8257 | 0.8281 | |
| NN | 0.8758 | 0.8518 | 0.8158 | 0.8691 | 0.8604 | |
| XGBoost |
| 0.8677 | 0.8126 |
|
|
SVM-RFE: support vector machine recursive feature elimination; RFFS: random forest feature selection; HFS: hybrid feature selection; LR: logistic regression; KNN: k-nearest neighbors; NN: neural network; RF: random forest; MLP: multilayer perceptron; XGBoost: extreme gradient boosting. Highest scores are marked in bold.
Evaluation results of the prediction models in the National Health and Nutrition Examination Survey dataset.
| Feature Selection | Classifier | Accuracy | Sensitivity | Specificity | Precision | F-Score |
|---|---|---|---|---|---|---|
| SVM-RFE | LR | 0.7349 | 0.6969 | 0.8874 | 0.7086 | 0.7027 |
| RF | 0.8522 | 0.7904 | 0.8805 | 0.8157 | 0.8029 | |
| KNN | 0.8118 | 0.7432 | 0.8608 | 0.8105 | 0.7754 | |
| MLP | 0.8002 | 0.7171 | 0.8759 | 0.6816 | 0.6989 | |
| NN | 0.8339 | 0.7659 | 0.8397 | 0.7609 | 0.7634 | |
| XGBoost | 0.8248 | 0.7707 | 0.8512 | 0.8066 | 0.7882 | |
| RFFS | LR | 0.8356 | 0.7169 | 0.8685 | 0.6938 | 0.7052 |
| RF | 0.8741 | 0.7863 | 0.9065 | 0.7356 | 0.7601 | |
| KNN | 0.8444 | 0.7716 | 0.8635 | 0.7594 | 0.7655 | |
| MLP | 0.8221 | 0.7043 | 0.8949 | 0.6842 | 0.6941 | |
| NN | 0.8639 | 0.7651 | 0.9003 | 0.7534 | 0.7592 | |
| XGBoost | 0.9029 | 0.8507 | 0.9379 | 0.8264 | 0.8384 | |
| HFS | LR | 0.7903 | 0.7781 | 0.8990 | 0.7732 | 0.7756 |
| RF | 0.8961 | 0.8157 | 0.9136 | 0.7857 | 0.8004 | |
| KNN | 0.8363 | 0.7928 | 0.8990 | 0.7981 | 0.7954 | |
| MLP | 0.7918 | 0.7586 | 0.9083 | 0.7635 | 0.7610 | |
| NN | 0.8553 | 0.8173 | 0.8808 | 0.7934 | 0.8052 | |
| XGBoost |
|
|
|
|
|
Highest scores are marked in bold.
Figure 5Boxplot of the accuracy over prediction models in the KNHANES dataset.
Figure 6Boxplot of the accuracy over prediction models in the NHANES dataset.
Statistical significance test of the area under the curve results for predictive models in the KNHANES and NHANES datasets.
| Feature Selection | Classifier | KNHANES Dataset | NHANES Dataset | ||||
|---|---|---|---|---|---|---|---|
| AUC | CI 95% | AUC | CI 95% | ||||
| SVM-RFE | LR | 0.7675 | 0.7474–0.7896 | <0.001 | 0.7922 | 0.7731–0.8088 | <0.001 |
| RF | 0.7987 | 0.7869–0.8118 | <0.001 | 0.8355 | 0.8254–0.8668 | <0.001 | |
| KNN | 0.7170 | 0.7094–0.7390 | <0.001 | 0.8020 | 0.7818–0.8210 | <0.001 | |
| MLP | 0.7864 | 0.7703–0.8001 | <0.001 | 0.7965 | 0.7851–0.8180 | <0.001 | |
| NN | 0.8239 | 0.8017–0.8405 | <0.001 | 0.8028 | 0.7981–0.8447 | <0.001 | |
| XGBoost | 0.8209 | 0.8097–0.8327 | <0.001 | 0.8110 | 0.8041–0.8315 | <0.001 | |
| RFFS | LR | 0.7397 | 0.7713–0.7971 | <0.001 | 0.7927 | 0.7806–0.8197 | <0.001 |
| RF | 0.7519 | 0.7683–0.8111 | <0.001 | 0.8464 | 0.8359–0.8637 | <0.001 | |
| KNN | 0.7395 | 0.7570–0.8037 | <0.001 | 0.8176 | 0.8070–0.8308 | <0.001 | |
| MLP | 0.7602 | 0.7721–0.8267 | <0.001 | 0.7996 | 0.7872–0.8135 | <0.001 | |
| NN |
| 0.8659–0.9005 | <0.001 | 0.8327 | 0.8206–0.8492 | <0.001 | |
| XGBoost | 0.8383 | 0.8245–0.8567 | <0.001 | 0.8943 | 0.8757–0.9013 | <0.001 | |
| HFS | LR | 0.7901 | 0.7812–0.8253 | <0.001 | 0.8386 | 0.8234–0.8539 | <0.001 |
| RF | 0.8151 | 0.7947–0.8286 | <0.001 | 0.8647 | 0.8564–0.8859 | <0.001 | |
| KNN | 0.7571 | 0.7401–0.7796 | <0.001 | 0.8459 | 0.8284–0.8653 | <0.001 | |
| MLP | 0.7909 | 0.7846–0.8243 | <0.001 | 0.8335 | 0.8195–0.8506 | <0.001 | |
| NN | 0.8338 | 0.8249–0.8494 | <0.001 | 0.8491 | 0.8310–0.8588 | <0.001 | |
| XGBoost | 0.8402 | 0.8384–0.8635 | <0.001 |
| 0.9073–0.9345 | <0.001 | |
Highest scores are marked in bold.
Figure 7The comparison of receiver operating characteristic (ROC) curves for smoking-induced noncommunicable diseases (SiNCDs) prediction in the KNHANES and NHANES datasets. (a) ROC of SVM-RFE based models in KNHANES; (b) ROC of SVM-RFE based models in NHANES; (c) ROC of RFFS based models in KNHANES; (d) ROC of RFFS based models in NHANES; (e) ROC of HFS based models in KNHANES; (f) ROC of HFS based models in NHANES.
Figure 8Feature Importance of XGBoost based framework incorporated with hybrid feature selection in the KNHANES dataset.
Figure 9Feature Importance of XGBoost based framework incorporated with hybrid feature selection in the NHANES dataset.