| Literature DB >> 36231678 |
Aishwariya Dutta1,2, Md Kamrul Hasan3, Mohiuddin Ahmad3, Md Abdul Awal4,5, Md Akhtarul Islam6, Mehedi Masud7, Hossam Meshref7.
Abstract
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.Entities:
Keywords: South Asian diabetes dataset; artificial intelligence; diabetes prediction; ensemble ML classifier; filling missing value; outlier rejection
Mesh:
Year: 2022 PMID: 36231678 PMCID: PMC9566114 DOI: 10.3390/ijerph191912378
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Overview of different ML-based methods utilized in the previous literature for diabetes prediction, including the year of publication, used dataset, missing value imputation techniques, feature selection strategies, number of selected features, classifier used, and corresponding performance evaluation metrics.
| Years | Dataset | MVI 1 | FS | NSF | BPC | Performance |
|---|---|---|---|---|---|---|
| 2016 [ | ENRC | None | None | 9 | DT | |
| 2018 [ | LMHC | None | None | All | RF | |
| 2018 [ | PIDD | None | mRMR | 7 | RF | |
| 2018 [ | PIDD | None | None | 8 | NB | |
| 2018 [ | PIDD | KNN impute | BWA | 4 | Linear Kernel SVM | |
| 2019 [ | PIDD | NB | None | 8 | RF | |
| 2019 [ | PIDD | None | CRB | 11 | NB | |
| 2019 [ | PIDD | None | None | 8 | MLP | |
| 2020 [ | PIDD | Mean | CRB | 6 | Ensemble of AB, XGB | |
| 2020 [ | NHANES | None | LR | 7 | RF | |
| 2020 [ | PIDD | Case deletion | None | 2 | SVM | |
| 2021 [ | PIDD | None | None | 8 | Ensemble of J48, NBT, RF, Simple CART, RT | |
| 2021 [ | LMHC | Case deletion | ANOVA, GI | 16 | XGB |
1 Note: MVI: Missing Value Imputation, FS: Feature Selection, NSF: Number of Selected Feature, BPC: Best Performing Classifier, ENRC: Egyptian National Research Center, LMHC: Luzhou Municipal Health Commission, PIDD: PIMA Indian Dataset, mRMR: Minimum Redundancy Maximum Relevance, BWA: Boruta Wrapper Algorithm, CRB: Correlation-Based, NHANES: National Health and Nutrition Examination Survey, ANOVA: Analysis of Variance, GI: Gini Impurity, NBT: Naive Bayes Tree, RT: Random Tree.
Class label description and class-wise sample distributions of the proposed DDC-2011 and DDC-2017 datasets.
| Dataset | Diabetes Patient | Non-Diabetes Patient |
| DDC-2011 | 4751 | 2814 |
| DDC-2017 | 3492 | 4073 |
The features (categorical/continuous) employed in this research are described in detail. For categorical variables we used an -test, whereas for continuous variables a mean ± std is engaged to represent the substantial relationship with diabetes disease prediction.
| Features | Different Features with Short Descriptions | Categorical? | Continuous? | ||
|---|---|---|---|---|---|
| DDC-2011 | DDC-2017 | ||||
|
| Division (the respondents’ residence place) | Yes | No | 144.689 (0.000) | 383.774 (0.000) |
|
| Location of respondents’ residence area (urban/rural) | Yes | No | 463.00 (0.496) | 93.958 (0.000) |
|
| Wealth index (respondent’s financial situation) | Yes | No | 16.104 (0.003) | 482.139 (0.000) |
|
| Household’s head sexuality (gender of the household head) | Yes | No | 5.858 (0.016) | 4.298 (0.117) |
|
| Age of household members | No | Yes | 54.87 ± 12.94 | 39.53 ± 16.21 |
|
| Respondent’s current educational status | Yes | No | 6.041 (0.110) | 6.960 (0.541) |
|
| Occupation type of the respondent | Yes | No | 30.430 (0.063) | 185.659 (0.000) |
|
| Eaten anything | Yes | No | 0.663 (0.416) | 3.065 (0.216) |
|
| Had caffeinated drink | Yes | No | 1.590 (0.207) | 20.738 (0.000) |
|
| Smoked | Yes | No | 0.001 (0.985) | 7.781 (0.020) |
|
| Average of systolic | No | Yes | 77.59 ± 12.05 | 122.63 ± 21.95 |
|
| Average of diastolic | No | Yes | 119.93 ± 21.93 | 80.52 ± 13.67 |
|
| Body mass index (BMI) for respondent | No | Yes | 2065.63 ± 369.25 | 2239.43 ± 416.47 |
Figure 1Block diagram of the proposed workflow incorporating various ML-based classifiers, a pre-processing step, and hyperparameter tuning through grid search optimization.
Comprehensive empirical findings for missing value imputation in terms of AUC, utilizing three distinct imputation techniques, two distinct DDC datasets, and six separate ML classifiers. The best imputation strategy has been seen in the blue underline for each dataset and classifier.
| Dataset | MVI Techniques | Different ML Classifiers | |||||
|---|---|---|---|---|---|---|---|
| GNB | BNB | RF | DT | XGB | LGB | ||
| DDC-2017 | Case Deletion |
|
|
|
|
|
|
| MEDimpute |
|
|
|
|
|
| |
| KNNimpute |
|
|
|
|
|
| |
| DDC-2011 | Case Deletion |
|
|
|
|
|
|
| MEDimpute |
|
|
|
|
|
| |
| KNNimpute |
|
|
|
|
|
| |
Feature importance score in accordance with the four different FS strategies (RF, IG, XGB, and LGB). The five most significant features of individual models are underlined in blue.
| FS Methods | Feature Importance Score | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| RF |
|
|
|
|
|
|
|
|
|
|
|
|
|
| IG |
|
|
|
|
|
|
|
|
|
|
|
|
|
| XGB |
|
|
|
|
|
|
|
|
|
|
|
|
|
| LGB |
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 2AUC versus feature numbers (2–13) in the submitted DDC dataset, considering four distinct feature-choosing approaches and six different ML-based models.
The highest achievable AUC for the DDC dataset with hyperparameters tuning of the six ML models.
| Classifiers | Tuned Hyperparameters | AUC (W/ GSO) | AUC (W/O GSO) |
|---|---|---|---|
| GNB | The classes’ prior probabilities (=None) and features’ largest variance portion for stability guesstimate (= |
|
|
| BNB | Additive Laplace smoothing parameter (=1.0), classes’ prior probabilities (=None), and to learn or not class priors (=True). |
|
|
| RF | Bootstrap samples or not (=True), split quality function (=gini), the best split feature numbers (=auto), leaf node number for grow trees (=3), leaf node’s samples (=0.4), the samples required to split an internal node ( |
|
|
| DT | Split quality function (=entropy), the best split feature numbers (=auto), leaf node’s samples required (=0.5), samples required to split an internal node (=0.1), the bootstrapping samples’ randomness control with feature sampling for node’ split (=100), and node’s partition strategy (=best). |
|
|
| XGB | Initial prediction score ( |
|
|
| LGB | Boosting method (=gbdt), class weight (=True), tree construction’s columns subsample ratio (=1.0), base learner tree depth (= |
|
|
Diabetes classification results have been obtained by implementing six individual ML and weighted ensemble models in the proposed DDC-2011 and DDC-2017 datasets, including the imputation of missing value, feature picking, and hyperparameter tuning. The metrics of the best-performing single model are highlighted in bold fonts, whereas the blue underlines are used to indicate them in the proposed ensemble models.
| Datasets | Different Classifiers | Sn ↑ | Sp ↑ | Acc ↑ | AUC ↑ |
|---|---|---|---|---|---|
| DDC-2011 | GNB |
|
|
|
|
| BNB |
|
|
|
| |
| RF |
|
|
|
| |
| DT |
|
|
|
| |
| XGB |
|
|
|
| |
| LGB |
|
|
|
| |
| GNB + BNB |
|
|
|
| |
| RF + DT |
|
|
|
| |
| LGB + XGB |
|
|
|
| |
| GNB + BNB + DT + RF |
|
|
|
| |
| GNB + BNB + XGB + LGB |
|
|
|
| |
| DT + RF + XGB + LGB |
|
|
|
| |
| GNB + BNB + DT + RF + XGB + LGB |
|
|
|
| |
| DDC-2017 | GNB |
|
|
|
|
| BNB |
|
|
|
| |
| RF |
|
|
|
| |
| DT |
|
|
|
| |
| XGB |
|
|
|
| |
| LGB |
|
|
|
| |
| GNB + BNB |
|
|
|
| |
| RF + DT |
|
|
|
| |
| LGB + XGB |
|
|
|
| |
| GNB + BNB + DT + RF |
|
|
|
| |
| GNB + BNB + XGB + LGB |
|
|
|
| |
| DT + RF + XGB + LGB |
|
|
|
| |
| GNB + BNB + DT + RF + XGB + LGB |
|
|
|
|
Figure 3Box and whisker plots of AUC results acquired from 5-fold cross-validation on various ML classifiers, where M-1 to M-13 represent GNB, BNB, RF, DT, XGB, LGB, GNB + BNB, RF + DT, LGB + XGB, GNB + BNB + DT + RF, GNB + BNB + XGB + LGB, DT + RF + LGB + XGB, and GNB + BNB + DT + RF + XGB + LGB, respectively.
Diabetes classification results are shown in case-1, case-2, and case-3, where features are selected from the DDC-2017 dataset, DDC-2011 dataset, and both datasets, including missing value imputation and hyperparameter tuning. The metrics of the best-performing single model are highlighted in bold fonts, whereas the blue underlines are used to indicate them in the proposed ensemble models.
| Cases | Different Classifiers | Sn ↑ | Sp ↑ | Acc ↑ | AUC ↑ |
|---|---|---|---|---|---|
| Merged datasets | GNB |
|
|
|
|
| BNB |
|
|
|
| |
| RF |
|
|
|
| |
| DT |
|
|
|
| |
| XGB |
|
|
|
| |
| LGB |
|
|
|
| |
| GNB + BNB |
|
|
|
| |
| RF + DT |
|
|
|
| |
| LGB + XGB |
|
|
|
| |
| GNB + BNB + DT + RF |
|
|
|
| |
| GNB + BNB + XGB + LGB |
|
|
|
| |
| DT + RF + XGB + LGB |
|
|
|
| |
| GNB + BNB + DT + RF + XGB + LGB |
|
|
|
| |
| Merged datasets | GNB |
|
|
|
|
| BNB |
|
|
|
| |
| RF |
|
|
|
| |
| DT |
|
|
|
| |
| XGB |
|
|
|
| |
| LGB |
|
|
|
| |
| GNB + BNB |
|
|
|
| |
| RF + DT |
|
|
|
| |
| LGB + XGB |
|
|
|
| |
| GNB + BNB + DT + RF |
|
|
|
| |
| GNB + BNB + XGB + LGB |
|
|
|
| |
| DT + RF + XGB + LGB |
|
|
|
| |
| GNB + BNB + DT + RF + XGB + LGB |
|
|
|
| |
| Merged datasets | GNB |
|
|
|
|
| BNB |
|
|
|
| |
| RF |
|
|
|
| |
| DT |
|
|
|
| |
| XGB |
|
|
|
| |
| LGB |
|
|
|
| |
| GNB + BNB |
|
|
|
| |
| RF + DT |
|
|
|
| |
| LGB + XGB |
|
|
|
| |
| GNB + BNB + DT + RF |
|
|
|
| |
| GNB + BNB + XGB + LGB |
|
|
|
| |
| DT + RF + XGB + LGB |
|
|
|
| |
| GNB + BNB + DT + RF + XGB + LGB |
|
|
|
|
Figure 4Box and whisker plots of AUC results acquired from 5-fold cross-validation on different ML-based classifiers, where M-1 to M-13 represent GNB, BNB, RF, DT, XGB, LGB, GNB + BNB, RF + DT, LGB + XGB, GNB + BNB + DT + RF, GNB + BNB + XGB + LGB, DT + RF + LGB + XGB, and GNB + BNB + DT + RF + XGB + LGB, respectively.