| Literature DB >> 36268149 |
Ahmed I ElSeddawy1, Faten Khalid Karim2, Aisha Mohamed Hussein3, Doaa Sami Khafaga2.
Abstract
Diabetes type 2 (T2DM) is a common chronic disease, increasingly leading to many complications and affecting vital organs. Hyperglycemia is the main characteristic caused by insufficient insulin secretion and poses a serious risk to human health. The objective is to construct a type-2 diabetes prediction model with high classification accuracy. Advanced machine learning and predictive model techniques are utilized to achieve cutting-edge techniques for the early diagnosis of diabetes. This paper proposes an efficient performance model to predict and classify the minority class of type-2 diabetes. The impact of oversampling and undersampling approaches to reduce the effect of an unbalanced class has been compared to classification performance algorithms. Synthetic Minority Oversampling (SMOTE) and Tomek-links techniques are applied and examined. The outcomes were then compared to the original unbalanced dataset using an artificial neural network (ANN) predictive model. The model is compared with other state-of-the-art classifiers such as support vector machine (SVM), random forest (RF), and decision tree (DT). The tuned model had the best accuracy of 92.2%. The experimental findings clearly manifest the improvement in accuracy and evaluation metrics in terms of AUC and F1-measure using the SMOTE oversampling strategy rather than the baseline and undersampling schemes. The study recommends adopting dynamic hyperparameter optimization to further improve accuracy.Entities:
Mesh:
Year: 2022 PMID: 36268149 PMCID: PMC9578843 DOI: 10.1155/2022/3078025
Source DB: PubMed Journal: Comput Intell Neurosci
Summary of related works.
| Reference | Approach | Algorithm | Significance and limitations |
|---|---|---|---|
| Zhu et al. [ | Dimensionality reduction of Pima diabetes dataset | PCA, K-means, LR | Accuracy: 97.40% |
|
| |||
| Devi et al. [ | Class imbalance and class overlap on Pima dataset. Eliminate missing values. | FFNN outperforms NB, SVM | Accuracy: 82.0% |
|
| |||
| Gupta et al. [ | Dimensionality reduction of Pima diabetes dataset | K-fold CV SVM outperforms NB | Accuracy: 81.1, 79.2%, no comparable studies |
|
| |||
| Choubey et al. [ | Dimensionality reduction of Pima diabetes dataset by PCA + LDA | AdaBoost, classification via regression (CVR), RBF, KNN | PCA-CVR: 91% accuracy with excessive feature selection. |
|
| |||
| Singh and Singh [ | Data preprocessing of Pima diabetes dataset | Ensemble model (NSGA-II) outperforms SVM, DT, RBF and poly-SVM | Accuracy: 83.8%. |
| Reduction techniques and comparability not applied | |||
|
| |||
| Kumari et al. [ | Data preprocessing of Pima diabetes dataset | Stacking model of RF, NB, LR | Accuracy of 79.04%. |
| Need more effective preprocessing steps | |||
|
| |||
| Khandegar and Pawar [ | Dimensionality reduction of Pima diabetes dataset | PCA + NN | Accuracy: 92.2% |
|
| |||
| Kandhasamy and balamurali [ | Data preprocessing of Pima diabetes dataset | RF outperforms (J48 DT, KNN, SVM) | Accuracy: 100%. |
| No comparability with other studies | |||
|
| |||
| Mercaldo et al. [ | Dimensionality reduction of Pima diabetes dataset | HoeffdingTree outperforms J48, MLP, JRip, BayesNet, RF | Accuracy: 75.5% |
|
| |||
| Mohebbi et al. [ | Classification methods with grid search | CNN outperforms MLP | Accuracy: 77.5% no comparison with other studies |
|
| |||
| Roy et al. [ | Prediction of diabetes: | (i) Lrgbm | Accuracy |
|
| |||
| Jhaldiyal and Mishra [ | Dimensionality reduction of Pima diabetes dataset | PCA + SVM outperforms PCA + REP | Accuracy: 93.66% |
| No comparability with other studies | |||
|
| |||
| Maniruzzaman et al. [ | Comparative approach in Pima dataset | GPC outperforms LDA, QDA, NB | Accuracy: 81.97% |
|
| |||
| Butt et al. [ | Classification and prediction on Pima dataset | MLP outperforms RF, LR | Accuracy: 86.08% |
| LSTM outperforms MA, LR | Accuracy: 87.26% | ||
| No feature selection | |||
|
| |||
| Nnamoko and Korkontzelos [ | SMOTE oversampling of outliers in Pima diabetes dataset | NB, SVM, ripper, | 77.0, 77.7, 83.6 |
|
| |||
| Zeng et al. [ | Handle class imbalance in Pima dataset | K |
|
|
| |||
| Wang et al. [ | ADASYN oversampling | NB-ADASYN- | 87.10% |
Feature characteristics of diabetes in Pima Indians' dataset.
| Feature name | Mean ± SD | Diabetes ( | Nondiabetes ( |
|
|---|---|---|---|---|
| Age (years) | 33.24 ± 11.8 | |||
| (<25) | 31 (11.8%) | 188 (38.0%) | <0.05 | |
| (25–30) | 53 (20.0%) | 124 (25.2%) | ||
| (30–35) | 42 (15.5%) | 50 (9.6%) | ||
| (35–40) | 34 (12.9%) | 39 (8.3%) | ||
| (>40) | 108 (40.0%) | 99 (19.9%) | ||
|
| ||||
| No. of pregnancies | 3.8 ± 3.4 | |||
| Never | 38 (14.0%) | 73 (14.0%) | <0.05 | |
| (1–3) | 75 (28.2%) | 238 (45.6%) | ||
| (4–6) | 60 (22.0%) | 115 (25.7%) | ||
| (>6) | 95 (35.8%) | 74 (14.7%) | ||
|
| ||||
| Insulin level (u U/ml) | 79.8 ± 15.2 | |||
| (Less than 200) | 221 (83.0%) | 458 (92.0%) | <0.05 | |
| (More than 200) | 47 (18.0%) | 42 (8.0%) | ||
|
| ||||
| BMI (kg/m^2) | 31.9 ± 7.9 | |||
| Normal | 7 (3.0%) | 108 (22.0%) | <0.05 | |
| Overweight | 122 (46.0%) | 239 (46.9%) | ||
| Obesity | 139 (52.0%) | 153 (31.1%) | ||
|
| ||||
| Blood pressure (mm·Hg) | 69 ± 19.4 | |||
| Low (<65) | 46 (18.0%) | 155 (31.4%) | <0.05 | |
| Normal (65–85) | 173 (64.6%) | 288 (57.2%) | ||
| High (>85) | 49 (17.4%) | 57 (11.4%) | ||
|
| ||||
| Glucose (mg/dL) | 120.9 ± 32 | |||
| Normal (<140) | 131 (49.9%) | 438 (88.6%) | <0.05 | |
| High (>140) | 137 (50.1%) | 62 (11.4%) | ||
|
| ||||
| Skin fold (mm) | 20.5 ± 16 | |||
| (<20) | 103 (38.6%) | 235 (47.4%) | <0.05 | |
| (20–40) | 123 (45.9%) | 217 (43.0%) | ||
| (>40) | 42 (15.5%) | 48 (9.6%) | ||
|
| ||||
| Pedigree function | 0.47 ± 0.3 | |||
| (<0.5) | 163 (60.8%) | 319 (63.8%) | <0.05 | |
| (0.5–1.0) | 87 (32.5%) | 145 (29.0%) | ||
| (>1.0) | 18 (6.7%) | 36 (7.2%) | ||
Multiple logistic regression model in diabetes mellitus dataset.
| Independent variables | Dependent variable (diabetes) | ||||
|---|---|---|---|---|---|
| B | Odds ratio | Lower limit | Upper limit | Significant risk | |
| Age (years) | |||||
| (<25) | −1.2 | 6.55 | 4.1 | 10.6 | 0.0 |
| (25–30) | −0.7 | 2.55 | 1.6 | 3.9 | 0.0 |
| (30–35) | −0.9 | 1.29 | 0.78 | 2.1 | 0.0 |
| (35–40) | −1.15 | 1.26 | 0.72 | 2.13 | 0.0 |
| (>40) | 1 | 1 | Ref | ||
|
| |||||
| Number of pregnancies | |||||
| Never | 1 | 1 | Ref | ||
| (1–3) | 0.69 | 1.66 | 1.0 | 2.6 | 0.0 |
| (4–6) | 0.46 | 1.0 | 0.6 | 1.6 | 0.02 |
| (>6) | 0.92 | 0.4 | 0.2 | 0.6 | 0.0 |
|
| |||||
| Insulin level (u U/ml) | |||||
| (Less than 200) | 0.9 | 1 | Ref | ||
| (More than 200) | 1.5 | 2.5 | 1.5 | 3.7 | 0.0 |
|
| |||||
| BMI (kg/m ^ 2) | |||||
| Normal | −2.5 | 8.01 | 3.6 | 16.9 | 0.0 |
| Overweight | 1 | 1 | Ref | ||
| Obesity | 0.1 | 0.6 | 0.39 | 0.69 | 0.3 |
|
| |||||
| Blood pressure (mm·Hg) | |||||
| Low (<65) | −1.3 | 2.9 | 1.7 | 4.8 | 0.0 |
| Normal (65–85) | −1.2 | 1.4 | 0.9 | 2.2 | 0.0 |
| High (>85) | 1 | 1 | Ref | ||
|
| |||||
| Glucose (mg/dL) | |||||
| Normal (<140) | −0.4 | 7.3 | 5.0 | 11.0 | 0.0 |
| High (>140) | 1 | 1 | Ref | ||
|
| |||||
| Skin fold (mm) | |||||
| (<20) | −0.1 | 1.3 | 0.9 | 1.8 | 0.1 |
| (20–40) | 1 | 1 | Ref | ||
| (>40) | −1.0 | 0.6 | 0.4 | 1.0 | 0.0 |
Figure 1Framework for the proposed methodology.
Figure 2Generation of training and testing datasets. (SMOTE: Synthetic minority oversampling technique).
Algorithm 1SMOTE [10].
Characteristics of the datasets after preprocessing steps.
| Dataset | Positive | Negative | Total |
|---|---|---|---|
| Baseline | 268 | 500 | 768 |
| Smote (100%) | 414 | 414 | 828 |
| SMOTE/undersample (200 : 100) | 648 | 432 | 1080 |
| Tomek-links | 216 | 329 | 545 |
Figure 3SVM model for SMOTE oversampling training dataset using R software.
Figure 4SVM model for Tomek-link undersampling training and test datasets.
Comparative analysis after applying different SVM kernels on training sets.
| Type | Kernel type | Training imbalanced data | SMOTED oversampling (100%) | SMOTE/Undersample (200 : 100)% | Tomek-links |
|---|---|---|---|---|---|
| Accuracy of SVM model | Linear | 69.4% | 73.0% | 80.4% | 80.2% |
| Linear grid | 82.6% | 90.1% | 85.7% | 84.6% | |
| Radial basis Function (RBF) | 76.0% | 97.4% | 96.6% | 88.0% | |
|
|
|
|
|
Hyperparameter (Cost = “c,” sigma = “s”) for SVM-RBF kernel.
Figure 5AUC for SVM-RBF after training SMOTE 100%, sigma = 0.1210985 and C = 128.
Figure 6AUC for SVM-RBF after training Tomek-links, sigma = 0.127 and C = 1.0.
Algorithm 2Random Forest Pseudocode [57].
Figure 7Conditional inference tree for training set after oversampling. The Bonferroni significant p values are presented for each inner node, and the proportion of result is displayed for every terminal node. Diabetes = 1, NonDiabetes = 0.
Performance metrics for the classification model.
| Performance metric | Formula |
|---|---|
| Precision | TP/(tp + FP) |
| Recall (sensitivity) | TP/(tp + FN) |
| Specificity (true negative rate) | TN/(TN + FP) |
| F1-score | 2 |
| Accuracy | (TP + TN)/(TP + TN + FP + FN) |
| G-mean |
|
TP = true positive, TN = true negative, FP = false positive, FN = false negative.
Evaluation of best performance of ANN using grid search.
| No. of neurons | No. of iterations | Decay | Accuracy | AUC | Time |
|---|---|---|---|---|---|
| 50 | 250 | 0.01 | 0.752 | 0.801 | 2.45 min |
| 30 | 250 | 0.01 | 0.725 | 0.786 | 2.30 min |
| 10 | 250 | 0.01 | 0.861 | 0.90 | 2.10 min |
|
|
|
|
|
|
|
| 3 | 250 | 0.01 | 0.79 | 0.801 | 50 sec |
Optimized parameters used for training ANN.
| No. | Parameters | Values |
|---|---|---|
| 1 | Loss function | Binary cross entropy |
| 2 | Optimizer | Gradient descent (backpropagation) |
| 3 | Algorithm | “rprop+” |
| 4 | Activation function | “Logistic” |
| 6 | Stepmax | 1e + 06 |
| 7 | Learning rate | 0.01 |
| 8 | Metrics | Accuracy |
Figure 8Features selection priority using the varImp function on ANN evaluation.
The ANN model's accuracy, error rate, and training time.
| Training data | Test accuracy (%) | MSE | Training time (sec) |
|---|---|---|---|
| Original set | 0.801 | 0.06 | 31.83 |
| Smote (100%) | 0.902 | 0.02 | 310 |
| Smote/undersample | 0.811 | 0.03 |
|
| Tomek-links | 0.722 | 0.08 | 228.6 |
MSE: mean square error (loss function).
Figure 9Performance evaluation of ANN classifier before and after resampling.
Figure 10AUC of 0.89 ANN classifier after SMOTE 100% oversampling.
Comparison between multilayer perceptron, support vector machine, RF, and decision tree based on various metrics in test data set.
| Model | Sensitivity | Specificity | F1-score | Kappa | Precision | Accuracy | ROC | G-mean |
|---|---|---|---|---|---|---|---|---|
| ANN + Gridsearch | 0.662 | 0.853 | 0.682 | 0.51 | 0.704 | 0.801 | 0.71 | 0.75 |
| ANN + Gridsearch + Smote (100%) | 0.845 | 0.931 | 0.871 | 0.66 | 0.903 | 0.902 | 0.89 | 0.89 |
| ANN + Gridsearch + Smote/undersample | 0.842 | 0.765 | 0.774 | 0.52 | 0.721 | 0.811 | 0.73 | 0.80 |
| ANN + Gridsearch + Tomek_links | 0.735 | 0.724 | 0.643 | 0.42 | 0.575 | 0.722 | 0.64 | 0.73 |
|
| ||||||||
| SVM | 0.645 | 0.751 | 0.609 | 0.38 | 0.657 | 0.691 | 0.69 | 0.70 |
| SVM + smote (100%) | 0.715 | 0.754 | 0.675 | 0.29 | 0.707 | 0.729 | 0.73 | 0.73 |
| SVM + Smote/undersample | 0.701 | 0.715 | 0.582 | 0.29 | 0.606 | 0.664 | 0.69 | 0.71 |
| SVM + Tomek_links | 0.661 | 0.794 | 0.593 | 0.44 | 0.675 | 0.716 | 0.70 | 0.72 |
|
| ||||||||
| RF | 0.522 | 0.754 | 0.652 | 0.33 | 0.652 | 0.695 | 0.62 | 0.63 |
| RF + smote (100%) | 0.871 | 0.635 | 0.787 | 0.40 | 0.713 | 0.754 | 0.75 | 0.74 |
| RF + smote/undersample | 0.815 | 0.691 | 0.694 | 0.38 | 0.614 | 0.742 | 0.62 | 0.75 |
| RF + Tomek_links | 0.752 | 0.641 | 0.543 | 0.38 | 0.494 | 0.693 | 0.65 | 0.7 |
|
| ||||||||
| DT | 0.711 | 0.801 | 0.691 | 0.22 | 0.674 | 0.771 | 0.74 | 0.75 |
| DT + smote (100%) | 0.789 | 0.764 | 0.669 | 0.41 | 0.627 | 0.789 | 0.78 | 0.78 |
| DT + Smote/undersample | 0.812 | 0.682 | 0.674 | 0.31 | 0.584 | 0.737 | 0.70 | 0.74 |
| DT + Tomek_links | 0.762 | 0.664 | 0.784 | 0.39 | 0.793 | 0.702 | 0.74 | 0.71 |
Figure 11Comparison performance of all classifiers using accuracy%.
Figure 12Performance evaluation of SVM classifier before and after resampling.
Figure 13Performance evaluation of RF classifier before and after resampling.
Figure 14Performance evaluation of DT classifier before and after resampling.
Summary of evaluation metrics using SMOTE (100%).
| Classifier | Sensitivity | Specificity | F1-score | Kappa | Precision | Accuracy | ROC | G-mean | |
|---|---|---|---|---|---|---|---|---|---|
| Test dataset smote 100% | ANN | 0.84 | 0.93 | 0.87 | 0.66 | 0.903 | 0.902 | 0.89 | 0.89 |
| SVM | 0.71 | 0.75 | 0.67 | 0.19 | 0.70 | 0.729 | 0.73 | 0.73 | |
| RF | 0.87 | 0.63 | 0.78 | 0.40 | 0.71 | 0.754 | 0.75 | 0.74 | |
| DT | 0.78 | 0.76 | 0.66 | 0.36 | 0.627 | 0.789 | 0.78 | 0.78 |
Figure 15AUC of 0.78 CTree classifier after SMOTE 100% oversampling.
Comparative results of ANN with previous studies based on accuracy%.
| Author | Approach | Algorithm | Accuracy% |
|---|---|---|---|
| Alam et al. [ | Prediction of diabetes | ANN | 75.7% |
| Median values and NB imputation | |||
| Pradhan et al. [ | Prediction of diabetes with classifier comparisons | ANN | 85.09 |
| Guldogan et al. [ | Prediction of diabetes: | MLP | 78.1% |
| Missing values deleted | RBF | 76.8% | |
| Ahuja et al. [ | Prediction of diabetes: | MLP | 78.7% |
| Missing values by median | |||
| Ramezani et al. [ | Predicting diabetes with reducing number of features from 8 to 5 | LANFIS | 88.05% |
| Our work (ANN) | Predicting diabetes with preprocessing steps + SMOTE 100% | ANN | 90.2% |