| Literature DB >> 35139846 |
Somayeh Sadeghi1, Davood Khalili2,3, Azra Ramezankhani2, Mohammad Ali Mansournia4, Mahboubeh Parsaeian5.
Abstract
BACKGROUND: Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging.Entities:
Keywords: Cost-sensitive learning; Diabetes mellitus; Imbalanced data; Machine learning; Sampling strategies
Mesh:
Year: 2022 PMID: 35139846 PMCID: PMC8830137 DOI: 10.1186/s12911-022-01775-z
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1DNN structure for classification diabetes
Fig. 2XGBoost structure for classification diabetes
Fig. 3The flowchart of the SVM-SMOTE algorithm
Baseline characteristics of adult participants of the Tehran Lipid and Glucose Study in phase 3
| Categorical variables | Diabetic individuals | Non-diabetic individuals | Total population |
|---|---|---|---|
| Number (%) | Number (%) | Number (%) | |
| Women | 522 (56.3) | 2690 (57,4) | 3212 (57.2) |
| Men | 405 (43.7) | 1996 (42.6) | 2401 (42.8) |
| Single | 47 (5.1) | 788 (16.8) | 835 (14.9) |
| Married | 819 (88.3) | 3707 (79.1) | 4526 (80.7) |
| Divorced | 8 (0.9) | 60 (1.3) | 68 (1.2) |
| Widowed | 53 (5.7) | 129 (2.8) | 182 (3.2) |
| High (> 12 years) | 135 (14.8) | 1034 (22.4) | 1169 (21.1) |
| Moderate (6–12 years) | 478 (52.35) | 2756 (59.7) | 3234 (58.4) |
| Low (< 6 years) | 300 (32.85) | 830 (18) | 1130 (20.4) |
| Never or in the past | 809 (88.8) | 4137 (89.6) | 4946 (89.5) |
| Current | 102 (11.2) | 480 (10.4) | 582 (10.5) |
| No | 688 (84.7) | 3357 (81.1) | 4045 (81.7) |
| Yes | 124 (15.3) | 783 (18.9) | 907 (18.3) |
| Low | 331 (36.7) | 1647 (36.1) | 1978 (36.2) |
| High | 570 (63.7) | 2918 (63.9) | 3488 (63.8) |
| No | 351 (45.3) | 2001 (57.4) | 2352 (55.2) |
| Yes | 424 (54.7) | 1484 (42.6) | 1908 (44.8) |
| No | 884 (95.4) | 4584 (97.8) | 5648 (97.4) |
| Yes | 43 (4.6) | 102 (2.2) | 145 (2.6) |
| No | 878 (94.7) | 4595 (98.1) | 5473 (97.5) |
| Yes | 49 (5.3) | 91 (1.9) | 140 (2.5) |
| No | 457 (66.6) | 2371 (72.4) | 2828 (71.4) |
| Yes | 229 (33.4) | 902 (27.6) | 1131 (28.6) |
CVD cardiovascular disease, BMI body mass index
Optimal hyper-parameters values based on fivefold stratified cross-validation grid search
| Model | Hyper-parameters |
|---|---|
| DNN | Number of layers = 4, number of nodes in each layer = (100,75,50,1), dropout rate in each layer = (0.5,0.5,0.25), activation function in each layer = (ReLU, ReLU, ReLU, sigmoid) |
| XGBoost | Learning rate = 0.3, maximum depth of each tree = 3, minimum loss reduction to split each node = 1, regularization term on weights = 20, subsample ratio of columns for each tree = 0.5 |
| Random forest | Number of trees in the forest = 1500, maximum depth of each tree = 19, the minimum number of samples to split each node = 8 |
Comparison between deep neural network, extremely gradient boosting and random forest based on various metrics in test dataset
| Accuracy | F1-measure | G-mean | MCC* | AUROC | AUPRC | Confusion matrix** | ||
|---|---|---|---|---|---|---|---|---|
| DNN | 0.862 | 0.575 | 0.713 | 0.747 | 0.857 | 0.603 | 0.926 | 0.074 |
| 0.452 | 0.548 | |||||||
| XGBoost | 0.872 | 0.554 | 0.667 | 0.748 | 0.854 | 0.622 | 0.956 | 0.044 |
| 0.534 | 0.466 | |||||||
| Random forest | 0.869 | 0.543 | 0.659 | 0.741 | 0.840 | 0.578 | 0.955 | 0.045 |
| 0.545 | 0.455 | |||||||
MCC Matthews Correlation Coefficient; AUROCReceiver Operating Characteristic Area Under Curve; AUPRC Precision-Recall Area Under Curve
*MCC has been projected from [-1,1] to [0,1] by formula
**Predicted and actual, non-diabetic and diabetic percent are presented in confusion matrix
Fig. 4ROC and Precision-Recall curves to find best threshold based on maximum of g-mean and f1-measure for all algorithms. Note: Star marker corresponds to threshold which maximize g-mean in ROC curve and f1-measure in P-R curve
Evaluation the effect of moving threshold and weighing in performance of the algorithms
| Accuracy | F1-measure | G-mean | MCC | AUROCC | AUPRC | |
|---|---|---|---|---|---|---|
| g-t | 0.784 | 0.554 | 0.786 | 0.732 | 0.857 | 0.603 |
| f1-t | 0.848 | 0.591 | 0.757 | 0.750 | 0.857 | 0.603 |
| weighted | 0.822 | 0.581 | 0.780 | 0.744 | 0.858 | 0.606 |
| g-t | 0.774 | 0.538 | 0.774 | 0.721 | 0.854 | 0.622 |
| f1-t | 0.855 | 0.586 | 0.738 | 0.749 | 0.854 | 0.622 |
| weighted | 0.832 | 0.588 | 0.776 | 0.748 | 0.853 | 0.620 |
| g-t | 0.777 | 0.534 | 0.767 | 0.717 | 0.840 | 0.578 |
| f1-t | 0.841 | 0.564 | 0.733 | 0.734 | 0.840 | 0.578 |
| weighted | 0.810 | 0.566 | 0.775 | 0.735 | 0.846 | 0.591 |
g-t maximum g-mean based moved threshold, f1-t maximum f1-measure based moved threshold
Fig. 5Comparison between various sampling methods on distribution of diabetic (black circles) and non-diabetic (red circles). X, y and z axes are first to third principal components
Comparison between various sampling methods on the performance of algorithms
| Accuracy | F1-measure | G-mean | MCC | AUROC | AUPRC | |
|---|---|---|---|---|---|---|
| RENN | 0.830 | 0.583 | 0.773 | 0.745 | 0.862 | 0.608 |
| OSS | 0.856 | 0.594 | 0.747 | 0.753 | 0.855 | 0.599 |
| SMOTE | 0.805 | 0.556 | 0.768 | 0.729 | 0.855 | 0.594 |
| SVM-SMOTE | 0.827 | 0.580 | 0.773 | 0.743 | 0.856 | 0.602 |
| ENN-SMOTE | 0.818 | 0.563 | 0.763 | 0.733 | 0.850 | 0.599 |
| RENN | 0.814 | 0.572 | 0.779 | 0.740 | 0.856 | 0.588 |
| OSS | 0.831 | 0.554 | 0.733 | 0.727 | 0.842 | 0.591 |
| SMOTE | 0.859 | 0.568 | 0.708 | 0.742 | 0.848 | 0.592 |
| SVM-SMOTE | 0.844 | 0.555 | 0.718 | 0.730 | 0.858 | 0.605 |
| ENN-SMOTE | 0.857 | 0.548 | 0.688 | 0.733 | 0.845 | 0.594 |
| RENN | 0.808 | 0.556 | 0.764 | 0.728 | 0.844 | 0.553 |
| OSS | 0.832 | 0.561 | 0.741 | 0.731 | 0.837 | 0.569 |
| SMOTE | 0.842 | 0.543 | 0.704 | 0.724 | 0.840 | 0.550 |
| SVM-SMOTE | 0.838 | 0.548 | 0.717 | 0.726 | 0.842 | 0.552 |
| ENN-SMOTE | 0.844 | 0.531 | 0.687 | 0.719 | 0.843 | 0.541 |