| Literature DB >> 35890983 |
Elias Dritsas1, Maria Trigka1.
Abstract
Diabetes mellitus is a chronic condition characterized by a disturbance in the metabolism of carbohydrates, fats and proteins. The most characteristic disorder in all forms of diabetes is hyperglycemia, i.e., elevated blood sugar levels. The modern way of life has significantly increased the incidence of diabetes. Therefore, early diagnosis of the disease is a necessity. Machine Learning (ML) has gained great popularity among healthcare providers and physicians due to its high potential in developing efficient tools for risk prediction, prognosis, treatment and the management of various conditions. In this study, a supervised learning methodology is described that aims to create risk prediction tools with high efficiency for type 2 diabetes occurrence. A features analysis is conducted to evaluate their importance and explore their association with diabetes. These features are the most common symptoms that often develop slowly with diabetes, and they are utilized to train and test several ML models. Various ML models are evaluated in terms of the Precision, Recall, F-Measure, Accuracy and AUC metrics and compared under 10-fold cross-validation and data splitting. Both validation methods highlighted Random Forest and K-NN as the best performing models in comparison to the other models.Entities:
Keywords: Machine Learning; data analysis; diabetes; prediction
Mesh:
Year: 2022 PMID: 35890983 PMCID: PMC9318204 DOI: 10.3390/s22145304
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Related works for the subject under consideration.
| Research Work | Use Case | Dataset | Proposed Models | Metrics |
|---|---|---|---|---|
| [ | Diabetes Prediction | Pima Indian Diabetes Dataset | Soft Weighted Voting | AUC: 0.950 |
| [ | Diabetes Classification | Pima Indian Diabetes Dataset | SVM/KNN | SVM: Accuracy 0.89, |
| [ | Diabetes Detection | Not Publicly Available | Simple Linear Regression | RMSE: 0.838 |
| [ | Diabetes Prediction | Pima Indian Diabetes Dataset | Random Forest | Accuracy: 94.1% |
| [ | Classification and | National Health and | Random Forest | Accuracy: 94.25% |
| [ | Diabetes Detection | ELSA Database | Weighted Majority Voting | AUC: 0.884 |
| [ | Diabetes Prediction | [ | Random Forest | Accuracy: 94.1% |
| [ | Diabetes Prediction | [ | KNN | Accuracy: 98.07% |
| [ | Diabetes Prediction | [ | Random Forest | Accuracy, Precision, |
| [ | Diabetes Prediction | [ | Random Forest | Accuracy: 97.88% |
Evaluation of feature importance based on the Pearson Correlation, Gain Ratio, Naive Bayes and Random Forest.
| Feature | Pearson | Feature | Gain | Feature | Naive Bayes | Feature | Random |
|---|---|---|---|---|---|---|---|
| polyuria | 0.7046 | polydipsia | 0.4317 | polyuria | 0.3329 | polyuria | 0.3337 |
| polydipsia | 0.6969 | polyuria | 0.4143 | polydipsia | 0.3189 | polydipsia | 0.3189 |
| sudden_weight_loss | 0.5017 | gender | 0.2117 | sudden_weight_loss | 0.2229 | age | 0.2537 |
| gender | 0.4922 | sudden_weight_loss | 0.2088 | gender | 0.2089 | sudden_weight_loss | 0.2232 |
| partial_paresis | 0.4757 | partial_paresis | 0.1814 | partial_paresis | 0.2084 | gender | 0.2092 |
| polyphagia | 0.3450 | irritability | 0.1218 | polyphagia | 0.1454 | partial_paresis | 0.2084 |
| irritability | 0.3398 | polyphagia | 0.0895 | irritability | 0.1174 | polyphagia | 0.1456 |
| alopecia | 0.2771 | alopecia | 0.0588 | alopecia | 0.1099 | irritability | 0.1175 |
| visual_blurring | 0.2564 | age | 0.0533 | visual_blurring | 0.1098 | alopecia | 0.1118 |
| weakness | 0.2547 | visual_blurring | 0.0489 | weakness | 0.1093 | visual_blurring | 0.1103 |
| genital_thrush | 0.1441 | weakness | 0.0477 | age | 0.0584 | weakness | 0.1096 |
| age | 0.1124 | genital_thrush | 0.0209 | genital_thrush | 0.0468 | genital_thrush | 0.0471 |
| muscle_stiffness | 0.1068 | muscle_stiffness | 0.0086 | muscle_stiffness | 0.0324 | muscle_stiffness | 0.0327 |
| obesity | 0.0808 | obesity | 0.0074 | obesity | 0.0180 | obesity | 0.0191 |
| delayed_healing | 0.0471 | delayed_healing | 0.0016 | delayed_healing | 0.0046 | delayed_healing | 0.0049 |
| itching | 0.0156 | itching | 0.0002 | itching | −0.0273 | itching | −0.0260 |
Figure 1Participants’ distribution in terms of the age group and gender.
Figure 2Participants’ distribution in terms of polyuria and polydipsia in the balanced dataset.
Figure 3Participants’ distribution in terms of sudden weight loss and weakness in the balanced dataset.
Figure 4Participants’ distribution in terms of polyphagia and obesity in the balanced dataset.
Figure 5Participants’ distribution in terms of irritability and alopecia in the balanced dataset.
Figure 6Participants’ distribution in terms of genital thrush and itching in the balanced dataset.
Figure 7Participants’ distribution in terms of partial paresis and muscle stiffness in the balanced dataset.
Figure 8Participants’ distribution in terms of delayed healing and visual blurring in the balanced dataset.
Machine Learning models’ settings.
| Models | Parameters |
|---|---|
|
| estimator: simpleEstimator |
|
| useKernelEstimator: False |
|
| eps = 0.001 |
|
| ridge = |
|
| hidden layers: ‘a’ |
|
| K = 1 |
|
| reducedErrorPruning: False |
|
| errorOnProbabilities: False |
|
| maxDepth = 0 |
|
| maxDepth = 0 |
|
| maxDepth = −1 |
|
| classifier: J48 |
|
| classifier: DecisionStump |
|
| epochs = 500 |
|
| Base Models: RF, KNN |
Performance evaluation after SMOTE with 10-fold cross-validation.
| Accuracy | Precision | Recall | F-Measure | AUC | |
|---|---|---|---|---|---|
|
| 88.75 ± 5.04% | 88.9 ± 4.8% | 88.8 ± 4.9% | 88.7 ± 5.1% | 95.6 ± 2.1% |
|
| 88.91 ± 5.02% | 89.1 ± 4.7% | 88.9 ± 5% | 88.9 ± 5.1% | 95.5 ± 2.4% |
|
| 95.62 ± 2.06% | 95.7 ± 1.8% | 95.6 2.1% | 95.6 ± 2.1% | 95.6 ± 2.1% |
|
| 93.44 ± 2.64% | 93.4 ± 2.6% | 93.4 ± 2.6% | 93.4 ± 2.7% | 97.6 ± 1.4% |
|
| 96.45 ± 2.00% | 97.3 ± 2.40% | 97.3 ± 2.40% | 97.2 ± 2.30% | 99.1 ± 2.60% |
|
| 98.59 ± 1.72% | 98.6 ± 1.62% | 98.6 ± 1.70% | 98.6 ± 1.70% | 98.9 ± 1.30% |
|
| 97.19 ± 2.74% | 97.2 ± 2.70% | 97.2 ± 2.70% | 97.2 ± 2.70% | 97.2 ± 2.20% |
|
| 97.19 ± 1.61% | 97.2 ± 1.60% | 97.2 ± 1.60% | 97.2 ± 1.60% | 98.3 ± 1.30% |
|
| 98.59 ± 1.15% | 98.6 ± 1.10% | 98.6 ± 1.12% | 98.6 ± 1.12% | 99.9 ± 0.20% |
|
| 97.97 ± 2.09% | 98 ± 2.10% | 98 ± 2.10% | 98 ± 2.10% | 98 ± 2.10% |
|
| 93.12 ± 3.23% | 93.2 ± 3.00% | 93.1 ± 3.20% | 93.1 ± 3.20% | 96.4 ± 2.30% |
|
| 98.28 ± 2.01% | 98.3 ± 1.17% | 98.3 ± 2.00% | 98.3 ± 2.00% | 99.9 ± 0.20% |
|
| 90.78 ± 2.59% | 91.2 ± 2.40% | 90.8 ± 2.60% | 90.8 ± 2.60% | 97.1 ± 2.10% |
|
| 94.22 ± 2.56% | 94.3 ± 2.40% | 94.2 ± 2.60% | 94.2 ± 2.60% | 94.2 ± 2.60% |
|
| 98.49 ± 1.10% | 98.5 ± 1.10% | 98.5 ± 1.11% | 98.5 ± 1.11% | 99.7 ± 0.20% |
Model comparison in terms of accuracy with 10-fold cross-validation.
| Accuracy | |||||
|---|---|---|---|---|---|
| Proposed models | [ | [ | [ | [ | |
|
| 88.75% | - | 86.92% | - | - |
|
| 88.91% | 87.4% | 87.11% | 87.1% | - |
|
| 95.62% | - | 92.11% | 92.1% | - |
|
| 93.44% | 92.4% | - | - | - |
|
| 96.45% | - | - | 96.3% | 96.34% |
|
| 98.59% | - | 98.07% | - | - |
|
| 97.19% | 95.6% | 95.96% | - | - |
|
| 98.59% | 97.4% | 97.5% | 97.5% | 97.88% |
|
| 97.97% | - | 96.15% | - | - |
Performance evaluation after SMOTE with percentage split (80:20).
| Accuracy | Precision | Recall | F-Measure | AUC | |
|---|---|---|---|---|---|
|
| 88.28% | 88.3% | 88.3% | 88.3% | 95.9% |
|
| 89.06% | 89.1% | 89.1% | 89.1% | 95.8% |
|
| 97.66% | 97.7% | 97.7% | 97.7% | 97.6% |
|
| 92.97% | 93% | 93% | 93% | 98.5% |
|
| 97.66% | 97.7% | 97.7% | 97.7% | 99.9% |
|
| 99.22% | 99.2% | 99.2% | 99.2% | 98.9% |
|
| 95.53% | 95.5% | 95.5% | 95.5% | 96.1% |
|
| 96.87% | 96.9% | 96.9% | 96.9% | 99.4% |
|
| 99.22% | 99.2% | 99.2% | 99.2% | 100% |
|
| 97.66% | 97.7% | 97.7% | 97.7% | 97.7% |
|
| 92.19% | 92.2% | 92.2% | 92.2% | 95.2% |
|
| 97.66% | 97.7% | 97.7% | 97.7% | 99.9% |
|
| 92.97% | 93% | 93% | 93% | 97.5% |
|
| 93.75% | 93.8% | 93.8% | 93.8% | 93.7% |
|
| 99.20% | 99.2% | 99.2% | 99.2% | 100% |
Model comparison in terms of accuracy with percentage split (80:20).
| Accuracy | ||||
|---|---|---|---|---|
|
|
|
|
| |
| Proposed models | 89.06% | 92.97% | 95.53% | 99.22% |
| [ | 88% | 91% | 95% | 99% |