| Literature DB >> 34946438 |
Yazan Jian1, Michel Pasquier1, Assim Sagahyroon1, Fadi Aloul1.
Abstract
Diabetes mellitus (DM) is a chronic disease that is considered to be life-threatening. It can affect any part of the body over time, resulting in serious complications such as nephropathy, neuropathy, and retinopathy. In this work, several supervised classification algorithms were applied for building different models to predict and classify eight diabetes complications. The complications include metabolic syndrome, dyslipidemia, neuropathy, nephropathy, diabetic foot, hypertension, obesity, and retinopathy. For this study, a dataset collected by the Rashid Center for Diabetes and Research (RCDR) located in Ajman, UAE, was utilized. The dataset consists of 884 records with 79 features. Some essential preprocessing steps were applied to handle the missing values and unbalanced data problems. Furthermore, feature selection was performed to select the top five and ten features for each complication. The final number of records used to train and build the binary classifiers for each complication was as follows: 428-metabolic syndrome, 836-dyslipidemia, 223-neuropathy, 233-nephropathy, 240-diabetic foot, 586-hypertension, 498-obesity, 228-retinopathy. Repeated stratified k-fold cross-validation (with k = 10 and a total of 10 repetitions) was employed for a better estimation of the performance. Accuracy and F1-score were used to evaluate the models' performance reaching a maximum of 97.8% and 97.7% for accuracy and F1-scores, respectively. Moreover, by comparing the performance achieved using different attributes' sets, it was found that by using a selected number of features, we can still build adequate classifiers.Entities:
Keywords: diabetes complications; diabetes prediction; supervised learning
Year: 2021 PMID: 34946438 PMCID: PMC8702133 DOI: 10.3390/healthcare9121712
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Figure 1The developed workflow for diabetes complications prediction.
RMSE results for each imputation method.
| Method | BMI | Triglycerides | Total RMSE |
|---|---|---|---|
| MissForest | 0.6264 | 1.2051 | 15.962 |
| 0.9711 | 1.3514 | 18.560 | |
| Mean substitution | 0.8972 | 1.3378 | 19.788 |
Figure 2Original and generated albumin test values using MissForest.
Categorical data before applying one-hot encoding.
| Idx | Gender Female | Gender Male | Nationality Name | Nationality Name | … | Type 2 Diabetes, Adult Onset | Type 1 Diabetes, Adult Onset |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 1 | 1 | 0 | |
| 1 | 0 | 1 | 0 | 1 | 1 | 0 | |
| 2 | 1 | 0 | 0 | 1 | 0 | 1 | |
| 3 | 1 | 0 | 0 | 1 | 1 | 0 | |
| 4 | 1 | 0 | 0 | 1 | 1 | 0 |
Figure 3Class distributions for each complication.
Figure 4Applying undersampling using cluster centroids. (a) Represents the datapoints before applying cluster centroids whereas (b) represents the final result of performing undersampling on the dataset.
Figure 5Class distributions for each complication after handling the imbalance problem.
Figure 6The use of KCV for both hyperparameters tuning and training [9].
Baseline model performance.
| Algorithm | Accuracy | F1-Score |
|---|---|---|
| Metabolic syndrome | 0.5397 | 0.3505 |
| Dyslipidemia | 0.5263 | 0.3448 |
| Hypertension | 0.5256 | 0.3445 |
| Obesity | 0.5402 | 0.3507 |
| Neuropathy | 0.5874 | 0.3701 |
| Nephropathy | 0.5880 | 0.3703 |
| Diabetic Foot | 0.5875 | 0.3701 |
| Retinopathy | 0.5877 | 0.3702 |
Summary of all experiments for the selection of the best-performing classifier for each diabetes complication.
| Complication | Algorithms | All Attributes | Top 10 | Top 5 | |||
|---|---|---|---|---|---|---|---|
| Accuracy | F1-Score | Accuracy | F1-Score | Accuracy | F1-Score | ||
|
|
|
|
|
|
|
| |
| SVM Linear | 0.763 | 0.762 | 0.746 | 0.744 | 0.69 | 0.684 | |
| CART (DT) | 0.646 | 0.639 | 0.649 | 0.643 | 0.651 | 0.646 | |
| RF | 0.753 | 0.75 | 0.703 | 0.7 | 0.682 | 0.679 | |
| AdaBoost | 0.74 | 0.738 | 0.698 | 0.694 | 0.673 | 0.67 | |
| XGBoost | 0.738 | 0.735 | 0.703 | 0.7 | 0.707 | 0.704 | |
|
| LR | 0.697 | 0.677 | 0.694 | 0.666 | 0.695 | 0.659 |
| SVM Linear | 0.693 | 0.66 | 0.685 | 0.65 | 0.691 | 0.654 | |
| CART (DT) | 0.649 | 0.646 | 0.649 | 0.646 | 0.637 | 0.634 | |
|
|
|
|
|
|
|
| |
| AdaBoost | 0.695 | 0.692 | 0.66 | 0.658 | 0.621 | 0.618 | |
| XGBoost | 0.747 | 0.745 | 0.709 | 0.706 | 0.679 | 0.676 | |
|
| LR | 0.735 | 0.732 | 0.726 | 0.723 | 0.702 | 0.698 |
| SVM Linear | 0.728 | 0.725 | 0.725 | 0.723 | 0.703 | 0.699 | |
| CART (DT) | 0.678 | 0.675 | 0.676 | 0.673 | 0.687 | 0.685 | |
|
|
|
|
|
|
|
| |
| AdaBoost | 0.707 | 0.705 | 0.673 | 0.67 | 0.607 | 0.604 | |
| XGBoost | 0.725 | 0.724 | 0.701 | 0.699 | 0.689 | 0.688 | |
|
| LR | 0.788 | 0.786 | 0.775 | 0.773 | 0.79 | 0.788 |
| SVM Linear | 0.793 | 0.791 | 0.79 | 0.788 | 0.774 | 0.77 | |
| CART (DT) | 0.768 | 0.765 | 0.767 | 0.764 | 0.768 | 0.765 | |
|
|
|
|
|
|
|
| |
| AdaBoost | 0.752 | 0.75 | 0.739 | 0.737 | 0.738 | 0.736 | |
| XGBoost | 0.785 | 0.784 | 0.772 | 0.771 | 0.767 | 0.765 | |
|
| LR | 0.778 | 0.764 | 0.757 | 0.744 | 0.708 | 0.688 |
| SVM Linear | 0.804 | 0.795 | 0.786 | 0.778 | 0.757 | 0.744 | |
| CART (DT) | 0.704 | 0.688 | 0.712 | 0.697 | 0.68 | 0.661 | |
| RF | 0.821 | 0.809 | 0.783 | 0.77 | 0.717 | 0.701 | |
| AdaBoost | 0.811 | 0.802 | 0.779 | 0.769 | 0.708 | 0.693 | |
|
|
|
|
|
|
|
| |
|
| LR | 0.825 | 0.811 | 0.8 | 0.784 | 0.772 | 0.753 |
| SVM Linear | 0.852 | 0.844 | 0.819 | 0.805 | 0.822 | 0.807 | |
| CART (DT) | 0.838 | 0.831 | 0.838 | 0.831 | 0.839 | 0.831 | |
| RF | 0.898 | 0.896 | 0.892 | 0.889 | 0.891 | 0.886 | |
|
|
|
|
|
|
|
| |
| XGBoost | 0.902 | 0.899 | 0.885 | 0.881 | 0.867 | 0.861 | |
|
| LR | 0.893 | 0.888 | 0.868 | 0.862 | 0.843 | 0.837 |
| SVM Linear | 0.935 | 0.933 | 0.908 | 0.904 | 0.907 | 0.903 | |
| CART (DT) | 0.86 | 0.856 | 0.865 | 0.86 | 0.86 | 0.855 | |
| RF | 0.971 | 0.97 | 0.944 | 0.942 | 0.92 | 0.916 | |
| AdaBoost | 0.941 | 0.939 | 0.935 | 0.933 | 0.918 | 0.915 | |
|
|
|
|
|
|
|
| |
|
| LR | 0.796 | 0.786 | 0.792 | 0.779 | 0.784 | 0.771 |
| SVM Linear | 0.818 | 0.812 | 0.801 | 0.789 | 0.792 | 0.774 | |
| CART (DT) | 0.719 | 0.703 | 0.722 | 0.705 | 0.731 | 0.717 | |
| RF | 0.848 | 0.842 | 0.832 | 0.825 | 0.801 | 0.793 | |
| AdaBoost | 0.852 | 0.846 | 0.828 | 0.821 | 0.759 | 0.748 | |
|
|
|
|
|
|
|
| |
1 Numbers in bold highlight the best classifiers.
A comparison of recent works developed for predicting diabetes complications using machine learning.
| Source | Dataset Size | Best Model | Complication | Accuracy |
|---|---|---|---|---|
| Our study | 884 | XGBoost | Diabetic foot | 97.8% |
| [ | 455 | ID3 | Eye, kidney, heart and diabetic Hyperlipidemia | 92.35% |
| [ | 943 | LR | Retinopathy | 77.7% |
| [ | 779 | RF | Nephropathy | 89% |
Figure 7The confusion matrix of the targets’ correlation.
Figure 8The average time needed to train a model.