| Literature DB >> 30112018 |
Yukai Li1, Huling Li1, Hua Yao2.
Abstract
The focus of this study is the use of machine learning methods that combine feature selection and imbalanced process (SMOTE algorithm) to classify and predict diabetes follow-up control satisfaction data. After the feature selection and unbalanced process, diabetes follow-up data of the New Urban Area of Urumqi, Xinjiang, was used as input variables of support vector machine (SVM), decision tree, and integrated learning model (Adaboost and Bagging) for modeling and prediction. The experimental results show that Adaboost algorithm produces better classification results. For the test set, the G-mean was 94.65%, the area under the ROC curve (AUC) was 0.9817, and the important variables in the classification process, fasting blood glucose, age, and BMI were given. The performance of the decision tree model in the test set is relatively lower than that of the support vector machine and the ensemble learning model. The prediction results of these classification models are sufficient. Compared with a single classifier, ensemble learning algorithms show different degrees of increase in classification accuracy. The Adaboost algorithm can be used for the prediction of diabetes follow-up and control satisfaction data.Entities:
Mesh:
Year: 2018 PMID: 30112018 PMCID: PMC6077367 DOI: 10.1155/2018/7207151
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
Analysis of control satisfaction of diabetes patients in New Urban Area of Urumqi (n=3406).
|
|
|
|
|
|
|---|---|---|---|---|
|
| 57(49-65) | 54(46-62) | - | - |
|
| ||||
| male | 276 | 1400 | 0.35 | 0.555 |
| female | 298 | 1432 | ||
|
| ||||
| Han nationality | 479 | 2544 | 28.05 |
|
| Hui | 57 | 183 | ||
| others | 3 | 28 | ||
| Uighur | 35 | 77 | ||
|
| ||||
| junior high school | 193 | 866 | 12.62 |
|
| College specialties and above | 55 | 392 | ||
| High School / Technical School | 96 | 559 | ||
| Illiteracy and semi-literacy | 56 | 245 | ||
| primary school | 174 | 770 | ||
|
| ||||
| Divorced / widowed | 59 | 362 | 2.79 | 0.248 |
| unmarried | 3 | 13 | ||
| married | 512 | 2457 | ||
|
| ||||
| clinical | 228 | 1673 | 73.96 |
|
| outpatient clinic | 333 | 1099 | ||
| others | 13 | 60 | ||
|
| ||||
| Coronary heart disease | ||||
| no | 525 | 2462 | 9.07 |
|
| yes | 49 | 370 | ||
| Hypertension | ||||
| no | 311 | 1317 | 11.27 |
|
| yes | 263 | 1515 | ||
| High cholesterol | ||||
| no | 483 | 2579 | 25.17 |
|
| yes | 91 | 253 | ||
|
| ||||
| no | 270 | 1546 | 10.94 |
|
| yes | 304 | 1286 | ||
|
| ||||
| no | 278 | 1622 | 15.13 |
|
| yes | 296 | 1210 | ||
|
| ||||
| no | 187 | 666 | 20.88 |
|
| yes | 387 | 2166 | ||
|
| ||||
| no | 158 | 621 | 8.48 |
|
| yes | 416 | 2211 | ||
|
| ||||
| no | 175 | 802 | 1.10 | 0.295 |
| yes | 399 | 2030 | ||
|
| ||||
| no | 337 | 1722 | 0.88 | 0.349 |
| yes | 237 | 1110 | ||
|
| ||||
| no | 356 | 1903 | 5.72 |
|
| yes | 218 | 929 | ||
|
| ||||
| no | 333 | 1863 | 12.58 |
|
| yes | 241 | 969 | ||
|
| ||||
| phone | 50 | 218 | 9.75 |
|
| home | 26 | 234 | ||
| clinic | 498 | 2380 | ||
|
| ||||
| poor | 8 | 13 | 78.86 |
|
| good | 327 | 2123 | ||
| fair | 239 | 696 | ||
|
| ||||
| poor | 98 | 103 | 191.40 |
|
| good | 254 | 1863 | ||
| fair | 222 | 866 | ||
|
| ||||
| no medication | 80 | 421 | 41.89 |
|
| regular | 455 | 2356 | ||
| intermittent | 39 | 55 | ||
|
| 130 (120-140) | 130 (120-140) | - | - |
|
| 78 (70-80) | 80 (70-84) | - | - |
|
| 25.36 (23.53-27.53) | 26.27 (24.14-28.43) | - | - |
|
| 6.4 (6.0-6.8) | 8.7 (7.5-11.03) | - | - |
Dataset description.
| Dataset | Samples distribution | Ratio | Description |
|---|---|---|---|
| Original data | 2832/574 | 5:1 | Original data with full instances |
|
| |||
| SMOTE-data | 2824/2870 | 1:1 | Dataset is balanced utilizing SMOTE oversampling |
Confusion matrix.
| Predicted classification | |||
|---|---|---|---|
| 1 | 0 | ||
| Actual classification | 1 | TP | FP |
| 0 | FN | TN | |
Figure 1General flowchart of modeling.
Comparison of prediction performance of the four models.
| Algorithms | Accuracy | Sensitivity | Specificity | G-mean | AUC |
|---|---|---|---|---|---|
| Decision Trees | 0.9115 | 0.9050 | 0.9181 | 0.9115 | 0.9115 |
| SVM | 0.9262 | 0.9408 | 0.9128 | 0.9267 | 0.9688 |
| Adaboost | 0.9484 | 0.9576 | 0.9356 | 0.9465 | 0.9817 |
| Bagging | 0.9115 | 0.9050 | 0.9181 | 0.9115 | 0.9164 |
Figure 2ROC curves for (a) decision tree model, (b) SVM model, (c) Adaboost model, and (d) Bagging model.