| Literature DB >> 28738059 |
Manal Alghamdi1,2, Mouaz Al-Mallah1,2,3, Steven Keteyian3, Clinton Brawner3, Jonathan Ehrman3, Sherif Sakr1,2.
Abstract
Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.Entities:
Mesh:
Year: 2017 PMID: 28738059 PMCID: PMC5524285 DOI: 10.1371/journal.pone.0179805
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Ranking of the dataset attributes based on their Information Gain (IG).
| Attribute | IG Rank |
|---|---|
| 14. Coronary Artery Disease | 0.009904 |
| 15. Nitrate Use | 0.009702 |
| 16. Diuretic Use | 0.006716 |
| 17. Beta Blocker Use | 0.006402 |
| 18. Sex | 0.005626 |
| 19. Smoking | 0.004923 |
| 20. Plavix Use | 0.009904 |
| 21. Angiotensin | 0.001397 |
| 22. Angiotensin Receptor Blockers Use | 0.001154 |
| 23. Other Hypertension Medication Use | 0.001132 |
| 24. Prior Cerebrovascular Accident | 0.0008 |
| 25. Congestive Heart Failure | 0.000777 |
| 26. Calcium Channel Blocker | 0.000242 |
Number of instances decreased by Random Under-Sampling technique.
| Distribution Spread | Class “No” | Class “Yes” |
|---|---|---|
| 2.50 | 12747 | 5099 |
| 1.50 | 7648 | 5099 |
| 1.00 | 5099 | 5099 |
Number of instances increased by SMOTE technique.
| Percentage of SMOTE Increase | Class “No” | Class “Yes” |
|---|---|---|
| 100% | 27456 | 10198 |
| 200% | 27456 | 15297 |
| 300% | 27456 | 20396 |
Fig 1ROC performance of classification models on imbalance dataset using G1.
Fig 2ROC performance of classification models on imbalance dataset using G2.
Evaluation of the performance of classification models on imbalance dataset using the G1 attributes.
| Model | Kappa | Recall (%) | Specificity (%) | Precision (%) | Accuracy (%) | F1-Score |
|---|---|---|---|---|---|---|
| 2.45 | 98.5 | 3.0 | 84.5 | 83.58 | 91 | |
| 5.93 | 98.1 | 5.8 | 84.9 | 83.64 | 91 | |
| (15.4) | 87.4 | (27.7) | (86.7) | 78.8 | 87.1 | |
| 0.92 | 99.8 | 0.70 | 84.4 | 84.29 | (91.5) | |
| 00.7 | (99.9) | 00.6 | 84.4 | (84.3) | (91.5) |
Evaluation of the performance of classification models on imbalance dataset using the G2 attributes.
| Model | Kappa | Recall (%) | Specificity (%) | Precision (%) | Accuracy (%) | F1-Score |
|---|---|---|---|---|---|---|
| 1.34 | 99.2 | 1.6 | 8.44 | 83.93 | 91.2 | |
| (3.63) | 99.2 | 3.1 | 84.6 | 84.14 | 91.3 | |
| 1.37 | 90.8 | (21.2) | (86.1) | 79.94 | 88.4 | |
| 0.70 | (99.9) | 0.50 | 84.4 | 84.32 | (91.5) | |
| 1.14 | 99.4 | 1.3 | 84.4 | 84.04 | 91.3 |
Fig 3Performance of classification models on balance dataset using Random Under-Sampling.
Fig 4Performance of classification models on balance dataset using SMOTE.
Evaluation of the performance of classification models on imbalance dataset using the G2 attributes.
| ROC | Kappa | Recall (%) | Specificity (%) | Precision (%) | Accuracy (%) | F1-Score | |
|---|---|---|---|---|---|---|---|
| 92.2 | 76.8 | 99.7 | 74.7 | 84.1 | 89.0 | 91.3 | |
| 92.2 | 77 | 99.9 | 74.6 | 84.1 | 89.0 | 91.3 |