| Literature DB >> 31826018 |
Hasan T Abbas1, Lejla Alic2, Madhav Erraguntla3, Jim X Ji1, Muhammad Abdul-Ghani4, Qammer H Abbasi5, Marwa K Qaraqe6.
Abstract
Diabetes is a large healthcare burden worldwide. There is substantial evidence that lifestyle modifications and drug intervention can prevent diabetes, therefore, an early identification of high risk individuals is important to design targeted prevention strategies. In this paper, we present an automatic tool that uses machine learning techniques to predict the development of type 2 diabetes mellitus (T2DM). Data generated from an oral glucose tolerance test (OGTT) was used to develop a predictive model based on the support vector machine (SVM). We trained and validated the models using the OGTT and demographic data of 1,492 healthy individuals collected during the San Antonio Heart Study. This study collected plasma glucose and insulin concentrations before glucose intake and at three time-points thereafter (30, 60 and 120 min). Furthermore, personal information such as age, ethnicity and body-mass index was also a part of the data-set. Using 11 OGTT measurements, we have deduced 61 features, which are then assigned a rank and the top ten features are shortlisted using minimum redundancy maximum relevance feature selection algorithm. All possible combinations of the 10 best ranked features were used to generate SVM based prediction models. This research shows that an individual's plasma glucose levels, and the information derived therefrom have the strongest predictive performance for the future development of T2DM. Significantly, insulin and demographic features do not provide additional performance improvement for diabetes prediction. The results of this work identify the parsimonious clinical data needed to be collected for an efficient prediction of T2DM. Our approach shows an average accuracy of 96.80% and a sensitivity of 80.09% obtained on a holdout set.Entities:
Year: 2019 PMID: 31826018 PMCID: PMC6905529 DOI: 10.1371/journal.pone.0219636
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The classification of the 1,492 subjects used in this study based on the ethnicity.
| Healthy | T2DM | CVD | T2DM+CVD | |
|---|---|---|---|---|
| Total | 1,277 | 161 | 44 | 10 |
| 85.56% | 10.79% | 2.95% | 0.67% | |
| MA | 836 | 131 | 24 | 7 |
| 83.77% | 13.13% | 2.40% | 0.70% | |
| NHW | 441 | 30 | 20 | 3 |
| 89.27% | 6.07% | 4.05% | 0.61% |
Fig 1Box plots of glucose and insulin levels for healthy and diabetic subjects measured at the baseline OGTT.
A: Plasma glucose. B: Serum insulin.
Fig 2Illustration of all 61 features extracted from the SAHS data-set.
List of ten most relevant features ranked by the mRMR algorithm.
| Rank | Feature |
|---|---|
| 1 | AuC-G0-120 |
| 2 | ΔG120-0 |
| 3 | ΔG120-60 |
| 4 | Ethnicity |
| 5 | ΔI120-0 |
| 6 | ΔG60-0 |
| 7 | ΔG30-0 |
| 8 | ΔG60-30 |
| 9 | ΔI120-60 |
| 10 | ΔI60-0 |
Fig 3The g-mean of sensitivity and specificity for A: linear, and B: RBF kernels.
The maximum performance feature combination is depicted by a different colour scheme.
Fig 4The classifier performance in terms of accuracy and sensitivity for the best feature combinations.
A: Linear kernel. B: RBF kernel.
Comparison of validation performance of the best SVM classifiers with previous studies.
| Accuracy ± SD | Sensitivity ± SD | Specificity ± SD | g-mean ± SD | |
|---|---|---|---|---|
| Linear SVM (10 features) | 95.55% ± 0.24% | 78.09% ± 0.33% | 97.87% ± 0.30% | 0.8742% ± 0.2100 |
| SVM-RBF (4 features) | 96.80% ± 0.41% | 80.09% ± 1.42% | 99.02% ± 0.33% | |
| SADPM [ | 56.329% | 88.80% | 52.00% | 0.6795 |
| Two-step Approach [ | 77.43% | 77.70% | 77.40% | 0.7755 |