| Literature DB >> 35178244 |
Koushik Chandra Howlader1, Md Shahriare Satu2, Md Abdul Awal3, Md Rabiul Islam4, Sheikh Mohammed Shariful Islam5, Julian M W Quinn6, Mohammad Ali Moni7.
Abstract
Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s13755-021-00168-2.Entities:
Keywords: Classifiers; Diabetes; Feature selection sets; Machine learning models; Prediction model
Year: 2022 PMID: 35178244 PMCID: PMC8828812 DOI: 10.1007/s13755-021-00168-2
Source DB: PubMed Journal: Health Inf Sci Syst ISSN: 2047-2501
Fig. 1Proposed methodology
The demographic details of pima Indian diabetes dataset
| S/N | Pregnancies | Glucose | BloodPressure | Thickness | Insulin | BMI | DPF | Age |
|---|---|---|---|---|---|---|---|---|
| Feature type | Integer | Real | Real | Real | Real | Real | Real | Integer |
| Unit | Number of times | mg/dL | mm Hg | mm | mu U/ml | kg/m2 | years | |
| Distinct count | 17 | 136 | 47 | 51 | 186 | 248 | 517 | 52 |
| Unique (%) | 2.20% | 17.70% | 6.10% | 6.60% | 24.20% | 32.30% | 67.30% | 6.80% |
| Mean | 3.8451 | 120.89 | 69.105 | 20.536 | 79.799 | 31.993 | 0.47188 | 33.241 |
| Range | 0–17 | 0–199 | 0–122 | 0–99 | 0–846 | 0–67.1 | 0.078–2.42 | 21–81 |
| Zeros (%) | 14.50% | 0.70% | 4.60% | 29.60% | 48.70% | 1.40% | 0.00% | 0.00% |
| 5-th percentile | 0 | 79 | 38.7 | 0 | 0 | 21.8 | 0.14035 | 21 |
| Q1 | 1 | 99 | 62 | 0 | 0 | 27.3 | 0.24375 | 24 |
| Median | 3 | 117 | 72 | 23 | 30.5 | 32 | 0.3725 | 29 |
| Q3 | 6 | 140.25 | 80 | 32 | 127.25 | 36.6 | 0.62625 | 41 |
| 95-th percentile | 10 | 181 | 90 | 44 | 293 | 44.395 | 1.1328 | 58 |
| Range | 17 | 199 | 122 | 99 | 846 | 67.1 | 2.342 | 60 |
| IQR | 5 | 41.25 | 18 | 32 | 127.25 | 9.3 | 0.3825 | 17 |
| Standard deviation | 3.370 | 31.973 | 19.356 | 15.952 | 115.240 | 7.884 | 0.331 | 11.760 |
| Coef of variation | 0.876 | 0.264 | 0.280 | 0.777 | 1.444 | 0.246 | 0.702 | 0.354 |
| Kurtosis | 0.159 | 0.641 | 5.180 | -0.520 | 7.214 | 3.290 | 5.595 | 0.643 |
| MAD | 2.772 | 25.182 | 12.639 | 13.660 | 84.505 | 5.842 | 0.247 | 9.586 |
| Skewness | 0.902 | 0.174 | -1.844 | 0.109 | 2.272 | -0.429 | 1.920 | 1.130 |
| Sum | 2953 | 92847 | 53073 | 15772 | 61286 | 24570 | 362.4 | 25529 |
| Variance | 11.354 | 1022.2 | 374.65 | 254.47 | 13281 | 62.16 | 0.10978 | 138.3 |
| Memory size | 6.1 KB | 6.1 KB | 6.1 KB | 6.1 KB | 6.1 KB | 6.1 KB | 6.1 KB | 6.1 KB |
Formulation of Various Feature Subsets
| FS | FST | Tool | SM/TS | Features |
|---|---|---|---|---|
| FS1 | IGAE | Orange | Top 5 | Glucose, Age, BMI, Insulin, and |
| GRAE | Orange | Top 5 | Pregnancies | |
| FS2 | GIAE | Orange | Top 5 | Glucose, BMI |
| ANOVA | Orange | Top 5 | Age, DPF, and | |
| X2 test | Orange | Top 5 | Pregnancies | |
| FS3 | RFAE | Weka | Ranker, Top 5 | Glucose, Age, Pregnancies, Thickness, and BMI |
| FS4 | FCFS | Orange | Top 5 | Glucose, Age, BMI, DPF, and Insulin |
| FS5 | CFS | Weka | BFS, ES, RS, SS | Glucose, BMI, DPF, and Age |
| FS6 | FSE | Weka | BFS | Glucose, BMI, and Age |
Fig. 2Average (Minimum, Median, Mean, and Maximum) Results of Different Classifiers
Fig. 3Wireframe Contour of Average Best Classification Results for Individual Datasets
Classifiers Ranking & Adjusted P-values using Post Hoc Methods (Friedman) based on Average Findings
| i | Classifier | Ranking | Unadjusted p | ||||
|---|---|---|---|---|---|---|---|
| 1 | GAMLOESS | 3.00 | |||||
| 2 | GAMBoost | 3.17 | 0.10 | 0.9240 | 8.3164 | 0.9240 | 0.9240 |
| 3 | GBM | 5.00 | 1.14 | 0.2526 | 2.2730 | 0.5458 | 0.5051 |
| 4 | SDWD | 5.33 | 1.33 | 0.1819 | 1.6373 | 0.5458 | 0.5051 |
| 5 | BGLM | 5.67 | 1.53 | 0.1271 | 1.1441 | 0.5167 | 0.5051 |
| 6 | GLM | 5.92 | 1.67 | 0.0952 | 0.8568 | 0.5167 | 0.4760 |
| 7 | NB | 6.00 | 1.72 | 0.0861 | 0.7751 | 0.5167 | 0.4760 |
| 8 | PLR | 6.67 | 2.10 | 0.0359 | 0.3235 | 0.2516 | 0.2516 |
| 9 | PMR | 6.92 | 2.24 | 0.0251 | 0.2254 | 0.2004 | 0.2004 |
| 10 | RLR | 7.33 | 2.48 | 0.0132 | 0.1186 | 0.1186 | 0.1186 |