| Literature DB >> 29637403 |
Md Maniruzzaman1,2, Md Jahanur Rahman1, Md Al-MehediHasan3, Harman S Suri4, Md Menhazul Abedin5, Ayman El-Baz6, Jasjit S Suri7,8.
Abstract
Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.Entities:
Keywords: Diabetes; Feature selection; Machine learning; Missing values; Outliers; Risk stratification
Mesh:
Substances:
Year: 2018 PMID: 29637403 PMCID: PMC5893681 DOI: 10.1007/s10916-018-0940-7
Source DB: PubMed Journal: J Med Syst ISSN: 0148-5598 Impact factor: 4.460
Fig. 1Preparation of diabetic data by missing value replacement and outlier removal
Demographics of the diabetic patient cohort
| SN | Attributes | Descriptions | Attributes type’s | Mean ± SD |
|---|---|---|---|---|
| 1 | Pregnant | Number of times pregnant | Continuous | 3.84 ± 3.36 |
| 2 | Glucose | Plasma glucose (2-h) | Continuous | 121.67 ± 30.46 |
| 3 | Pressure | Diastolic blood pressure (mm Hg) | Continuous | 72.38 ± 12.10 |
| 4 | Triceps | Triceps skin fold thickness (mm) | Continuous | 29.08 ± 8.89 |
| 5 | Insulin | Two hours serum-insulin (μ U/ml) | Continuous | 141.76 ± 89.10 |
| 6 | Mass | Body mass index (weight in kg/ (height in m)2) | Continuous | 32.43 ± 6.88 |
| 7 | Pedigree | Diabetes pedigree function | Continuous | 0.47 ± 0.33 |
| 8 | Age | Age (years) | Continuous | 33.24 ± 11.76 |
| 9 | Class | Diabetic (500) vs. control (268) | Categorical | – |
Fig. 2Architecture of the machine learning system
Fig. 3Concept showing the hypothesis link between outlier removals in relation to the performance of the ML system
Comparison of mean accuracy of different protocols between O1 and O2 over FST
| FST | O1 | O2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K2 | K4 | K5 | K10 | JK | K2 | K4 | K5 | K10 | JK | |
| F1 | 81.58 | 81.97 | 84.23 |
|
| 84.30 | 85.71 | 85.88 |
|
|
| F2 | 81.84 | 81.45 | 83.23 | 83.56 | 86.77 | 84.66 | 86.16 | 84.40 | 84.40 | 88.40 |
| F3 | 81.92 | 82.73 | 81.88 | 81.90 | 86.19 | 84.50 | 85.27 | 85.64 | 84.73 | 88.45 |
| F4 | 81.48 | 81.98 | 83.09 | 82.23 | 85.66 | 83.71 | 84.60 | 83.73 | 84.60 | 87.91 |
| F5 | 81.94 | 81.94 | 82.51 | 82.47 | 87.89 | 83.77 | 83.44 | 84.20 | 84.01 | 87.75 |
| F6 | 71.48 | 73.51 | 74.90 | 74.82 | 78.13 | 75.53 | 75.82 | 76.77 | 77.35 | 79.32 |
Bold values indicate the highest classification accuracy
Comparisons of all classifiers and FST over protocols in terms of accuracy for O1
| PT* | FST | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| K2 | F1 | 77.21 | 73.83 | 76.56 | 85.23 | 85.18 | 78.88 | 86.33 | 78.54 | 86.93 |
|
| F2 | 77.76 | 76.38 | 77.86 | 84.71 | 84.92 | 76.88 | 85.76 | 79.17 | 87.24 | 87.73 | |
| F3 | 77.24 | 74.27 | 77.03 | 83.88 | 83.93 | 81.98 | 87.16 | 78.57 | 86.25 | 88.88 | |
| F4 | 77.55 | 75.13 | 77.45 | 82.89 | 82.99 | 79.82 | 85.08 | 79.90 | 86.51 | 87.47 | |
| F5 | 77.73 | 75.36 | 77.55 | 84.48 | 85.08 | 78.70 | 85.78 | 79.56 | 87.21 | 87.97 | |
| F6 | 69.64 | 67.97 | 68.78 | 72.29 | 71.61 | 68.62 | 73.67 | 71.20 | 75.18 | 75.81 | |
| K4 | F1 | 76.30 | 73.49 | 75.73 | 86.25 | 86.46 | 79.90 | 86.41 | 78.39 | 86.93 |
|
| F2 | 77.34 | 74.84 | 76.93 | 85.68 | 83.39 | 75.68 | 84.74 | 79.27 | 88.02 | 88.65 | |
| F3 | 78.02 | 75.26 | 77.71 | 85.52 | 84.90 | 81.56 | 87.55 | 80.10 | 86.93 | 89.79 | |
| F4 | 77.55 | 75.10 | 77.86 | 84.11 | 82.97 | 80.62 | 85.21 | 80.94 | 86.98 | 88.49 | |
| F5 | 78.96 | 76.72 | 78.28 | 85.68 | 86.35 | 80.10 | 85.00 | 80.73 | 87.66 | 89.06 | |
| F6 | 70.94 | 68.85 | 69.95 | 75.83 | 73.70 | 70.52 | 75.57 | 73.28 | 77.92 | 78.49 | |
| K5 | F1 | 80.32 | 77.40 | 79.48 | 88.51 | 87.21 | 81.17 | 87.34 | 82.40 | 88.57 |
|
| F2 | 79.22 | 77.27 | 78.64 | 87.53 | 86.56 | 79.68 | 86.30 | 80.78 | 87.34 | 88.96 | |
| F3 | 77.47 | 73.77 | 76.62 | 84.81 | 84.22 | 81.36 | 85.91 | 78.96 | 87.01 | 88.70 | |
| F4 | 77.92 | 76.30 | 78.18 | 85.39 | 84.03 | 82.66 | 86.36 | 82.60 | 87.86 | 89.55 | |
| F5 | 77.79 | 75.19 | 77.21 | 85.58 | 84.81 | 81.04 | 86.30 | 80.32 | 87.73 | 89.16 | |
| F6 | 71.62 | 71.82 | 71.88 | 78.70 | 75.39 | 72.53 | 74.55 | 74.74 | 78.57 | 79.22 | |
| K10 | F1 | 77.62 | 74.48 | 77.03 | 85.57 | 85.00 | 80.93 | 86.63 | 79.62 | 87.07 | 89.59 |
| F2 | 78.18 | 76.36 | 78.44 | 88.05 | 85.58 | 78.31 | 88.70 | 81.69 | 89.35 |
| |
| F3 | 76.75 | 73.38 | 76.10 | 84.81 | 83.12 | 81.69 | 85.71 | 80.39 | 86.88 | 90.13 | |
| F4 | 77.92 | 75.32 | 77.92 | 85.84 | 83.25 | 78.96 | 85.45 | 82.34 | 86.23 | 89.09 | |
| F5 | 76.75 | 75.58 | 77.53 | 86.88 | 85.19 | 81.43 | 84.29 | 80.91 | 87.01 | 89.09 | |
| F6 | 72.73 | 69.87 | 71.56 | 79.09 | 75.58 | 72.34 | 74.29 | 75.97 | 77.66 | 79.09 | |
| JK | F1 | 77.92 | 76.05 | 77.44 | 89.01 | 90.41 | 82.16 | 99.92 | 78.32 | 89.28 |
|
| F2 | 78.27 | 81.22 | 78.78 | 88.12 | 89.24 | 83.20 | 99.49 | 79.16 | 90.23 |
| |
| F3 | 78.09 | 76.24 | 77.04 | 88.49 | 89.99 | 83.82 | 99.87 | 78.37 | 90.03 |
| |
| F4 | 77.77 | 75.60 | 77.34 | 86.77 | 88.97 | 82.98 | 99.62 | 80.29 | 87.31 |
| |
| F5 | 83.67 | 83.84 | 82.84 | 88.42 | 88.28 | 79.06 | 98.63 | 84.13 | 88.59 |
| |
| F6 | 70.10 | 70.20 | 68.88 | 77.30 | 76.63 | 75.82 | 96.02 | 71.26 | 76.97 |
|
Bold values indicate the highest classification accuracy
Comparisons of accuracy of all classifiers and FST over protocols for O2
| PT* | FST | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| K2 | F1 | 83.88 | 83.78 | 84.37 | 87.34 | 86.15 | 79.14 | 86.67 | 85.29 | 86.54 |
|
| F2 | 84.40 | 84.11 | 84.71 | 86.82 | 85.05 | 77.66 | 84.87 | 85.23 | 86.02 | 87.76 | |
| F3 | 83.39 | 83.41 | 83.70 | 86.48 | 85.21 | 79.87 | 85.42 | 84.82 | 85.05 | 87.66 | |
| F4 | 82.50 | 82.84 | 82.50 | 85.16 | 84.32 | 80.03 | 84.19 | 83.70 | 85.13 | 86.72 | |
| F5 | 83.10 | 83.83 | 83.10 | 85.81 | 83.93 | 77.63 | 84.11 | 84.27 | 84.82 | 86.75 | |
| F6 | 76.56 | 76.28 | 76.90 | 77.37 | 75.99 | 72.16 | 72.16 | 77.94 | 73.88 | 76.04 | |
| K4 | F1 | 85.31 | 85.21 | 85.05 | 88.28 | 86.61 | 79.58 | 87.50 | 87.45 | 87.03 |
|
| F2 | 83.80 | 84.17 | 84.06 | 87.24 | 85.99 | 79.90 | 86.35 | 85.83 | 86.41 | 88.96 | |
| F3 | 85.10 | 84.43 | 84.90 | 88.28 | 86.61 | 79.79 | 86.09 | 86.61 | 85.83 | 89.48 | |
| F4 | 83.49 | 83.75 | 83.23 | 86.41 | 84.90 | 79.90 | 84.84 | 85.42 | 85.89 | 88.13 | |
| F5 | 82.45 | 82.50 | 82.40 | 86.15 | 83.91 | 77.71 | 83.33 | 84.01 | 85.26 | 86.67 | |
| F6 | 76.61 | 75.99 | 76.82 | 78.33 | 75.47 | 72.03 | 71.15 | 79.06 | 76.46 | 76.30 | |
| K5 | F1 | 84.94 | 84.48 | 84.29 | 88.70 | 86.49 | 79.29 | 86.49 | 86.82 | 87.53 |
|
| F2 | 82.47 | 82.27 | 82.27 | 86.75 | 84.94 | 77.99 | 86.88 | 85.26 | 86.17 | 88.96 | |
| F3 | 84.22 | 84.61 | 83.77 | 88.31 | 86.82 | 79.94 | 87.08 | 85.91 | 86.10 | 89.61 | |
| F4 | 82.08 | 82.08 | 81.95 | 86.17 | 83.70 | 80.58 | 83.05 | 84.35 | 85.71 | 87.66 | |
| F5 | 83.12 | 83.38 | 83.38 | 87.08 | 84.35 | 79.16 | 82.21 | 84.94 | 85.91 | 88.44 | |
| F6 | 77.08 | 76.36 | 77.14 | 78.57 | 77.34 | 72.66 | 73.57 | 79.68 | 76.62 | 78.64 | |
| K10 | F1 | 83.38 | 84.16 | 82.73 | 89.35 | 86.49 | 81.17 | 84.42 | 85.97 | 87.27 |
|
| F2 | 85.45 | 85.71 | 85.97 | 89.61 | 86.62 | 78.83 | 87.27 | 88.05 | 87.79 | 90.26 | |
| F3 | 82.86 | 82.60 | 83.38 | 88.05 | 85.19 | 76.75 | 86.62 | 86.62 | 86.49 | 88.70 | |
| F4 | 82.60 | 83.77 | 82.73 | 87.14 | 83.64 | 79.74 | 82.86 | 87.40 | 87.79 | 88.31 | |
| F5 | 82.86 | 83.12 | 82.73 | 87.79 | 85.06 | 76.49 | 82.73 | 86.36 | 85.97 | 87.01 | |
| F6 | 77.66 | 77.27 | 77.53 | 80.13 | 77.14 | 75.32 | 72.99 | 80.26 | 76.62 | 78.57 | |
| JK | F1 | 84.12 | 84.31 | 83.74 | 88.43 | 88.66 | 80.40 | 99.82 | 84.78 | 90.20 |
|
| F2 | 84.01 | 84.79 | 84.03 | 88.72 | 88.30 | 79.14 | 99.24 | 85.66 | 90.14 |
| |
| F3 | 84.12 | 84.31 | 83.74 | 88.44 | 88.62 | 80.50 | 99.82 | 84.78 | 90.20 |
| |
| F4 | 83.45 | 84.00 | 82.76 | 87.30 | 87.45 | 81.67 | 99.81 | 84.13 | 88.58 |
| |
| F5 | 81.10 | 82.26 | 83.66 | 88.17 | 89.09 | 83.21 | 99.82 | 81.39 | 90.23 |
| |
| F6 | 71.10 | 70.32 | 69.88 | 78.30 | 77.63 | 77.82 | 97.02 | 72.26 | 76.97 |
|
Bold values indicate the highest classification accuracy
*Protocol Types
Comparison of accuracy of classifier’s between O1 and O2 over protocols and FST
| CT* | O1 | O2 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | F2 | F3 | F4 | F5 | F6 | F1 | F2 | F3 | F4 | F5 | F6 | |
| C1 | 78.14 | 78.15 | 77.51 | 77.74 | 78.22 | 71.23 | 84.03 | 83.73 | 83.68 | 82.82 | 83.04 | 76.98 |
| C2 | 75.45 | 77.21 | 74.58 | 75.49 | 76.75 | 69.63 | 83.99 | 83.73 | 83.82 | 83.29 | 83.33 | 76.48 |
| C3 | 77.56 | 78.13 | 76.90 | 77.75 | 78.7 | 70.54 | 83.77 | 83.67 | 83.73 | 82.63 | 82.89 | 77.10 |
| C4 | 87.67 | 86.82 | 85.50 | 85.00 | 85.92 | 76.48 | 88.40 | 87.46 | 87.70 | 86.44 | 87.05 | 78.60 |
| C5 | 87.44 | 85.94 | 85.23 | 84.44 | 85.85 | 74.07 | 86.72 | 85.97 | 86.37 | 84.80 | 85.11 | 76.48 |
| C6 | 81.15 | 78.75 | 82.08 | 81.01 | 80.62 | 71.00 | 79.87 | 78.47 | 79.39 | 80.38 | 78.01 | 73.04 |
| C7 | 89.92 | 89.00 | 89.24 | 88.34 | 88.39 | 74.52 | 88.38 | 89.07 | 89.06 | 86.95 | 86.20 | 72.47 |
| C8 | 80.13 | 80.01 | 79.28 | 81.21 | 80.35 | 73.80 | 85.71 | 85.77 | 85.59 | 85.00 | 84.74 | 79.24 |
| C9 | 88.16 | 88.44 | 87.42 | 86.98 | 87.88 | 77.33 | 87.24 | 87.11 | 86.85 | 86.62 | 86.14 | 75.90 |
| C10 |
| 91.25 | 91.50 | 90.92 | 90.84 | 78.15 |
| 91.05 | 90.98 | 90.16 | 89.81 | 77.39 |
Bold values indicate the highest classification accuracy
* Classifier types
Fig. 4Performance evaluations of machine learning system
Fig. 5Comparison of all classifiers over different FST’s based on RI for O1
Fig. 6Comparison of all classifiers over different FST’s based on RI for O2
Comparison of all classifiers over different FST’s based on RI for O1
| FST | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|---|---|---|---|---|---|---|---|---|---|---|
| F1 | 96.97 | 96.17 | 96.82 | 97.84 | 98.25 | 96.50 | 97.73 | 96.94 | 97.78 |
|
| F2 | 96.52 | 96.04 | 96.34 | 97.56 | 97.68 | 96.42 | 97.56 | 97.22 | 97.58 | 98.01 |
| F3 | 96.08 | 95.22 | 95.88 | 96.98 | 97.14 | 96.18 | 97.37 | 96.73 | 97.53 | 98.01 |
| F4 | 96.25 | 95.95 | 96.40 | 97.43 | 97.51 | 97.11 | 97.34 | 97.43 | 97.44 | 98.31 |
| F5 | 96.79 | 96.49 | 97.13 | 97.44 | 97.03 | 96.77 | 97.68 | 97.17 | 97.13 | 98.24 |
| F6 | 96.23 | 96.07 | 95.42 | 96.40 | 95.94 | 95.59 | 96.45 | 96.32 | 96.78 | 97.57 |
Bold values indicate the highest classification accuracy
Comparison of all classifiers over different FST’s based on RI for O2
| FST | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|---|---|---|---|---|---|---|---|---|---|---|
| F1 | 97.73 | 97.21 | 97.14 | 97.77 | 97.74 | 96.81 | 97.06 | 97.54 | 97.16 |
|
| F2 | 97.17 | 97.02 | 97.33 | 97.52 | 97.24 | 96.43 | 97.05 | 97.34 | 97.19 | 97.91 |
| F3 | 97.00 | 96.43 | 96.80 | 97.59 | 97.47 | 96.41 | 96.69 | 97.31 | 97.22 | 97.72 |
| F4 | 97.11 | 96.32 | 97.10 | 97.41 | 96.77 | 95.58 | 96.50 | 97.44 | 96.93 | 97.84 |
| F5 | 97.23 | 97.21 | 97.01 | 97.67 | 97.34 | 96.67 | 96.53 | 97.35 | 97.39 | 97.73 |
| F6 | 96.24 | 96.18 | 96.06 | 97.38 | 96.18 | 95.34 | 95.96 | 96.96 | 96.15 | 96.46 |
Bold values indicate the highest classification accuracy
Comparative performance of our proposed method against previous studies
| SN | Authors | Year | Data size & class | # Features | MVIMa | ORTb | FSTc | # selected Features | Classifier types | Performances measure (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Karthikeyani et al. [ | 2012 | 768 | 8 | Mean | NA | NA | – |
| ACC:74.80 |
| 2 | Karthikeyani et al. [ | 2013 | 768 | 8 | NA | NA | PLS | 3 |
| ACC:74.40 |
| 3 | Kumari and Chitra [ | 2013 | 460 | 8 | NA | NA | NA | – |
| ACC:75.50 |
| 4 | Parashar et al. [ | 2014 | 768 | 8 | NA | NA | LDA | 2 | ACC:75.65 | |
| 5 | Bozkurt et al. [ | 2014 | 768 | 8 | NA | NA | NA | – | AIS, | ACC:76.00 |
| 6 | Iyer et al. [ | 2015 | 768 | 8 | Mean | CFS | 4 | ACC:74.79 | ||
| 7 | Kumar Dewangan and Agrawal [ | 2015 | 768 | 8 | NA | NA | None | – | ACC:81.19 | |
| 8 | Bashir et al. [ | 2016 | 768 | 8 | NA | NA | NA | – | NB, SVM, LR, QDA, KNN, RF, ANN, | ACC:77.21 |
| 9 | Sivanesan et al. [ | 2017 | 768 | 8 | NA | NA | NA | – | J48 | ACC:76.58 |
| 10 | Meraj Nabi et al. [ | 2017 | 768 |
| NA | NA | NA | – | NB, | ACC:80.43 |
| 11 | Maniruzzaman et al. [ | 2017 | 768 | 8 | Median | NA | NA | – | LDA, QDA, NB, | ACC: 81.97 |
| 12 |
| – | 768 | 8 | Group Median | Median |
| 4 | LDA, QDA, NB, ANN, GPC, SVM, Adaboost, LR,DT, | ACC: 92.26 |
Bold values indicate the highest classification accuracy
aMissing value imputation method; bOutliers removal techniques; cFeature selection techniques
Fig. 7Comparison of our proposed method against the existing methods in literature. RED arrows shows the proposed work