| Literature DB >> 32384737 |
Amin Ul Haq1, Jian Ping Li1, Jalaluddin Khan1, Muhammad Hammad Memon1, Shah Nazir2, Sultan Ahmad3, Ghufran Ahmad Khan4, Amjad Ali5.
Abstract
Significant attention has been paid to the accurate detection of diabetes. It is a big challenge for the research community to develop a diagnosis system to detect diabetes in a successful way in the e-healthcare environment. Machine learning techniques have an emerging role in healthcare services by delivering a system to analyze the medical data for diagnosis of diseases. The existing diagnosis systems have some drawbacks, such as high computation time, and low prediction accuracy. To handle these issues, we have proposed a diagnosis system using machine learning methods for the detection of diabetes. The proposed method has been tested on the diabetes data set which is a clinical dataset designed from patient's clinical history. Further, model validation methods, such as hold out, K-fold, leave one subject out and performance evaluation metrics, includes accuracy, specificity, sensitivity, F1-score, receiver operating characteristic curve, and execution time have been used to check the validity of the proposed system. We have proposed a filter method based on the Decision Tree (Iterative Dichotomiser 3) algorithm for highly important feature selection. Two ensemble learning algorithms, Ada Boost and Random Forest, are also used for feature selection and we also compared the classifier performance with wrapper based feature selection algorithms. Classifier Decision Tree has been used for the classification of healthy and diabetic subjects. The experimental results show that the proposed feature selection algorithm selected features improve the classification performance of the predictive model and achieved optimal accuracy. Additionally, the proposed system performance is high compared to the previous state-of-the-art methods. High performance of the proposed method is due to the different combinations of selected features set and Plasma glucose concentrations, Diabetes pedigree function, and Blood mass index are more significantly important features in the dataset for prediction of diabetes. Furthermore, the experimental results statistical analysis demonstrated that the proposed method would effectively detect diabetes and can be deployed in an e-healthcare environment.Entities:
Keywords: decision tree; diabetes disease; e-healthcare; feature selection; machine learning; medical data; performance
Mesh:
Year: 2020 PMID: 32384737 PMCID: PMC7249007 DOI: 10.3390/s20092649
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Mathematically symbols and notations Used in the paper.
| Symbol | Description |
|---|---|
|
| Data set |
|
| Subset |
|
| Feature set |
|
| Number of instances in dataset |
|
| Input features in dataset |
|
| Predicted output classes label |
|
| Bais is offset value from the origin |
|
| d-dimensional coefficient vector |
|
| |
|
| |
|
| Target labels to x |
|
| Training set |
|
| Test set |
|
| Finite set |
| IG(F) | Information gain |
| Test probability value | |
|
| Degree of freedom |
|
| Feature in dataset |
| MI | Mutual information |
|
| ith feature in dataset |
|
| Empty sect |
|
| probability |
|
| Null hypothesis |
|
| Alternate hypothesis |
Figure 1Feature selection process.
Figure 2Flow chart of the proposed method of Diabetes Detection.
The Diabetes dataset description along with some statistical operations.
| Feature Name | Feature Code | Description | Min-Max | Mean, (±) STD |
|---|---|---|---|---|
| Pregnancies | PG | Number of period pregnant | 0.000000–17.000000 | 3.703500, (±) 3.306063 |
| Glucose | GL | Plasma glucose concentrations | 0.000000–199.000000 | 121.182500, (±) 32.068636 |
| Blood Pressure | BP | Blood pressures (mm Hg) | 0.000000–122.000000 | 69.145500, (±)19.188315 |
| Skin Thickness | ST | Triceps skin fold thickness(mm) | 0.000000–110.000000 | 20.935000, (±) 16.103243 |
| Insulin | IS | Serum insulin concentration | 0.000000–744.000000 | 80.254000, (±)111.180534 |
| BMI | BMI | Blood mass index | 0.000000–80.600000 | 32.193000, (±) 8.149901 |
| Diabetes Pedigree Function | DPF | Diabetes pedigree function | 0.078000–2.420000 | 0.470930, (±) 0.323553 |
| Age | AGE | Age in years | 21.000000–81.000000 | 33.090500, (±)11.786423 |
| Outcome | 1 = yes | Diabetes = 1 | 0.000000–1.000000 | 0.342000, (±) 0.474498 |
| 0 = no | Healthy = 0 |
Figure 3Histograms for the visual representation of features.
Figure 4Heat map of the dataset.
Feature ranking and importance by decision tree (DT) (ID3) algorithm.
| S.No | Feature Label | Ranking | Score |
|---|---|---|---|
| 1 | PG | IS | 0.07605 |
| 2 | GL | ST | 0.07947 |
| 3 | BP | BP | 0.10179 |
| 4 | ST | PG | 0.11071 |
| 5 | IS | DPF | 0.11491 |
| 6 | BMI | BMI | 0.13829 |
| 7 | DPF | AGE | 0.14366 |
| 8 | AGE | GL | 0.23511 |
Rank and score of features selected by DT (ID3), Ada Boost and Random Forest algorithm.
| S.NO | Feature Set | Feature Selection Algorithm | ||
|---|---|---|---|---|
| DT(ID3) | Ada Boost | Random Forest | ||
| 1 | PG | GL | GL | BP |
| 2 | GL | AGE | BMI | GL |
| 3 | BP | IS | DPF | AGE |
| 4 | ST | DPF | BP | ST |
| 5 | IS | BMI | AGE | IS |
| 6 | BMI | BP | IS | BMI |
| 7 | DPF | PG | DPE | |
| 8 | AGE | |||
Figure 5Feature selected by DT (ID3) algorithm.
Classification Performance on individual features, full features and features set without GL.
| Classifier | Feature | Acc (%) | Sn (%) | Sp (%) | MCC (%) | ROC-AUC (%) | K-Fold (%) | LOSO (%) | Time (s) |
|---|---|---|---|---|---|---|---|---|---|
| DT | GL | 75 | 45 | 88 | 67 | 67 | 77 | 76 | 0.001 |
| BP | 68 | 8 | 74 | 52 | 53 | 67 | 66 | 0.005 | |
| BMI | 74 | 45 | 88 | 66 | 66 | 73 | 72 | 0.005 | |
| DPF | 84 | 66 | 87 | 78 | 78 | 84 | 83 | 0.002 | |
| IS | 73 | 34 | 92 | 64 | 63 | 73 | 73 | 0.001 | |
| ST | 68 | 14 | 95 | 54 | 54 | 65 | 66 | 0.001 | |
| PG | 69 | 27 | 90 | 59 | 58 | 69 | 70 | 0.0009 | |
| AGE | 70 | 40 | 85 | 62 | 63 | 70 | 71 | 0.0018 | |
| Full with GL | 98.2 | 100 | 97 | 99 | 99 | 99 | 99.8 | 0.006 | |
| Without GL | 97 | 75 | 82 | 97 | 97 | 99.5 | 99.7 | 0.005 |
Classification Performance with and without selected feature set by Filter FS algorithms.
| Feature Set Selection | Acc (%) | Sn (%) | Sp (%) | MCC (%) | Pre (%) | Rec (%) | F1 (%) | ROC (%) | K-Folds (%) | LOSO (%) | Time (S) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full set | 98.2 | 98 | 97 | 97 | 99.8 | 98 | 98.6 | 98 | 99.2 | 99.6 | 0.006 |
| ID3 | 99 | 100 | 98 | 99 | 100 | 100 | 100 | 99.8 | 99.8 | 99.9 | 0.005 |
| Ada Boost | 98.5 | 98 | 99 | 98 | 98 | 98 | 99 | 98.6 | 99.3 | 99.6 | 0.004 |
| Random Forest | 98.3 | 98 | 98 | 98 | 95 | 98 | 99 | 98.7 | 99.4 | 99.7 | 0.006 |
Figure 6Accuracy on selected features set by DT-ID3 with different validation methods.
Classification Performance with and without selected feature set by Wrapper based FS algorithms.
| Feature Set Selection | Acc (%) | Sn (%) | Sp (%) | MCC (%) | Pre (%) | F1 (%) | ROC (%) | K-Fold (%) | LOSO (%) | Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| SBS | 98 | 99 | 98 | 98 | 99 | 98 | 97.6 | 98.5 | 98.9 | 0.007 |
Performances comparison of the proposed method with previous methods on the diabetes dataset.
| Reference | Method | Accuracy (%) | |
|---|---|---|---|
| [ | LANFIS | 88.05 | 0.87 |
| [ | SM-Rule-Miner | 89.87 | 0.92 |
| [ | TSHDE | 91.91 | 0.21 |
| [ | C4.5 algorithm | 92.38 | 0.69 |
| [ | Modified K-Means Clustering +SVM (10-FC) | 96.71 | 0.07 |
| [ | Support Vector Machine | 97.14 | 0.06 |
| [ | Artificial Neural Network (ANN) | 82.35 | 1.23 |
| [ | SBNN + PSO + ALR | 88.75 | 0.31 |
| [ | DPM | 96.74 | 0.08 |
| [ | DNN | 95.6 | 0.09 |
| [ | BN | 99.51 | 0.06 |
| DT(ID3) + DT | 99 (Hold out) | 0.04 | |
| Our study | DT(ID3) + DT | 99.8 (K-fold) | |
| DT(ID3) + DT | 99.9 (LOSO) |