| Literature DB >> 35463255 |
Roshi Saxena1, Sanjay Kumar Sharma1, Manali Gupta1, G C Sampada2.
Abstract
An active research area where the experts from the medical field are trying to envisage the problem with more accuracy is diabetes prediction. Surveys conducted by WHO have shown a remarkable increase in the diabetic patients. Diabetes generally remains in dormant mode and it boosts the other diseases if patients are diagnosed with some other disease such as damage to the kidney vessels, problems in retina of the eye, and cardiac problem; if unidentified, it can create metabolic disorders and too many complications in the body. The main objective of our study is to draw a comparative study of different classifiers and feature selection methods to predict the diabetes with greater accuracy. In this paper, we have studied multilayer perceptron, decision trees, K-nearest neighbour, and random forest classifiers and few feature selection techniques were applied on the classifiers to detect the diabetes at an early stage. Raw data is subjected to preprocessing techniques, thus removing outliers and imputing missing values by mean and then in the end hyperparameters optimization. Experiments were conducted on PIMA Indians diabetes dataset using Weka 3.9 and the accuracy achieved for multilayer perceptron is 77.60%, for decision trees is 76.07%, for K-nearest neighbour is 78.58%, and for random forest is 79.8%, which is by far the best accuracy for random forest classifier.Entities:
Mesh:
Year: 2022 PMID: 35463255 PMCID: PMC9033325 DOI: 10.1155/2022/3820360
Source DB: PubMed Journal: Comput Intell Neurosci
Hyperparameter optimization.
| K-nearest neighbour | Random forest | Decision trees | Multilayer perceptron |
|---|---|---|---|
| Number of neighbours = 45 | Size of each bag = 53 | Confidence factor = 0.11 | Learning rate = 0.003 |
| Batch size = 100 | Max depth = 0 | Min num. of objects = 1 | Momentum = 0.9 |
| Algorithm = linear search | No. of trees = 100 | Unpruned = false | Hidden layers = 10 |
| Distance function = Manhattan function |
Figure 1Machine learning system.
Figure 2Partitioning of dataset using 5-fold cross-validation [38].
Figure 3((a)–(f)) Two-dimensional distribution of PIMA Indians diabetes dataset. (a) Line plot between glucose and blood pressure. (b) Line plot between mass and pedigree function. (c) Line plot between glucose and mass. (d) Line plot between pressure and pedigree. (e) Line plot between pressure and mass. (f) Line plot between glucose and pedigree.
Description of PIMA Indian diabetes dataset.
| S. No | Attributes | Mean | Standard deviation | Min/max value |
|---|---|---|---|---|
| 1 | No. of times pregnant | 3.8 | 3.4 | 1/17 |
| 2 | Plasma glucose concentration | 120.9 | 32 | 56/197 |
| 3 | Diastolic blood Pressure | 69.1 | 19.4 | 24/110 |
| 4 | Triceps skin fold thickness (mm) | 20.5 | 16 | 7/52 |
| 5 | 2-Hour serum insulin | 79.8 | 115.2 | 15/846 |
| 6 | Body mass index (kg/m2) | 32 | 7.9 | 18.2/57.3 |
| 7 | Diabetes pedigree function | 0.5 | 0.3 | 0.0850/2.32 |
| 8 | Age | 33.2 | 11.8 | 21/81 |
| 9 | Class | Tested positive: | Diabetic | |
| Tested negative: | Nondiabetic |
Accuracy of classifiers for different feature selection technique.
|
| Algorithm | Correlation attribute | Information | Gain | Principal component |
|---|---|---|---|---|---|
| 6 | Multilayer perceptron |
| 74.8 | 74.0 | |
| Decision trees | 74.3 | 74.2 | 73.5 | ||
| Random forest |
| 74.6 | 75.1 | ||
| K-nearest neighbour | 67.0 | 68.0 | 65.7 | ||
|
| |||||
| 4 | Multilayer perceptron | 75.1 | 76.9 | 72.6 | |
| Decision trees | 74.0 | 74.3 | 72.5 | ||
| Random forest | 73.3 | 71.7 | 72.3 | ||
| K-nearest neighbour | 70.1 | 68.0 | 65.7 | ||
Figure 4Line diagram of accuracy comparison.
Figure 5Bar diagram of accuracy comparison.
Comparison of accuracy after proposed system.
| S. no. | Classification algorithm | Before | Proposed method | After proposed method |
|---|---|---|---|---|
| 1 | K-nearest neighbour | 70.1 | 78.58 | |
| 2 | Random forest | 75.9 |
| |
| 3 | Decision trees | 73.8 | 76.07 | |
| 4 | Multilayer perceptron | 75.1 | 77.60 |
Bold means the improved accuracy after the proposed method.
Classification accuracy of different methods with literature.
| Authors | Data size | Techniques | Classification accuracy (%) |
|---|---|---|---|
| Li et al. [ | 768 | Ensemble of SVM, ANN, and NB | 58.3 |
| Deng and Kasabov [ | 768 | Self-organizing maps | 78.40 |
| Brahim-Belhouari and Bermak [ | 768 | NB, SVM, DT | 76.30 |
| Smith et al. [ | 768 | Neural ADAP algorithm | 76 |
| Choubey et al. [ | 768 | Ensemble of RF and XB | 78.9 |
| Quinlan et al. [ | 768 | C4.5 Decision trees | 71.10 |
| Bozkurt et al. [ | 768 | Artificial neural network | 76.0 |
| Parashar et al. [ | 768 | SVM, LDA | 77.60 |
| Sahan et al. [ | 768 | Artificial immune System | 75.87 |
| Chatreti et al. [ | 768 | Linear discriminant analysis | 72 |
| Christobel and Sivaprakasam [ | 460 | K-nearest neighbour | 78.16 |
| Smith et al. [ | 768 | Ensemble of MLP and NB | 64.1 |
| Proposed method | 768 | KNN, |
|
Evaluation parameters.
| S. no. | Classification algorithm | Sensitivity | Specificity | AUC | Accuracy |
|---|---|---|---|---|---|
| 1 | K-nearest neighbour | 0.786 | 0.659 | 0.838 | 78.58 |
| 2 | Random forest |
|
| 0.836 |
|
| 3 | Decision trees | 0.761 | 0.691 | 0.785 | 76.07 |
| 4 | Multilayer perceptron | 0.776 | 0.679 |
| 77.60 |
Bold means the improvement in sensitivity, specificity, AUC, and accuracy after the proposed method.