| Literature DB >> 35127623 |
Sivashankari R1, Sudha M1, Mohammad Kamrul Hasan2, Rashid A Saeed3, Suliman A Alsuhibany4, Sayed Abdel-Khalek5,6.
Abstract
Today, disease detection automation is widespread in healthcare systems. The diabetic disease is a significant problem that has spread widely all over the world. It is a genetic disease that causes trouble for human life throughout the lifespan. Every year the number of people with diabetes rises by millions, and this affects children too. The disease identification involves manual checking so far, and automation is a current trend in the medical field. Existing methods use a single algorithm for the prediction of diabetes. For complex problems, a single model is not enough because it may not be suitable for the input data or the parameters used in the approach. To solve complex problems, multiple algorithms are used. These multiple algorithms follow a homogeneous model or heterogeneous model. The homogeneous model means the same algorithm, but the model has been used multiple times. In the heterogeneous model, different algorithms are used. This paper adopts a heterogeneous ensemble model called the stacked ensemble model to predict whether a person has diabetes positively or negatively. This stacked ensemble model is advantageous in the prediction. Compared to other existing models such as logistic regression Naïve Bayes (72), (74.4), and LDA (81%), the proposed stacked ensemble model has achieved 93.1% accuracy in predicting blood sugar disease.Entities:
Keywords: KNN classifier; PIMA dataset; SVM and Gaussian Naïve Bayes; decision tree; gradient boosting; healthcare systems; random forest
Mesh:
Year: 2022 PMID: 35127623 PMCID: PMC8814448 DOI: 10.3389/fpubh.2021.792124
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Summary of the existing work.
|
|
|
|
|
|---|---|---|---|
| Iyer et al. ( | J48 | 74.87 | WEKA tool is used for prediction and prediction accuracy rate is less. |
| Ahmed ( | J48 | 73.5 | A more extensive study is missed for the data analysis. |
| Soltani and Jafarian ( | Probabilistic Neural Network (PNN) | 89.56 | Type 2 diabetics details only considered for the application development. |
| Kopitar et al. ( | Naïve Bayes, Random Forest and KNN | 64.47 | Diabetic prediction accuracy is less compared with proposed stacking approach. |
| Ashiquzzaman et al. ( | DNN, with Dropout | 88.41 | This method is achieved an 88.41 detection rate. Single approach is used. |
| Chugh et al. ( | Decision Tree and Gradient Boosting machine | 90.00 | The proposed method achieved a 90 accuracy in analyzing diabetes. This paper has focused only on children's data for predicting the diabetics. |
| Rakshit et al. ( | Two-class neural network | 83.3 | This proposed model achieved an 83.3 detection rate of type 2 diabetes. This method has considered the women dataset with their age above 21. |
| Maniruzzaman et al. ( | Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naïve Bayes classifier, Gaussian Process modeling | 81.97 | They accuracy as 81.97, which is less than the proposed method. |
| Sisodia and Sisodia ( | Decision Tree | 76.30 | Diabetic prediction accuracy is less compared to proposed stacking approach. |
| Rao et al. ( | Decision Tree with radial function | 75.65 | Diabetic prediction accuracy is less compared with proposed stacking approach. |
| Kopitar et al. ( | XGBOOST | 88.4 | The obtained accuracy is less and single algorithm XGBOOST is used. |
| Naveen et al. ( | SVM, selection Tree, Naive Bayes, Logistic Regression and KNN | 75 | Several algorithms are used but those algorithms are not combined together for final prediction. |
| Aishwarya et al. ( | SVM | 95 | Single machine learning algorithm is used for prediction. |
| Kandhasamy and Balamurali ( | J48, KNN, RF, and SVM | 73.82 | Diabetic prediction accuracy is less compared to proposed stacking approach. |
Figure 1Ensemble techniques.
Ensemble techniques comparisons.
|
|
|
|
|---|---|---|
| Multiple Classifiers are trained parallelly. | Builds the new learner in a sequential way. | Multiple Classifiers are trained parallelly. |
| The result is obtained by averaging the responses of the N learners. | On each iteration, update the model by weights until the desired result is obtained. | The result is obtained from the second level classifier. |
| Reduces the variance. | Reduces the bias. | Increases the accuracy. |
Figure 2Stacked ensemble model.
Figure 3Proposed system: Stacked ensemble model architecture.
Level-0 Input Set.
|
|
|
|---|---|
| AttrVec1 (1st row) |
|
| AttrVec2 (2nd row) |
|
| AttrVec3 |
|
| ⋮ | ⋮ |
| AttrVecn (nth row) |
|
Figure 4Level-1 classifier input set.
Figure 5Proposed system flow chart.
Attribute details of Pima Indian Diabetes dataset (PIDD).
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| 1 | Number of times pregnant- Number of pregnancy | Pregnant | Integer | 0 | 17 | Pregnancies |
| 2 | Glucose concentration (2-h oral glucose test [mg/dL]) | gl | Integer | 0 | 199 | Glucose |
| 3 | Blood Pressure (Diastolic blood pressure [mm Hg]) | bp | Integer | 0 | 122 | Blood pressure |
| 4 | Skin thickness (Triceps skin fold thickness [mm]) | sk | Integer | 0 | 99 | Skin thickness |
| 5 | Serum Insulin (2-H serum insulin [mu U/mL]) | in | Integer | 0 | 846 | Insulin |
| 6 | BMI (Body Mass Index [kg/m2]) | bmi | Real | 0 | 67.10 | BMI |
| 7 | Diabetes Pedigree Function (Diabetes in family history) | dp | Real | 0.08 | 2.42 | Diabetes Pedigree Function |
| 8 | Age (Age in Years) | age | Integer | 21 | 81 | Age |
| 9 | Class | Target label | Binary | 0 (0-Tested Negative [500]) | 1 (1-Tested Positive [268]) | Target output |
Figure 6Pearson correlation coefficient of Pima dataset input attributes.
Figure 7Pearson correlation coefficient result of Pima input attribute set.
Figure 8Accuracy chart of various models and proposed model comparison chart.
Figure 9Precision-Recall Curve of the proposed system and various machine learning models.
Figure 10Precision, Recall, F1-Score, and Accuracy results of the proposed system and various machine learning models.
Quality metrics results.
|
|
|
|
|
|
|---|---|---|---|---|
| Random forest | 78 | 78.3 | 77.8 | 68.5 |
| KNN | 69.3 | 70.1 | 69.5 | 62.4 |
| Logistic regression | 75.7 | 76.2 | 75.3 | 71 |
| Gradient boosting | 76.1 | 76.6 | 75.9 | 70 |
| Ada boosting | 77.9 | 77.5 | 77.9 | 72.7 |
| SVM | 76.5 | 76.6 | 75.4 | 73.1 |
| Stacking | 84 | 83.9 | 83.5 | 93.1 |
Algorithm Stacked Ensemble.
| 1. | Input: A training set |
| Input: A testing set T: =(a1, | |
| where Y: 0 or 1 | |
| Feature set F: {f1, f2, f3, …, fn} | |
| 2. | Step 1: Assign level-0 classifiers |
| 3. | Number of level-0 learners |
| 4. | Step 2: Train the level-0 classifiers using the following |
| 5. | |
| 6. | Step 3: Prepare new training set |
| | |
| 7. |
|
| 8. | Mh = (a1', |
| 9. | |
| 10. | Step 4: Assign |
| 11. | Step 5: Train level-1 classifier using |
| 12. | Step 6: |
| 13. | Step 7: |
| | |