| Literature DB >> 35890927 |
Umm E Laila1, Khalid Mahboob2, Abdul Wahid Khan3, Faheem Khan4, Whangbo Taekeun4.
Abstract
Diabetes is a long-lasting disease triggered by expanded sugar levels in human blood and can affect various organs if left untreated. It contributes to heart disease, kidney issues, damaged nerves, damaged blood vessels, and blindness. Timely disease prediction can save precious lives and enable healthcare advisors to take care of the conditions. Most diabetic patients know little about the risk factors they face before diagnosis. Nowadays, hospitals deploy basic information systems, which generate vast amounts of data that cannot be converted into proper/useful information and cannot be used to support decision making for clinical purposes. There are different automated techniques available for the earlier prediction of disease. Ensemble learning is a data analysis technique that combines multiple techniques into a single optimal predictive system to evaluate bias and variation, and to improve predictions. Diabetes data, which included 17 variables, were gathered from the UCI repository of various datasets. The predictive models used in this study include AdaBoost, Bagging, and Random Forest, to compare the precision, recall, classification accuracy, and F1-score. Finally, the Random Forest Ensemble Method had the best accuracy (97%), whereas the AdaBoost and Bagging algorithms had lower accuracy, precision, recall, and F1-scores.Entities:
Keywords: AdaBoost; Bagging; Random Forest; data mining; diabetes dataset; ensemble techniques; prediction
Mesh:
Year: 2022 PMID: 35890927 PMCID: PMC9324493 DOI: 10.3390/s22145247
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Comparison of studies.
| General Information | ||||||||
|---|---|---|---|---|---|---|---|---|
| Author/Year | Purpose | Classifier Used | Datasets | Validation Parameters | Key Findings | |||
| Chatrati et al. [ | To forecast the presence of diabetes and hypertension | SVM, KNN, DT, LR | PIDD | ACC, scatter plot, CM, ROC curve | ACC for SVM was 75% | |||
| Maniruzzaman et al. [ | Create a system using machine learning (ML) to anticipate diabetes patients | LR-RF combination for feature selection, NB, DT, RF, AdaBoost | National Health and Nutrition Examination Survey | ACC, AUC | ACC 94.25% | |||
| S. Kumari et al. [ | Improve the accuracy of prediction of diabetes mellitus using a combination of machine learning techniques | NB, RF, LR | PIDD and Breast cancer dataset | ACC, Precision, Recall, F1-score, AUC | 97.02% accuracy on the breast cancer dataset | 79.08% accurate results on PIMA dataset | ||
| P. Rajendra et al. [ | Create a prediction model and investigate several methods to improve performance and accuracy | LR | PIDD and Vanderbilt | Precision, Recall, F1-score | 78% accuracy for Dataset 1 | 93% accuracy for Dataset 2 | ||
| C. Yadav et al. [ | To use a classification technique for diabetes prediction | Chi-Square for feature selection, DT, JRIP, OneR, Bagging, Boosting | UCI repository. 9 attributes | ACC, Recall, Precision, and Fi-score | ACC for Bagging ensemble methods was 98% | |||
| Goyal et al. [ | The development of a type 2 diabetes prediction model | Using the 10-folds cross-validation approach and the ensemble method | PIDD | ACC | ACC 77.60% | |||
| A. Prakash [ | To enhance the performance indicators for early diabetes diagnosis | J48, NB, RF, RT, SimpleCART | PIDD | ACC, computational time, Precision, FM ROC, and PRC | ACC 79.22% | |||
| Singh Ashima et al. [ | To use an ensemble of various machine learning techniques for predicting diabetes | SVM, NN, DT, XGBoost, RF | PIDD | ACC, Sen, Spe, Gini Index, Precision, AUC, AUCH, minimum error rate, and minimum weighted coefficient | ACC 95% | |||
| R. Saxena et al. [ | To compare several classifiers and feature selection techniques to more accurately predict diabetes | MLP, DT, KNN, RF | PIDD | Sen, Spe, ACC, and AUC | ACC for | ACC for DT 76.07% | ACC for KNN 78.58% | ACC for RF 79.8% |
| K. Hasan [ | To put forward a robust framework for predicting diabetes | SVM, KNN, DT, MLP, NB, AdaBoost, XGBoost | PIDD | Sen, Spe, and AUC | ACC achieved was 78.9% by using AdaBoost | AUC Gradient boost 95% | ||
| Tigga et al. [ | Various machine learning algorithms were used to predict the risk of type 2 diabetes | NB, RF | PIDD | ACC, Precision, Recall, and F1-score. | 74.46% accuracy using RF on both datasets | |||
| Jashwanth Reddy et al. [ | To create a model with the highest degree of accuracy for predicting human diabetes | SVM, KNN, LR, NB, GB, RF | PIDD | ACC, ROC, Precision, Recall, FM | ACC 80% using RF | |||
| Jackins et al. [ | To discover a model for the diagnosis of diabetes, coronary heart disease, and cancer among the available data | NB, RF | PIDD | ACC | NB ACC 74.64% | RF ACC 74.04% | ||
| Raghavendran et al. [ | Analyze a patient dataset to determine the probability of type 2 diabetes | LR, KNN, RF, SVM, NB, AdaBoost | PIDD | ACC, Precision, Recall, | AdaBoost performs well 95% | |||
| Laila et al. (This study) | To increase the machine learning ensemble standard algorithms accuracy | AdaBoost, Bagging, RF | UCI repository | Precision, Recall, ACC, F1-score | RF performs well 97% | |||
List of characteristics with their standards.
| ATTRIBUTES | VALUE |
|---|---|
| Age | Numeric |
| Gender | Men = 328, Women = 192 |
| Polyuria | ✓ = 258, × = 262 |
| Polydipsia | ✓ = 233, × = 287 |
| Sudden weight loss | ✓ = 217, × = 303 |
| Weakness | ✓ = 305, × = 215 |
| Polyphagia | ✓ = 237, × = 283 |
| Genital thrush | ✓ = 116, × = 404 |
| Visual blurring | ✓ = 233, × = 287 |
| Itching | ✓ = 253, × = 267 |
| Irritability | ✓ = 126, × = 394 |
| Delayed healing | ✓ = 239, × = 281 |
| Partial paresis | ✓ = 224, × = 296 |
| Toughness of muscle | ✓ = 195, × = 325 |
| Alopecia | ✓ = 179, × = 341 |
| Overweightness | ✓ = 88, × = 432 |
| Class | Positive = 320, Negative = 200 |
Figure 1Preprocessing visualizations of the attributes.
Summary of stratified cross-validation performance metric.
| Accuracy | Kappa Statistic | Mean Absolute Error | Root Mean Squared Error | Relative Absolute Error | |
|---|---|---|---|---|---|
|
| 90.576% | 0.803 | 0.157 | 0.269 | 33.157% |
|
| 94.615% | 0.887 | 0.109 | 0.224 | 23.153% |
|
| 97.115% | 0.939 | 0.059 | 0.154 | 12.586% |
|
|
|
|
|
| |
|
| 55.436% | 0.908 | 0.906 | 0.906 | |
|
| 46.219% | 0.947 | 0.946 | 0.946 | |
|
| 31.709% | 0.971 | 0.971 | 0.971 |
Figure 2Comparison of accuracy, precision, recall, and F-measure of ensemble classifiers.
Figure 3Threshold curve of a positive class.
Figure 4Threshold curve of a negative class using AdaBoost.
Figure 5Threshold curve of a positive class using Bootstrap Aggregation (Bagging).
Figure 6Threshold curve of a negative class using Bootstrap Aggregation (Bagging).
Figure 7Threshold curve of a positive class using Random Forest.
Figure 8Threshold curve of a negative class using Random Forest.
Figure 9Confusion matrix of AdaBoost.
Figure 10Confusion matrix of Bootstrap Aggregation (Bagging).
Figure 11Confusion matrix of Random Forest.
Figure 12Computational representation of the attributes with their scores obtained from the Chi-Square technique.