| Literature DB >> 35433625 |
Michael Onyema Edeh1, Osamah Ibrahim Khalaf2, Carlos Andrés Tavera3, Sofiane Tayeb4, Samir Ghouali5,6, Ghaida Muttashar Abdulsahib7, Nneka Ernestina Richard-Nnabu8, AbdRahmane Louni9.
Abstract
Diabetes is considered to be one of the leading causes of death globally. If diabetes is not treated and detected early, it can lead to a variety of complications. The aim of this study was to develop a model that can accurately predict the likelihood of developing diabetes in patients with the greatest amount of precision. Classification algorithms are widely used in the medical field to classify data into different categories based on some criteria that are relatively restrictive to the individual classifier, Therefore, four machine learning classification algorithms, namely supervised learning algorithms (Random forest, SVM and Naïve Bayes, Decision Tree DT) and unsupervised learning algorithm (k-means), have been a technique that was utilized in this investigation to identify diabetes in its early stages. The experiments are per-formed on two databases, one extracted from the Frankfurt Hospital in Germany and the other from the database. PIMA Indian Diabetes (PIDD) provided by the UCI machine learning repository. The results obtained from the database extracted from Frankfurt Hospital, Germany, showed that the random forest algorithm outperformed with the highest accuracy of 97.6%, and the results obtained from the Pima Indian database showed that the SVM algorithm outperformed with the highest accuracy of 83.1% compared to other algorithms. The validity of these results is confirmed by the process of separating the data set into two parts: a training set and a test set, which is described below. The training set is used to develop the model's capabilities. The test set is used to put the model through its paces and determine its correctness.Entities:
Keywords: AI; Bayesian Naive; ML; Support Vector Machine (SVM); classification; decision tree; diabetes; random forest
Mesh:
Substances:
Year: 2022 PMID: 35433625 PMCID: PMC9008347 DOI: 10.3389/fpubh.2022.829519
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1Proposed model diagram.
The observation and analysis of the two databases.
|
|
|
|
|
|---|---|---|---|
| Blood pressure | The data shows that there are 0 number for blood pressure. | 35 | 90 |
| glucose | It is impossible for a person to have a glucose zero value even if they are fasting. | 5 | 13 |
| Skin thikness | For normal people, the thickness of the skin fold cannot be <10 mm. | 227 | 573 |
| BMI | It is impossible for a person to have a BMI 0 value. | 11 | 28 |
| Insulin | In a rare situation a person may have 0 insulin | 374 | 956 |
| Pregnancies | It is normal to have a zero value for this column so there is no need for cleaning. | 111 | 301 |
Accuracy measures.
|
|
|
|
|---|---|---|
| Accuracy | Accuracy determines the accuracy of the algorithm in predicting instances | A = (TP + TN)/ (Nombre total d'échantillons) |
| Recall | Is the ability of a classification model to identify all relevant instances | R = TP/(TP + FN) |
| F1– Mesure | Is the weighted average of precision and recall | F = 2 * [(P * R)/(P + R)] |
| Precision | Classifiers correctness/accuracy is measured by Precision | P = TP/ (TP + FP) |
Evaluation attributes results for different models (Frankfurt Germany).
|
|
|
|
|
|
|---|---|---|---|---|
| Naïve bayes | C = 16 | 0.776 | 0.625 | 0.654 |
| SVM | C = 20 | 0.783 | 0.566 | 0.638 |
| Decision tree | C = 285 | 0.971 | 0.975 | 0.958 |
| Random forest | C = 20 | 0.989 | 0.95 | 0.972 |
Evaluation attributes results for the different models (Pima Indian).
|
|
|
|
|
|
|---|---|---|---|---|
| Naïve bayes | C = 55 | 0.785 | 0.6 | 0.62 |
| SVM | C = 16 | 0.831 | 0.533 | 0.648 |
| Decision tree | C = 55 | 0.707 | 0.622 | 0.554 |
| Random forest | C = 20 | 0.805 | 0.711 | 0.68 |
Comparison of the proposed work with the existing works (Pima Indian).
|
|
|
|
|---|---|---|
| Logistic regression | 76.80% | ( |
| Decision table | 79.81% | ( |
| Naïve Bayes | 76.3% | ( |
| Logistic Regression (LR) | 80% | ( |
| SVM | 83.1% | Our study |
Comparison of the proposed work with the existing works (Frankfort Allemagne).
|
|
|
|
|
|
|---|---|---|---|---|
| Random forest | The median | Test size = 0.2 | 91% | ( |
| Gaussian process | The mean | Test size = 0.2 | 98.25% | ( |
| DeepNN | Linear interpolation | Test size = 0.1 | 99.5% | ( |
| Random forest | k-means | Test size = 0.2 | 100% | Our Study |
| Random forest | k-means | Test size = 0.3 | 97.6% | Our Study |
|
|
|
|
|---|---|---|
| Kumari and Chitra | In the proposed work, SVM with radial basis function kernel is used for classification. The performance parameters such as the classification accuracy (78.2 %), sensitivity (80%), and specificity of the SVM and RBF have found to be high thus making it a good option for the classification process. | ( |
| Ahmed | Accuracy of the proposed models has been compared. The random forest method provided an accuracy of 74.7%, ANN gave 75.7% and K-means clustering method has given 73.6% accuracy. | ( |
| Shetty et al. | Used K-Nearest Neighbors (KNN) and the Naïve Bayes technique for diabetes prediction. This technique was implemented in the form of a software program, in which users provide data in terms of patient records and the finding that the patient is diabetic or not. | ( |
| Bhoia et al. | In this paper. Various supervised learning algorithms have been used such as CT, SVM, k-NN, NB, RF, NN, AB, and LR, and generated the training dataset | ( |
| and testing dataset using k-fold cross-validation with k = 10. The results of accuracy = 76.80% | ||
| Kandhasamy and Balamurali | The authors in used data from the University of California, diabetes mellitus patients were classified using a machine learning data repository to compare the performance of four common classifiers (J48 DT, the K-Nearest Neighbors algorithm, the Random Forest algorithm, and the Support Vector Machines algorithm). They used a data sample from the UCI machine learning data repository. Preliminary results suggest that the J48 DT classifier outperforms the other three classifiers in terms of accuracy (73.82 percent) before data preparation, and that the KNN (k = 1) and Random Forest classifiers outperform the other three classifiers after data pre-processing. | ( |
| Vijayan et al. | The KNN method and the ANFIS algorithm are comparable. According to the results of the experiment, the amalgam of KNN and ANFIS gives the highest classification accuracy of 80 % among the algorithms tested. | ( |
| Soleh et al. | The data in this study divided into two, 75% for training data, and 25% for testing data. This study produces an evaluation with an accuracy 80%, which means it is better than the previous paper, which is 75, 97%. | ( |
| Rajput et al. | The target of analysis made in the present research is to list the risks factors and correlation that exist among those risk factors. In this work, logistic regression, support vector machine, random forest, decision tree, Naive Bayes, K nearest neighbor classifiers are used for prediction, and their accuracy is compared to choose the better machine learning model. SVM provides higher accuracy (96.0) among the choosen algorithms. | ( |
| Deepa et al. | This work aims to propose an artificial intelligence-based intelligent system for earlier prediction of the disease using Ridge-Adaline Stochastic Gradient Descent Classifier (RASGD. The results of the proposed scheme have been compared with state-of-the-art machine learning algorithms such as support vector machine and logistic regression methods. The RASGD intelligent system attains an accuracy of 92%, which is better than the other selected classifiers. | ( |