| Literature DB >> 35242738 |
Monika Arya1, Hanumat Sastry G2, Anand Motwani3, Sunil Kumar2, Atef Zaguia4.
Abstract
Diabetes has been recognized as a global medical problem for more than half a century. Patients with diabetes can benefit from the Internet of Things (IoT) devices such as continuous glucose monitoring (CGM), intelligent pens, and similar devices. Smart devices generate continuous data streams that must be processed in real-time to benefit the users. The amount of medical data collected is vast and heterogeneous since it is gathered from various sources. An accurate diagnosis can be achieved through a variety of scientific and medical techniques. It is necessary to process this streaming data faster to obtain relevant and significant knowledge. Recently, the research has concentrated on improving the prediction model's performance by using ensemble-based and Deep Learning (DL) approaches. However, the performance of the DL model can degrade due to overfitting. This paper proposes the Extra-Tree Ensemble feature selection technique to reduce the input feature space with DL (ETEODL), a predictive framework to predict the likelihood of diabetes. In the proposed work, dropout layers follow the hidden layers of the DL model to prevent overfitting. This research utilized a dataset from the UCI Machine learning (ML) repository for an Early-stage prediction of diabetes. The proposed scheme results have been compared with state-of-the-art ML algorithms, and the comparison validates the effectiveness of the predictive framework. This proposed work, which outperforms the other selected classifiers, achieves a 97.38 per cent accuracy rate. F1-Score, precision, and recall percent are 96, 97.7, and 97.7, respectively. The comparison unveils the superiority of the suggested approach. Thus, the proposed method effectively improves the performance against the earlier ML techniques and recent DL approaches and avoids overfitting.Entities:
Keywords: data stream classification; deep learning; diabetes detection; ensemble technique; extra tree ensemble; feature selection; machine learning; overfitting
Mesh:
Substances:
Year: 2022 PMID: 35242738 PMCID: PMC8885585 DOI: 10.3389/fpubh.2021.797877
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Methodology and limitations of recent relevant work.
|
|
|
|
|
|---|---|---|---|
| 1 | Cho et al. ( | A model which combines Linear SVM classifiers and wrapper or embedded feature selection methods | Wrapper methods for feature selection have high computational costs and are generally prone to overfitting. They are also dependent on the classifiers used. |
| 2 | Le et al. ( | A novel model utilizing Gray Wolf Optimization (GWO) and an Adaptive Particle Swam Optimization (APSO) to optimize the Multilayer Perceptron (MLP) to reduce the number of required input attributes. | In MLP, computations are complex and time-consuming. |
| 3 | Lukmanto et al. ( | A classification framework to identify and classify diabetes datasets using F-Score Feature Selection and Fuzzy SVM. | A disadvantage of the F-score is that it does not reveal mutual information among features. Instead, it only captures the linear relationships between features and labels. |
| 4 | Putri et al. ( | Learning Vector Quantization (LVQ) to classify the diabetes dataset with Chi-Square for feature selection. | Chi-Square for feature selection does not take into consideration the feature interactions. It is best suited only for categorical variables |
| 5 | Sneha and Gangil ( | Classification by selecting the optimal features based on the correlation values. | Correlation values for feature selection uncover only relationships and do not determine what variables have the most influence. Thus, it can be a time-consuming process. |
Figure 1Flowchart for the proposed framework.
Comparison of proposed model with conventional ML algorithms.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Naïve bayes | 87.5 | 88.2 | 87.5 | 87.6 | 94 | 38.65 |
| Decision tree | 80.76 | 85.3 | 80.7 | 81.1 | 83.7 | 45.56 |
| Hoeffding tree | 87.5 | 88.2 | 87.5 | 87.6 | 94 | 50.68 |
| Random forest | 95.19 | 95.55 | 95.19 | 95.2 | 91.1 | 54.72 |
| Ensemble (stacking) | 63.46 | 83.4 | 63.46 | 73.5 | 50 | 69.8 |
| ETEODL (proposed) | 97.38 | 97.7 | 97.7 | 96 | 95 | 28.63 |
Comparison of the proposed model with recent works.
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
| Diabetes detection using DL algorithms ( | Employed long short-term memory (LSTM), convolutional neural network (CNN), and its combinations for extracting complex temporal dynamic features | 95.7 | 0.77 | 0.86 | 0.87 | 0.94 | 58.73 | LSTM have high computational complexity and is prone to overfitting |
| Health care system: stream ML classifier for features prediction in diabetes therapy | Used combination of probabilistic and ML models | 90 | 0.746 | 0.678 | 0.85 | 0.5 | 45.67 | The Probabilistic approach suffers from the problem of selecting the suitable metrics to conduct a detection process |
| Diabetes detection using DL approach ( | DL-based Restricted Boltzmann machine approach is used. | 84.32 | 0.86 | 0.75 | 0.77 | 0.911 | 67.83 | In RBM, training is more problematic as it is difficult to calculate the energy gradient function |
| ETEODL (Proposed) | 97.38 | 0.977 | 0.977 | 0.96 | 0.95 | 28.63 |
Figure 2Accuracy (%) comparison with existing methods.
Figure 3Comparison of computation time (in sec).
Figure 4Comparison of precision, recall, F1-score, and ROC area.
Figure 5Comparison of prediction accuracy.
Figure 6Comparison of computation time.
Figure 7Comparison of precision, recall, f1-score, and roc area.
Comparison of training and testing accuracy of ETEODL over 100 epochs.
|
|
|
|
|---|---|---|
| 01–10 | 0.836 | 0.954 |
| 11–20 | 0.877 | 0.962 |
| 21–30 | 0.922 | 0.968 |
| 31–40 | 0.934 | 0.971 |
| 41–50 | 0.952 | 0.973 |
| 51–60 | 0.964 | 0.976 |
| 61–70 | 0.965 | 0.977 |
| 71–80 | 0.966 | 0.977 |
| 81–90 | 0.968 | 0.978 |
| 91–100 | 0.968 | 0.978 |
Figure 8Comparison of training and testing accuracy.
Extra Tree Ensemble Optimized DL Algorithm
| | |
| | |
| 1: Call Algorithm-2 for ETE on Dataset X; | |
| 2: Ensemble | |
| 3: Transform Input Features: | |
| 4: for each hidden layer, l do | |
| a. | |
| b. ỹ( | |
| c. | |
| d. | |
| 5: Calculate the probability score for predicting the class of transaction: | |
| 6: Calculate objective function such as Error Function: E(W) is calculated as | |
| 7: Predict diabetes for the given feature set. | |
ETE
|
|
| 1: if (| |
| return SS=True; |
| 2: if all features are constant in F |
| return SS=True; |
| 3: if the output is constant in F |
| return SS=True; |
| else |
| return SS=false; |
| Endif |
| 4: if (SS) is TRUE |
| return none; |
| else |
| a) Select k features { |
| b) Take k splits { |
| c) return a split s* such that Score (s*, F) = max i=1 to k Score (s*, F); |
| End if |
Rand_Split(s,f)
|
|
|
|
| 1: Draw a random cut point |
| 2: return the split [ |