| Literature DB >> 35449846 |
Syed Immamul Ansarullah1, Syed Mohsin Saif2, Syed Abdul Basit Andrabi3, Sajadul Hassan Kumhar4, Mudasir M Kirmani5, Dr Pradeep Kumar6.
Abstract
Heart disease is a severe disorder, which inflicts an adverse burden on all societies and leads to prolonged suffering and disability. We developed a risk evaluation model based on visible low-cost significant noninvasive attributes using hyperparameter optimization of machine learning techniques. The multiple set of risk attributes is selected and ranked by the recursive feature elimination technique. The assigned rank and value to each attribute are validated and approved by the choice of medical domain experts. The enhancements of applying specific optimized techniques like decision tree, k-nearest neighbor, random forest, and support vector machine to the risk attributes are tested. Experimental results show that the optimized random forest risk model outperforms other models with the highest sensitivity, specificity, precision, accuracy, AUROC score, and minimum misclassification rate. We simulate the results with the prevailing research; they show that it can do better than the existing risk assessment models with exceptional predictive accuracy. The model is applicable in rural areas where people lack an adequate supply of primary healthcare services and encounter barriers to benefit from integrated elementary healthcare advances for initial prediction. Although this research develops a low-cost risk evaluation model, additional research is needed to understand newly identified discoveries about the disease.Entities:
Mesh:
Year: 2022 PMID: 35449846 PMCID: PMC9018172 DOI: 10.1155/2022/9882288
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 3.822
Figure 1World map showing the global distribution of heart disease mortality rates [11].
List of contributions from previous surveys on heart disease prediction using machine learning techniques.
| Ref | Year | Dataset | Results | Contribution | Future work | Limitations |
|---|---|---|---|---|---|---|
| [ | 2007 | UCI heart disease dataset contains 13 attributes and 270 instances | Weighted K-NN/87% accuracy | Proposed a cardiac arrhythmia model using K-NN-weighted preprocessing and fuzzy allocation mechanism of artificial immune recognition system | To enhance the model by applying SVM, decision tree, and hybrid techniques | (i) This model identifies only a type of heart disease |
| (ii) The small dataset size is not ideal for stable performance because it results in biased measurements | ||||||
| [ | 2008 | Cleveland heart disease dataset has 909 instances and 15 medical risk features | Naive Bayes/95% accuracy | They developed a risk model using decision tree, neural network, and naive Bayes algorithms | To improve model performance by training on massive data | (i) Small dataset with limited instances |
| (ii) Overfitting problems occur in a small dataset, leading to poor performance with the test data | ||||||
| [ | 2009 | Researchers obtain live datasets from patients with heart disease | Bagging with naive Bayes/84.1% accuracy | Developed risk model using C4.5, bagging with naive Bayes, and bagging with C4.5 | To develop a robust heart disease risk model using python or R | Weka handles small datasets, and whenever a dataset is bigger than a few megabytes, an OutOfMemoryError occurs |
| [ | 2010 | UCI data repository | The accuracy of the developed model is 82% | (i) Develop fuzzy expert risk model | Build an ensemble model because that will result in a robust and optimal model and increase the model's efficiency | Derived rules from the cardiovascular disease dataset are complex and large, making the system slow and making wrong decisions |
| (ii) It generated 44 rules compared with the results of other rule bases. | ||||||
| [ | 2011 | Cleveland heart disease dataset has 909 instances and 15 medical risk features | The accuracy of 79.1% with voting and 84.1% without voting is achieved | (i) Develop a classification model using the decision tree | To develop a risk evaluation model using hybrid classification techniques for optimal results | The developed heart disease evaluation models lack generalization ability |
| (ii) Heart disease rules are generated using reduced error pruning | ||||||
| [ | 2012 | Cleveland heart disease dataset | K-NN with the accuracy of 97% | They develop the K-NN model for the early prediction of heart disease | To work on a primary heart disease dataset with considerable volume size | Applying the voting technique did not progress in precision even after estimating different parametric values of k |
| [ | 2013 | The random dataset of 303 instances is collected | The accuracy of the LAD stenosis is 79.54% | (i) They apply C4.5 and bagging to check the lab and ECG data | To work on a primary heart disease dataset with massive instances using hybrid learning techniques | The model uses invasive risk features, making it difficult for general users and limiting its usage to the medical domain |
| (ii) The Gini index and information gain select the essential features | ||||||
| [ | 2014 | Used Cleveland, Hungarian, and Switzerland datasets | 80% and 42% accuracies on Switzerland and Hungarian heart disease datasets, respectively | (i)Proposed a model by combining rough set theory with the fuzzy set | To develop a risk evaluation model with less computational complexity and high predictive capability | They use medical domain performance measures and do not test the model measures (computational complexity, scalability, robustness, and comprehensibility) |
| (ii) Generate fuzzy base rules using the rough set approach and the fuzzy classifier. | ||||||
| [ | 2015 | Five binary class medical datasets collected from the UCI repository | Optimal results | Initially, the k-means algorithm is used for clustering and then 12 distinct classifiers are used to create the final model using stratified 10-fold cross-validation | To develop a risk model with the best generalization ability and less computational complexity | (i) The developed risk model is complex and takes more computational time |
| (ii) Used Weka tool that handles small datasets, and an OutOfMemoryError occurs on a vast dataset | ||||||
| [ | 2016 | Cleveland heart disease dataset | Accuracy increases by controlling the discrete features using feature selection techniques | Sequential minimal optimization algorithm is applied to develop the risk model using the MATLAB tool | To extend the model by using real-time datasets to get an accurate diagnosis in advance | (i) The model time complexity is high |
| (ii) Overfitting problems occur in a small dataset, leading to poor performance with the test data | ||||||
| [ | 2017 | Used Z-Alizadeh sani heart disease dataset | The accuracy, sensitivity, and specificity are 84%, 85%, and 89% | Proposed a hybrid model that uses the error back propagation algorithm in ANN with MLP structure and sigmoid exponential function | Develop a one-size-fits-all heart disease model to successfully prescribe a treatment plan for the disease | The error generated by the hidden neurons on output nodes degrades the neural network's logic potential, resulting in wrong prediction and decision-making |
| [ | 2019 | Review paper | Analyze security models, check trends, and highlight opportunities and challenges for future IoT-based healthcare development | (i) Review latest IoT components, applications, and healthcare market trends | Will address the challenges that prevent the development of IoT and cloud computing in healthcare, such as data security, system development processes, and business models | They did not review IoT privacy and security issues like potential threats, attack types, and security setups |
| (ii) Analyze the influence of cloud computing, big data, and wearables to determine how they help the sustainable development of IoT and cloud computing in the healthcare industry | ||||||
| [ | 2020 | Unknown | MSSO-ANFIS/accuracy = 99.4 and precision = 96.54 | Propose an Internet of Medical Things framework using modified salp swarm optimization and an adaptive neuro-fuzzy inference system for heart disease prediction | Researchers will use different feature selection and optimization techniques to improve the model effectiveness of prediction | The developed heart disease prediction model is complex and expensive because of medical attribute examination and IoT use |
| [ | 2020 | Live dataset | MDCNN/accuracy = 98.2 | Propose a wearable IoT-enabled framework to evaluate heart disease using a modified deep convolutional neural network (MDCNN) | (i) To increase the model's performance using other feature selection and optimization techniques | The developed risk model is complex and expensive because of medical attribute examinations and IoT use |
| The MDCNN classifies the received sensor data into normal and abnormal | (ii) To train the model with fully wearable devices available in the market | |||||
| [ | 2020 | Live dataset | Avg. correlation coefficient is 0.045, encryption time = 1.032S, and decryption time = 1.004 S | Proposed a secure framework that uses the wearable sensor device which monitors blood pressure, body temperature, serum cholesterol, glucose level, etc | To extend this work, such as capturing the data from the wearable sensors and performing real-time analysis | The developed framework is complex because of the use of IoT components |
| [ | 2021 | Wireless body area networks (WBAN) framework | Execution time, memory, and energy consumption of the developed WBAN are optimal | Propose a three-tier security model for wireless body area networks (WBAN) systems that is suitable for e-health applications | To incorporate the security solutions and concentrate on competitive execution time, memory, and energy consumption | (i) They used lightweight cryptography instead of robust crypto-algorithms |
| (ii) A complete comparison with other methods is difficult due to security services, device types, and security levels. | ||||||
| [ | 2022 | Primary heart disease dataset consisting of 5776 records | Random forest/85% accuracy | Develop an effective, low-cost, reliable risk evaluation model using significant noninvasive risk attributes | (i) To enhance the risk model by adding other noninvasive features | The risk model is developed on a specific population, hence narrowing its application |
| (ii) To investigate deep learning and study the significance of other controlled features on different age and sex groups |
Figure 2Description of SEMMA data mining methodology [31].
Risk attribute ranking using recursive feature elimination technique.
| Risk attributes | Values | Ranking | VIF factor |
|---|---|---|---|
| Age | True | 1 | 1.1 |
| Sex | False | 2 | 1.4 |
| Height | True | 1 | 1.1 |
| Weight | True | 1 | 1.1 |
| Systolic BP | True | 1 | 1.3 |
| Diastolic BP | True | 1 | 1.1 |
| Hereditary | True | 1 | 1.2 |
| Unhealthy diet | True | 1 | 1.1 |
| Physical activity | True | 1 | 1.2 |
| Alcohol consumption | False | 2 | 1.5 |
| Smoking | False | 2 | 1.7 |
| Socioeconomic level | False | 1 | 1.3 |
Figure 3Showing single cross-validation technique for hyperparameter optimization.
Showing experimental results of the optimized decision tree model.
| Max depth | Min samples split | Min samples leaf | Max features | Criterion | Accuracy |
|---|---|---|---|---|---|
| 10 | 18 | 15 | Auto | Entropy | 81% |
| 15 | 25 | 12 | Auto | Gini | 82% |
| 18 | 11 | 10 | Sqrt | Gini | 72% |
| 20 | 15 | 20 | Sqrt | Gini | 83% |
| 25 | 10 | 50 | Auto | Entropy | 71% |
| 30 | 12 | 30 | Auto | Entropy | 74% |
| 35 | 22 | 25 | Sqrt | Entropy | 75% |
| 40 | 8 | 14 | Sqrt | Gini | 78% |
| 45 | 5 | 16 | Auto | Entropy | 73% |
| 50 | 14 | 0 | Auto | Gini | 78% |
| 70 | 17 | 18 | Auto | Entropy | 75% |
| 80 | 13 | 0 | Auto | Entropy | 84% |
| 100 | 20 | 0 | Sqrt | Entropy | 84% |
Showing experimental results of the optimized K-NN model.
| Leaf size | Metric | Neighbors | Weights | Accuracy |
|---|---|---|---|---|
| 5 | Euclidean | 5 | Distance | 82% |
| 10 | Minkowski | 11 | Uniform | 67% |
| 30 | City block | 13 | Distance | 85% |
| 25 | Euclidean | 9 | Distance | 70% |
| 15 | Minkowski | 7 | Uniform | 72% |
| 20 | City block | 11 | Uniform | 68% |
| 12 | Euclidean | 15 | Distance | 75% |
| 16 | Minkowski | 13 | Uniform | 77% |
| 18 | Minkowski | 7 | Uniform | 80% |
| 28 | Euclidean | 9 | Distance | 82% |
Showing experimental results of the optimized SVM model.
| Kernel | Gamma | Regularization | Accuracy |
|---|---|---|---|
| Linear | 0.001 | 0.11 | 71% |
| Sigmoid | 0.1 | 1.0 | 70% |
| Sqrt | 0.00001 | 0.001 | 68% |
| Rbf | 0.1 | 1.0 | 81% |
| Linear | 0.001 | 0.001 | 72% |
| Rbf | 0.0001 | 0.1 | 80% |
| Linear | 0.01 | 0.10 | 73% |
| Rbf | 0.0011 | 0.0001 | 78% |
| Sqrt | 0.0001 | 0.010 | 75% |
| Sqrt | 0.1 | 0.11 | 76% |
| Sigmoid | 0.01 | 1.0 | 74% |
| Linear | 0.0001 | 1.0 | 71% |
| Sigmoid | 0.010 | 0.11 | 77% |
| Rbf | 0.11 | 0.0001 | 69% |
| Sqrt | 0.10 | 0.001 | 73% |
Shows the experimental results of the optimized random forest model.
| Criterion | Max depth | Max features |
| Min samples leaf | Accuracy |
|---|---|---|---|---|---|
| Gini | 70 | 0 | 0 | 0 | 85% |
| Entropy | 60 | Auto | 0 | 0 | 86% |
| Gini | 50 | Auto | 100 | 0 | 87% |
| Entropy | 80 | Auto | 100 | 100 | 73% |
| Gini | 100 | Auto | 100 | 50 | 76% |
| Entropy | 30 | 0 | 80 | 60 | 80% |
| Gini | 40 | 0 | 90 | 40 | 78% |
| Gini | 25 | Auto | 70 | 30 | 75% |
| Entropy | 20 | Auto | 40 | 25 | 82% |
| Entropy | 35 | Auto | 30 | 20 | 81% |
| Gini | 45 | 0 | 60 | 35 | 80% |
Showing performance measures of the developed optimized heart disease models.
| Performance measures of the models | ||||||
| Models | TPR | TNR | Accuracy | Precision | Error rate | AUROC |
|
| ||||||
| Decision tree | 83% | 80% | 82% | 82% | 5% | 82% |
| K-NN | 87% | 81% | 84% | 83% | 15% | 85% |
| SVM | 80% | 82% | 82% | 86% | 18% | 82% |
| Random forest | 87% | 84% | 87% | 86% | 13% | 87% |
Figure 4Combined AUROC of the optimized risk evaluation models.