| Literature DB >> 35742188 |
Nurul Absar1, Emon Kumar Das1, Shamsun Nahar Shoma1, Mayeen Uddin Khandaker2,3, Mahadi Hasan Miraz4, M R I Faruque5, Nissren Tamam6, Abdelmoneim Sulieman7, Refat Khan Pathan8.
Abstract
The disease may be an explicit status that negatively affects human health. Cardiopathy is one of the common deadly diseases that is attributed to unhealthy human habits compared to alternative diseases. With the help of machine learning (ML) algorithms, heart disease can be noticed in a short time as well as at a low cost. This study adopted four machine learning models, such as random forest (RF), decision tree (DT), AdaBoost (AB), and K-nearest neighbor (KNN), to detect heart disease. A generalized algorithm was constructed to analyze the strength of the relevant factors that contribute to heart disease prediction. The models were evaluated using the datasets Cleveland, Hungary, Switzerland, and Long Beach (CHSLB), and all were collected from Kaggle. Based on the CHSLB dataset, RF, DT, AB, and KNN models predicted an accuracy of 99.03%, 96.10%, 100%, and 100%, respectively. In the case of a single (Cleveland) dataset, only two models, namely RF and KNN, show good accuracy of 93.437% and 97.83%, respectively. Finally, the study used Streamlit, an internet-based cloud hosting platform, to develop a computer-aided smart system for disease prediction. It is expected that the proposed tool together with the ML algorithm will play a key role in diagnosing heart diseases in a very convenient manner. Above all, the study has made a substantial contribution to the computation of strength scores with significant predictors in the prognosis of heart disease.Entities:
Keywords: AdaBoost; KNN; decision tree; heart disease; prediction; random forest; smart system
Year: 2022 PMID: 35742188 PMCID: PMC9222326 DOI: 10.3390/healthcare10061137
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Figure 1The system architecture of the present work.
One database has four datasets that connect Cleveland, Hungary, Switzerland, and Long Beach (CHSL), while the other contains a dataset from the Cleveland heart disease dataset. Both databases are described in detail below.
| Si. No. | Qualities | Variety | Standard |
|---|---|---|---|
| (i) | Age | Integer | 29–77 |
| (ii) | Sex | Integer | male = 1; female = 0 |
| (iii) | Chest pain type | Integer | angina = 0; abnanr = 1; |
| notang = 2; | |||
| asympt = 3 | |||
| (iv) | Blood pressure value | Integer | 94–200 |
| (v) | Serum cholesterol | Integer | 126–564 |
| (vi) | Fasting blood sugar | Integer | true = 1; false = 0 |
| (vii) | Resting electro-cardiographic results | Integer | 0–2 |
| (viii) | Maximum heart rate | Integer | 71–202 |
| (x) | Old peak | Float | 0.0–6.2 |
| (xi) | The slant of the peak exercise ST segment | Integer | upsloping = 0; flat = 1; |
| Down sloping = 2 | |||
| (xii) | Number of major vessels | Integer | 0–4 |
| Exercise-induced angina | integer | 1 = yes; 0 = no | |
| (xiii) | Thal | Integer | |
| defect = 6; reversible | |||
| defect = 7 | |||
| (xiv) | Coronary heart disease | Integer | present = 1; absent = 0 |
Figure 2The outliers present in the Cleveland dataset.
Figure 3The changes of box plot after the outlier removal using IQR in the Cleveland dataset.
The confusion metrics for evaluating the heart disease detection system of test data using Cleveland, Hungary, Switzerland, and Long Beach (CHSLB) dataset for the used models.
| Sr. No. | Used Model for CHSLB Datasets | Predicted Value (Actual Class) | Predicted Value | Actual Value | ||
|---|---|---|---|---|---|---|
| 1. | Random Forest | N = 308 | NO | YES | ||
| NO | TN = 159 | 159 | ||||
| YES | 149 | |||||
| Total predict | 162 | 146 | 308 | |||
| 2. | AdaBoost | N = 308 | NO | YES | ||
| NO | 159 | |||||
| YES | 149 | |||||
| Total predict | 171 | 137 | 308 | |||
| 3. | Decision Tree | N = 308 | NO | YES | ||
| NO | 159 | |||||
| YES | 149 | |||||
| Total predict | 159 | 149 | 308 | |||
| 4. | KNN | N = 308 | NO | YES | ||
| NO | 159 | |||||
| YES | 149 | |||||
| Total predict | 159 | 149 | 308 | |||
The confusion metrics for evaluating the heart disease detection system of test data using the Cleveland dataset for the used models.
| Sr. No. | Used Model for Cleveland Datasets | Predicted Value (Actual Class) | Predicted Value | Actual Value | ||
|---|---|---|---|---|---|---|
| 1. | Random Forest | N = 46 | NO | YES | ||
| NO | 16 | |||||
| YES | 30 | |||||
| Total predict | 16 | 30 | 46 | |||
| 2. | AdaBoost | N = 46 | NO | YES | ||
| NO | 16 | |||||
| YES | 30 | |||||
| Total predict | 16 | 30 | 46 | |||
| 3. | Decision Tree | N = 46 | NO | YES | ||
| NO | 16 | |||||
| YES | 30 | |||||
| Total predict | 23 | 23 | 46 | |||
| 4. | KNN | N = 46 | NO | YES | ||
| NO | 16 | |||||
| YES | 30 | |||||
| Total predict | 15 | 31 | 46 | |||
Figure 4The AUC curve of test data using the CHSLB datasets for the used models.
Figure 5The AUC curve of test data using the Cleveland dataset for the used models.
Performances matrices for evaluating the heart disease detection system of CHSLB datasets for used models.
| Performance Matrices | Models | |||
|---|---|---|---|---|
| RF | AB | DT | KNN | |
| Accuracy | 99.03% | 96.10% | 100% | 100% |
| Precision (0) | 0.98 | 0.93 | 1.00 | 1.00 |
| Precision (1) | 1.00 | 1.00 | 1.00 | 1.00 |
| Recall (0) | 1.00 | 1.00 | 1.00 | 1.00 |
| Recall (1) | 0.98 | 0.92 | 1.00 | 1.00 |
| F1-score (0) | 0.99 | 0.96 | 1.00 | 1.00 |
| F1-score (1) | 0.99 | 0.96 | 1.00 | 1.00 |
| MAE | 0.00974 | 0.0389610 | 0.0 | 0.0 |
| R2 Score | 96.09 | 84.08 | 1.0 | 1.0 |
Performances matrices for evaluating the heart disease detection system of Cleveland dataset for used models.
| Performance Matrices | Models | |||
|---|---|---|---|---|
| RF | AB | DT | KNN | |
| accuracy | 93.478% | 91.30% | 71.739% | 97.826% |
| Precision (0) | 0.88 | 0.88 | 0.57 | 1.00 |
| Precision (1) | 0.97 | 0.93 | 0.87 | 0.97 |
| Recall (0) | 0.94 | 0.88 | 0.81 | 0.94 |
| recall (1) | 0.93 | 0.93 | 0.67 | 1.00 |
| F1-score (0) | 0.91 | 0.88 | 0.67 | 0.97 |
| f1-score (1) | 0.95 | 0.93 | 0.75 | 0.98 |
| MAE | 6.521% | 8.69 | 28.260% | 2.173% |
| R2 Score | 71.249% | 61.66% | 71.249% | 90.41% |
A comparison of the proposed system’s accuracy with the existing results.
| Sr. No. | Used Data Set | Models | |||
|---|---|---|---|---|---|
| RF | AB | DT | KNN | ||
| 1 | CHSLB datasets (1025) (Present study) | 99.03% | 96.10% | 100% | 100% |
| Cleveland dataset (303) (Present study) | 93.478% | 91.30% | 71.739% | 97.826% | |
| 2 | Five-fold in the statlog dataset | 90.46% [ | - | 96.42% [ | 96.42% [ |
| 3. | Cleveland dataset (303) | 75.55% [ | 90.16% [ | ||
| 4. | Cleveland dataset (303) | 80% [ | |||
| 5. | Armed forces institute of cardiology | 68.6% [ | 86.6% [ | ||
| 6. | CHSLB datasets (920) | 80.89% [ | |||
| 7. | Kita Hospital Jakarta (450) | 46% [ | |||
| 8. | Cleveland dataset (303) | 54.13% [ | |||
| 9. | Cleveland dataset (303) | 91.6% [ | |||
| 10 | People’s Hospital dataset | 97% [ | |||
| 11 | Northern Lebanon | 97.7% [ | |||
| 12 | Cleveland dataset (303) | 84% [ | - | 79% [ | 87% [ |
Figure 6The accuracy performance graph for the Cleveland dataset.
Figure 7The accuracy performance graph for the CHSLB datasets.
Figure 8Real-time web-based smart system for heart disease prediction.