| Literature DB >> 35547562 |
K Karthick1, S K Aruna2, Ravi Samikannu3, Ramya Kuppusamy4, Yuvaraja Teekaraman5, Amruth Ramesh Thelkar6.
Abstract
Cardiovascular disease prediction aids practitioners in making more accurate health decisions for their patients. Early detection can aid people in making lifestyle changes and, if necessary, ensuring effective medical care. Machine learning (ML) is a plausible option for reducing and understanding heart symptoms of disease. The chi-square statistical test is performed to select specific attributes from the Cleveland heart disease (HD) dataset. Support vector machine (SVM), Gaussian Naive Bayes, logistic regression, LightGBM, XGBoost, and random forest algorithm have been employed for developing heart disease risk prediction model and obtained the accuracy as 80.32%, 78.68%, 80.32%, 77.04%, 73.77%, and 88.5%, respectively. The data visualization has been generated to illustrate the relationship between the features. According to the findings of the experiments, the random forest algorithm achieves 88.5% accuracy during validation for 303 data instances with 13 selected features of the Cleveland HD dataset.Entities:
Mesh:
Year: 2022 PMID: 35547562 PMCID: PMC9085310 DOI: 10.1155/2022/6517716
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.809
UCI ML repository's Cleveland heart disease dataset—feature subset [24].
| Attribute name | Attribute description |
|---|---|
| Age | Age in years |
| Sex | 1 denotes male and 0 denotes female |
| CP | Chest pain type 1, typical angina; type 2, atypical angina; type 3, nonanginal pain; and type 4, asymptomatic |
| trestbps | Resting blood pressure (in mmHg at entry to the health center) |
| chol | Serum lipid level in mg/dL |
| fbs | 1 denotes true, i.e., the fasting blood sugar level > 120 mg/dL; 0 denotes false |
| restecg | Resting ECG results: null, normal; 1, ST-T wave abnormality; and 2, probable or definite left ventricular hypertrophy |
| thalach | Maximum heart rate achieved |
| exang | Exercise induced angina (1 = yes; null = no) |
| oldpeak | ST depression induced by exercise relative to rest |
| slope | The slope of the peak exercise ST segment (1, 2, and 3): 1, upsloping; 2, flat; and 3, downsloping |
| ca | Number of major vessels (0-3) colored by fluoroscopy |
| thal | Thalassemia: 3 = normal, 6 = fixed defect, and 7 = reversible defect |
Statistical outline of subset attributes.
| Attributes | Age | Sex | CP | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | 54.44 | 0.68 | 3.16 | 131.69 | 246.69 | 0.15 | 0.99 | 149.61 | 0.33 | 1.04 | 1.60 | 0.66 | 4.70 | 0.94 |
| std | 9.04 | 0.47 | 0.96 | 17.60 | 51.78 | 0.36 | 0.99 | 22.88 | 0.47 | 1.16 | 0.62 | 0.93 | 1.97 | 1.23 |
| min | 29 | 0 | 1 | 94 | 126 | 0 | 0 | 71 | 0 | 0 | 1 | 0 | 0 | 0 |
| 25% | 48 | 0 | 3 | 120 | 211 | 0 | 0 | 133.5 | 0 | 0 | 1 | 0 | 3 | 0 |
| 50% | 56 | 1 | 3 | 130 | 241 | 0 | 1 | 153 | 0 | 0.8 | 2 | 0 | 3 | 0 |
| 75% | 61 | 1 | 4 | 140 | 275 | 0 | 2 | 166 | 1 | 1.6 | 2 | 1 | 7 | 2 |
| max | 77 | 1 | 4 | 200 | 564 | 1 | 2 | 202 | 1 | 6.2 | 3 | 3 | 7 | 4 |
Figure 1Visualization of features of the Cleveland heart dataset.
Figure 2Heat map of subset attributes.
Figure 3Distribution of various attributes.
Figure 4Subset attribute correlation.
Figure 5Pair plot.
Classification model—prediction accuracy.
| Machine learning classifier | Accuracy | |
|---|---|---|
| Training set (80%) | Test set (20%) | |
| SVM | 92.56 | 80.32 |
| Gaussian Naive Bayes | 86.77 | 78.68 |
| Logistic regression | 85.95 | 80.32 |
| LightGBM | 98.76 | 77.04 |
| XGBoost | 99.58 | 73.77 |
| Random forest | 100 | 88.5 |
Figure 6ML classification models—prediction accuracy.
Figure 7Confusion matrix.
Figure 8Performance of classification models—ROC curves.