| Literature DB >> 32284873 |
Robinson Spencer1, Fadi Thabtah1, Neda Abdelhamid2, Michael Thompson1.
Abstract
Machine learning has been used successfully to improve the accuracy of computer-aided diagnosis systems. This paper experimentally assesses the performance of models derived by machine learning techniques by using relevant features chosen by various feature-selection methods. Four commonly used heart disease datasets have been evaluated using principal component analysis, Chi squared testing, ReliefF and symmetrical uncertainty to create distinctive feature sets. Then, a variety of classification algorithms have been used to create models that are then compared to seek the optimal features combinations, to improve the correct prediction of heart conditions. We found the benefits of using feature selection vary depending on the machine learning technique used for the heart datasets we consider. However, the best model we created used a combination of Chi-squared feature selection with the BayesNet algorithm and achieved an accuracy of 85.00% on the considered datasets.Entities:
Keywords: Classification; data analysis; feature selection; heart disease; machine learning; prediction
Year: 2020 PMID: 32284873 PMCID: PMC7133070 DOI: 10.1177/2055207620914777
Source DB: PubMed Journal: Digit Health ISSN: 2055-2076
Figure 1.Experimental approach.
Data features.
| Feature | Description | Type | Values |
|---|---|---|---|
| age | Age in years | Numerical | 28–77, mean: 51.9 |
| sex | Gender | Nominal | 0 = female (188)1 = male (532) |
| cp | Chest pain type | Nominal | 1 = typical angina (38)2 = atypical angina (160)3 = non-anginal pain (157)4 = asymptomatic (365) |
| trestbps | Resting blood pressure in mmHg | Numerical | 80–200, mean: 131.8missing values (2) |
| chol | Serum cholesterol in mg/dl | Numerical | 0–603, mean: 204missing values (23) |
| Fbs | Fasting blood sugar >120 mg/dl | Nominal | 0 = false (567)1 = true (70) |
| restecg | Resting electrocardiographic results | Nominal | 0 = normal (471)1 = having ST-T wave abnormality (86)2 = showing probable left ventricular hypertrophy (161)missing values (2) |
| thalach | Maximum heart rate achieved | Numerical | 60–202, mean: 140.6missing values (2) |
| exang | Exercise induced angina | Nominal | 0 = no (476)1 = yes (242)missing values (2) |
| oldpeak | ST depression induced by exercise relative to rest | Numerical | −2.6–6.2, mean: 0.8missing values (6) |
| slope | The slope of the peak exercise ST segment | Nominal | 1 = upsloping (187)2 = flat (292)3 = downsloping (34)missing values (207) |
| Ca | Number of major vessels colored by fluoroscopy | Nominal | 0 (179)1 (67)2 (41)3 (20)missing values (413) |
| Thal | Heart rate | Nominal | 3 = normal (192)6 = fixed defect (38)7 = reversible defect (170)missing values (320) |
| target | The predicted class: if the patient has heart disease | Nominal | 0 = heart disease not present (360)1 = heart disease present (360) |
Feature sets created by feature selection methods.
| Heart-ChiSq dataset | Heart-Refdataset | Heart-SyUndataset |
|---|---|---|
| cp | cp | cp |
| exang | sex | exang |
| oldpeak | thal | chol |
| chol | ca | oldpeak |
| thalach | chol | thal |
| thal | exang | thalach |
| sex | restecg | sex |
| age | slope | age |
| ca | oldpeak | ca |
| slope |
Figure 2.Accuracy across models.
Figure 3.Precision across models.
Figure 4.Recall across models.