| Literature DB >> 32420387 |
Bayu Adhi Tama1, Sun Im2, Seungchul Lee1.
Abstract
Coronary heart disease (CHD) is one of the severe health issues and is one of the most common types of heart diseases. It is the most frequent cause of mortality across the globe due to the lack of a healthy lifestyle. Owing to the fact that a heart attack occurs without any apparent symptoms, an intelligent detection method is inescapable. In this article, a new CHD detection method based on a machine learning technique, e.g., classifier ensembles, is dealt with. A two-tier ensemble is built, where some ensemble classifiers are exploited as base classifiers of another ensemble. A stacked architecture is designed to blend the class label prediction of three ensemble learners, i.e., random forest, gradient boosting machine, and extreme gradient boosting. The detection model is evaluated on multiple heart disease datasets, i.e., Z-Alizadeh Sani, Statlog, Cleveland, and Hungarian, corroborating the generalisability of the proposed model. A particle swarm optimization-based feature selection is carried out to choose the most significant feature set for each dataset. Finally, a two-fold statistical test is adopted to justify the hypothesis, demonstrating that the performance differences of classifiers do not rely upon an assumption. Our proposed method outperforms any base classifiers in the ensemble with respect to 10-fold cross validation. Our detection model has performed better than current existing models based on traditional classifier ensembles and individual classifiers in terms of accuracy, F 1, and AUC. This study demonstrates that our proposed model adds a considerable contribution compared to the prior published studies in the current literature.Entities:
Mesh:
Year: 2020 PMID: 32420387 PMCID: PMC7201579 DOI: 10.1155/2020/9816142
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Summarization of existing methods for CHD prediction in chronological order.
| Study | Technique | Feature selection | Validation method | Dataset |
|---|---|---|---|---|
| Ozcift and Gulten [ | Rotation forest | No | 10CV | Cleveland |
| Muthukaruppan and Er [ | Fuzzy expert system | No | Hold-out | Cleveland |
| Nahar et al. [ | SMO | Yes | 10CV | Cleveland |
| Alizadehsani et al. [ | Bagging-C4.5 | Yes | 10CV | Z-Alizadeh Sani |
| Alizadehsani et al. [ | SMO | Yes | 10CV | Z-Alizadeh Sani |
| Alizadehsani et al. [ | SVM | Yes | 10CV | Z-Alizadeh Sani |
| Verma et al. [ | MLP | Yes | 10CV | Cleveland, IGMC |
| Qin et al. [ | EA-MFS | Yes | 10CV | Z-Alizadeh Sani |
| Arabasadi et al. [ | Hybrid NN-GA | Yes | 10CV | Z-Alizadeh Sani, Cleveland, Hungarian, Long-beach-va, and Switzerland |
| Haq et al. [ | SVM | Yes | 10CV | Cleveland |
| Dwivedi [ | Logistic regression | No | 10CV | Statlog |
| Ahmadi et al. [ | NN | Yes | Hold-out | Cleveland |
| Abdar et al. [ | SVM | Yes | 10CV | Z-Alizadeh Sani |
| Raza [ | Voting ensemble | No | 10CV | Statlog |
| Amin et al. [ | Voting ensemble | Yes | 10CV | Cleveland, Statlog |
| Mohan et al. [ | HRFLM | No | Notmentioned | Cleveland, Hungarian, Long-beach-va, and Switzerland |
Summarization of each dataset's characteristics and properties.
| Dataset | # features | # instances | Ratio between normal and CHD |
|---|---|---|---|
| Z-Alizadeh Sani | 54 | 303 | 1 : 2.5 |
| Statlog | 13 | 261 | 1 : 0.78 |
| Cleveland | 13 | 303 | 1 : 0.85 |
| Hungarian | 13 | 294 | 1 : 0.56 |
Figure 1Theoretical framework of heart disease detection.
Parameter settings used in particle swarm optimization-based feature selection.
| Parameter | Value |
|---|---|
|
| 1.0 |
|
| 2.0 |
| Maximum generations | 30 |
| Number of particles | 5, 10, 20, 50, 100, 500, 1000, 2000, 5000, and 10000 |
| Mutation type | Bit-flip |
| Mutation probability | 0.01 |
| Prune | False |
Figure 2Classification accuracy of CDT for each CHD dataset w.r.t various number of particles.
Number of selected features for each CHD dataset w.r.t different number of particles.
| # particles | Z-Alizadeh | # selected features | Hungarian | |
|---|---|---|---|---|
| Cleveland | Statlog | |||
| 5 | 15 | 10 | 10 | 6 |
| 10 | 26 | 7 | 7 | 6 |
| 20 | 27 | 9 | 8 | 6 |
| 50 | 13 | 7 | 7 | 6 |
| 100 | 13 | 7 | 7 | 6 |
| 500 | 13 | 7 | 7 | 6 |
| 1000 | 13 | 7 | 7 | 6 |
| 2000 | 13 | 7 | 7 | 6 |
| 5000 | 13 | 7 | 7 | 6 |
| 10000 | 13 | 7 | 7 | 6 |
The selected features obtained from PSO based feature selection for each dataset.
| Dataset | # selected features | Feature name |
|---|---|---|
| Z-Alizadeh Sani | 27 | Age, hypertension, airway disease, thyroid disease, congestive heart failure, dyslipidemia, blood pressure, systolic murmur, diastolic murmur, typical chest pain, dyspnea, atypical, nonanginal, low threshold angina, ST elevation, T inversion, poor R progression, fasting blood sugar, LDL, HDL, blood urea nitrogen, erythrocyte sedimentation rate, white blood cell, neutrophil, ejection fraction, region with regional wall motion abnormality, and valvular heart disease. |
| Statlog | 8 | Gender/sex, chest pain type, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, ST depression, number of major vessels, and thallium stress test result. |
| Cleveland | 7 | Chest pain type, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, oldpeak, number of major vessels, and thallium stress test result. |
| Hungarian | 6 | Gender/sex, chest pain type, heart rate, old peak, slope, and number of major vessels. |
Results of mean value of AUC (%) and the Friedman rank and Iman-Davenport tests (the best value is indicated in bold).
| Algorithm | Z-Alizadeh Sani | Statlog | Cleveland | Hungarian | Friedman rank | Iman-Davenport |
|---|---|---|---|---|---|---|
| DT | 76.30 | 80.30 | 79.80 | 77.10 | 5.50 | 3.69E-10 |
| RT | 69.90 | 78.90 | 75.20 | 73.60 | 7.00 | |
| CART | 78.20 | 79.80 | 78.60 | 80.30 | 5.50 | |
| RF | 92.47 | 89.49 |
| 91.54 | 1.75 | |
| GBM | 88.99 | 88.13 | 89.24 | 91.13 | 2.75 | |
| XGBoost | 87.65 | 80.73 | 85.30 | 86.98 | 4.00 | |
|
|
|
| 85.86 |
|
|
Comparative results of all classifiers of the w.r.t Friedman post hoc test.
| Comparison | Friedman post hoc |
|---|---|
| Proposed vs. DT | 0.0088 |
| Proposed vs. RT | 0.00031 |
| Proposed vs. CART | 0.0088 |
| Proposed vs. RF | 0.86 |
| Proposed vs. GBM | 0.41 |
| Proposed vs. XGBoost | 0.10 |
Comparison of the proposed method with some prior studies using the Z-Alizadeh Sani dataset (the best value is indicated in bold).
| Study | Technique | # of features | Validation method | Accuracy | (%) |
| AUC (%) | Statistical test |
|---|---|---|---|---|---|---|---|---|
| [ | Bagging-DT | 20 | 10CV | 79.54 | (Lad), (LCX), 68.96 | Not reported | Not reported | No |
| [ | Information gain-SMO | 34 | 10CV | 94.08 | Not reported | Not reported | No | |
| [ | Information gain-SVM | 24 | 10CV | 86.14 | (Lad), (LCX), (RCA) | Not reported | Not reported | No |
| [ | Neural network genetic algorithm | 22 | 10CV | 93.85 | Not reported | Not reported | No | |
| [ | Ensemble algorithm multiple feature selection | 34 | 10CV | 93.70 | 95.53 | Not reported | No | |
| [ | Support vector machine feature engineering | 32 | 10CV | 96.40 | Not reported | Not reported | No | |
| [ |
| 29 | 10CV | 93.08 | 91.51 | Not reported | No | |
| This paper | Two-tier ensemble PSO-based feature selection | 27 | 10CV |
|
|
| Two-step statistical test |
Comparison of the proposed method with some prior studies using the StatLog dataset (the best value is indicated in bold).
| Study | Technique | # of features | Validation method | Accuracy (%) |
| AUC (%) | Statistical test |
|---|---|---|---|---|---|---|---|
| [ | Logistic regression | 13 | 10CV | 85 | 87 | Not reported | No |
| [ | Ensemble voting logistic regression multilayer perceptron naive Bayes | 13 | 10CV | 89 | 87 | 88 | No |
| This paper | Two-tier ensemble PSO-based feature selection | 8 | 10CV |
|
|
| Two-step |
Comparison of the proposed method with some prior studies using the Cleveland dataset (the best value is indicated in bold).
| Study | Technique | # of features | Validation method | Accuracy (%) |
| AUC (%) | Statistical test |
|---|---|---|---|---|---|---|---|
| [ | Rotation forest-J48-CFS | 7 | 10CV | 84.48 | Not reported |
| No |
| [ | PSO fuzzy expert systems | 76 | Hold-out |
| Not reported | Not reported | No |
| [ | SMO-expert-based feature selection | 8 | 10CV | 84.49 | 86.2 | Not reported | No |
| [ | CFS-PSO-clustering-MLP | 5 | 10CV | 90.28 | Not reported | Not reported | No |
| [ | Logistic regression-LASSO | 6 | 10CV | 89 | Not reported | Not reported | No |
| [ | Boosted-C5.0 and neural network | 12 | 10CV | 77.8 & 81.9 | Not reported | Not reported | Paired |
| [ | Voting-naive Bayes-logistic regression | 9 | 10CV | 87.41 | Not reported | Not reported | No |
| This paper | Two-tier ensemble PSO-based feature selection | 7 | 10CV | 85.71 |
| 85.86 | Two-step statistical test |
Comparison of the proposed method with some prior studies using the Hungarian dataset (the best value is indicated in bold).
| Study | Technique | # of features | Validation method | Accuracy (%) |
| AUC (%) | Statistical test |
|---|---|---|---|---|---|---|---|
| [ | Neural network genetic algorithm | 14 | 10CV | 87.1 | Not reported | Not reported | No |
| This paper | Two-tier ensemble PSO-based feature selection | 6 | 10CV |
|
|
| Two-step statistical test |