| Literature DB >> 35741283 |
Furqan Rustam1, Abid Ishaq2, Kashif Munir3, Mubarak Almutairi4, Naila Aslam5, Imran Ashraf6.
Abstract
Cardiovascular diseases (CVDs) have been regarded as the leading cause of death with 32% of the total deaths around the world. Owing to the large number of symptoms related to age, gender, demographics, and ethnicity, diagnosing CVDs is a challenging and complex task. Furthermore, the lack of experienced staff and medical experts, and the non-availability of appropriate testing equipment put the lives of millions of people at risk, especially in under-developed and developing countries. Electronic health records (EHRs) have been utilized for diagnosing several diseases recently and show the potential for CVDs diagnosis as well. However, the accuracy and efficacy of EHRs-based CVD diagnosis are limited by the lack of an appropriate feature set. Often, the feature set is very small and unable to provide enough features for machine learning models to obtain a good fit. This study solves this problem by proposing the novel use of feature extraction from a convolutional neural network (CNN). An ensemble model is designed where a CNN model is used to enlarge the feature set to train linear models including stochastic gradient descent classifier, logistic regression, and support vector machine that comprise the soft-voting based ensemble model. Extensive experiments are performed to analyze the performance of different ratios of feature sets to the training dataset. Performance analysis is carried out using four different datasets and results are compared with recent approaches used for CVDs. Results show the superior performance of the proposed model with 0.93 accuracy, and 0.92 scores each for precision, recall, and F1 score. Results indicate both the superiority of the proposed approach, as well as the generalization of the ensemble model using multiple datasets.Entities:
Keywords: cardiovascular disease prediction; deep learning; feature extraction; transfer learning
Year: 2022 PMID: 35741283 PMCID: PMC9221641 DOI: 10.3390/diagnostics12061474
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1The architecture of the proposed methodology.
Description of dataset attributes.
| Attribute | Data Type | Description |
|---|---|---|
| Age | Integer | This attribute contains the age of a patient (years). |
| Sex | String | This attribute contains the sex/gender of the patient in string format, [M = male, F = female] |
| Chest pain type | String | This attribute contains the type of the chest pain experienced by the patient in string format, [TA = typical angina, ATA = Atypical angina, NAP = Non-anginal pain, Asy = asymtomatic] |
| Resting BP | Integer | Resting blood pressure of the patient in mmHg |
| Cholesterol | Integer | Serum cholesterol in mm/dL |
| Fasting BS | Integer | Fasting blood sugar [1 = if fasting BS > 120 mg/dL, 0 = otherwise] |
| Resting ECG | String | Electrocardiogram (ECG) result [Normal = Normal, ST = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >0.05 mV), LVH = showing probable or definite left ventricular hypertrophy by Estes’ criteria] |
| Max HR | Integer | This attribute contains the maximum heart rate of a patient [Numeric value between 60 and 202] |
| Exercise Angina | String | Exercise-induced angina [Y = yes, N = no] |
| Old Peak | Float | ST depression induced by exercise relative to rest |
| ST-Slope | String | Slop or the peak exercise ST segment [Up = upsloping, Flat = flat, Down = downsloping] |
| Heart Disease | Integer | Binary Target, [Class 1 = heart disease, Class 0 = normal] |
Results of the encoding technique on the sample dataset.
| Age | Sex | ChestPainType | RestingBP | Choles. | FastingBS | RestingECG | MaxHR | Exer.Angina | Oldpeak | ST_Slope | HeartDisease |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Before Encoding | |||||||||||
| 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
| 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
| 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
| After Encoding | |||||||||||
| 12 | 1 | 1 | 41 | 147 | 0 | 1 | 98 | 0 | 10 | 2 | 0 |
| 21 | 0 | 2 | 55 | 40 | 0 | 1 | 82 | 0 | 20 | 1 | 1 |
| 9 | 1 | 1 | 31 | 141 | 0 | 2 | 25 | 0 | 10 | 2 | 0 |
Number of samples for training and testing.
| Set | Total Samples | Class 0 | Class 1 |
|---|---|---|---|
| Training | 734 | 321 | 413 |
| Testing | 184 | 89 | 95 |
Figure 2Architecture of the proposed ConvSGLV.
Results using the original feature set.
| Model | Accuracy | Class | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| DT | 0.77 | 0 | 0.75 | 0.75 | 0.75 |
| 1 | 0.79 | 0.79 | 0.79 | ||
| Avg | 0.77 | 0.77 | 0.77 | ||
| ADA | 0.83 | 0 | 0.80 | 0.83 | 0.81 |
| 1 | 0.85 | 0.82 | 0.84 | ||
| Avg | 0.82 | 0.83 | 0.83 | ||
| SVM | 0.86 | 0 | 0.85 | 0.85 | 0.85 |
| 1 | 0.87 | 0.87 | 0.87 | ||
| Avg | 0.86 | 0.86 | 0.86 | ||
| RF | 0.85 | 0 | 0.85 | 0.81 | 0.83 |
| 1 | 0.85 | 0.88 | 0.86 | ||
| Avg | 0.85 | 0.84 | 0.85 | ||
| ETC | 0.88 | 0 | 0.87 | 0.86 | 0.86 |
| 1 | 0.88 | 0.89 | 0.89 | ||
| Avg | 0.87 | 0.87 | 0.87 | ||
| LR | 0.86 | 0 | 0.86 | 0.85 | 0.85 |
| 1 | 0.87 | 0.88 | 0.88 | ||
| Avg | 0.86 | 0.86 | 0.86 | ||
| SGDC | 0.71 | 0 | 0.86 | 0.43 | 0.57 |
| 1 | 0.66 | 0.94 | 0.78 | ||
| Avg | 0.76 | 0.68 | 0.67 | ||
| SGLV | 0.87 | 0 | 0.85 | 0.83 | 0.84 |
| 1 | 0.86 | 0.88 | 0.87 | ||
| Avg | 0.86 | 0.86 | 0.86 |
Experiment results using the convolutional feature set.
| Model | Accuracy | Class | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| DT | 0.79 | 0 | 0.78 | 0.76 | 0.77 |
| 1 | 0.80 | 0.82 | 0.81 | ||
| Avg | 0.79 | 0.79 | 0.79 | ||
| ADA | 0.88 | 0 | 0.85 | 0.88 | 0.86 |
| 1 | 0.90 | 0.87 | 0.89 | ||
| Avg | 0.87 | 0.88 | 0.87 | ||
| SVM | 0.88 | 0 | 0.85 | 0.88 | 0.86 |
| 1 | 0.90 | 0.87 | 0.89 | ||
| Avg | 0.87 | 0.88 | 0.87 | ||
| RF | 0.89 | 0 | 0.93 | 0.82 | 0.87 |
| 1 | 0.86 | 0.95 | 0.90 | ||
| Avg | 0.90 | 0.89 | 0.89 | ||
| ETC | 0.85 | 0 | 0.81 | 0.85 | 0.83 |
| 1 | 0.88 | 0.84 | 0.86 | ||
| Avg | 0.85 | 0.85 | 0.85 | ||
| LR | 0.90 | 0 | 0.93 | 0.87 | 0.90 |
| 1 | 0.88 | 0.94 | 0.91 | ||
| Avg | 0.90 | 0.90 | 0.90 | ||
| SGDC | 0.88 | 0 | 0.85 | 0.86 | 0.86 |
| 1 | 0.89 | 0.88 | 0.89 | ||
| Avg | 0.87 | 0.87 | 0.87 | ||
| SGLV | 0.92 | 0 | 0.93 | 0.91 | 0.92 |
| 1 | 0.92 | 0.94 | 0.93 | ||
| Avg | 0.92 | 0.92 | 0.92 |
Figure 3Comparison between model performance with proposed features and original features.
Figure 4Visualization of the features sets, (a) original features and (b) CNN features set.
Models’ accuracy using a different number of features.
| Model | Features | |||
|---|---|---|---|---|
| 10,000 | 15,000 | 20,000 | 25,000 | |
| DT | 0.67 | 0.78 | 0.79 | 0.79 |
| ADA | 0.80 | 0.83 | 0.83 | 0.88 |
| SVM | 0.84 | 0.85 | 0.85 | 0.88 |
| RF | 0.86 | 0.82 | 0.84 | 0.89 |
| ETC | 0.83 | 0.81 | 0.83 | 0.85 |
| LR | 0.86 | 0.84 | 0.88 | 0.90 |
| SGDC | 0.86 | 0.83 | 0.81 | 0.88 |
| SGLV | 0.86 | 0.86 | 0.88 | 0.92 |
Number of correct and wrong predictions for all models.
| Model | CNN Features | Original Features | ||
|---|---|---|---|---|
| CP | WP | CP | WP | |
| DT | 146 | 38 | 142 | 42 |
| ADA | 161 | 23 | 152 | 32 |
| SVM | 161 | 23 | 158 | 26 |
| RF | 164 | 20 | 156 | 28 |
| ETC | 156 | 28 | 161 | 23 |
| LR | 166 | 18 | 159 | 25 |
| SGDC | 161 | 23 | 130 | 54 |
| SGLV | 170 | 14 | 160 | 24 |
Performance validation using four different datasets.
| Dataset | Accuracy | |
|---|---|---|
| CNN Features | Original Features | |
| CHD | 0.93 | 0.82 |
| SHD | 0.90 | 0.80 |
| SAHDD | 0.77 | 0.75 |
| HFPD | 0.92 | 0.86 |
Architecture of deep learning models.
| Model | Hyper-Parameter Setting |
|---|---|
| GRU | Embedding (1000, 100, 11) |
| Dropout (0.2) | |
| GRU (64, return_sequences = True) | |
| Dense (2, activation=’softmax’) | |
| loss = ’binary_crossentropy’, optimizer = ’adam’, epochs = 100 | |
| CNN | Embedding (1000, 100, 11) |
| Conv1D (128, 2, activation = ’relu’) | |
| MaxPooling1D (pool_size = 2) | |
| Flatten () | |
| Dense (2, activation = ’softmax’) | |
| loss = ’binary_crossentropy’, optimizer = ’adam’, epochs = 100 |
Results for deep learning models used in the study.
| Model | Accuracy | Class | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| CNN | 0.82 | 0 | 0.77 | 0.77 | 0.77 |
| 1 | 0.85 | 0.85 | 0.85 | ||
| Avg | 0.81 | 0.81 | 0.81 | ||
| GRU | 0.81 | 0 | 0.75 | 0.78 | 0.77 |
| 1 | 0.85 | 0.83 | 0.84 | ||
| Avg | 0.80 | 0.80 | 0.80 |
Comparison of computational complexity.
| Models | Using Original Features | Using CNN 25,000 Features |
|---|---|---|
| DT | 0.01 | 9.45 |
| ADA | 0.43 | 105.01 |
| SVM | 1.79 | 4.99 |
| RF | 0.38 | 7.90 |
| ETC | 0.33 | 15.45 |
| LR | 0.01 | 2.97 |
| SGDC | 0.01 | 0.93 |
| SGLV | 7.99 | 17.06 |
Performance comparison with recent studies for heart disease prediction.
| Dataset | Ref. | Year | Model Used | Accuracy |
|---|---|---|---|---|
| CHD | [ | 2019 | HRFLM | 0.880 |
| [ | 2021 | LR | 0.868 | |
| [ | 2021 | Naïve Bayes | 0.850 | |
| [ | 2021 | RF with CDTL | 0.893 | |
| [ | 2021 | MBAR-CB, CBC | 0.918 | |
| Current study | 2021 | ConvSGLV |
| |
| SHD | [ | 2021 | MBAR-CB, CBC | 0.870 |
| [ | 2021 | Ensemble model | 0.88 | |
| Current study | 2021 | ConvSGLV |
| |
| SAHDD | [ | 2021 | MBAR-XGB, XGBC | 0.75 |
| Current study | 2021 | ConvSGLV |
| |
| HFPD | Current study | 2021 | ConvSGLV |
|