| Literature DB >> 35799599 |
Madhumita Pal1, Smita Parija1, Ganapati Panda1, Kuldeep Dhama2, Ranjan K Mohapatra3.
Abstract
Cardiovascular disease (CVD) makes our heart and blood vessels dysfunctional and often leads to death or physical paralysis. Therefore, early and automatic detection of CVD can save many human lives. Multiple investigations have been carried out to achieve this objective, but there is still room for improvement in performance and reliability. This study is yet another step in this direction. In this study, two reliable machine learning techniques, multi-layer perceptron (MLP), and K-nearest neighbour (K-NN) have been employed for CVD detection using publicly available University of California Irvine repository data. The performances of the models are optimally increased by removing outliers and attributes having null values. Experimental-based results demonstrate that a higher accuracy in detection of 82.47% and an area-under-the-curve value of 86.41% are obtained using the MLP model, unlike the K-NN model. Therefore, the proposed MLP model was recommended for automatic CVD detection. The proposed methodology can also be employed in detecting other diseases. In addition, the performance of the proposed model can be assessed via other standard data sets.Entities:
Keywords: K-nearest neighbour; cardiovascular disease; machine learning algorithms; multi-layer perceptron
Year: 2022 PMID: 35799599 PMCID: PMC9206502 DOI: 10.1515/med-2022-0508
Source DB: PubMed Journal: Open Med (Wars)
Figure 1Methodology employed in developing a CVD detection model: (a) training phase of ML models and (b) testing phase of ML models.
Figure 2Count plot of patients with CVD (1) and without CVD (0).
Figure 3Correlation plot for CVD data set categorical features (cp: chest pain, fbs: fasting blood sugar, Restecg: resting electrocardiographic result, exang: exercise-induced angina, slope: slope of peak ST segment, Ca: number of major vessels, thal: thallium stress result, and target output (1 = patient having CVD, 0 = patient not having CVD)).
Figure 4Correlation plot of CVD data set continuous features.
Figure 5Schematic diagram of MLP.
Definition of confusion matrix parameters
| Confusion matrix parameters | Description |
|---|---|
| True positive | Instances where we predicted yes (patient has the CVD), and it turned out to be correct |
| True negative | Instances where a patient does not have CVD and was predicted to not have CVD |
| False positive | Instances where a patient does not have CVD, but was predicted to have CVD |
| False negative | Instances where a patient does not have CVD and was predicted to not have CVD |
Definition of performance metrics
| Performance metrices | Definition/explanation |
|---|---|
| Accuracy |
|
| Precision |
|
| Recall (TP rate) |
|
|
|
|
| Support | The number of actual occurrences of a class in the provided data set |
| FP rate |
|
| Area under the curve (AUC) | AUC is an important feature of the ROC curve that measures the ability of a classifier to distinguish between classes. The greater the AUC, the better the model’s performance |
| ROC | An ROC, or ROC curve, is a graphical representation of a binary classifier |
| Macro average | All classes equally contribute to the final averaged metric |
| Weighted avg. | The weight of each class’s contribution to the average |
Figure 6Confusion matrix for CVD prediction using the K-NN model.
Figure 7Confusion matrix for CVD prediction using the MLP model.
Confusion matrix results of ML models
| Confusion matrix parameters | ML algorithms | |
|---|---|---|
|
| MLP | |
| TN | 21 | 33 |
| TP | 24 | 47 |
| FP | 10 | 8 |
| FN | 6 | 9 |
Figure 8Data visualization plot showing correlation between features.
Figure 9Data visualization plot showing a correlation between resting blood pressure and maximum heart rate.
Figure 10Data visualization plot showing a correlation between chest pain and cholesterol level.
Figure 11Data visualization plot showing a correlation between resting electrocardiographic and old peak.
Figure 12Scatter plot of maximum heart rate and age.
Figure 13Scatter plot showing correlation of CVD data set features.
Figure 14Correlation matrix plot between features.
Figure 15ROC plot of the K-NN model.
Classification results of the K-NN model
| Parameter | Precision | Recall |
| Support |
|---|---|---|---|---|
| 0 (Without CVD) | 0.78 | 0.68 | 0.72 | 31 |
| 1 (With CVD) | 0.71 | 0.80 | 0.75 | 30 |
| Accuracy | NA | NA | 0.74 | 61 |
| Macro average | 0.74 | 0.74 | 0.74 | 61 |
| Weighted average | O.74 | 0.74 | 0.74 | 61 |
Figure 16ROC curve for the MLP model.
Classification results of the MLP model
| Parameter | Precision | Recall |
| Support |
|---|---|---|---|---|
| 0 (Without CVD) | 0.79 | 0.80 | 0.80 | 41 |
| 1 (With CVD) | 0.85 | 0.84 | 0.85 | 56 |
| Accuracy | NA | NA | 0.82 | 97 |
| Macro avg | 0.82 | 0.82 | 0.82 | 97 |
| Weighted avg | 0.83 | 0.82 | 0.83 | 97 |
Comparision of accuracy and AUC scores obtained from ML models
| ML algorithms | Accuracy score (%) | AUC score (%) |
|---|---|---|
|
| 73.77 | 86.21 |
| MLP | 82.47 | 86.41 |
Performance comparision of the proposed work with existing works
| Existing work | Algorithms | Accuracy (%) | Reference |
|---|---|---|---|
| Kaur et al., 2019 | MLP | 47.54 | [ |
| Nahar et al., 2013 | Naïve Bayes | 69.11 | [ |
| Verma et al., 2016 | Decision Tree | 80.68 | [ |
| Ei-bialy et al., 2015 | Decision Tree | 78.54 | [ |
| Proposed work | MLP | 82.47 | — |
Description of features of the dataset
| SL NO. | Features | Types | Description | Range of features increasing the probability of heart disease |
|---|---|---|---|---|
| 1. | Age | Continous | Age in years | NA |
| 2. | Sex | Categorical | Male,Female | 1=male,0=female |
| 3. | Cp | Categorical | Chest pain type0: Typical angina1: Atypical angina2: Non-anginal pain3.Asymptomatic | People with cp equl to 1, 2, 3 are more likely to have heart disease than people with cp equal to 0. |
| 4. | Trestbps | Continous | Resting blood pressure in mm Hg | Concerning above 130-140 . |
| 5. | Chol | Continous | Serum cholesterol in mg/dl | serum = LDL + HDL + .2 * triglyceridesAbove 200 is matter of concern |
| 6. | Fbs | Categorical | Fasting blood sugar>120 mg/dl | 1=true,0=false>126 mg/dl signals diabetesPeople with Fbs equal to 1 increased the probability of suffering from heart disease than people with Fbs equal to 0. |
| 7. | Restecg | Categorical | Resting electrocardiographic results. People with value 1 (signals non-normal heart beat, can range from mild symptoms to severe problems) are more likely to have heart disease. | 0: Nothing to note1: Non normal heart beat can range from mild symptoms to severe problems2: Possible or definite left ventricular hypertrophyEnlarged heart's main pumping chamber |
| 8. | Thalach | Continous | Maximum heart rate achieved | More than 140 are more likely to have heart disease. |
| 9. | Exang | Categorical | Exercise induced angina | People with value 0have more probability of suffering from heart disease than people with value 1 1=yes0=no |
| 10. | Oldpeak | Continous | ST depression induced by exercise relative to rest looks at stress of heart | During exercise unhealthy heart will stress more |
| 11. | Slope | Categorical | The slope of the peak exercise ST segment | 0: Upsloping: better heart rate with excercise(uncommon)1: Flatsloping: minimal change (typical healthy heart)2: Downsloping: signs of unhealthy heart |
| 12. | Ca | Categorical | Number of major vessels (0-3) colored by flourosopy | Ca equal to 0 are more likely to have heart disease. |
| 13. | Thal | Categorical | Thalium stress result | 1,3: normal6: fixed defect: used to be defect 7: reversable defect: no proper blood movement when exercising People with thal value equal to 2 more likely to have heart disease. |
| 14. | Target | Categorical | Have heart disease or not | 1=yes, 0=no |