| Literature DB >> 34886312 |
Mirza Rizwan Sajid1, Bader A Almehmadi2, Waqas Sami3,4, Mansour K Alzahrani5, Noryanti Muhammad6,7, Christophe Chesneau8, Asif Hanif9, Arshad Ali Khan10, Ahmad Shahbaz11.
Abstract
Criticism of the implementation of existing risk prediction models (RPMs) for cardiovascular diseases (CVDs) in new populations motivates researchers to develop regional models. The predominant usage of laboratory features in these RPMs is also causing reproducibility issues in low-middle-income countries (LMICs). Further, conventional logistic regression analysis (LRA) does not consider non-linear associations and interaction terms in developing these RPMs, which might oversimplify the phenomenon. This study aims to develop alternative machine learning (ML)-based RPMs that may perform better at predicting CVD status using nonlaboratory features in comparison to conventional RPMs. The data was based on a case-control study conducted at the Punjab Institute of Cardiology, Pakistan. Data from 460 subjects, aged between 30 and 76 years, with (1:1) gender-based matching, was collected. We tested various ML models to identify the best model/models considering LRA as a baseline RPM. An artificial neural network and a linear support vector machine outperformed the conventional RPM in the majority of performance matrices. The predictive accuracies of the best performed ML-based RPMs were between 80.86 and 81.09% and were found to be higher than 79.56% for the baseline RPM. The discriminating capabilities of the ML-based RPMs were also comparable to baseline RPMs. Further, ML-based RPMs identified substantially different orders of features as compared to baseline RPM. This study concludes that nonlaboratory feature-based RPMs can be a good choice for early risk assessment of CVDs in LMICs. ML-based RPMs can identify better order of features as compared to the conventional approach, which subsequently provided models with improved prognostic capabilities.Entities:
Keywords: LMICs; features importance; machine learning models; nonlaboratory-based features; risk prediction models
Mesh:
Year: 2021 PMID: 34886312 PMCID: PMC8657087 DOI: 10.3390/ijerph182312586
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1Flowchart for the development of baseline and ML-based RPMs and relative feature importance.
General characteristics of binary features of the study.
| Sr. No | Features | Frequency (%) |
|---|---|---|
| 1 | Gender ( | |
| Male | 312 (67.8) | |
| Female | 148 (32.2) | |
| 2 | Parental history of CVDs ( | |
| Yes | 78 (17.0) | |
| No | 382 (83.0) | |
| 3 | Diabetes mellitus ( | |
| Present | 115 (25%) | |
| Absent | 345(75%) | |
| 4 | Hypertension ( | |
| Present | 114 (24.8) | |
| Absent | 346 (75.2) | |
| 5 | Smoking history ( | |
| Smoker | 142 (30.9) | |
| Never smoker | 318 (69.1) | |
| 6 | Physical inactivity ( | |
| Low profile physical activity | 160 (34.8) | |
| Moderate to high physical activity | 300 (65.2) | |
| 7 | Self-reported general stress ( | |
| Sometimes to very stressful | 137 (29.8) | |
| Not at all to rarely stressful | 323 (70.2) | |
| 8 | Abdominal obesity ( | |
| Obese | 100 (21.7) | |
| Non-obese | 360 (78.3) | |
| 9 | Consumption of high-salt foods ( | |
| Consumption of high-salt foods or snacks ≥ 1 time a day | 194 (42.2) | |
| Consumption of high-salt foods or snacks < 1 time a day | 266 (57.8) | |
| 10 | Low fruit consumption ( | |
| <1-time fruit per day | 316 (68.7) | |
| ≥1-time fruit per day | 144 (31.3) | |
| 11 | Low vegetable consumption ( | |
| <1-time vegetables daily | 163 (35.4) | |
| ≥1-time vegetables daily | 297 (64.4) | |
| 12 | High fried foods/trans fats consumption( | |
| Deep-fried foods/snacks/fast foods ≥ 3 times a week | 180 (39.1) | |
| Deep-fried foods/snacks/fast foods < 3 times a week | 280 (60.9) | |
| 13 | Red meat/poultry consumption ( | |
| ≥2 times daily | 58 (12.6) | |
| <2 times daily | 402 (87.4) | |
| 14 | Second-hand smoke exposure ( | |
| More than 1 h of passive smoke exposure per week | 226 (49.0) | |
| Less than 1 h of passive smoke exposure per week | 234 (51.0) |
Performance of baseline and ML-based RPMs.
| Models | ANN | Linear SVM | RBF-SVM | RF | Baseline RPM | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Confusion Matrix | Case | Control | Case | Control | Case | Control | Case | Control | Case | Control |
| Case | 178 | 52 | 186 | 44 | 185 | 45 | 185 | 45 | 185 | 45 |
| Control | 35 | 195 | 44 | 186 | 54 | 176 | 55 | 175 | 49 | 181 |
| Sensitivity | 0.780 | 0.809 | 0.804 | 0.804 | 0.804 | |||||
| Specificity | 0.848 | 0.809 | 0.765 | 0.761 | 0.787 | |||||
| Accuracy | 81.09 | 80.86 | 78.50 | 78.30 | 79.56 | |||||
| AUC | 0.871 | 0.864 | 0.853 | 0.856 | 0.859 | |||||
| Kappa-statistic | 0.622 | 0.617 | 0.570 | 0.565 | 0.592 | |||||
| RMSE | 0.378 | 0.382 | 0.392 | 0.386 | 0.389 | |||||
| NRI | 3.7% | 2.7% | −2.2% | −2.6% | ||||||
Percentage change in performance matrices of ML-based RPMs to conventional baseline RPM.
| Models * | Sensitivity | Specificity | Accuracy | AUC | Kappa-Statistic | RMSE | BS | Number of Criteria Fulfilled |
|---|---|---|---|---|---|---|---|---|
| ANN | −2.40% | 6.10% | 1.53% | 1.20% | 2.97% | 0.378 | 0.143 | 5/6 |
| Linear SVM | 0.50% | 2.20% | 1.30% | 0.50% | 2.50% | 0.382 | 0.146 | 6/6 |
| RBF-SVM | 0.00% | −2.20% | −1.06% | −0.60% | −2.20% | 0.392 | 0.154 | 0/6 |
| RF | 0.00% | −2.60% | −1.26% | −0.30% | −2.68% | 0.386 | 0.149 | 1/6 |
| 0.389 | 0.151 | |||||||
* LRA is a baseline model.
Figure 2Partial dependence plot (PDP) for visualizing marginal effects of age and CVDs status.
Figure 3Relative feature importance extracted through best-performed ML and baseline RPMs.