| Literature DB >> 35118145 |
Nikolaos Mittas1, Fani Chatzopoulou2,3, Konstantinos A Kyritsis4, Christos I Papagiannopoulos4, Nikoleta F Theodoroula4, Andreas S Papazoglou5, Efstratios Karagiannidis5, Georgios Sofidis5, Dimitrios V Moysidis5, Nikolaos Stalikas5, Anna Papa2, Dimitrios Chatzidimitriou2, Georgios Sianos5, Lefteris Angelis6, Ioannis S Vizirianakis4,7.
Abstract
Our study aims to develop a data-driven framework utilizing heterogenous electronic medical and clinical records and advanced Machine Learning (ML) approaches for: (i) the identification of critical risk factors affecting the complexity of Coronary Artery Disease (CAD), as assessed via the SYNTAX score; and (ii) the development of ML prediction models for accurate estimation of the expected SYNTAX score. We propose a two-part modeling technique separating the process into two distinct phases: (a) a binary classification task for predicting, whether a patient is more likely to present with a non-zero SYNTAX score; and (b) a regression task to predict the expected SYNTAX score accountable to individual patients with a non-zero SYNTAX score. The framework is based on data collected from the GESS trial (NCT03150680) comprising electronic medical and clinical records for 303 adult patients with suspected CAD, having undergone invasive coronary angiography in AHEPA University Hospital of Thessaloniki, Greece. The deployment of the proposed approach demonstrated that atherogenic index of plasma levels, diabetes mellitus and hypertension can be considered as important risk factors for discriminating patients into zero- and non-zero SYNTAX score groups, whereas diastolic and systolic arterial blood pressure, peripheral vascular disease and body mass index can be considered as significant risk factors for providing an accurate estimation of the expected SYNTAX score, given that a patient belongs to the non-zero SYNTAX score group. The experimental findings utilizing the identified set of important risk factors indicate a sufficient prediction performance for the Support Vector Machine model (classification task) with an F-measure score of ~0.71 and the Support Vector Regression model (regression task) with a median absolute error value of ~6.5. The proposed data-driven framework described herein present evidence of the prediction capacity and the potential clinical usefulness of the developed risk-stratification models. However, further experimentation in a larger clinical setting is needed to ensure the practical utility of the presented models in a way to contribute to a more personalized management and counseling of CAD patients.Entities:
Keywords: SYNTAX score; coronary artery disease; machine learning; personalized (precision) medicine; risk-stratification model
Year: 2022 PMID: 35118145 PMCID: PMC8804295 DOI: 10.3389/fcvm.2021.812182
Source DB: PubMed Journal: Front Cardiovasc Med ISSN: 2297-055X
Figure 1Proposed data-driven framework.
Classification and regression methods for building the zero- and count-part models.
|
|
|
|---|---|
| Regression Analysis (RA) variant with Logistic Regression and Linear Regression for fitting the zero- and count-part models, respectively. | Logistic Regression employs a logit function for estimating the log odds of a binary response and probabilities for differentiating the cases into negative (absence)/positive (presence) classes. |
| Linear Regression estimates the parameters (regression coefficients) of a known explicit linear function describing the relationship between a continuous response and a set of predictors minimizing the sum of square residuals. | |
| Classification and Regression Tree (CART) for fitting the zero- and count-part models. | Build hierarchical models composed of decision nodes and leaves to predict the class (or continuous outcome) of a response based on a set of predictors. |
| Random Forest (RF) for fitting the zero- and count-part models. | An ensemble algorithm that combines a set of votes (or continuous outcomes) evaluated by a set of individual decision trees estimations. |
| Support Vector (SV) variant with Support Vector Machines (SVM) and Support Vector Regression (SVR) for fitting the zero- and count-part models, respectively. | SVM finds the optimal hyperplane separating the cases into negative (absence)/positive (presence) classes margin between the data points to classify them into predefined classes. |
| SVR is an extension of SVM sharing the same principles but with the aim of estimating a continuous outcome for a response variable. |
Performance evaluation metrics for classification and regression tasks.
|
|
|
|
|---|---|---|
| Classification (zero-part model) | Accuracy |
|
| Balanced accuracy |
| |
| Precision | ||
| Recall |
| |
| F-measure |
| |
| Regression (count-part model) | Median Error (MdE) | |
| Median Absolute Error (MdAE) | ||
| Median Magnitude of Relative Error (MdMRE) | ||
| Median Magnitude of Relative Error to the Estimate (MdMER) |
Descriptive and exploratory analyses for categorical risk factors and SYNTAX score.
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Gender | Female | 90 (29.70) | 8.29 (11.46) | 0.00 [0, 49.0] | 0.061 |
| Male | 213 (70.30) | 10.79 (12.78) | 7.00 [0, 54.5] | ||
| Hypertension | No | 109 (35.97) | 8.45 (12.00) | 0.00 [0, 49.0] |
|
| Yes | 194 (64.03) | 10.94 (12.62) | 7.00 [0, 54.5] | ||
| Diabetes mellitus | No | 215 (70.96) | 8.29 (11.18) | 2.00 [0, 45.0] |
|
| Yes | 88 (22.04) | 14.34 (14.27) | 9.75 [0, 54.5] | ||
| Dyslipidaemia | No | 163 (53.80) | 10.04 (12.64) | 5.00 [0, 49.0] | 0.757 |
| Yes | 140 (46.20) | 10.05 (12.24) | 6.00 [0, 54.5] | ||
| Positive (+) family history of CAD | No | 252 (83.17) | 10.00 (12.64) | 5.00 [0, 54.5] | 0.705 |
| Yes | 51 (16.83) | 10.29 (11.51) | 7.00 [0, 41.5] | ||
| Smoking | No | 196 (64.69) | 9.71 (12.55) | 5.00 [0, 54.5] | 0.353 |
| Yes | 107 (35.31) | 10.66 (12.25) | 7.00 [0, 44.5] | ||
| Chronic kidney failure | No | 290 (95.71) | 10.07 (12.39) | 5.00 [0, 54.5] | 0.651 |
| Yes | 13 (4.29) | 9.62 (13.93) | 0.00 [0, 42.0] | ||
| Peripheral vascular disease | No | 292 (96.37) | 9.60 (11.95) | 5.00 [0, 49.0] |
|
| Yes | 11 (3.63) | 21.82 (18.88) | 20.50 [0, 54.5] | ||
| ST-T changes | No | 252 (83.17) | 10.48 (12.55) | 6.00 [0, 54.5] | 0.088 |
| Yes | 51 (16.83) | 7.89 (11.74) | 0.00 [0, 42.0] | ||
Risk factors presenting a statistically significant effect on SYNTAX score are highlighted in bold font.
Descriptive and exploratory analyses for continuous risk factors and SYNTAX score.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Age (in years) | 64.25 | 11.13 | 66.00 | 24.00 | 87.00 | 0.092 (0.110) |
| Body mass index (BMI) (kg/m2) | 28.85 | 4.97 | 28.40 | 12.20 | 44.30 | −0.083 (0.151) |
| Systolic arterial pressure (SAP) (mmHg) | 136.08 | 17.85 | 135.00 | 85.00 | 200.00 |
|
| Diastolic arterial pressure (DAP) (mmHg) | 80.17 | 9.62 | 80.00 | 49.00 | 110.00 | −0.017 (0.767) |
| Glomerular filtration rate (GFR) by CKD-EPI (mL/min/1.73m2) | 93.15 | 34.43 | 89.00 | 6.10 | 254.40 | −0.089 (0.121) |
| UREA (mg/dL) | 40.97 | 20.58 | 36.00 | 0.84 | 177.00 | 0.062 (0.285) |
| Total cholesterol (CHOL) (mg/dL) | 164.73 | 42.14 | 162.00 | 5.50 | 341.00 | −0.045 (0.438) |
| High density lipoprotein cholesterol (HDL) (mg/dL) | 45.47 | 14.80 | 43.00 | 18.00 | 109.00 |
|
| Aspartate aminotransferase (SGOT) (units/L) | 21.38 | 11.20 | 19.00 | 4.00 | 102.00 | −0.032 (0.578) |
| Alanine aminotransferase (SGPT) (units/L) | 23.25 | 14.76 | 19.00 | 3.00 | 114.00 | −0.013 (0.819) |
| Hemoglobin (HGB) (g/dL) | 14.02 | 1.65 | 14.00 | 4.52 | 18.80 | −0.053 (0.354) |
| Platelets (PLT) (*1000) | 232.38 | 65.02 | 227.00 | 70.00 | 599.00 | 0.065 (0.262) |
| White blood cells (WBC) (*1000) | 7.51 | 1.98 | 7.31 | 1.06 | 14.90 |
|
|
| 1.51 | 0.81 | 1.41 | 0.04 | 6.77 |
|
| 0.33 | 0.17 | 0.29 | 0.01 | 1.47 | 0.013 (0.826) | |
| Atherogenic index of plasma levels ( | 0.47 | 0.32 | 0.46 | −0.21 | 1.73 |
|
Statistically significant correlations are highlighted in bold font.
Figure 2The boxplots and violin plots represent the distributions of the SYNTAX score of patients (dots) for each level of categorical risk factor.
Figure 3Importance of features extracted by the Boruta algorithm (zero-part) (the abbreviations of the risk-factors can be found in Table 4) [Ratio 1: Monocyte − to − HDL − cholesterol ratio; Ratio 2: Lymphocyte − to − monocyte ratio; Ratio 3: Atherogenic Index of Plasma levels (].
Figure 4Importance of features extracted by the Boruta algorithm (count-part) (the abbreviations of the risk-factors can be found in Table 4).
Performance evaluation results of zero-part models (classification task).
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| Accuracy |
| 0.5545 | 0.6304 |
| 0.6436 |
| 0.5941 | 0.5809 |
| Balanced accuracy |
| 0.5391 | 0.6075 |
| 0.6085 | 0.5748 | 0.5394 | 0.5240 |
| Precision |
| 0.6022 | 0.6488 |
| 0.6407 | 0.6190 | 0.5926 | 0.5839 |
| Recall | 0.7688 | 0.6474 | 0.7679 | 0.7399 | 0.8555 | 0.8266 |
|
|
| F1 |
| 0.6240 | 0.7037 | 0.6845 | 0.7327 | 0.7079 | 0.7223 |
|
The best classifier in terms of each performance measure is denoted in bold font for both training and test sets.
Performance evaluation results of count-part models (regression task).
|
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| MdE | −1.3125 | −1.3548 | −2.3666 | −2.4494 | −1.5827 | −1.8972 |
|
|
| MdAE | 6.3125 | 8.5806 | 7.5393 | 7.7431 |
| 7.5513 | 6.3216 |
|
| MdMRE | 0.3567 | 0.4885 | 0.4511 | 0.4679 |
| 0.4451 | 0.4340 |
|
| MdMER |
|
| 0.4469 | 0.4624 | 0.3715 | 0.4604 | 0.4352 | 0.4535 |
The best regression model in terms of each performance measure is denoted in bold font for both training and test sets.