| Literature DB >> 33808743 |
Shelda Sajeev1,2, Stephanie Champion1, Alline Beleigoli1,3, Derek Chew4, Richard L Reed4, Dianna J Magliano5,6, Jonathan E Shaw5,6,7, Roger L Milne8,9,10, Sarah Appleton11,12, Tiffany K Gill12, Anthony Maeder1.
Abstract
Effective cardiovascular disease (CVD) prevention relies on timely identification and intervention for individuals at risk. Conventional formula-based techniques have been demonstrated to over- or under-predict the risk of CVD in the Australian population. This study assessed the ability of machine learning models to predict CVD mortality risk in the Australian population and compare performance with the well-established Framingham model. Data is drawn from three Australian cohort studies: the North West Adelaide Health Study (NWAHS), the Australian Diabetes, Obesity, and Lifestyle study, and the Melbourne Collaborative Cohort Study (MCCS). Four machine learning models for predicting 15-year CVD mortality risk were developed and compared to the 2008 Framingham model. Machine learning models performed significantly better compared to the Framingham model when applied to the three Australian cohorts. Machine learning based models improved prediction by 2.7% to 5.2% across three Australian cohorts. In an aggregated cohort, machine learning models improved prediction by up to 5.1% (area-under-curve (AUC) 0.852, 95% CI 0.837-0.867). Net reclassification improvement (NRI) was up to 26% with machine learning models. Machine learning based models also showed improved performance when stratified by sex and diabetes status. Results suggest a potential for improving CVD risk prediction in the Australian population using machine learning models.Entities:
Keywords: artificial intelligence; cardiovascular disease; cardiovascular risk factors; clinical decision support; machine learning; risk prediction
Mesh:
Year: 2021 PMID: 33808743 PMCID: PMC8003399 DOI: 10.3390/ijerph18063187
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Data collection methods and measures for the cardiovascular disease (CVD) risk factor variables used in the analysis.
| Risk Factor | Data Collection Methods | Measures |
|---|---|---|
| Age | Self-report | Years |
| Sex | Self-report | Male/Female |
| Total Cholesterol | Biomedical measure | Fasting blood sample |
| High-density lipoprotein (HDL) Cholesterol | ||
| Systolic blood pressure | Biomedical measure | Dinamap/mercury sphygmomanometer, average of two recorded measures |
| Hypertension medication | Self-report | No/Yes |
| Diabetes | Self-report or biological measure | Told by a doctor that they have diabetesFasting plasma glucose (FPG) level of at least 7.0 mmol/L |
| Smoking status | Self-report | No/Yes |
Missing numbers and summary data (mean ± standard deviation) for the three-study cohorts and combined cohort. The values for n, age, male, female, total cholesterol, HDL cholesterol, systolic blood pressure, hypertension medication, diabetes, and smoker were input after removing CVD history and death, missing data, and imputation of other missing risk factor variables.
| North West Adelaide Health Study (NWAHS) | Australian Diabetes, Obesity, and Lifestyle (AusDiab) | Melbourne Collaborative Cohort Study (MCCS) | Combined | |||||
|---|---|---|---|---|---|---|---|---|
| Summary | Missing | Summary | Missing | Summary | Missing | Summary | Missing | |
| n | 3654 | 10,150 | 32,611 | 46,305 | ||||
| Age, y | 48.5 ± 15.8 | nil | 50.0 ± 7.5 | nil | 54.4 ± 8.6 | nil | 53.0 ± 10.9 | nil |
| Male, n% | 1693 (46.3) | nil | 4437 (43.7) | nil | 12,790 (39.3) | nil | 18,919 (40.8) | nil |
| Female, n% | 1961 (53.7) | nil | 5713 (56.3) | nil | 19,722 (60.7) | nil | 27,386 (59.2) | nil |
| Total cholesterol (mg/dL) | 94.9 ± 18.8 | 41 | 102.1 ± 23.4 | 2 | 99.2 ± 19.0 | 151 | 99.5 ± 19.1 | 194 |
| HDL cholesterol (mg/dL) | 24.7 ± 6.8 | 41 | 25.8 ± 1.6 | 4 | 29.4 ± 7.9 | 10,503 | 29.7 ± 42.4 | 10,548 |
| Systolic blood pressure (mm Hg) | 126.6 ± 17.9 | 0 | 128.4 ± 7.5 | 54 | 135.9 ± 18.7 | 117 | 133.5 ± 18.9 | 171 |
| Hypertension medication, n% | 451 (12.3) | 0 | 792 (7.8) | 98 | 4671 (14.4) | 94 | 6452(13.9) | 192 |
| Diabetes, n% | 233 (6.4) | 13 | 1252 (12.3) | 169 | 1051 (3.2) | 9 | 3791(8.2) | 191 |
| Smoker | 1957 (53.6) | 22 | 2124 (20.9) | 212 | 13,382 (41.2) | 10 | 19,833(42.8) | 244 |
| History of CVD | 326 | 6 | 938 | 142 | 7035 | nil | 8299 | 148 |
| CVD death, n% | 121 (3.3) | 70 | 341 (3.4) | 17 | 520 (1.6) | 1867 | 982(2.1) | 1954 |
Figure 1Flowchart describing the machine learning approach. CVD indicates cardiovascular disease; Synthetic Minority Over Sampling Technique (SMOTE); Machine Learning (ML); logistic regression (LR); linear discriminant analysis (LDA); support vector machine with linear kernel (SVM); random forest (RF).
Two-fold cross validation: Comparison of the performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality of NWAHS, AusDiab and MCCS participants, and combined cohorts.
| Models | Area-under-curve (AUC) (95% CI) | Difference from | |
|---|---|---|---|
| NWAHS | |||
| BL: Framingham Score | 0.837 (0.792–0.882) | – | – |
| ML: Logistic Regression | 0.874 (0.833–0.915) | <0.001 | +3.7% |
| ML: Linear Discriminant Analysis | 0.874 (0.833–0.915) | <0.001 | +3.7% |
| ML: Support Vector Machine | 0.873 (0.832–0.914) | <0.001 | +3.6% |
| ML: Random Forest | 0.854 (0.811–0.897) | 0.0162 | +1.7% |
| AusDiab | |||
| BL: Framingham Score | 0.850 (0.824–0.876) | – | – |
| ML: Logistic Regression | 0.900 (0.878–0.922) | <0.001 | +5.0% |
| ML: Linear Discriminant Analysis | 0.901 (0.879–0.923) | <0.001 | +5.1% |
| ML: Support Vector Machine | 0.902 (0.880–0.924) | <0.001 | +5.2% |
| ML: Random Forest | 0.891 (0.868–0.914) | <0.001 | +4.1% |
| MCCS | |||
| BL: Framingham Score | 0.754 (0.730–0.778) | – | – |
| ML: Logistic Regression | 0.753 (0.729–0.777) | 0.230 | −0.1% |
| ML: Linear Discriminant Analysis | 0.756 (0.732–0.780) | 0.070 | +0.2% |
| ML: Support Vector Machine | 0.758 (0.734–0.782) | 0.008 | +0.4% |
| ML: Random Forest | 0.781 (0.757–0.805) | <0.001 | +2.7% |
| Combined | |||
| BL: Framingham Score | 0.802 (0.783–0.817) | – | |
| ML: Logistic Regression | 0.852 (0.837–0.867) | <0.001 | +5.1% |
| ML: Linear Discriminant Analysis | 0.852 (0.837–0.867) | <0.001 | +5.1% |
| ML: Support Vector Machine | 0.851 (0.836–0.866) | <0.001 | +5.1% |
| ML: Random Forest | 0.832 (0.814–0.848) | 0.001 | +3.0% |
Two-fold cross validation: Comparison of classification (Sensitivity, Specificity, Precision) and net reclassification improvement (NRI) performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality of NWAHS, AusDiab and MCCS participants, and the combined cohort.
| Models | Sensitivity | Specificity | Precision | NRI % (95%) | |
|---|---|---|---|---|---|
| NWAHS | |||||
| BL: Framingham Score | 41.3 | 91.3 | 14.0 | – | |
| ML: Logistic Regression | 79.5 | 81.7 | 13.2 | 28.5 (25.9–30.5) | <0.001 |
| ML: Linear Discriminant Analysis | 77.7 | 84.1 | 14.5 | 29.1 (26.1–30.6) | <0.001 |
| ML: Support Vector Machine | 80.7 | 81.0 | 12.9 | 29.0 (26.0–31.8) | <0.001 |
| ML: Random Forest | 79.4 | 80.8 | 12.7 | 27.5 (25.7–29.6) | <0.001 |
| AusDiab | |||||
| BL: Framingham Score | 57.1 | 88.2 | 14.4 | – | |
| ML: Logistic Regression | 84.6 | 84.1 | 16.1 | 23.3 (21.1–25.2) | <0.001 |
| ML: Linear Discriminant Analysis | 85.2 | 84.0 | 15.7 | 23.8 (20.7–26.1) | <0.001 |
| ML: Support Vector Machine | 84.0 | 85.4 | 16.7 | 24.1 (22.7–27.7) | <0.001 |
| ML: Random Forest | 84.3 | 83.6 | 15.3 | 22.5 (20.5–24.4) | <0.001 |
| MCCS | |||||
| BL: Framingham Score | 31.2 | 91.4 | 5.6 | – | |
| ML: Logistic Regression | 71.1 | 68.4 | 3.5 | 16.9 (13.6–19.9) | <0.001 |
| ML: Linear Discriminant Analysis | 70.4 | 69.5 | 3.6 | 17.3 (14.1–20.2) | <0.001 |
| ML: Support Vector Machine | 72.0 | 68.1 | 3.6 | 17.5 (13.6–20.4) | <0.001 |
| ML: Random Forest | 81.6 | 63.1 | 3.5 | 22.1 (19.1–24.8) | <0.001 |
| Combined | |||||
| BL: Framingham Score | 41.5 | 90.7 | 8.8 | – | |
| ML: Logistic Regression | 81.0 | 77.7 | 8.1 | 26.5 (20.1–29.8) | <0.001 |
| ML: Linear Discriminant Analysis | 80.5 | 78.2 | 8.2 | 26.5 (20.0–29.9) | <0.001 |
| ML: Support Vector Machine | 80.8 | 77.8 | 8.1 | 26.4 (19.8–29.5) | <0.001 |
| ML: Random Forest | 77.4 | 76.9 | 6.8 | 22.0 (16.5–27.5) | <0.001 |
Variable ranking based on their contribution to the prediction for NWAHS, AusDiab, and MCCS populations. Variables are listed based on their contribution (Score) to the predictions.
| NWAHS | AusDiab | MCCS | Combined | ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| Age | 0.412 | Age | 0.429 | Age | 0.422 | Age | 0.563 |
| Systolic blood | 0.251 | Systolic blood pressure | 0.301 | Systolic blood | 0.222 | Systolic blood | 0.201 |
| Hypertension | 0.141 | Hypertension medication | 0.116 | Hypertension | 0.141 | Hypertension | 0.125 |
| Diabetes status | 0.089 | Diabetes status | 0.077 | HDL | 0.105 | Diabetes status | 0.070 |
| Tot. Cholesterol | 0.057 | HDL | 0.036 | Tot. Cholesterol | 0.066 | HDL | 0.020 |
| HDL | 0.028 | Tot. Cholesterol | 0.028 | Diabetes status | 0.032 | Sex | 0.011 |
| Sex | 0.011 | Sex | 0.008 | Sex | 0.005 | Tot. Cholesterol | 0.008 |
| Smoking status | 0.010 | Smoking status | 0.004 | Smoking status | 0.004 | Smoking status | 0.005 |
Two-fold cross validation: Comparison of the performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on Sex stratification.
| Models | AUC (95% CI) | Difference from | |
|---|---|---|---|
| Men | |||
| BL: Framingham Score | 0.799 (0.776–0.823) | – | – |
| ML: Logistic Regression | 0.816 (0.793–0.839) | <0.001 | +1.7% |
| ML: Linear Discriminant Analysis | 0.818 (0.795–0.841) | <0.001 | +1.9% |
| ML: Support Vector Machine | 0.818 (0.795–0.841) | <0.001 | +1.9% |
| ML: Random Forest | 0.812(0.791–0.837) | <0.001 | +1.7% |
| Women | |||
| BL: Framingham Score | 0.836 (0.814–0.858) | – | – |
| ML: Logistic Regression | 0.871 (0.851–0.892) | <0.001 | +3.5% |
| ML: Linear Discriminant Analysis | 0.869 (0.848–0.890) | <0.001 | +3.4% |
| ML: Support Vector Machine | 0.870 (0.850–0.891) | <0.001 | +3.4% |
| ML: Random Forest | 0.854 (0.833–0.876) | < 0.001 | +2.0% |
Two-fold cross validation: Comparison of the classification (Sensitivity, Specificity, Precision) and NRI performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on Sex stratification.
| Models | Sensitivity | Specificity | Precision | NRI % (95%) | |
|---|---|---|---|---|---|
| Men | |||||
| BL: Framingham Score | 66.3 | 79.3 | 8.0 | – | |
| ML: Logistic Regression | 75.9 | 75.8 | 8.6 | 6.1 (5.0–8.4) | <0.001 |
| ML: Linear Discriminant Analysis | 76.2 | 75.5 | 8.8 | 6.1 (5.0–8.8) | <0.001 |
| ML: Support Vector Machine | 76.1 | 76.0 | 8.6 | 6.5 (6.1–7.7) | <0.001 |
| ML: Random Forest | 77.1 | 74.0 | 7.6 | 5.5 (4.0–6.4) | <0.001 |
| Women | |||||
| BL: Framingham Score | 15.6 | 98.5 | 15.7 | – | |
| ML: Logistic Regression | 83.4 | 79.1 | 7.7 | 48.4 (46.4–50.1) | <0.001 |
| ML: Linear Discriminant Analysis | 81.9 | 80.8 | 8.6 | 48.7 (46.0–50.0) | <0.001 |
| ML: Support Vector Machine | 83.4 | 79.4 | 8.1 | 48.7 (47.3–49.6) | <0.001 |
| ML: Random Forest | 80.6 | 77.6 | 6.1 | 44.1 (43.6–46.5) | <0.001 |
Two-fold cross validation: Comparison of the performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on diabetes stratification.
| Models | AUC (95% CI) | Difference from | |
|---|---|---|---|
| Diabetes | |||
| BL: Framingham Score | 0.734 (0.696–0.771) | – | – |
| ML: Logistic Regression | 0.823 (0.790–0.856) | <0.001 | +9.0% |
| ML: Linear Discriminant Analysis | 0.824 (0.791–0.857) | <0.001 | +9.1% |
| ML: Support Vector Machine | 0.824 (0.791–0.857) | <0.001 | +9.0% |
| ML: Random Forest | 0.800 (0.766–0.835) | <0.001 | +6.6% |
| Non-Diabetes | |||
| BL: Framingham Score | 0.789 (0.770–0.88) | – | – |
| ML: Logistic Regression | 0.842 (0.824–0.860) | <0.001 | +5.3% |
| ML: Linear Discriminant Analysis | 0.843 (0.825–0.861) | <0.001 | +5.4% |
| ML: Support Vector Machine | 0.844 (0.826–0.862) | <0.001 | +5.5% |
| ML: Random Forest | 0.831 (0.813–0.850) | <0.001 | +4.2% |
Two-fold cross validation: Comparison of the classification (Sensitivity, Specificity, Precision) and NRI performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on diabetes stratification.
| Models | Sensitivity | Specificity | Precision | NRI % (95%) | |
|---|---|---|---|---|---|
| Diabetes | |||||
| BL: Framingham Score | 70.1 | 63.4 | 11.1 | – | |
| ML: Logistic Regression | 78.8 | 72.7 | 16.0 | 17.9 (15.1–19.6) | <0.001 |
| ML: Linear Discriminant Analysis | 80.0 | 72.2 | 16.0 | 18.7 (16.9–20.0) | <0.001 |
| ML: Support Vector Machine | 79.6 | 72.2 | 15.8 | 18.2 (15.6–20.0) | <0.001 |
| ML: Random Forest | 79.7 | 70.7 | 15.3 | 16.8 (14.5–19.2) | <0.001 |
| Non-Diabetes | |||||
| BL: Framingham Score | 32.6 | 93.0 | 7.7 | – | |
| ML: Logistic Regression | 81.2 | 75.3 | 5.7 | 30.8 (28.6–34.2) | <0.001 |
| ML: Linear Discriminant Analysis | 83.7 | 73.1 | 5.6 | 31.2 (27.6–34.4) | <0.001 |
| ML: Support Vector Machine | 80.2 | 76.2 | 6.6 | 30.8 (28.7–34.0) | <0.001 |
| ML: Random Forest | 77.4 | 76.7 | 5.7 | 28.5 (26.4–32.5) | <0.001 |
External Validation: Comparison of the classification and NRI performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined AusDiab and MCCS dataset as the training set and NWAHS as the external validation set.
| Models | AUC | Sensitivity | Specificity | Precision | NRI |
|---|---|---|---|---|---|
| BL: Framingham Score | 0.837 | 41.3 | 91.3 | 14.0 | - |
| ML: Logistic Regression | 0.879 | 76.0 | 85.7 | 15.4 | 29.1 |
| ML: Linear Discriminant Analysis | 0.880 | 75.2 | 86.8 | 16.4 | 29.4 |
| ML: Support Vector Machine | 0.880 | 72.5 | 89.0 | 18.5 | 28.9 |
| ML: Random Forest | 0.866 | 79.4 | 80.4 | 12.2 | 27.2 |
| Men | |||||
| BL: Framingham Score | 0.841 | 72.1 | 82.4 | 13.3 | - |
| ML: Logistic Regression | 0.858 | 73.8 | 83.8 | 14.6 | 3.1 |
| ML: Linear Discriminant Analysis | 0.857 | 73.7 | 83.5 | 14.3 | 2.7 |
| ML: Support Vector Machine | 0.856 | 73.9 | 84.6 | 14.8 | 1.3 |
| ML: Random Forest | 0.846 | 72.13 | 82.65 | 13.5 | 0.28 |
| Women | |||||
| BL: Framingham Score | 0.871 | 10.5 | 97.4 | 22.2 | - |
| ML: Logistic Regression | 0.898 | 87.3 | 78.8 | 11.6 | 58.2 |
| ML: Linear Discriminant Analysis | 0.898 | 88.1 | 78.6 | 11.7 | 58.8 |
| ML: Support Vector Machine | 0.900 | 88.4 | 78.4 | 13.5 | 58.9 |
| ML: Random Forest | 0.891 | 84.5 | 83.1 | 11.6 | 59.7 |
| Diabetes | |||||
| BL: Framingham Score | 0.675 | 66.7 | 57.8 | 15.3 | - |
| ML: Logistic Regression | 0.744 | 74.4 | 71.4 | 23.1 | 21.3 |
| ML: Linear Discriminant Analysis | 0.741 | 75.0 | 70.5 | 22.5 | 21.0 |
| ML: Support Vector Machine | 0.738 | 75.8 | 65.3 | 19.8 | 16.6 |
| ML: Random Forest | 0.706 | 62.5 | 79.1 | 25.4 | 17.1 |
| Non-Diabetes | |||||
| BL: Framingham Score | 0.841 | 35.1 | 93.4 | 13.5 | - |
| ML: Logistic Regression | 0.889 | 80.4 | 83.6 | 12.5 | 35.5 |
| ML: Linear Discriminant Analysis | 0.888 | 83.5 | 80.4 | 11.1 | 35.4 |
| ML: Support Vector Machine | 0.890 | 87.6 | 76.0 | 9.7 | 35.1 |
| ML: Random Forest | 0.866 | 78.4 | 81.9 | 11.0 | 31.8 |