| Literature DB >> 31091238 |
Ahmed M Alaa1, Thomas Bolton2,3, Emanuele Di Angelantonio2,3, James H F Rudd4, Mihaela van der Schaar1,5,6.
Abstract
BACKGROUND: Identifying people at risk of cardiovascular diseases (CVD) is a cornerstone of preventative cardiology. Risk prediction models currently recommended by clinical guidelines are typically based on a limited number of predictors with sub-optimal performance across all patient groups. Data-driven techniques based on machine learning (ML) might improve the performance of risk predictions by agnostically discovering novel risk predictors and learning the complex interactions between them. We tested (1) whether ML techniques based on a state-of-the-art automated ML framework (AutoPrognosis) could improve CVD risk prediction compared to traditional approaches, and (2) whether considering non-traditional variables could increase the accuracy of CVD risk predictions. METHODS ANDEntities:
Mesh:
Year: 2019 PMID: 31091238 PMCID: PMC6519796 DOI: 10.1371/journal.pone.0213653
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1An illustrative schematic for AutoPrognosis.
In this depiction, AutoPrognosis constructs an ensemble of three ML pipelines. Pipeline 1 uses the MissForest algorithm to impute missing data, and then compresses the data into a lower-dimensional space using the principal component analysis (PCA) algorithm, before using the random forest algorithm to issue predictions. Pipelines 2 and 3 use different algorithms for imputation, feature processing, classification and calibration. AutoPrognosis uses the algorithm in [19] to make decisions on what pipelines to select and how to tune the pipelines’ parameters.
List of algorithms included in AutoPrognosis.
| Pipeline Stage | Algorithms | ||
|---|---|---|---|
|
missForest Mean MICE |
Median EM None |
Most-frequent Matrix completion | |
Feature agglomeration R. kitchen sinks Select Rates | Kernel PCA Fast ICA Nystroem |
Polynomial PCA Linear SVM | |
|
Bernoulli NB Linear SVM Gaussian NB Multinomial NB Light GBM Survival Forest Cox Regression | AdaBoost Gradient Boosting XGBoost Random Forest Logistic Regression Bagging Ridge Classifier | Decision Tree LDA Extr. Random Trees Neural Network Gaussian Process | |
Calibration | Sigmoid | none | |
MICE: multiple imputation by chained equations, EM: expectation maximization, PCA: principal component analysis, ICA: independent component analysis, SVM: support vector machines, NB: Naïve Bayes, NN: nearest neighbors, LDA: linear discriminant analysis, GBM: gradient boosting machine.
Performance of all prediction models under consideration.
| Model | AUC-ROC | Absolute AUC-ROC Change |
|---|---|---|
| Framingham Score | 0.724 ± 0.004 | Baseline model |
| Cox PH Model (7 core variables) | 0.734 ± 0.005 | + 1.0% |
| Cox PH Model (all variables) | 0.758 ± 0.005 | + 3.4% |
| Support Vector Machines | 0.709 ± 0.061 | - 1.5% |
| Random Forest | 0.730 ± 0.004 | + 0.6% |
| Neural Networks | 0.755 ± 0.005 | + 3.1% |
| AdaBoost | 0.759 ± 0.004 | + 3.5% |
| Gradient Boosting | 0.769 ± 0.005 | + 4.5% |
| AutoPrognosis (7 core variables) | 0.744 ± 0.005 | + 2.0% |
| AutoPrognosis (369 non-lab. variables) | 0.761 ± 0.005 | + 3.7% |
| AutoPrognosis (104 lab. variables) | 0.735 ± 0.008 | + 1.1% |
| AutoPrognosis (all variables) | 0.774 ± 0.005 | + 5.0% |
The Framingham score is provided as the reference model for comparative purposes.
Variable ranking by their contribution to the predictions of AutoPrognosis.
| Variable (Men) | Score | Variable (Women) | Score |
|---|---|---|---|
| 0.346 | 0.370 | ||
| 0.101 | 0.099 | ||
| Usual walking pace | 0.052 | Usual walking pace | 0.057 |
| 0.040 | Ankle spacing width | 0.035 | |
| Microalbumin in urine | 0.032 | Self-reported health rating | 0.030 |
| High blood pressure | 0.030 | 0.026 | |
| Red blood cell distribution width | 0.025 | High blood pressure | 0.024 |
| Self-reported health rating | 0.019 | Red blood cell distribution width | 0.023 |
| Haematocrit percentage | 0.014 | Microalbumin in urine | 0.017 |
| Father age at death | 0.014 | Father age at death | 0.017 |
| 0.013 | White blood cell count | 0.011 | |
| Diastolic blood pressure | 0.012 | Number of Treatments | 0.011 |
| White blood cell count | 0.012 | Mean reticulocyte volume | 0.008 |
| Impedance of arm (left) | 0.009 | Leg predicted mass (right) | 0.006 |
| Haemoglobin concentration | 0.007 | Neutrophill count | 0.006 |
| Neutrophill count | 0.005 | Basal metabolic rate | 0.005 |
| Number of Treatments | 0.004 | Hormone-replac. therapy usage | 0.005 |
| Mean reticulocyte volume | 0.004 | Blood clot in the leg | 0.004 |
| Urinary sodium concentration | 0.004 | Forced expiratory volume | 0.004 |
| Monocyte count | 0.004 | Duration of fitness test | 0.004 |
* Risk factors utilized by existing risk prediction algorithms.
Explanations for the different variables in this table are provided in S2 Appendix.
Performance of AutoPrognosis in the diabetic patient subgroup.
| Model | AUC-ROC (No diabetes) | AUC-ROC (Diabetes) |
|---|---|---|
| Framingham score | 0.724 ± 0.004 | 0.578 ± 0.018 |
| AutoPrognosis | 0.774 ± 0.005 | 0.713 ± 0.010 |
Performance of AutoPrognosis and the Framingham score validated separately on a testing cohort of diabetic patients (1,790 participants), and a testing cohort of non-diabetic patients (40,570 participants) via 10-fold cross-validation. AutoPrognosis was trained using the entire training cohort that combines both diabetic and non-diabetic individuals (381,244 participants).
Variable ranking for the diabetic population.
| Variable | Score |
|---|---|
| Age | 0.207 |
| 0.110 | |
| Usual walking pace | 0.078 |
| Smoking status | 0.064 |
| Systolic blood pressure | 0.034 |
| Red blood cell distribution width | 0.027 |
| Neutrophill count | 0.018 |
| Number of Treatments | 0.018 |
| High blood pressure | 0.014 |
| Urinary sodium concentration | 0.014 |
Fig 2Predictive ability of the UK Biobank variables for men and women.
Each point represents a variable in the UK Biobank ordered by the ability to predict CVD events for men and women. Predictions based solely on age achieved an AUC-ROC of 0.632 ± 0.003 for men and 0.665 ± 0.002 for women. We report the AUC-ROC from models trained with individual variables in addition to age, and only display variables that achieved a statistically significant improvement in AUC-ROC compared to predictions based on age only. Each color represents a different variable category. Variables deviating from the (dotted gray) regression line have an AUC-ROC that differs between men and women more than expected in view of the overall association between the two genders, suggesting a stronger relative importance in one gender group.