| Literature DB >> 27436868 |
Benjamin A Goldstein1,2, Ann Marie Navar2, Rickey E Carter3.
Abstract
Risk prediction plays an important role in clinical cardiology research. Traditionally, most risk models have been based on regression models. While useful and robust, these statistical methods are limited to using a small number of predictors which operate in the same way on everyone, and uniformly throughout their range. The purpose of this review is to illustrate the use of machine-learning methods for development of risk prediction models. Typically presented as black box approaches, most machine-learning methods are aimed at solving particular challenges that arise in data analysis that are not well addressed by typical regression approaches. To illustrate these challenges, as well as how different methods can address them, we consider trying to predicting mortality after diagnosis of acute myocardial infarction. We use data derived from our institution's electronic health record and abstract data on 13 regularly measured laboratory markers. We walk through different challenges that arise in modelling these data and then introduce different machine-learning approaches. Finally, we discuss general issues in the application of machine-learning methods including tuning parameters, loss functions, variable importance, and missing data. Overall, this review serves as an introduction for those working on risk modelling to approach the diffuse field of machine learning.Entities:
Keywords: Electronic health records; Personalized medicine; Precision medicine; Risk prediction
Mesh:
Year: 2017 PMID: 27436868 PMCID: PMC5837244 DOI: 10.1093/eurheartj/ehw302
Source DB: PubMed Journal: Eur Heart J ISSN: 0195-668X Impact factor: 29.983
Model fits for different algorithms
| Squared-error loss | Logistic loss | Misclassification ratea | ||
|---|---|---|---|---|
| Regression based | ||||
| Logistic regression | 0.702 | 0.049 | 0.995 | 0.23 |
| Forward selection | 0.761 | 0.995 | 0.24 | |
| LASSO | 0.750 | 0.995 | 0.26 | |
| Ridge | 0.753 | 0.047 | 0.996 | 0.27 |
| PCR | 0.546 | 0.049 | 0.998 | 0.41 |
| Generalized additive model | 0.708 | 0.050 | 0.22 | |
| Tree based | ||||
| CART | 0.623 | 0.053 | 0.997 | |
| Random forests | 0.741 | 0.048 | 0.995 | 0.32 |
| Boosting | 0.047 | 0.996 | 0.20 | |
| Other | ||||
| Nearest Neighbours | 0.583 | 0.050 | 0.998 | 0.22 |
| Neural Networks | 0.598 | 0.065 | 0.996 | 0.44 |
The bold value represents the best algorithm for that performance metric.
aMisclassification rate is discretized at the mean event rate.
Variable importance rankings
| Variable rank | GLM | LASSO | GAM | Random forests | Boosting | |
|---|---|---|---|---|---|---|
| 1 | CO2 Min | Ca2+ Max | Ca2+ Median | Ca2+ Min | CO2 Min | CO2 Min |
| 2 | CO2 Median | K+ Min | Ca2+ Max | Ca2+ Max | CO2 Median | WBC Max |
| 3 | WBC Max | Hgb Median | Hgb Median | CO2 Median | WBC Max | CO2 Median |
| 4 | K+ Max | Ca2+ Median | K+ Median | Ca2+ Median | Glucose Max | Ca2+ Median |
| 5 | CO2 Max | Hgb Min | K+ Min | RDW Median | WBC Median | K+ Max |