| Literature DB >> 31842878 |
Ben Van Calster1,2,3, David J McLernon4,5, Maarten van Smeden6,7,5, Laure Wynants8,9, Ewout W Steyerberg6,5.
Abstract
BACKGROUND: The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention. MAIN TEXT: Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice.Entities:
Keywords: Calibration; Heterogeneity; Model performance; Overfitting; Predictive analytics; Risk prediction models
Mesh:
Year: 2019 PMID: 31842878 PMCID: PMC6912996 DOI: 10.1186/s12916-019-1466-7
Source DB: PubMed Journal: BMC Med ISSN: 1741-7015 Impact factor: 8.775
Fig. 1Illustrations of different types of miscalibration. Illustrations are based on an outcome with a 25% event rate and a model with an area under the ROC curve (AUC or c-statistic) of 0.71. Calibration intercept and slope are indicated for each illustrative curve. a General over- or underestimation of predicted risks. b Predicted risks that are too extreme or not extreme enough
Fig. 2Calibration curves when validating a model for obstructive coronary artery disease before and after updating. a Calibration curve before updating. b Calibration curve after updating by re-estimating the model coefficients. The flexible curve with pointwise confidence intervals (gray area) was based on local regression (loess). At the bottom of the graphs, histograms of the predicted risks are shown for patients with (1) and patients without (0) coronary artery disease. Figure adapted from Edlinger et al. [38], which was published under the Creative Commons Attribution–Noncommercial (CC BY-NC 4.0) license
Summary points on calibration
| Why calibration matters | - Decisions are often based on risk, so predicted risks should be reliable |
| - Poor calibration may make a prediction model clinically useless or even harmful | |
| Causes of poor calibration | - Statistical overfitting and measurement error |
| - Heterogeneity in populations in terms of patient characteristics, disease incidence or prevalence, patient management, and treatment policies | |
| Assessment of calibration in practice | - Perfect calibration, where predicted risks are correct for every covariate pattern, is utopic; we should not aim for that |
| - At model development, focus on nonlinear effects and interaction terms only if a sufficiently large sample size is available; low sample sizes require simpler modeling strategies or that no model is developed at all | |
| - Avoid the Hosmer–Lemeshow test to assess or prove calibration | |
| - At internal validation, focus on the calibration slope as a part of the assessment of statistical overfitting | |
| - At external validation, focus on the calibration curve, intercept and slope | |
| - Model updating should be considered in case of poor calibration; re-estimating the model entirely requires sufficient data |