Literature DB >> 35354282

Generalizability of Cardiovascular Disease Clinical Prediction Models: 158 Independent External Validations of 104 Unique Models.

Gaurav Gulati^1,2, Jenica Upshaw^1,2, Benjamin S Wessler^1,2, Riley J Brazil¹, Jason Nelson¹, David van Klaveren^1,3, Christine M Lundquist¹, Jinny G Park¹, Hannah McGinnes¹, Ewout W Steyerberg³, Ben Van Calster^3,4,5, David M Kent¹.

Abstract

BACKGROUND: While clinical prediction models (CPMs) are used increasingly commonly to guide patient care, the performance and clinical utility of these CPMs in new patient cohorts is poorly understood.
METHODS: We performed 158 external validations of 104 unique CPMs across 3 domains of cardiovascular disease (primary prevention, acute coronary syndrome, and heart failure). Validations were performed in publicly available clinical trial cohorts and model performance was assessed using measures of discrimination, calibration, and net benefit. To explore potential reasons for poor model performance, CPM-clinical trial cohort pairs were stratified based on relatedness, a domain-specific set of characteristics to qualitatively grade the similarity of derivation and validation patient populations. We also examined the model-based C-statistic to assess whether changes in discrimination were because of differences in case-mix between the derivation and validation samples. The impact of model updating on model performance was also assessed.
RESULTS: Discrimination decreased significantly between model derivation (0.76 [interquartile range 0.73-0.78]) and validation (0.64 [interquartile range 0.60-0.67], P<0.001), but approximately half of this decrease was because of narrower case-mix in the validation samples. CPMs had better discrimination when tested in related compared with distantly related trial cohorts. Calibration slope was also significantly higher in related trial cohorts (0.77 [interquartile range, 0.59-0.90]) than distantly related cohorts (0.59 [interquartile range 0.43-0.73], P=0.001). When considering the full range of possible decision thresholds between half and twice the outcome incidence, 91% of models had a risk of harm (net benefit below default strategy) at some threshold; this risk could be reduced substantially via updating model intercept, calibration slope, or complete re-estimation.
CONCLUSIONS: There are significant decreases in model performance when applying cardiovascular disease CPMs to new patient populations, resulting in substantial risk of harm. Model updating can mitigate these risks. Care should be taken when using CPMs to guide clinical decision-making.

Entities: Chemical

Keywords: cardiovascular; cardiovascular diseases; clinical decision; decision support techniques; models; risk; validation study

Mesh：

Year: 2022 PMID： 35354282 PMCID： PMC9015037 DOI： 10.1161/CIRCOUTCOMES.121.008487

Source DB: PubMed Journal: Circ Cardiovasc Qual Outcomes ISSN： 1941-7713

Clinical prediction models (CPMs) are used routinely to guide clinical decision making, yet the majority of published CPMs have never been externally validated. How well these models perform on new populations, as well as how likely these models are to improve clinical decision making, is unknown. In this collection of cardiovascular CPMs, discrimination and calibration decrease substantially when models are validated on external databases, with the largest decrease when derivation and validation cohorts are the most dissimilar. The majority of CPMs had the potential to motivate harmful clinical decisions, particularly when decision thresholds were far from the population average risk. Model updating can reduce the risk of harm and should be performed before widespread clinical use of a CPM. See Editorial by Clinical prediction models (CPMs) are multivariable statistical algorithms that produce patient-specific estimates of clinically important outcome risks based on ascertainable clinical characteristics. They are designed to improve prognostication and thus clinical decision making. CPMs are increasingly common and important tools for patient-centered outcomes research and clinical care. Recent reviews have demonstrated the abundance of CPMs in the literature but have also pointed at shortcomings.[1] Our own database, the Tufts Predictive Analytics and Comparative Effectiveness Center CPM Registry,[2] currently includes 1382 CPMs just for patients with cardiovascular disease (CVD), including 344 CPMs for patients with coronary artery disease, 195 for population-based samples (ie, predicting incident CVD), and 135 for patients with heart failure (HF). How well these CPMs are likely to perform when tested on a new patient population is poorly understood. Large-scale evaluations of the model development methods have revealed that the vast majority of models do not follow best practice and are classified as having a high risk of bias.[3,4] The concern that prediction models may fail when disseminated into clinical practice has grown increasingly urgent, now that models are being broadly distributed by vendors and influencing care at a large scale. Examples of model failure of clinically influential and widely disseminated models have recently come to light.[5] Our prior literature review found that approximately 60% of published CPMs have never been externally validated. Most of those that have been externally validated have been evaluated only once.[6,7] Yet, our prior analysis also called into question the value of these single validations, since discriminatory performance typically varies tremendously when a single model is evaluated on multiple databases.[6] However, there are inherent limitations in literature reviews in understanding how well models perform when evaluated on external data. For example, when discrimination in a new database is poor, it can be due to model invalidity, on the one hand, or because the case-mix in the external database is substantially restricted compared with the derivation database. The published literature does not distinguish between these possibilities. Further, when CPMs are validated, typically no clinically interpretable measure of calibration is reported, despite the fact that it is known that poor calibration can lead to harmful decision making. Finally, there is no widely accepted criteria by which one can claim that a model has been validated, since models are assessed for statistical accuracy, but scant attention is paid to whether models can improve the quality of decisions. Given these limitations, it is difficult to understand the quality of CPMs reported in the literature, and how they might influence decision-making if widely disseminated. Thus, we sought to perform a large scale and systematic external validation on published CPMs, using both conventional and novel measures of model performance to address some of the above limitations. In particular, we sought to examine both discrimination and calibration, to examine the proportion of decreased performance that might be due to model invalidity versus case-mix, and to examine the influence that predictions might have on decisions through the use of decision curve analysis. We were especially interested in evaluating when CPMs might lead to harmful decision making. We also evaluated the effect of simple updating procedures.

Methods

Source of Models

The Tufts Predictive Analytics and Comparative Effectiveness Center CPM Registry is a registry of CPMs published between January 1990 and December 2015 that predict outcomes in patients at risk for or with known cardiovascular disease. Detailed methods for development of the registry have been reported previously.[2,8] Briefly, for inclusion in the registry, articles must (1) develop a CPM as a primary aim, (2) contain at least 2 outcome predictors, and (3) present enough information to estimate the outcome probability for an individual patient. For this analysis, we selected from the registry all CPMs predicting outcomes for 3 index conditions: (1) acute coronary syndrome, (2) HF (both preserved and reduced ejection fraction), and (3) healthy patients at risk for CVD (primary prevention or population sample). Some data and materials for this analysis have been made publicly available and can be accessed at www.pacecpmregistry.org, and other data are available from the corresponding author upon reasonable request. The Tufts Health Sciences Institutional Review Board (IRB) approved this study.

Source of Validation Cohorts

Deidentified patient-level data from clinical trials were obtained from the National Heart, Lung, and Blood Institute via application to the Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). For the Acute coronary syndrome index condition, we used the AMIS[9] (Aspirin-Myocardial Infarction Study), TIMI-II[10] (Thrombolysis in Myocardial Infarction: phase II), TIMI-III[11] (Thrombolysis in Myocardial Ischemia), MAGIC[12] (Magnesium in Coronaries), and ENRICHD[13] (Enhanced Recovery in Coronary Heart Disease) trials. For the HF index condition, we used the TOPCAT[14] (Treatment of Preserved Cardiac Function Heart Failure with an Aldosterone Antagonist), HEAAL[15] (Heart Failure evaluation of Angiotensin II Antagonist Losartan), HF-ACTION[16] (Heart Failure: A Controlled Trial Investigating Outcomes of Exercise Training), EVEREST[17] (Efficacy of Vasopressin Antagonism in Heart Failure Outcome Study with Tolvaptan), SCD-HeFT[18] (Sudden Cardiac Death in Heart Failure), BEST[19] (Beta Blocker Evaluation of Survival), DIG[20] (Digitalis Investigation Group), and SOLVD[21] (Studies of Left Ventricular Dysfunction) trials. For the primary prevention index condition, we used the ACCORD[22] (Action to Control Cardiovascular Risk in Diabetes), ALLHAT-HTN[23] (Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack), ALLHAT-LLT[24] (Lipid-Lowering Therapy), and WHI[25] (Women’s Health Initiative). Details of the trials have been reported previously and are summarized in Tables S1 through S3.

CPM-Dataset Matching Process

To identify which clinical trial dataset could be used to validate which CPMs, we employed a hierarchical matching procedure. First, each CPM was compared with each dataset by nonclinical research staff to identify pairs that had grossly similar inclusion criteria and outcomes, which were then reviewed for appropriateness by clinical experts. Potential pairs passing these screening steps were carefully reviewed at a granular level, and only pairs where sufficient patient-level data existed in the trial dataset such that the CPM could be used to generate a predicted outcome probability for each patient were included in the analysis. Observed outcomes in the patient-level data were defined using the CPM outcome definition and prediction time horizon. A list of CPMs included in this analysis is shown in Table S4.

Measuring CPM Performance

The performance of CPMs in external cohorts was evaluated with measures of discrimination, calibration, and net benefit when applied to external validation cohorts. For all model validations, observed outcome events that occurred after the prediction time horizon were censored. For time-to-event models, the Kaplan-Meier estimator was used for right-censored follow up times. For binary outcome models, unobserved outcomes (ie, due to loss-to-follow up before the prediction time horizon) were considered missing and excluded from analyses. For each CPM-database pair, the linear predictor was calculated for each patient in the dataset using the intercept and coefficients from the published CPM. Model discrimination was assessed using the C-statistic. The percent change in a CPM’s discrimination from the derivation cohort to the validation cohort was calculated as [(Validation C-statistic—0.5)−(Derivation C-statistic—0.5)]/(Derivation C-statistic—0.5)×100.[26] If the C-statistic at model derivation was not reported, the model was excluded from the assessment of decrement in C-statistic relative to derivation. Since changes in case-mix between derivation and validation population will affect the C-statistic in the validation population even without any change in measured effects, change in discrimination was also compared relative to the model-based C-statistic (MB-c). The MB-c is the C-statistic that would be obtained in the validation database under the assumption that the CPM is perfectly valid in the validation database and depends only on the distribution of the linear predictors in the validation database.[27] For example, even a model with no invalidity would have both a C-statistic of 0.5 and an MB-c of 0.5 in a validation cohort if all patients in that cohort were identical with respect to their covariates. Thus, any difference between the derivation C-statistic and the validation MB-c reflects differences in case-mix, while the difference between the validation MB-c and the C-statistic in the validation cohort reflects model invalidity. Because calculation of the MB-c depends entirely on the validation cohort, MB-c could be calculated for all pairs. Model calibration was assessed by converting the linear predictor to a predicted probability (including a specified time point if Cox proportional hazards modeling was used). From the predicted probabilities, calibration slope and Harrell’s EAVG and E90 statistics were calculated. Harrell’s EAVG and E90 statistics measure the mean and 90th percentile, respectively, of the absolute difference between the predicted and observed event probabilities, where observed probabilities are estimated nonparametrically using locally weighted scatterplot smoothing curves. For this analysis, EAVG and E90 values were standardized by dividing by the outcome rate in the validation cohort to improve comparability between CPM-validation pairs. If point estimates of outcome incidences at similar time points in the CPM derivation cohort and paired validation cohort were not able to be calculated with published information, that pair was excluded from analysis of calibration. Finally, decision curve analysis[28] was used to estimate the net benefit of each model in each paired validation dataset. Decision curve analysis presents a comprehensive assessment of the potential population-level clinical consequences of using CPMs to inform treatment decisions by examining misclassification of patients across a relevant range of decision thresholds, while weighting the relative utility of false-positive and false-negative predictions as implicitly determined by the threshold. As each model could be used to guide many different decisions, each with a unique threshold probability, we assessed whether each model resulted in a positive net benefit (above the best default strategy of treat all or treat none) or negative net benefit (below the best default strategy) first at 3 threshold probabilities spanning a broad range of plausible thresholds: half the outcome incidence, outcome incidence, and twice the outcome incidence, and then over the entire range of threshold probabilities from half the outcome incidence to twice the outcome incidence. A range centered around the outcome incidence was also chosen because model net benefit is most likely to differ from that of the default strategy at thresholds close to the outcome incidence. Models with negative net benefit were considered harmful, while models with positive net benefit were considered not harmful. We used standardized net benefit[29] to make results comparable across validations by controlling for variation in the incidence of the outcome.

Stratification by Relatedness

To explore sources of variability in model performance in external validation, we categorized each CPM-dataset pair based on the relatedness of the underlying study populations. Study populations were reviewed in detail by clinical experts on the basis of key clinical characteristics, such as inclusion/exclusion criteria, patient demographics, outcome, enrollment period, and follow-up duration. These characteristics were index condition-specific and are detailed in Tables S5 through S7). Pairs were categorized as related when there were no clinically relevant differences in inclusion criteria, exclusion criteria, recruitment setting, and baseline clinical characteristics. Any matches with clinically relevant differences in any criterion were categorized as distantly related. Clinical experts scoring relatedness were blinded to the derivation C-statistic of the CPM and outcome rates in the derivation and validation cohorts.

Model Updating

We assessed the impact of model updating on discrimination, calibration, and net benefit. Models were updated using data from each paired validation dataset in a sequential fashion: (1) by updating the model intercept using the observed outcome rate in the validation cohort (recalibration-in-the-large); (2) by updating the intercept and rescaling all the model coefficients by the calibration slope; and (3) by re-estimating all regression coefficients using data from the validation database (but maintaining the predictors from the original model).[30]

Statistical Analysis

Differences in various model performance measures were assessed using the Wilcoxon rank-sum test. All analyses were performed in R version 3.5.3 (R foundation for statistical computing, Vienna, Austria).

Results

CPM-Validation Cohort Matching

From a set of 674 potential CPMs across all 3 index conditions, 548 (81%) were screened as potential matches based on title and abstract review and underwent granular review to assess for sufficient patient level variable and outcome data within the publicly available clinical trial databases. We matched 104 (15%) CPMs to at least one database, yielding 158 CPMs-database pairs across the 3 index conditions (Figure 1). The matching success frequency varied by index condition, from 6% (23 of 344) for acute coronary syndrome to 32% (59 of 195) for primary prevention. Details about the CPMs used in this analysis are summarized in Table S4.

Figure 1.

Flowchart of clinical prediction model-database matching process. ACS indicates acute coronary syndrome; and HF, heart failure.

CPM Discrimination in Independent External Validations

Of the 158 total CPM-database pairs, there were 111 pairs in which the CPM reported a C-statistic at model derivation. Among these, the median C-statistic in the derivation cohorts was 0.76 (interquartile range [IQR], 0.73–0.78) and the median C-statistic at model validation was 0.64 (IQR, 0.60–0.67, P<0.001; Table 1). Discriminative ability decreased by a median of 49% (IQR, 29%–64%). Approximately half the loss in discriminatory power was attributable to a decrease in case-mix heterogeneity, while half was attributable to model invalidity. When stratified by relatedness, 57 (36%) pairs were graded as related and 101 (64%) were graded as distantly related. CPM-trial pairs that were related had significantly higher MB-c and validation C statistics than pairs that were distantly related (Table 1). Median percentage decrement in discrimination among related pairs was 30% (IQR, 16%–45%), of which approximately two-thirds was due to a decrease in case-mix heterogeneity and one-third due to model invalidity. In contrast, CPM-trial pairs that were distantly related had a median percentage decrement in discrimination of 55% (IQR, 40%–68%, P<0.001 versus related pairs), approximately half of which was due to case-mix heterogeneity and half due to model invalidity.

Table 1.

Discriminative Ability of Clinical Prediction Models in Derivation and Validation Cohorts Stratified by Cohort Relatedness

CPM Calibration in Independent External Validations

Of the 158 total CPM-database pairs, there were 132 pairs in which the validation was assessed for calibration. The median calibration slope in the external validations was 0.64 (IQR, 0.48–0.84). Median calibration slope among related pairs was 0.77 (IQR, 0.59–0.90), significantly higher than the median calibration slope among distantly related pairs (0.59, IQR, 0.43–0.73, P=0.001). Median EAVG and median E90 standardized to the outcome incidence among all pairs was 0.53 (IQR, 0.38–0.72) and 0.95 (IQR, 0.62–1.25), respectively and did not differ significantly between related and distantly related pairs (Table 2).

Table 2.

Calibration Performance of Clinical Prediction Models on External Validation Stratified by Cohort Relatedness

Net Benefit

When we assessed net benefit at 3 thresholds (half the outcome incidence, outcome incidence, and twice the outcome incidence), the vast majority of models (110 of 132, 83%) were harmful when used in their paired database at one or more threshold. Even more models (120 of 132, 91%) were harmful at some point within the range of thresholds between half and twice the outcome incidence. However, when we considered each threshold individually, models were much less likely to be harmful at a threshold equal to the outcome incidence than at either of the more extreme thresholds. In particular, only 10 of 132 (8%) of models were harmful at a threshold equal to the outcome incidence, while 72 (55%) were harmful at half the outcome incidence and 58 (44%) were harmful at twice the outcome incidence (Table 3). When exploring the likelihood of a model being harmful at any point within the range of thresholds explored, we found that 97% (103/106) of models with a validation C-statistic below 0.7 were potentially harmful at some threshold, while only 65% (17/26) of those with a validation C-statistic above 0.7 were potentially harmful. Similarly, we found that 96% (106/111) of models with a standardized EAVG above 0.3 were potentially harmful, while only 67% (14/21) of models with a standardized EAVG below 0.3 were potentially harmful (Figure 2).

Table 3.

Net Benefit Analysis of Models at 3 Representative Decision Thresholds Before and After Sequential Model Updating

Figure 2.

Model harm status before and after sequential model updating based on validation c-statistic and standardized calibration error. A model was considered harmful if net benefit was below default strategy at any point across a range of thresholds from half the outcome incidence to twice the outcome incidence.

Net Benefit Analysis of Models at 3 Representative Decision Thresholds Before and After Sequential Model Updating Model harm status before and after sequential model updating based on validation c-statistic and standardized calibration error. A model was considered harmful if net benefit was below default strategy at any point across a range of thresholds from half the outcome incidence to twice the outcome incidence.

Effects of Updating

EAVG improved by a median of 56% (IQR, 23%–79%) across all the CPM-trial pairs with updating of the intercept and by a median of 93% (IQR, 80%–99%) after updating the intercept and slope (Table 4). Similar results were seen for E90. No further improvement in calibration error was seen with re-estimation. Plots of harmful and nonharmful models across the range of validation C-statistic and standardized EAVG showed the points moving progressively downward (and rightward after re-estimation), reflecting improved calibration (when the intercept and slope are adjusted) and discrimination (when the coefficients are re-estimated, Figure 2). Similar results were seen when net benefit was assessed only at the 3 thresholds (Figure S1). Significant improvement in net benefit was seen with sequential model updating (Table 3). Updating the intercept alone reduced the likelihood of model harm at any threshold in the full range of thresholds considered from 91% to 73% (97/132). Updating the intercept and slope reduced this likelihood further, to 53% (70/132), and complete re-estimation reduced the likelihood to 48% (63/132).

Table 4.

Impact of Model Updating on Calibration Error and Risk of Harm

Discussion

The major finding of this analysis is that off-the-shelf CPMs often perform poorly in new populations, and this very frequently results in potential for net harm. Indeed, only 12/132 (9%) of the unique evaluations we performed were either beneficial or neutral in the full range of thresholds examined, and only 22/132 (17%) were either beneficial or neutral at each of the 3 thresholds. In contrast to what is often assumed, use of an explicit data-driven CPM is often not likely to be better than nothing. Model re-updating substantially reduced the risk of harm, although half the evaluations showed potential harm at least at some threshold within the range considered even after re-estimation. These findings emphasize the need for close oversight, governance and regulation of CPMs as they are more broadly deployed in clinical practice.[31,32] The risk of harm of using CPMs in clinical practice is most salient when decision thresholds depart substantially from the average risk in the patient population of interest. For example, risk of harm would be substantial when trying to deselect a very low risk population for a test or treatment that is clearly beneficial on average, or when trying to select a very high risk population for a test or treatment that is clearly not indicated for those at average risk. However, when the point of indifference lies closer to the average risk (ie, decision is unclear for a typical patient), CPMs seem to be more likely to yield net clinical benefit and to be tolerant of some miscalibration. These findings were consistent across the 3 index conditions we tested. That the decision threshold emerged as a very important determinant of the utility of applying the CPMs in this sample emphasizes the importance of selecting the right decision context for CPM application—an often neglected issue. Based on our results, CPMs yielding typical (ie, nonexcellent) performance should generally be reserved for applications where the decision threshold is near the population average risk, particularly when model updating is not feasible (as it often is not). Intuitively, the value of risk information is the highest when the decision threshold is near the average risk, since even relatively small shifts from the average risk due to using a CPM can reclassify patients into more appropriate decisions. Our prior literature review[6] was unable to examine calibration because it is frequently unreported and, when reported, the metrics used vary from study to study and are largely uninformative with regard to the magnitude of miscalibration (eg, Hosmer Lemeshow, which yields only a P, which tends to be large in small samples and small in large samples). The validations we performed ourselves revealed that CPM-predicted outcome rates frequently deviate from observed outcome rates even when discrimination was good. The typical standardized EAVG was 0.5 (IQR, 0.4–0.7), which means that the absolute error is half the average risk. In exploratory analysis, we found that when the standardized EAVG was > 0.3 (average prediction was off by at least 30%), models were generally found to yield harmful decisions at least at one threshold within the range examined (half the outcome rate to twice the outcome rate). The importance of good calibration in guarding against harmful decision making has recently been emphasized.[33-35] Similarly, it was very unusual to find models that were consistently nonharmful at all examined thresholds when the validation C-statistic dropped below 0.7. We found that the risk of harm can be substantially mitigated often simply by adjusting the intercept alone. Indeed, updating the intercept alone resulted in 100% of the models yielding positive net benefit when the decision threshold was set at the average risk. Yet for the more extreme thresholds, there was still substantial risk of harm; 60% (79/132) of CPMs tested yielded harmful predictions at one or more of the extreme thresholds, even after intercept updating. When both the slope and the intercept were updated, 62/132 (47%) of models were consistently beneficial or nonharmful across all examined thresholds. This underscores the importance of calibration in determining the risk of harm—and also the importance of clear and consistent reporting of calibration, which is largely absent from the literature. Unfortunately, in many clinical settings, recalibration may not be possible. Among other notable findings, we discovered that the vast majority of CPMs were impossible to validate on publicly available patient-level trial databases. The most common reason was a mismatch between the variables in the models and those collected in the publicly available databases. Among the CPMs that we were able to validate, we found that discrimination and calibration deteriorated substantially when compared with the derivation cohorts. Interestingly, much of the decrease in discrimination was due to a narrower case-mix in the validation cohorts as well as model invalidity—although this varied somewhat across the different index conditions. For example, in our examination of acute coronary syndrome models, the median derivation C-statistic was 0.76. This was found to decrease on validation to 0.70. Almost all of this decrease, however, was due to changes in case-mix, not model performance (median MB-c=0.71). In heart failure and population models, the decrement in discriminatory performance appeared more evenly due to case-mix and model invalidity. Our analysis also showed the potential usefulness of the MB-c. This is the first large-scale evaluation to apply this rarely utilized tool. By permitting an estimation of the C-statistic based on the variation of predictions only (ie, independent of the actual outcomes), the MB-c permits comparison of the actual c-statistic to a more appropriate baseline determined by the case-mix in the validation sample, rather than to that in the derivation population. This was particularly germane for our study since we used publicly available clinical trials to evaluate the CPMs. These databases are generally assumed to have a narrower case-mix than registry or real world populations derived from electronic health records—an assumption supported by our results. Our analysis also showed a larger decrement in discrimination when externally validating a CPM on a distantly related cohort than if the cohort were more closely related, a result that confirms findings from our literature review.[6] Furthermore, the proportion of decrement in model discrimination attributable to model invalidity was somewhat higher when the cohorts were distantly related. Relatedness often hinged on subtle but clinically relevant differences between cohorts, such as years of enrollment or the distribution of baseline comorbidities, which required careful review from expert clinicians to identify.

Limitations

Our analysis has several limitations. Models published after 2015 were not included, so our results may not generalize to more recent models in these clinical domains. The sample of databases used was a convenience sample and this sample determined the CPMs selected, since many models were not compatible with the available databases usually because of incongruence between variables required for prediction and those collected in the trial. The validation databases generally represent older therapeutic eras, as this work reflects databases that are currently available through BioLINCC. Using derivation and validation databases from different errors, however, might be thought to simulate the kind of calibration problems that models are likely to confront from data shifts over time and in different settings.[36] Since the database are randomized trials, we anticipate poorer discrimination in these samples just on the basis of the restricted case-mix. Many potential CPM-validation database matches were not possible because of missing or differently defined variables in the validation databases. Given the small number of CPM-validation database matches, we were seldom able to match a CPM to > 1 validation database. A given CPM may perform differently when validated against different cohorts, and more research is required to understand the sources of this variation before validation performance can be used to grade the quality of a model. Our relatedness categorization was one such attempt, but it requires content area expertise, is inherently subjective, and is difficult to generalize to CPMs for other clinical domains. Further, our net benefit analyses used a range of decision thresholds that may be considered clinically arbitrary in that they were not informed by the relative cost of overtreatment versus under-treatment in the specific clinical context. However, we would anticipate that most clinically relevant thresholds would fall within this range, since risk prediction is much less likely to be useful for decision thresholds that are even more extreme. Nevertheless, considering any negative net benefit within this range as indicative of a potentially harmful model may provide an unduly pessimistic view, since many models that are labeled harmful may be beneficial at most thresholds, including the clinically most relevant ones.

Conclusions

Discrimination and calibration often decrease substantially when CPMs for cardiovascular disease are tested in external populations, especially when validation cohorts are only distantly related to model derivation cohorts. This leads to substantial risk of net harm, particularly when decision thresholds are not near the population average risk. Model updating can reduce this risk substantially and will likely be needed to realize the full potential of risk-based decision making. Our findings underscore the need for more thorough model evaluation, including the use of novel measures assessing utility, and better model oversight and stewardship.

Article Information

Sources of Funding

Research reported in this work was funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1606-35555).

Disclosures

None.

Supplemental Material

Figure S1 Tables S1–S7 References 9–25,37–107

107 in total

1. A trial of the beta-blocker bucindolol in patients with advanced chronic heart failure.

Authors: Eric J Eichhorn; Michael J Domanski; Heidi Krause-Steinrauf; Michael R Bristow; Philip W Lavori
Journal: N Engl J Med Date: 2001-05-31 Impact factor: 91.245

2. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination.

Authors: George C M Siontis; Ioanna Tzoulaki; Peter J Castaldi; John P A Ioannidis
Journal: J Clin Epidemiol Date: 2014-10-23 Impact factor: 6.437

3. Prediction of coronary heart disease using risk factor categories.

Authors: P W Wilson; R B D'Agostino; D Levy; A M Belanger; H Silbershatz; W B Kannel
Journal: Circulation Date: 1998-05-12 Impact factor: 29.690

4. A simple benchmark for evaluating quality of care of patients following acute myocardial infarction.

Authors: M F Dorsch; R A Lawrance; R J Sapsford; J Oldham; D C Greenwood; B M Jackson; C Morrell; S G Ball; M B Robinson; A S Hall
Journal: Heart Date: 2001-08 Impact factor: 5.994

5. Simplified scoring system for predicting mortality after percutaneous coronary intervention.

Authors: Mansoor A Qureshi; Robert D Safian; Cindy L Grines; James A Goldstein; Douglas C Westveer; Susan Glazier; Mamtha Balasubramanian; William W O'Neill
Journal: J Am Coll Cardiol Date: 2003-12-03 Impact factor: 24.094

6. A simple prognostic classification model for postprocedural complications after percutaneous coronary intervention for acute myocardial infarction (from the New York State percutaneous coronary intervention database).

Authors: Abdissa Negassa; E Scott Monrad; Vankeepuram S Srinivas
Journal: Am J Cardiol Date: 2009-02-14 Impact factor: 2.778

7. Cardiovascular disease risk profiles.

Authors: K M Anderson; P M Odell; P W Wilson; W B Kannel
Journal: Am Heart J Date: 1991-01 Impact factor: 4.749

8. Predictors of mortality after discharge in patients hospitalized with heart failure: an analysis from the Organized Program to Initiate Lifesaving Treatment in Hospitalized Patients with Heart Failure (OPTIMIZE-HF).

Authors: Christopher M O'Connor; William T Abraham; Nancy M Albert; Robert Clare; Wendy Gattis Stough; Mihai Gheorghiade; Barry H Greenberg; Clyde W Yancy; James B Young; Gregg C Fonarow
Journal: Am Heart J Date: 2008-10 Impact factor: 4.749

9. Stroke risk profile: adjustment for antihypertensive medication. The Framingham Study.

Authors: R B D'Agostino; P A Wolf; A J Belanger; W B Kannel
Journal: Stroke Date: 1994-01 Impact factor: 7.914

10. External Validations of Cardiovascular Clinical Prediction Models: A Large-Scale Review of the Literature.

Authors: Benjamin S Wessler; Jason Nelson; Jinny G Park; Hannah McGinnes; Gaurav Gulati; Riley Brazil; Ben Van Calster; David van Klaveren; Esmee Venema; Ewout Steyerberg; Jessica K Paulus; David M Kent
Journal: Circ Cardiovasc Qual Outcomes Date: 2021-08-03

1 in total

1. ACCEPT 2·0: Recalibrating and externally validating the Acute COPD exacerbation prediction tool (ACCEPT).

Authors: Abdollah Safari; Amin Adibi; Don D Sin; Tae Yoon Lee; Joseph Khoa Ho; Mohsen Sadatsafavi
Journal: EClinicalMedicine Date: 2022-07-22

1 in total