BACKGROUND: An established risk model may demonstrate miscalibration, meaning predicted risks do not accurately capture event rates. In some instances, investigators can identify and address the cause of miscalibration. In other circumstances, it may be appropriate to recalibrate the risk model. Existing recalibration methods do not account for settings in which the risk score will be used for risk-based clinical decision making. METHODS: We propose 2 new methods for risk model recalibration when the intended purpose of the risk model is to prescribe an intervention to high-risk individuals. Our measure of risk model clinical utility is standardized net benefit. The first method is a weighted strategy that prioritizes good calibration at or around the critical risk threshold. The second method uses constrained optimization to produce a recalibrated risk model with maximum possible net benefit, thereby prioritizing good calibration around the critical risk threshold. We also propose a graphical tool for assessing the potential for recalibration to improve the net benefit of a risk model. We illustrate these methods by recalibrating the American College of Cardiology (ACC)-American Heart Association (AHA) atherosclerotic cardiovascular disease (ASCVD) risk score within the Multi-Ethnic Study of Atherosclerosis (MESA) cohort. RESULTS: New methods are implemented in the R package ClinicalUtilityRecal. Recalibrating the ACC-AHA-ASCVD risk score for a MESA subcohort results in higher estimated net benefit using the proposed methods compared with existing methods, with improved calibration in the most clinically impactful regions of risk. CONCLUSION: The proposed methods target good calibration for critical risks and can improve the net benefit of a risk model. We recommend constrained optimization when the risk model net benefit is paramount. The weighted approach can be considered when good calibration over an interval of risks is important.
BACKGROUND: An established risk model may demonstrate miscalibration, meaning predicted risks do not accurately capture event rates. In some instances, investigators can identify and address the cause of miscalibration. In other circumstances, it may be appropriate to recalibrate the risk model. Existing recalibration methods do not account for settings in which the risk score will be used for risk-based clinical decision making. METHODS: We propose 2 new methods for risk model recalibration when the intended purpose of the risk model is to prescribe an intervention to high-risk individuals. Our measure of risk model clinical utility is standardized net benefit. The first method is a weighted strategy that prioritizes good calibration at or around the critical risk threshold. The second method uses constrained optimization to produce a recalibrated risk model with maximum possible net benefit, thereby prioritizing good calibration around the critical risk threshold. We also propose a graphical tool for assessing the potential for recalibration to improve the net benefit of a risk model. We illustrate these methods by recalibrating the American College of Cardiology (ACC)-American Heart Association (AHA) atherosclerotic cardiovascular disease (ASCVD) risk score within the Multi-Ethnic Study of Atherosclerosis (MESA) cohort. RESULTS: New methods are implemented in the R package ClinicalUtilityRecal. Recalibrating the ACC-AHA-ASCVD risk score for a MESA subcohort results in higher estimated net benefit using the proposed methods compared with existing methods, with improved calibration in the most clinically impactful regions of risk. CONCLUSION: The proposed methods target good calibration for critical risks and can improve the net benefit of a risk model. We recommend constrained optimization when the risk model net benefit is paramount. The weighted approach can be considered when good calibration over an interval of risks is important.
Entities:
Keywords:
calibration; clinical utility; net benefit; recalibration; risk prediction
Risk models can help clinicians and patients make health care decisions. Recommendations
for specific interventions can be based on comparing patients’ estimated risk of a
particular clinical outcome to a predefined risk threshold. In 2013, the American
College of Cardiology (ACC) and the American Heart Association (AHA) published
guidelines recommending that individuals with an estimated 10-y risk of atherosclerotic
cardiovascular disease (ASCVD) greater than 7.5% receive statin therapy.
Paired with this guideline, the panel developed the ACC-AHA-ASCVD risk calculator
to estimate 10-y ASCVD risk, with recommendations to reassess risk every 4 to 6 y in
adults aged 40 to 79 y free of ASCVD.In such settings, risk model calibration carries heightened importance. The calibration
of a risk model refers to the agreement between predicted risks and observed rates of
events. There is evidence that the ACC-AHA-ASCVD risk calculator substantially
overestimates the risk of ASCVD.
Following risk-based treatment guidelines, using overestimated risks implies
overtreatment in the population. Hence, miscalibration can have a serious public health impact.When an established risk model is applied to a new population, we are particularly
concerned that predicted risks may not be well calibrated. In addition, a
well-calibrated model may become miscalibrated over time.
Ideally, when miscalibration appears, one can identify and address the cause.
However, this is not always possible. Miscalibration can arise for complex reasons or
because of fundamental differences between populations. In such instances, it will not
be possible to eliminate the source of miscalibration.
When a model is poorly calibrated and development of a new model is infeasible or
undesirable, then it may be prudent to use statistical methods to recalibrate the risk
model. However, existing methods for risk model recalibration do not account for how the
risk model will be used in clinical practice. In this work, we propose 2 methods for
risk model recalibration when the purpose of the risk model is to recommend for or
against an intervention based on a predetermined risk threshold.Before implementing recalibration methods, researchers may wish to understand whether
recalibration has the potential to improve the usefulness of a risk model. We propose a
graphical tool to help with this assessment. The tool indicates when specialized methods
of recalibration, such as those proposed, have the potential to improve the clinical
utility of a risk model beyond standard methods of recalibration.First, we define terminology and notation and summarizes key background material. Next,
we introduce a the graphical device to help researchers assess the potential for
recalibration to improve the clinical utility of a risk model. Following this, we
propose 2 new methods of recalibration, weighted logistic recalibration and constrained
logistic recalibration. We present simulation results and illustrate the use of the
graphical tool and apply the proposed methods to recalibration the ACC-AHA-ASCVD risk
model within the ethnically diverse Multi-Ethnic Study of Atherosclerosis (MESA) cohort.
We close with a discussion of the materials presented.
Methods
Preliminaries
Notation and definitions
denotes the clinical outcome (ASCVD events in the example
given above). Throughout this work, we refer to individuals who experience
the event without intervention as cases (i.e.,
) and individuals who do not experience the event (i.e.,
) as controls. In the population without
intervention,
, we refer to
as the prevalence of the outcome, as is
customary in the biomarker literature.The expected benefit of the intervention to a would-be case is
. Controls expect harm or cost of the intervention
. We note that
encapsulates both the positive and negative aspects of the
intervention for cases. In our application, the benefit
is the reduction of ASCVD events (due to statins) to an
individual who would have an event without intervention, after accounting
for the monetary costs and side effects of statins.
is the risk threshold for recommending for/against the
intervention. As noted above,
in the ASCVD example. Here and throughout, the term
risk threshold refers to the clinically relevant
threshold used to assign intervention, defined a priori. We use the term
cutpoint to refer to any generic threshold.Let
be a risk model for
, based on 1 or more predictors (risk factors). We call
the predicted risk, estimated risk, or
risk score (equivalently) for individual
.
is the logit-transformed risk score. The premise of this
article is that there is an existing risk model
that we are interested in recalibrating. We assume that
is monotonically nondecreasing with risk, meaning
. If
did not have this monotonicity property, we would likely
not be interested in recalibrating it.Here and throughout, we assume a data set is available for recalibrating the
risk model
. The data set is a random sample of the relevant
population with observed outcomes
without the intervention.
is a recalibration of
as long as
for some monotone nondecreasing function
.
Risk model calibration
Different notions of calibration have appeared in the literature on risk
models. Van Calster et al.
presented 4 notions of risk model calibration: strong calibration,
moderate calibration, weak or logistic calibration, and
calibration-in-the-large. These types of calibration are hierarchical:
strong calibration implies moderate calibration, moderate calibration
implies weak calibration, and weak calibration implies
calibration-in-the-large. The definition of calibration in this article is
“moderate calibration,” as defined by Van Calster et al.
and is formally expressed as follows. For risk model
estimating risk of binary outcome
,
is calibrated at
if
. If
for all
, then we say
is calibrated.The calibration of a risk model can be assessed by examining observed event
rates in groups with similar predicted risks. In Hosmer-Lemeshow plots,
predicted risks are typically grouped by deciles; for each decile, the event
rate for the decile is plotted against the average predicted risk in that group.
Alternatively, smoothing functions (such as a LOESS smoother) can be
used to generate a calibration curve.
The calibration curve for a calibrated risk model is the identity
line.
Logistic recalibration and other methods
Logistic recalibration, proposed by Cox in 1958, is the most prominent method
of recalibration.
Under logistic recalibration,
has the form
, where
is the recalibration intercept and
is the recalibration slope. The recalibration intercept
and slope,
and
, are estimated by fitting a simple linear logistic
regression model in which
is regressed on the logit-transformed risk scores
. Recalibrated risk scores are generated by scaling
by
, shifting by
, then transforming back to the risk scale via the inverse
of the logit function. Note that this is a family of valid recalibration
functions for any real
and positive
.More recently, more flexible methods of recalibration have been
proposed.[10-14] The
greater flexibility of such methods raises the possibility of overfitting.
Some alternative methods are not guaranteed to produce a monotone
transformation of the original risk score. We consider that a nonmonotone
transformation fundamentally changes a risk model and should not strictly be
considered a recalibration of the risk model. Some flexible methods of
recalibration have been seen to perform poorly for risk models constructed
using logistic regression.
Although not a presented as a method of recalibration per se, the
risk-mapping plot developed under the relative utility framework has
potential to produce a recalibrated risk marker through similarly flexible
methods, with requirements that ensure monotonicity.
The goal of the approaches proposed in this article is to retain the
parsimony of Cox’s logistic recalibration while prioritizing calibration
near the clinically important risk threshold.
Clinical utility of risk models for treatment decisions based on
risk
The clinical utility of a risk model refers to the usefulness of a risk model
for its intended clinical application. The standardized net benefit
(
) is the measure of risk model clinical utility considered
in this article. Given a risk model
for outcome
and risk threshold
for recommending an intervention to prevent or ameliorate
,where
(
) is the true-positive rate (false-positive rate) for the
risk model using risk threshold
.
captures the utility of the risk model to correctly assign
intervention to cases, discounted by the proportion of controls receiving
intervention, where the “discounting factor” accounts for the prevalence and
harms and benefits of intervention.[17-20] Henceforth, we
suppress notation showing the dependence of
on
. Unless stated explicitly otherwise, we presume the risk
threshold
for all calculations of
.A key assumption is that the risk threshold
accurately represents the benefits and harms of the
intervention according to the relation
.[17,20-22] In the ASCVD example, the risk threshold
implies the benefit (
) of statins to a case is about 12 times greater than the
harm of statin therapy (
) to a control. Further, the harm-to-benefit ratio must be
independent of the risk model.
We adopt these assumptions throughout.We note that we use the “opt-in” formulation of
, indicating that the default treatment policy without a
risk model should be treat none (rather than treat all).[16,23] This
article focuses on the standardized version of net benefit (
), but methods could easily be formulated in terms of net
benefit (
) instead. As shown in equation (1),
divides
(the net benefit of intervention less net harms) by the
prevalence. The maximum value of
is always 1, which would occur for a risk model that
perfectly discriminates (
,
).
We find this theoretical maximum to be useful for gauging a risk
model’s clinical utility relative to the maximum possible clinical utility.
There are other measures of clinical utility in the literature (notably
relative utility) we do not consider here.[16,20]
Calibration of a risk model and its clinical utility
Van Calster and Vickers
give examples using simulated data in which miscalibration reduces
the clinical utility of risk-based treatment policies. As the authors note,
these results are expected because net benefit is a proper scoring rule.
Baker et al.
established the connection between the calibration of a risk model,
the slope of its receiver-operating characteristic (ROC) curve, and the
prevalence,
. Metz
related ROC analyses to a cost-benefit framework for decision making.
We provide an alternative presentation of the result in Baker et al.,
relating the height of the calibration curve for
to the prevalence
and the slope of the ROC curve. Supplementary Material A provides the full statement of our
version of this Lemma and proof. The relationship yields the following
corollary, with proof given in Supplementary Material A.Corollary 1
(sNB of risk-based treatment policies and calibration of
Graphically Assessing Potential Net Benefit under Recalibration
Before presenting our methods, we introduce a graphical tool to help researchers
assess the potential for recalibration to improve the clinical utility of a risk
model. Recalibration preserves the rank order of risk scores, meaning that under
recalibration, some subset of individuals with similar predicted risks will move
from below the risk threshold to above the risk threshold, or vice versa. Given
fixed
, for every
that results in a new value of
, there is an equivalent cutpoint,
, that produces the same
when paired with the original risk score
. Using this relationship, varying the cutpoint between 0 and 1
for the original risk score
and harm-benefit ratio
yields all values of
that can be achieved by recalibrating the risk model. We
propose that investigators assess this space to understand the potential for
recalibration to improve net benefit. Specifically, we propose that
investigators plot estimates ofon the vertical axis against
on the horizontal axis. We emphasize that
is constant in this expression of
because it represents benefits and harms. We note that when
the cutpoint
equals the risk threshold
, equation (2) is the
standardized net benefit of the risk model. In addition, when evaluated at the
cutpoint
that maximizes
, equation (2) is the relative
utility evaluated at risk threshold
.Figure 1 shows 2
examples. In Figure 1,
the horizontal axis gives all possible cutpoints, and the vertical axis gives
for cutpoint
and fixed harm-benefit ratio
. The maximum of the curve estimates the maximum
that can be achieved via recalibration of the risk model. The
estimated
of the original risk score and the recalibrated risk score
under standard logistic recalibration are noted on the curves, and these can be
compared with the maximum. If the estimated
of the original risk score is far below the maximum of the
curve, then there are potentially recalibration parameters
that can increase the clinical utility of the risk model.
Similarly, if standard logistic recalibration does not produce a risk model near
the maximum, then alternative methods of recalibration may produce superior
results. The graphical tool also provides researchers a sense of how much loss
in
occurs due to miscalibration. From corollary 1, the maximum of
this curve estimates the
of a risk score if calibrated at
. The vertical distance between the maximum of the curve and
the observed
of the risk score estimates the loss in
from miscalibration at
.
Figure 1.
Potential
under recalibration. The dotted line shows 1 standard
error below the estimated maximum possible
. In both (A) and (B), the estimated
for the original risk model is more than 1 standard
error lower than the estimated maximum possible
, indicating that a recalibrated risk score could yield
higher net benefit. In (A), the estimated
for the risk model after standard logistic
recalibration is near the maximum value. Alternative methods of
recalibration may not be worth pursuing in this setting. In (B), the
recalibrated risk model produced by standard logistic recalibration
yields estimated
more than 1 standard error lower than the estimated
maximum possible
, suggesting that alternative recalibration methods may
be useful.
Potential
under recalibration. The dotted line shows 1 standard
error below the estimated maximum possible
. In both (A) and (B), the estimated
for the original risk model is more than 1 standard
error lower than the estimated maximum possible
, indicating that a recalibrated risk score could yield
higher net benefit. In (A), the estimated
for the risk model after standard logistic
recalibration is near the maximum value. Alternative methods of
recalibration may not be worth pursuing in this setting. In (B), the
recalibrated risk model produced by standard logistic recalibration
yields estimated
more than 1 standard error lower than the estimated
maximum possible
, suggesting that alternative recalibration methods may
be useful.In light of sampling variability, it may be unclear whether a risk model is
“close” to the maximum. Following Friedman et al.,
we suggest a “1-standard error” rule to decide if the estimated
of a risk model is near the maximum. Each plot in Figure 1 includes a
dotted horizontal line 1 standard error below the maximum. (The standard error
for the maximum of the curve is derived via the delta method; see Supplementary Material A.)In both Figures 1A and
1B, the original
risk model has notably lower
than the maximum possible value. In Figure 1A, the estimated
of the risk model is close to the maximum possible value after
recalibration via standard logistic recalibration. In contrast, standard
logistic recalibration makes little difference in Figure 1B. Alternative methods of
recalibration, such as those we propose, may be worthwhile to pursue in
situations such as in Figure
1B.
Weighted Logistic Recalibration for Improved Clinical Utility
We propose a weighted variant of Cox’s logistic recalibration to prioritize
calibration near the risk threshold, which corollary 1 implies should maximize
. The weighted recalibration intercept
and slope
are estimated by maximizing the weighted likelihoodwhere
. We propose the weight functionwhere
is a smoothed observed event rate, obtained via LOESS
regression of
on the risk scores
. Notation reflects the dependence of observed event rates on
the risk model
.
are presented on the vertical axis of the calibration plot.
and
are tuning parameters and control the degree of differential
weighting of observations. As
increases, all weights tend to 1, and the weighted
recalibration method approaches standard logistic recalibration. The parameter
prescribes how much weight is assigned to observations outside
a critical risk interval
, where clinicians may be additionally concerned about good
calibration.
is bounded below by 0 and bounded above by the infimum of the
weights within the interval.The weight function (4) encompasses 2 useful forms. The first has the form of an
exponential decay weight (Figure 2A). Under this weighting scheme, observations with event
rates at or near the risk threshold have the largest contribution to the
likelihood, which decays exponentially moving away from
. The second form (Figure 2B) approximates a step function
and is useful to prioritize calibration over a range of risks instead of a
single risk threshold. In the ASCVD example, additional guidelines and current
practices in cardiology indicate that 5% to 10% is an interval of critical risks
that may affect clinical decisions. For settings in which good calibration is
important for the interval
,
can be set to a large value so that weights within the
interval are all close to 1 (e.g.,
), and only specification of
is needed. For settings in which good calibration at the risk
threshold
is most important, the exponential decay form can be used, and
only specification of
is needed. Supplementary Material B provides guidance for obtaining
weights, including tuning parameter selection using a cross-validation
procedure.
Figure 2.
Example of weight functions used in the weighting scheme. The horizontal
axis shows the event rate, and the vertical axis gives the weight. (A)
Exponential decay weight with
(and therefore no need to specify
) and
and
. This weight may be appropriate when good calibration
at the risk threshold is of primary interest. (B) Weight function
approximating a step function with
,
, and specified risk interval
around
. This weighting scheme may be appropriate when good
calibration over an interval of risk is of interest in addition to good
calibration at
.
Example of weight functions used in the weighting scheme. The horizontal
axis shows the event rate, and the vertical axis gives the weight. (A)
Exponential decay weight with
(and therefore no need to specify
) and
and
. This weight may be appropriate when good calibration
at the risk threshold is of primary interest. (B) Weight function
approximating a step function with
,
, and specified risk interval
around
. This weighting scheme may be appropriate when good
calibration over an interval of risk is of interest in addition to good
calibration at
.These weighting schemes down-weight, to a greater or lesser degree, observations
with smoothed event rate (
) away from the risk threshold. In a sense, we use less data to
achieve a more targeted calibration, and therefore, there is a tradeoff between
the improved calibration at or near the risk threshold (and therefore also
) and the precision of results (more variability in
and
). When using this method, we recommend reporting the effective
sample proportion. Since all weights are between 0 and 1, the effective sample
proportion can be calculated as the average weight,
In standard logistic recalibration, all
, and the effective sample proportion is 1.
Constrained Logistic Recalibration
In our second approach to recalibration, we propose estimating the recalibration
intercept and slope by maximizing the logistic likelihood over a restricted
parameter space. The restricted space only includes recalibration parameters
and
that produce a recalibrated risk model with high
. The concepts in the “Graphically Assessing Potential Net
Benefit under Recalibration” section make this possible, because we are able to
estimate the maximum possible
among all possible relcalibrations of
.Given a risk score
, risk threshold
, and recalibration parameters
and
, the plug-in estimator of the standardized net benefit of the
recalibrated risk model iswhere
and
are the number of cases and controls, respectively, in the
sample of data available for recalibration. We propose estimating recalibration
parameters
via the following constrained maximization problem.where
is the estimated maximum achievable
among all risk scores of the formThat is, we propose estimating
by maximizing the likelihood of the logistic model over a
constrained parameter space. For fixed harm-benefit ratio
,
is found by varying decision threshold
(see the “Graphically Assessing Potential Net Benefit under
Recalibration” section). That is, we solve the 1-dimensional optimization
problemAcknowledging that there is uncertainty in
, we use a 1-standard-error type of rule in the inequality
constraint. Such rules are often used when tuning penalized regression methods.
The constrained parameter space includes all
that produce a risk model with
within 1 standard error of
. Supplementary Material A provides an estimate of this standard
error.The constrained logistic recalibration solution differs from the standard
logistic recalibration solution when the latter is outside the constrained
parameter space. These are exactly the instances in which there is evidence that
standard logistic recalibration is inadequate in terms of the clinical utility
of the recalibrated risk model. In situations lacking such evidence, the
constrained and standard logistic recalibration solutions will be the same.
Results
Simulation Results
In this section, we compare weighted and constrained logistic recalibration to
standard logistic recalibration using simulated data. We present 4 different
simulation examples representing different types of miscalibration. For all
examples, we use the risk threshold
. For the weighted approach, an exponential and step weight
function are used, with risk interval
. For brevity, results for the step function appear in
Supplementary Material C. Tuning parameters are selected using
25 replications of 5-fold cross-validation.Recalibration parameters are estimated from data sets of size 500, 1000, 5000,
and 10,000. We use a large independent validation data set of size
to evaluate the true (rather than estimated) risk model
performance before and after recalibration. Table 6 in Supplementary Material C summarizes results for each example
with 500 repeated simulations.We simulate the data as follows. First, true risks (
) are generated from a mixture Beta distribution, comprised of
3 subdistributions. The subdistributions are defined by the tendency to have
low, medium, or high true risks. Beta hyperparameters and mixing proportions
vary by example. Outcomes
are generated from a
distribution. The overall event rate is
, where
and
are the Beta hyperparameters for 3 different subpopulations,
and
is the mixing proportion for subpopulation
. Finally, we induce miscalibration by applying a piecewise
polynomial function to the true risk model. We vary the type of miscalibration
to capture different scenarios. Full details for each scenario are provided in
Supplementary Material C.We present 4 types of miscalibration: underestimation of risk scores near the
risk threshold and overestimation elsewhere (example 1); underestimation of
risks for all risk scores (example 2); overestimation of risk scores near the
risk threshold and underestimation far from the risk threshold (example 3); and
overestimation of risks for all risk scores (example 4). Table 1 shows the
of the original and recalibrated risk models for all examples
and sample sizes. Figure
3 shows calibration curves for all examples and sample size of
. Calibration plots for other sample sizes and additional
simulation results are in Supplementary Material C.
Table 1
Summary of Simulation Results
Sample Size
Original
Standard
Weighted
Constrained
Example 1
500
0.430
0.446
0.446
0.458
1000
0.430
0.445
0.445
0.481
5000
0.430
0.455
0.503
0.504
10,000
0.430
0.457
0.499
0.501
Example 2
500
0.475
0.510
0.513
0.524
1000
0.475
0.491
0.491
0.491
5000
0.475
0.511
0.523
0.524
10,000
0.475
0.511
0.521
0.525
Example 3
500
0.440
0.459
0.461
0.475
1000
0.440
0.440
0.457
0.476
5000
0.440
0.458
0.484
0.488
10,000
0.440
0.449
0.502
0.503
Example 4
500
0.282
0.417
0.418
0.417
1000
0.282
0.417
0.417
0.417
5000
0.282
0.413
0.421
0.435
10,000
0.282
0.422
0.430
0.431
of original risk score, and after standard,
weighted, and constrained recalibration. All
calculations are obtained from a large set of
independent data (
) to show the true performance of each risk
model.
Figure 3
Calibration curves and the distributions of risk scores for examples 1 to
4 and sample size
. Calibration curves for the original, standard
recalibrated, weighted recalibrated, and constrained recalibrated risk
models are shown for each example. The histogram shows the distribution
of risk scores before any recalibration. Dotted lines indicate the
clinically important risk threshold,
. The dashed is the identity line.
Summary of Simulation Resultsof original risk score, and after standard,
weighted, and constrained recalibration. All
calculations are obtained from a large set of
independent data (
) to show the true performance of each risk
model.Calibration curves and the distributions of risk scores for examples 1 to
4 and sample size
. Calibration curves for the original, standard
recalibrated, weighted recalibrated, and constrained recalibrated risk
models are shown for each example. The histogram shows the distribution
of risk scores before any recalibration. Dotted lines indicate the
clinically important risk threshold,
. The dashed is the identity line.In example 1, the original risk model underestimates risk at the risk threshold.
The calibration curves in Figure 3A show good calibration at the risk threshold under the
weighted and constrained approaches. In contrast, after standard recalibration,
risks continue to be underestimated at the risk threshold. Weighted and
constrained logistic recalibration increase
by 0.042 and 0.044 compared with standard logistic
recalibration. However, gains are smaller for smaller sample sizes, particularly
for the weighted approach. The smaller gains in
under the weighted approach can be attributed to the tuning
parameter selected via the cross-validation procedure. When sample size is
inadequate to support targeted recalibration, the weighted approach is designed
to approximate standard logistic recalibration via the tuning parameter
selection using the proposed cross-validation approach.Next, we consider an example in which risks are underestimated across all
predicted risks. The calibration curves shown in Figure 3B show slight improvement in
calibration at the risk threshold for the weighted approach compared with
standard logistic recalibration when
. Both weighted and constrained logistic recalibration yield a
recalibrated risk model with larger
compared with standard logistic recalibration for all sample
sizes except
. Weighted recalibration and constrained logistic recalibration
produce similar
, with slightly higher
for the constrained approach. For the smallest sample size,
, the constrained logistic recalibration approach has over 0.01
higher
compared with standard recalibration, while the weighted
approach offers smaller improvement.In this example, when the sample size is small, there are too little data near
the risk threshold to support the weighted approach. Therefore, weighted
recalibration approaches standard logistic recalibration. Similarly, when
, the constrained logistic recalibration is the same as
standard logistic recalibration because there is relatively large uncertainty in
, and the constraint space includes the standard logistic
recalibration solution.In example 3 (Figure
3C), risks are overestimated at the risk threshold and underestimated for
very high and low predicted risks. Both the weighted and constrained
recalibration methods produce a recalibrated risk model with higher
than standard logistic recalibration. As the sample size
decreases, the
for weighted recalibration is similar to that for standard
logistic recalibration, while the constrained recalibration approach has
sustained increases in
compared with standard logistic recalibration. Weighted and
constrained logistic recalibration sacrifice calibration away from the risk
threshold to achieve better calibration near the risk threshold. These methods
were designed to make this tradeoff, since miscalibration away from the risk
threshold does not affect clinical decisions.Finally, in example 4, standard, weighted, and constrained logistic recalibration
all have similar
when recalibration parameters are estimated in smaller data
sets. For larger sample sizes, the
of the constrained approach is larger than all other methods,
while the weighted method still offers higher
than standard logistic recalibration. For sample sizes
and
, the estimated recalibration parameters under the weighted
approach closely approximate those from standard recalibration (Table 5 in Supplementary Material C).
Recalibration of the ACC-AHA-ASCVD Risk Model
MESA is a large, prospective, nationwide, multiethnic cohort study of
cardiovascular disease (CVD) in men and women free of CVD at enrollment.
Demographic and clinical data were collected at baseline, and
participants were monitored for more than 10 y for cardiovascular clinical
events. Recalibrating the ACC-AHA-ASCVD risk model using the MESA cohort and
prioritizing good calibration at the treatment threshold of 7.5% could improve
the clinical utility of the risk tool for the population. Figure 4 shows the estimated potential
of the ACC-AHA-ASCVD risk model. After standard logistic
recalibration, the estimated
is near the maximum, suggesting that alternative recalibration
methods may not be worthwhile.
Figure 4
Potential gains in
under recalibration of the American College of
Cardiology (ACC)–American Heart Association (AHA)–atherosclerotic
cardiovascular disease (ASCVD) model for all Multi-Ethnic Study of
Atherosclerosis participants eligible for risk score application
(N = 4830). The plot indicates the potential for
recalibration to achieve higher clinical utility than the original risk
model since its estimated
is more than 1 standard error lower than the estimated
maximum
. Standard logistic recalibration produces a risk model
with near maximum
, so results do not support pursuing specialized
methods of recalibration.
Potential gains in
under recalibration of the American College of
Cardiology (ACC)–American Heart Association (AHA)–atherosclerotic
cardiovascular disease (ASCVD) model for all Multi-Ethnic Study of
Atherosclerosis participants eligible for risk score application
(N = 4830). The plot indicates the potential for
recalibration to achieve higher clinical utility than the original risk
model since its estimated
is more than 1 standard error lower than the estimated
maximum
. Standard logistic recalibration produces a risk model
with near maximum
, so results do not support pursuing specialized
methods of recalibration.MESA is an ethnically diverse cohort, and there is interest in evaluating and
correcting miscalibration of the ACC-AHA-ASCVD risk score within different
subgroups defined by sex and/or ethnicity.
Applying the graphical tool to different subgroups in MESA, we found
potential for improvement for the Black male cohort (Figure 5). The 10-y event rate of CVD in
Black men (within age range and low-density lipoprotein range, and diabetes
free,
) was 7.1%. The average estimated 10-y risk of CVD from the
ACC-AHA-ASCVD risk score was 12.5%, indicating overestimation of risks.
Figure 5
Potential gains in
under recalibration of the American College of
Cardiology (ACC)–American Heart Association (AHA)–atherosclerotic
cardiovascular disease (ASCVD) risk model for Black males eligible for
risk score application (N = 538). The plot indicates
the potential to achieve higher clinical utility than the original risk
model or the risk model after standard logistic recalibration since the
estimated
of those risk models is more than 1 standard error
lower than the estimated maximum
.
Potential gains in
under recalibration of the American College of
Cardiology (ACC)–American Heart Association (AHA)–atherosclerotic
cardiovascular disease (ASCVD) risk model for Black males eligible for
risk score application (N = 538). The plot indicates
the potential to achieve higher clinical utility than the original risk
model or the risk model after standard logistic recalibration since the
estimated
of those risk models is more than 1 standard error
lower than the estimated maximum
.We applied standard, weighted, and constrained logistic recalibration to the
ACC-AHA-ASCVD risk score in the Black, male MESA cohort. Table 2 shows the estimated
recalibration parameters
, standardized net benefit (and its components), event rate in
the risk interval, and proportion treated. We used bootstrap methods to correct
for optimistic bias in estimating
.
The estimated maximum achievable
under recalibration for this sample was 0.362, with estimated
standard error 0.102. Therefore, the lower bound used to to define the
constrained parameter space was
. Both weighted and constrained recalibration offered
improvements in
over standard recalibration. Figure 6 shows similar calibration
between the three methods at the risk threshold.
Table 2
Comparison of Recalibration Methods in the MESA Black, Male Cohort for
RAW = 0.01
Measure
Original
Standard
Weighted
Constrained
(α^0, α^1)
—
(–0.911, 0.856)
(0.088, 1.271)
(–0.699, 0.960)
Effective Sample Proportion %
—
100
70
100
sNB^ (95% CI)
0.175 (–0.082, 0.432)
0.179 (–0.068, 0.426)
0.304 (0.023, 0.586)
0.274 (–0.018, 0.473)
Optimism corrected sNB^ (95% CI)
—
0.165 (–0.082, 0.411)
0.295 (0.013, 0.577)
0.211 (–0.034 0.457)
Optimism corrected TPR^
0.868
0.550
0.759
0.582
Optimism corrected FPR^
0.650
0.350
0.430
0.355
For weighted recalibration, an indicator weight is used where a
constant weight is applied to observations within the clinically
defined risk interval [2.5%, 10%], and a smaller constant weight is
applied to observations outside that interval. Optimism-bias
correction for
and confidence intervals estimated via bootstrap
method with 500 replications.
Figure 6
Calibration curves for the original risk score as well as standard,
weighted, and constrained recalibrated risk score in the Multi-Ethnic
Study of Atherosclerosis Black, male cohort.
Comparison of Recalibration Methods in the MESA Black, Male Cohort for
RAW = 0.01For weighted recalibration, an indicator weight is used where a
constant weight is applied to observations within the clinically
defined risk interval [2.5%, 10%], and a smaller constant weight is
applied to observations outside that interval. Optimism-bias
correction for
and confidence intervals estimated via bootstrap
method with 500 replications.Calibration curves for the original risk score as well as standard,
weighted, and constrained recalibrated risk score in the Multi-Ethnic
Study of Atherosclerosis Black, male cohort.We acknowledge wide confidence intervals in these results. The small sample size
and resulting uncertainty make it difficult to draw definitive conclusions about
improved clinical utility. However, despite the small sample size, both the
graphical device and optimism-corrected estimates of
suggest the proposed methods are advantageous.
Discussion
We presented methods for risk model recalibration that aim to optimize a risk model’s
clinical utility for making risk-based decisions. Box 1 compares the 2 proposed methods,
which are both generalizations of standard logistic recalibration. Moreover, both
methods can be expected to approximate or reproduce standard logistic recalibration
when it produces good calibration at the critical risk threshold. We consider this
feature a strength of these approaches.
Box 1
Comparison of Proposed Recalibration Methods
Weighted logistic recalibration
• Aims to improve recalibration at the risk threshold, with
improved sNB as result
• Flexible weight function allows researchers to specify either
a single risk or an interval where good calibration should be
prioritized
• Down-weighting reduces effective sample size
• Requires tuning parameters
Constrained logistic recalibration
• Aims to maximize sNB, with improved calibration at the risk
threshold as a consequence
• If variance of sNB^max is large, the constrained parameter space will
also be large, and the recalibration will be identical to
standard logistic recalibration
Comparison of Proposed Recalibration MethodsWe additionally proposed a graphical device to help researchers assess the potential
for recalibration to improve the clinical utility of a risk model. For a predefined
risk model, we also provided methods to estimate its maximum possible net benefit
and its variance. These results enable researchers to evaluate whether specialized
methods of recalibration, such as the 2 we propose, are likely to be advantageous.
Both methods and the graphical tools are in the R package ClinicalUtilityRecal.As discussed, this work assumes all conditions required for net benefit metrics to be
meaningful. We also emphasize that we do not think recalibration should be an
automatic response to observing miscalibration. Miscalibration can indicate issues,
such as measurement or population heterogeneity, that might be resolved in other
ways. When possible, identifying the source of miscalibration can provide
researchers with a better understanding of avenues for correction, as well as
indications of complex changes in populations. Moreover, if there are adequate data
to develop a new risk model, refitting may be preferred over recalibration. Other
work has compared standard logistic recalibration to refitting methods.[33-35] However, even when refitting
is possible, investigators might prefer recalibration to maintain a connection with
the original model. In this article, we presume a context in which investigators
have decided that recalibration is their best course of action.Standard logistic recalibration is a parsimonious method to address miscalibration.
In settings where the miscalibration pattern at the risk threshold is similar to the
pattern for the bulk of the data (e.g., systematic under- or overestimation) or
settings where there is under- or overfitting, standard logistic recalibration may
adequately improve calibration at the risk threshold. In settings where standard
logistic recalibration does not adequately correct miscalibration at the risk
threshold, alternative recalibration methods are useful to ensure risk-guided
clinical decisions are made appropriately. However, it may be unappealing to use
methods that increase the number of recalibration parameters estimated, particularly
if this leads to overfitting. Our methods leverage the parsimony of standard
logistic recalibration while allowing researchers to focus on the regions where good
calibration matters most. Furthermore, we note that our methods could naturally be
applied with other families of recalibration functions, such as the 3-parameter
family proposed by Kull et al.
The methods we propose are not intrinsically tied to the logistic
recalibration family.Statistical software may return elements of statistical inference (standard errors,
confidence intervals, P values) when estimating the recalibration
intercept and slope. These elements might be useful when the model is fit to detect
miscalibration, but we do not find them to be useful for the actual process of
recalibration. Instead, there are 2 instances in which elements of statistical
inference play a key role in our proposed methods. First, we propose a
1-standard-error rule for assessing the potential for recalibration to improve
clinical utility using our proposed graphical device. Second, we suggest that
investigators use a 1-standard-error rule when implementing constrained logistic
recalibration.Weighted logistic recalibration requires tuning parameters to specify the weighting
scheme. We envision using the weighting scheme in 1 of 2 special forms, each
requiring a single tuning parameter. The computational burden of cross-validation is
a disadvantage of the weighted method. When there are few events, heavy
down-weighting may be undesirable. In these instances, the cross-validation
procedure paired with a 1-standard-error rule will indicate that the data do not
support the weighted approach, and weighted logistic recalibration will approximate
standard logistic recalibration. In general, we recommend reporting the effective
sample proportion to gauge the impact of weighting.As risk prediction becomes more ubiquitous with the increase in both data
availability and more sophisticated prediction methods, opportunities to observe
miscalibration are also more common. A recent article describes and classifies
reasons for “data set drift” and implications for the performance of artificial
intelligence systems.
A risk model’s miscalibration has been called “clinically harmful” if it
reduces the net benefit of using the risk model below that of the uniform treatment
policies (treat all and treat none).
However, Kerr et al.
give an example in which the net benefit of a miscalibrated risk model is
higher than both uniform treatment policies, but addressing the miscalibration could
substantially improve the model’s net benefit to the relevant population. This is a
situation in which we consider miscalibration to be clinically harmful. It is
important to assess calibration even if a risk model outperforms treat-all and
treat-none rules. Investigators should consider recalibrating a risk model whenever
there is evidence that its clinical utility could be meaningfully improved.Click here for additional data file.Supplemental material, sj-pdf-1-mdm-10.1177_0272989X211044697 for Recalibration
Methods for Improved Clinical Utility of Risk Scores by Anu Mishra, Robyn L.
McClelland, Lurdes Y. T. Inoue and Kathleen F. Kerr in Medical Decision
Making
Authors: E W Steyerberg; F E Harrell; G J Borsboom; M J Eijkemans; Y Vergouwe; J D Habbema Journal: J Clin Epidemiol Date: 2001-08 Impact factor: 6.437
Authors: Yvonne Vergouwe; Daan Nieboer; Rianne Oostenbrink; Thomas P A Debray; Gordon D Murray; Michael W Kattan; Hendrik Koffijberg; Karel G M Moons; Ewout W Steyerberg Journal: Stat Med Date: 2016-11-28 Impact factor: 2.373
Authors: Diane E Bild; David A Bluemke; Gregory L Burke; Robert Detrano; Ana V Diez Roux; Aaron R Folsom; Philip Greenland; David R Jacob; Richard Kronmal; Kiang Liu; Jennifer Clark Nelson; Daniel O'Leary; Mohammed F Saad; Steven Shea; Moyses Szklo; Russell P Tracy Journal: Am J Epidemiol Date: 2002-11-01 Impact factor: 4.897