Literature DB >> 26765451

Characterizing Decision-Analysis Performances of Risk Prediction Models Using ADAPT Curves.

Abstract

The area under the receiver operating characteristic curve is a widely used index to characterize the performance of diagnostic tests and prediction models. However, the index does not explicitly acknowledge the utilities of risk predictions. Moreover, for most clinical settings, what counts is whether a prediction model can guide therapeutic decisions in a way that improves patient outcomes, rather than to simply update probabilities.Based on decision theory, the authors propose an alternative index, the "average deviation about the probability threshold" (ADAPT).An ADAPT curve (a plot of ADAPT value against the probability threshold) neatly characterizes the decision-analysis performances of a risk prediction model.Several prediction models can be compared for their ADAPT values at a chosen probability threshold, for a range of plausible threshold values, or for the whole ADAPT curves. This should greatly facilitate the selection of diagnostic tests and prediction models.

Entities: Chemical

Mesh：

Year: 2016 PMID： 26765451 PMCID： PMC4718277 DOI： 10.1097/MD.0000000000002477

Source DB: PubMed Journal: Medicine (Baltimore) ISSN： 0025-7974 Impact factor: 1.817

INTRODUCTION

In clinical practice and epidemiologic research, one often needs to evaluate the performance of diagnostic and prognostic prediction models. For example, does a test discriminate the diseased from the nondiseased, and by what extent? Does a model predict the outcome of a patient under chemotherapy, and how accurate is the prognostication? The area under the receiver operating characteristic curve (AUC) is by far the most commonly used performance index for diagnostic tests or prognostic prediction models.[1] Other indices have also been proposed, such as the receiver operating characteristic curve-based “projected length of the curve” and the “area swept out by the curve” indices,[2] the Lorenz curve-based Gini and Pietra indices,[3,4] and the Brier and the scaled Brier scores.[4,5] The AUC index has been criticized as insensitive to the addition of strong markers to a baseline prediction model.[6-9] In comparison, the Pietra and the scaled Brier indices are more sensitive to “gray-zone resolving markers,” that is, markers with discrimination powers concentrated in the gray zone of the baseline model.[4] The “predictiveness curve” proposed by Huang et al[10,11] and the 2 criteria of “proportion of cases followed” and the “proportion needed to follow-up” proposed by Pfeiffer and Gail[12] are closely related to Lorenz curve. Rather than summarizing the curve using a single index, these methods display and examine the whole curve and hence are more informative. However, none of the above methods explicitly acknowledges the “utilities” of risk predictions; for a lethal but curable disease, a larger utility value should be assigned for the true positive than for the true negative, whereas for a fairly innocuous condition, the opposite is true. Moreover, for most clinical settings, what counts are whether a prediction model can guide therapeutic decisions in a way that improves patient outcomes. A prediction model may be good at revising disease probabilities, but if it cannot identify high-risk patients (those with disease probabilities larger than a certain threshold) for intervention, it is clinically useless. In this paper, we propose an alternative performance index for risk prediction based on decision theory:[13,14] the “average deviation about the probability threshold” (ADAPT). The index is simple to calculate and easy to interpret. We propose to plot the ADAPT value against the probability threshold (which we called the “ADAPT curve”). The ADAPT curve neatly characterizes the decision-analysis performances of a risk prediction model. We use 3 examples to demonstrate our method.

METHODS

A Primer on Decision Analysis

Assume that a prediction model has been developed, yielding a predicted probability of a certain disease (D) for each and every subject in the population (a total of N subjects with a disease prevalence of π): pi for i = 1,2,…,N. We assume that the prediction model is well calibrated (unbiased) so that close to x of 100 subjects with a predicted probability of x% are actually diseased. Denote the utilities (U's) associated with the 4 different prediction outcomes as UTP for true positive (TP), UTN for true negative (TN), UFP for false positive (FP), and UFN for false negative (FN), respectively. Next, calculate the probability threshold, t=B/(BD + BD), where BD = UTP - UFN is the “benefit” or “profit” for a correct prediction for a subject with disease, and BD = UTN-UFP is the corresponding value for a subject without disease. (Equivalently, BD is the harm or loss for a wrong prediction for a subject without disease.) Standard decision analysis[13,14] dictates that using t as the cutoff probability for calling a result positive will maximize the expected utility (EU). The t is also the “indifference point;” calling it positive or negative yields the same EU. A decision analysis based on 4 utility values (UTP, UTN, UFP, and UFN) is exactly the same as when it is based on a single probability threshold value (t). This means that we can bypass the utility values altogether and conduct a decision analysis based directly on the chosen probability threshold. This is certainly good news for decision makers, since determining the whole suites of utility values is by no means easy. Still, the probability threshold needs to be determined beforehand. In some clinical settings, these are already dictated in the clinical practice guidelines, and can readily be adopted in decision analyses. If such guidelines are lacking, one can convene a panel of experts or conduct a physician/patient survey to determine the probability threshold (the indifference point) for the decision problem at hand.

The ADAPT Index

The ADAPT index (for a well calibrated prediction model) is defined as: If the prediction model is inherently a binary model that classifies subjects into high-risk (H) and low-risk (L) groups (eg, a pap smear that has only 2 outcomes: screen positive and screen negative), the ADAPT index is: where Pr(H) and Pr(L) = 1 - Pr(H) are the probabilities for a subject to be labeled as being high risk and low risk, respectively, and pH = Pr(D|H) and pL are the predicted disease probabilities for high-risk and low-risk subjects, respectively. As its namesake suggests, the ADAPT index is to be interpreted as the ADAPT. As noted above, the probability threshold is also the indifference point. We believe any well-rounded decision maker will abhor a model that is often indifferent to positive and negative results (a model with a smaller ADAPT index), and will prefer a model that is on average farther away from this indifference point (a model with a larger ADAPT index). For a perfect prediction model (which yields a predicted probability of 1 for any subject with the disease, and 0 for any subject without the disease), the ADAPT index is: For a null prediction model (which uses disease prevalence in the population to predict disease for each and every subject), the index is: These 2 values also serve as the upper and lower bounds, respectively, for the APAPT index. Between these 2 extremes, a prediction model with a higher ADAPT value is better than a model with a lower ADAPT value, because the EU from using the former model is larger than that from using the latter (eAppendix 1).

Relations with Other Decision-Analysis Indices

The ADAPT index is the key building block for a number of previously proposed decision-analysis indices. First, the EU for a prediction model (eAppendix 1) is: We therefore see that the index of relative utility (RU)[15] can be expressed in either the EU scale or the ADAPT scale: Next, the index of weighted accuracy (WA) suggested by previous researchers[15,16] is a weighted average of sensitivity and specificity that takes into account the disease prevalence and the probability threshold (see the first equality in the following equation). It is also the accuracy rate of the prediction model using “person-benefit” (instead of “individual person”) as the basis for the calculation (the 2nd equality). Therefore, WA and ADAPT indices are related through the following simple equation (the 3rd equality): From this, we see that for a perfect prediction model, (100% accuracy as guaranteed). For a null prediction model, we have . Note that a null model uses disease prevalence in the population (the a priori disease probability) to predict disease, and therefore it is better than tossing a coin as long as π≠t, Finally, the “net benefit” (NB) is the basic index for decision curve analysis.[17-20] It subtracts harm from benefit of a prediction model, with the benefit of a true positive prediction being standardized to 1 (the 1st equality in the following equation). The index can be expressed as: Note that the index in fact counts only those predictions from the model that are positive (TP and FP); likewise the summation in the above formula is also restricted to those i with pi > t. To be more exact, we therefore put a superscript “P” to the index and refer to it as the NB of positive prediction. (Rousson and Zumbrunn[20] called this the “net benefit for the treated.”) Following the same logic, the NB of negative prediction (NBN; restricted to TN and FN, or those i with p≤t) is, with the benefit of a true negative prediction being standardized to 1: (Rousson and Zumbrunn[20] called this the “net benefit for the untreated.”) Now, it is of interest to see that a weighted average of the 2 NB conjugates is complete, accounting for all 4 possible prediction outcomes (the 1st equality in the following equation) and for all iє{1,2,…,N}(the 2nd equality). It also turns out to be the ADAPT index itself (the third equality): The mathematical relations shown here also mean that given the same set of utility values (or the same probability threshold), the performance ranking of various prediction models in the same population should be the same,[21] using EU, RU, WA, or the proposed ADAPT. Note that the 2 incomplete NBs are not on this list.

The ADAPT Curve

The ADAPT curve is a plot of the ADAPT value against the probability threshold. Along their respective ADAPT curves, several prediction models can be compared at a chosen probability threshold (such as when that value is dictated in the clinical guideline), for a range of plausible threshold values (such as when a sensitivity analysis about an estimated threshold value is indicated), or for the whole curves (such as when one wishes to acquire a global picture of the prediction performances).

Example Data

Ethical approval is not necessary for the 3 datasets used in this study: a table in a published paper (Example 1), hypothetical data (Example 2), and simulated data (Example 3).

RESULTS

Example 1: Computer Tomographic Diagnosis of Neurological Problems

To illustrate, we use the data of Hanley and McNeil[1] which consist of the rating results of 109 computed tomographic images obtained from 51 diseased subjects and 58 nondiseased subjects (Table 1). Assume that the disease prevalence in the study population is π = 0.1 and that the benefit of ruling in a diseased subject is 4 times the benefit of ruling out a nondiseased 1 [t = B/(B + B) = 1/(4 + 1) = 0.2]

TABLE 1

Data of Computer Tomographic Diagnosis of Neurological Problems (1)

Data of Computer Tomographic Diagnosis of Neurological Problems (1) Table 1 presents for each of the 5 possible rating results, the proportion of subjects, the predicted disease probability, and the deviation about threshold. The ADAPT index is therefore 0.5180 × 0.1886 + 0.0970 × 0.1596 + 0.0970 × 0.1596 + 0.1923 × 0.0878 + 0.0957 × 0.4758 = 0.1911. From this, we see that after the computer tomographic examination, the predicted probabilities of neurological problems for subjects in the study population will be distanced from the indifference point of t = 0.2, by an average amount of 0.1911. At π = 0.1 and t = 0.2, the upper and lower bounds for ADAPT are 0.1 × (1 − 0.2) + (1 − 0.1) × 0.2 = 0.26 and |0.1 − 0.2| respectively. The ADAPT value of 0.1911 sits roughly at the middle of these 2 bounds. Thus, we realize that the computer tomographic diagnosis achieves roughly 50% performance of a perfect prediction model.

Example 2: Mammographic Diagnosis of Breast Cancers

As a 2nd example, we consider the 2 hypothetical tests for breast cancers (test A and test B) in Lee's paper.[22] The AUCs for these 2 tests are exactly equal (AUC = 0.83), and are higher than that of mammography (AUC = 0.79).[23] But the question is which to choose for an alternative to the conventional mammographic examination. We can turn to the ADAPT index to facilitate the selection of diagnostic tests; the test with a higher ADAPT value should be the 1 that yields a larger EU (eAppendix 1). Figure 1 presents the ADAPT curves for these 2 tests. Note that the curves are confined within the triangles, bounded below by the null model and above by the perfect model. We see that the choice depends on the probability threshold for the disease to be diagnosed and the disease prevalence itself: when the prevalence is lower than the threshold (when everyone is to be called negative a priori, but a prediction model may rule in some people), test A performs better; when the 2 are equal, both tests perform the same; and when the prevalence is higher than the threshold (when everyone is to be called positive a priori, but a prediction model may rule out some people), test B performs better.

FIGURE 1

The average deviation about the probability threshold (ADAPT) curves for 2 hypothetical diagnostic tests for breast cancer (test A: dotted lines; test B: dash lines). The upper sides of the triangles are for the perfect models, and the lower 2 sides, for the null models. If one agrees that the benefit of a correct diagnosis for a breast cancer patient is 50 times the benefit of ruling out a disease-free person [t = B/(B + B) = 1/(50 + 1) ≈ 0.02], then for a community screening (disease prevalence <2%), one can use test A to rule in disease, and for a clinical diagnosis (disease prevalence >2%), one can use test B to rule out disease.

Example 3: Prediction Using Gray-Zone Resolving Markers

Wu and Lee[4] previously studied the performances of prediction models using the baseline score (B), with or without an additional new marker (B + M1 or B + M2). The new marker M1 is an “ordinary” marker with its discrimination power independent of the baseline score, whereas M2 is a gray-zone resolving marker with its discrimination power concentrated in the gray zone of the baseline model (where the predicted probability using the baseline model is close to the a priori probability). Here the performances of the 3 prediction models (B, B + M1, and B + M2) are compared using the ADAPT index. We follow Wu and Lee's[4] simulation scheme (500 subjects as the training sample, and another 500 as the validation sample; 10,000 simulations for each scenario), and the results are shown in Figure 2. We see that model B + M1 performs better than model B uniformly across all probability thresholds. By comparison, the model with the gray-zone resolving marker added (model B + M2) is rather intriguing; it performs better than model B + M1 (and model B) when the probability threshold is less than ∼0.5, but worse than model B + M1 (and surprizingly, even model B) when the probability threshold gets larger.

FIGURE 2

The average deviation about the probability threshold (ADAPT) curves for 3 prediction models (model B: solid line; model B + M1: dotted line; model B + M2: dash lines). The upper side of the triangle is for the perfect model, and the lower 2 sides, for the null model.

DISCUSSION

From a decision-analysis point of view, a binary model that simply revises disease probabilities is of no use. The probability revision has to be such that the predicted disease probability for the high-risk subjects is larger than the probability threshold, and the predicted disease probability for the low-risk ones, smaller than it. The 2 predicted probabilities being on 2 flanks of the probability threshold means that the high-risk and the low-risk subjects can be managed differently. Otherwise, there will be no point making the prediction. This is prescribed right in the ADAPT formula. We readily recognize that the 2 predicted probabilities (pH and pL) have to be flanking the probability threshold: pL < t < pH, for the model to be better than the null (ie, with an ADAPT index larger than |π−t|). A detailed mathematical analysis of the conditions for flanking is given in eAppendix 2. Sometimes, people may just want to learn their disease probabilities from a prediction model, irrespectively of whether such knowledge leads to treatment plan changes, that is, “knowing for the sake of knowing” as Asch et al[24] aptly pointed out. In such circumstances, one can simply let t = π and calculate the ADAPT index as usual. The ADAPT index now measures specifically the probability revision potentials of a prediction model, and is mathematically related to the Pietra index[3,4]: ADAPT=2π(1 − π)× Pietra, when t = π (eAppendix 3). This also helps to explain the rather peculiar behaviors of the gray-zone resolving marker in our Example 3. A gray-zone resolving marker should work best in revising probabilities in the neighborhood of π (∼0.4 in the example), so that when t for the disease under study is near π or when knowing for the sake of knowing is emphasized (t = π), the marker is of high value. However, if t is too high (such as when t > 0.5 in the example), the probability revision may seldom lead to threshold crossings, and then the marker becomes useless. In summary, the ADAPT index is simple to calculate and easy to interpret. It is the key building block for a number of previously proposed decision-analysis indices. An ADAPT curve neatly characterizes the decision-analysis performances of a diagnostic/prognostic prediction model. Several prediction models can be compared for their ADAPT values at a chosen probability threshold, for a range of plausible threshold values, or for the whole ADAPT curves. This should greatly facilitate the selection of diagnostic tests and prediction models. Further studies are warranted to develop statistical inference methods regarding the ADAPT index and ADAPT curve.

Supplementary Materials

The reader is referred to the on-line eAppendices for more information.

24 in total

1. Probabilistic analysis of global performances of diagnostic tests: interpreting the Lorenz curve-based summary measures.

Authors: W C Lee
Journal: Stat Med Date: 1999-02-28 Impact factor: 2.373

2. Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods.

Authors: Ying Huang; Margaret Sullivan Pepe
Journal: Stat Med Date: 2010-06-15 Impact factor: 2.373

Review 3. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve.

Authors: Nancy R Cook
Journal: Clin Chem Date: 2007-11-16 Impact factor: 8.327

4. Evaluating the predictiveness of a continuous marker.

Authors: Ying Huang; Margaret Sullivan Pepe; Ziding Feng
Journal: Biometrics Date: 2007-05-08 Impact factor: 2.571

5. Alternative summary indices for the receiver operating characteristic curve.

Authors: W C Lee; C K Hsiao
Journal: Epidemiology Date: 1996-11 Impact factor: 4.822

6. Decision curve analysis: a novel method for evaluating prediction models.

Authors: Andrew J Vickers; Elena B Elkin
Journal: Med Decis Making Date: 2006 Nov-Dec Impact factor: 2.583

7. Assessing the performance of prediction models: a framework for traditional and novel measures.

Authors: Ewout W Steyerberg; Andrew J Vickers; Nancy R Cook; Thomas Gerds; Mithat Gonen; Nancy Obuchowski; Michael J Pencina; Michael W Kattan
Journal: Epidemiology Date: 2010-01 Impact factor: 4.822

8. C-reactive protein, the metabolic syndrome, and prediction of cardiovascular events in the Framingham Offspring Study.

Authors: Martin K Rutter; James B Meigs; Lisa M Sullivan; Ralph B D'Agostino; Peter W F Wilson
Journal: Circulation Date: 2004-07-19 Impact factor: 29.690

9. Using relative utility curves to evaluate risk prediction.

Authors: Stuart G Baker; Nancy R Cook; Andrew Vickers; Barnett S Kramer
Journal: J R Stat Soc Ser A Stat Soc Date: 2009-10-01 Impact factor: 2.483

10. Genotype score in addition to common risk factors for prediction of type 2 diabetes.

Authors: James B Meigs; Peter Shrader; Lisa M Sullivan; Jarred B McAteer; Caroline S Fox; Josée Dupuis; Alisa K Manning; Jose C Florez; Peter W F Wilson; Ralph B D'Agostino; L Adrienne Cupples
Journal: N Engl J Med Date: 2008-11-20 Impact factor: 91.245

2 in total

1. Decision curve analysis: a technical note.

Authors: Zhongheng Zhang; Valentin Rousson; Wen-Chung Lee; Cyril Ferdynus; Mingyu Chen; Xin Qian; Yizhan Guo
Journal: Ann Transl Med Date: 2018-08

2. A comprehensive validation of HBV-related acute-on-chronic liver failure models to assist decision-making in targeted therapeutics.

Authors: Yi Shen; Xulin Wang; Sheng Zhang; Gang Qin; Yanmei Liu; Yihua Lu; Feng Liang; Xun Zhuang
Journal: Sci Rep Date: 2016-09-16 Impact factor: 4.379

2 in total