Literature DB >> 23935462

Evaluation of prediction models for decision-making: beyond calibration and discrimination.

Abstract

Lars Holmberg and Andrew Vickers discuss the importance of ensuring prediction models lead to better decision making in light of new research into breast, endometrial, and ovarian cancer risk by Ruth Pfeiffer and colleagues. Please see later in the article for the Editors' Summary.

Entities: Disease Gene Species

Mesh：

Year: 2013 PMID： 23935462 PMCID： PMC3728013 DOI： 10.1371/journal.pmed.1001491

Source DB: PubMed Journal: PLoS Med ISSN： 1549-1277 Impact factor: 11.069

Linked Research Article

This Perspective discusses the following new study published in PLOS Medicine: Pfeiffer RM, Park Y, Kreimer AR, Lacey JV Jr, Pee D, et al. (2013) Risk Prediction for Breast, Endometrial, and Ovarian Cancer in White Women Aged 50 Years or Older: Derivation and Validation from Population-Based Cohort Studies. PLoS Med 10(7): e1001492. doi:10.1371/journal.pmed.1001492 Ruth Pfeiffer and colleagues describe models to calculate absolute risks for breast, endometrial, and ovarian cancers for white, non-hispanic women over 50 years old using easily obtainable risk factors. In this week's issue of PLOS Medicine, Ruth Pfeiffer and colleagues present risk prediction models for breast, endometrial, and ovarian cancer [1]. Improvement of existing models and a new model for endometrial cancer can, as the authors say, be useful for several purposes. However, the paper also raises issues about the challenges of model improvement, interpretation, and application to public health and to clinical decision-making. Ruth Pfeiffer and colleagues present models for absolute risks and thereby avoid the common mistake of proclaiming a substantial relative risk as clinically relevant without considering the background risk. For example, a relative risk of 3.0 corresponding to a risk increase from 1% to 3% may have quite different implications than an increase from 10% to 30% [2]. The key claim of the paper is that the models “may assist in clinical decision-making.” While the examples in the paper predominately concern prevention, rather than what many readers would intuitively think of as clinical decision-making—situations such as primary treatment of early prostate cancer or choice of adjuvant chemotherapy for early breast cancer—the emphasis on decision-making is laudable. What we want from models is that they help us make better decisions, leading to better outcomes for our patients. This raises the question of how to evaluate whether a model does indeed improve decision-making. As the authors state, good calibration is essential for good decision-making. A model is well calibrated if, for every 100 individuals given a risk of x%, close to x will indeed have the event of interest. Calibration concerns average risk in a population and a well-calibrated model may assist in prevention decisions, but a miscalibrated model may lead to situations where an individual at high risk is assigned a low predicted probability, and thus forgoes effective preventive intervention. However, calibration is necessary but not sufficient for clinical utility, as the example of mammography screening shows: breast cancer risk prediction models are rarely used to determine eligibility for screening, which is instead based predominately on age, because very large differences in risk between women would be needed to justify separating women into higher versus lower intensities of mammography. The statistical measure of how well a model separates risk is known as discrimination. But traditional analyses of risk factors are, on their own, not well suited to discriminate prognostic groups in a way that is useful for clinical decision-making [3],[4]. Discrimination is often described in terms of the area under the receiver operating characteristic curve (AUC), taken from the receiver operating characteristics. The AUC is often a useful first step in evaluating a model or in comparing two diagnostic or prognostic models against each other. But like calibration, the AUC value is insufficient to demonstrate that a model would improve decision-making [5]. The calculation of AUC assumes that sensitivity is of equal value to specificity, whereas typically the consequences of a false negative (such as a missed cancer) are dramatically different from those of a false positive (such as an unnecessary biopsy). One example is a classifier of aggressive prostate cancer associated with a clearly elevated relative risk of lethal cancer that has an AUC statistically significantly over 0.5, but that still has an unacceptable rate of false negatives that could imply missed treatment opportunities for the ranges where it is reasonable to use [6]. The paper by Pfeiffer and colleagues raises the critical issue of how we should determine the clinical utility of a model, whether it changes decisions, and whether those decisions are good ones. This is an issue that touches a variety of different areas in medical prediction, including comparisons of models and the value of novel molecular markers. Recent years have seen numerous methodological developments, going above and beyond a clear recommendation that clinical utility should be formally assessed [7] to actual statistical techniques for doing so [8],[9]. One of us (A. V.) developed a method for evaluating prediction models called decision curve analysis [10], a straightforward technique with readily available software (http://www.decisioncurveanalysis.org). Thus, there are now quantitative techniques available that can determine whether a model does more good than harm given reasonable assumptions about the consequences of false negatives compared to false positives. This takes us substantially further than the (unsolved) debate about whether model evaluation should prioritize calibration or discrimination [5],[11]–[14]. Use of novel decision analytic techniques can also avoid the sort of problems raised by statements such as “[w]ell-calibrated risk models, even those with modest discriminatory accuracy, have public health applications” [1]: it is difficult to know what counts as “modest” discrimination, and how much discrimination would have to improve to outweigh a given level of miscalibration. For instance, the discrimination estimated by Pfeiffer and colleagues as judged by the AUC would, to many, seem weak rather than modest. The paper by Pfeiffer and colleagues is one of many current papers illustrating the need for quantitative evaluation of the clinical value of prediction models, needed to arm us to transfer rapidly growing medical knowledge to sound decision-making to the benefit of patients. Hopefully, we can have models and model evaluations that illuminate the whole spectrum, from public health decisions for groups of people to the vision of individualized medicine with individually tailored treatments.

14 in total

Review 1. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker.

Authors: Margaret Sullivan Pepe; Holly Janes; Gary Longton; Wendy Leisenring; Polly Newcomb
Journal: Am J Epidemiol Date: 2004-05-01 Impact factor: 4.897

2. Interpreting diagnostic accuracy studies for patient care.

Authors: Susan Mallett; Steve Halligan; Matthew Thompson; Gary S Collins; Douglas G Altman
Journal: BMJ Date: 2012-07-02

3. Ratio measures in leading medical journals: structured review of accessibility of underlying absolute risks.

Authors: Lisa M Schwartz; Steven Woloshin; Evan L Dvorin; H Gilbert Welch
Journal: BMJ Date: 2006-10-23

4. Integrating the predictiveness of a marker with its performance as a classifier.

Authors: Margaret S Pepe; Ziding Feng; Ying Huang; Gary Longton; Ross Prentice; Ian M Thompson; Yingye Zheng
Journal: Am J Epidemiol Date: 2007-11-02 Impact factor: 4.897

5. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond.

Authors: Michael J Pencina; Ralph B D'Agostino; Ralph B D'Agostino; Ramachandran S Vasan
Journal: Stat Med Date: 2008-01-30 Impact factor: 2.373

6. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers.

Authors: Michael J Pencina; Ralph B D'Agostino; Ewout W Steyerberg
Journal: Stat Med Date: 2010-11-05 Impact factor: 2.373

7. Prostate-specific antigen levels as a predictor of lethal prostate cancer.

Authors: Katja Fall; Hans Garmo; Ove Andrén; Anna Bill-Axelson; Jan Adolfsson; Hans-Olov Adami; Jan-Erik Johansson; Lars Holmberg
Journal: J Natl Cancer Inst Date: 2007-04-04 Impact factor: 13.506

8. Use and misuse of the receiver operating characteristic curve in risk prediction.

Authors: Nancy R Cook
Journal: Circulation Date: 2007-02-20 Impact factor: 29.690

9. Decision curve analysis: a novel method for evaluating prediction models.

Authors: Andrew J Vickers; Elena B Elkin
Journal: Med Decis Making Date: 2006 Nov-Dec Impact factor: 2.583

10. Assessing the performance of prediction models: a framework for traditional and novel measures.

Authors: Ewout W Steyerberg; Andrew J Vickers; Nancy R Cook; Thomas Gerds; Mithat Gonen; Nancy Obuchowski; Michael J Pencina; Michael W Kattan
Journal: Epidemiology Date: 2010-01 Impact factor: 4.822

16 in total

1. BATCAVE: calling somatic mutations with a tumor- and site-specific prior.

Authors: Brian K Mannakee; Ryan N Gutenkunst
Journal: NAR Genom Bioinform Date: 2020-02-06

2. Disease Staging and Prognosis in Smokers Using Deep Learning in Chest Computed Tomography.

Authors: Germán González; Samuel Y Ash; Gonzalo Vegas-Sánchez-Ferrero; Jorge Onieva Onieva; Farbod N Rahaghi; James C Ross; Alejandro Díaz; Raúl San José Estépar; George R Washko
Journal: Am J Respir Crit Care Med Date: 2018-01-15 Impact factor: 21.405

Review 3. Reporting and Interpreting Decision Curve Analysis: A Guide for Investigators.

Authors: Ben Van Calster; Laure Wynants; Jan F M Verbeek; Jan Y Verbakel; Evangelia Christodoulou; Andrew J Vickers; Monique J Roobol; Ewout W Steyerberg
Journal: Eur Urol Date: 2018-09-19 Impact factor: 20.096

4. Prediction of Patient Hepatic Encephalopathy Risk with Freiburg Index of Post-TIPS Survival Score Following Transjugular Intrahepatic Portosystemic Shunts: A Retrospective Study.

Authors: Weimin Cai; Beishi Zheng; Xinran Lin; Wei Wu; Chao Chen
Journal: Int J Gen Med Date: 2022-04-13

5. Prognostic utility of serum CRP levels in combination with CURB-65 in patients with clinically suspected sepsis: a decision curve analysis.

Authors: Shungo Yamamoto; Shin Yamazaki; Tsunehiro Shimizu; Taro Takeshima; Shingo Fukuma; Yosuke Yamamoto; Kentaro Tochitani; Yasuhiro Tsuchido; Koh Shinohara; Shunichi Fukuhara
Journal: BMJ Open Date: 2015-04-28 Impact factor: 2.692

6. Simplified clinical prediction scores to target viral load testing in adults with suspected first line treatment failure in Phnom Penh, Cambodia.

Authors: Johan van Griensven; Vichet Phan; Sopheak Thai; Olivier Koole; Lutgarde Lynen
Journal: PLoS One Date: 2014-02-04 Impact factor: 3.240

7. Validation of a novel prediction model for early mortality in adult trauma patients in three public university hospitals in urban India.

Authors: Martin Gerdin; Nobhojit Roy; Monty Khajanchi; Vineet Kumar; Li Felländer-Tsai; Max Petzold; Göran Tomson; Johan von Schreeb
Journal: BMC Emerg Med Date: 2016-02-22