OBJECTIVE: To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs). STUDY DESIGN AND SETTING: Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set. RESULTS: Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure. CONCLUSION: Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended.
OBJECTIVE: To investigate the behavior of predictive performance measures that are commonly used in external validation of prognostic models for outcome at intensive care units (ICUs). STUDY DESIGN AND SETTING: Four prognostic models (Simplified Acute Physiology Score II, the Acute Physiology and Chronic Health Evaluation II, and the Mortality Probability Models II) were evaluated in the Dutch National Intensive Care Evaluation registry database. For each model discrimination (AUC), accuracy (Brier score), and two calibration measures were assessed on data from 41,239 ICU admissions. This validation procedure was repeated with smaller subsamples randomly drawn from the database, and the results were compared with those obtained on the entire data set. RESULTS: Differences in performance between the models were small. The AUC and Brier score showed large variation with small samples. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration tests were extremely sensitive to sample size. Direct comparison of performance, without statistical analysis, was unreliable with either measure. CONCLUSION: Substantial sample sizes are required for performance assessment and model comparison in external validation. Calibration statistics and significance tests should not be used in these settings. Instead, a simple customization method to repair lack-of-fit problems is recommended.
Authors: Pedro Celiny R Garcia; Pablo Eulmesekian; Ricardo G Branco; Augusto Perez; Ana Sffogia; Lorenzo Olivero; Jefferson P Piva; Robert C Tasker Journal: Intensive Care Med Date: 2009-04-10 Impact factor: 17.440
Authors: Philippe Lambin; Ruud G P M van Stiphout; Maud H W Starmans; Emmanuel Rios-Velazquez; Georgi Nalbantov; Hugo J W L Aerts; Erik Roelofs; Wouter van Elmpt; Paul C Boutros; Pierluigi Granone; Vincenzo Valentini; Adrian C Begg; Dirk De Ruysscher; Andre Dekker Journal: Nat Rev Clin Oncol Date: 2012-11-20 Impact factor: 66.675
Authors: Márcio Soares; Ulisses V A Silva; José M M Teles; Eliézer Silva; Pedro Caruso; Suzana M A Lobo; Felipe Dal Pizzol; Luciano P Azevedo; Frederico B de Carvalho; Jorge I F Salluh Journal: Intensive Care Med Date: 2010-03-11 Impact factor: 17.440
Authors: Ewout W Steyerberg; Andrew J Vickers; Nancy R Cook; Thomas Gerds; Mithat Gonen; Nancy Obuchowski; Michael J Pencina; Michael W Kattan Journal: Epidemiology Date: 2010-01 Impact factor: 4.822
Authors: Angel Candela-Toha; Elena Elías-Martín; Victor Abraira; María T Tenorio; Diego Parise; Angélica de Pablo; Tomasa Centella; Fernando Liaño Journal: Clin J Am Soc Nephrol Date: 2008-05-07 Impact factor: 8.237