L Wynants1, W Bouwmeester2, K G M Moons3, M Moerbeek4, D Timmerman5, S Van Huffel6, B Van Calster7, Y Vergouwe8. 1. KU Leuven Department of Electrical Engineering-ESAT, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Kasteelpark Arenberg 10, Box 2446, Leuven 3001, Belgium; KU Leuven iMinds Medical IT Department, Kasteelpark Arenberg 10, Box 2446, Leuven 3001, Belgium. Electronic address: Laure.wynants@esat.kuleuven.be. 2. Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Heidelberglaan 100, 3584 CX Utrecht, The Netherlands; Pharmerit B.V., Marten Meesweg 107, Rotterdam 3068 AV, The Netherlands. 3. Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Heidelberglaan 100, 3584 CX Utrecht, The Netherlands. 4. Department of Methodology and Statistics, Utrecht University, Padualaan 14, 3584 CH Utrecht, The Netherlands. 5. KU Leuven Department of Development and Regeneration, Herestraat 49 Box 7003, Leuven 3000, Belgium; Department of Obstetrics and Gynaecology, University Hospitals Leuven, Herestraat 49, 3000 Leuven, Belgium. 6. KU Leuven Department of Electrical Engineering-ESAT, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Kasteelpark Arenberg 10, Box 2446, Leuven 3001, Belgium; KU Leuven iMinds Medical IT Department, Kasteelpark Arenberg 10, Box 2446, Leuven 3001, Belgium. 7. KU Leuven Department of Development and Regeneration, Herestraat 49 Box 7003, Leuven 3000, Belgium; Center for Medical Decision Sciences, Department of Public Health, Erasmus Medical Center, Wytemaweg 80, 3015 CN Rotterdam, The Netherlands. 8. Center for Medical Decision Sciences, Department of Public Health, Erasmus Medical Center, Wytemaweg 80, 3015 CN Rotterdam, The Netherlands.
Abstract
OBJECTIVES: This study aims to investigate the influence of the amount of clustering [intraclass correlation (ICC) = 0%, 5%, or 20%], the number of events per variable (EPV) or candidate predictor (EPV = 5, 10, 20, or 50), and backward variable selection on the performance of prediction models. STUDY DESIGN AND SETTING: Researchers frequently combine data from several centers to develop clinical prediction models. In our simulation study, we developed models from clustered training data using multilevel logistic regression and validated them in external data. RESULTS: The amount of clustering was not meaningfully associated with the models' predictive performance. The median calibration slope of models built in samples with EPV = 5 and strong clustering (ICC = 20%) was 0.71. With EPV = 5 and ICC = 0%, it was 0.72. A higher EPV related to an increased performance: the calibration slope was 0.85 at EPV = 10 and ICC = 20% and 0.96 at EPV = 50 and ICC = 20%. Variable selection sometimes led to a substantial relative bias in the estimated predictor effects (up to 118% at EPV = 5), but this had little influence on the model's performance in our simulations. CONCLUSION: We recommend at least 10 EPV to fit prediction models in clustered data using logistic regression. Up to 50 EPV may be needed when variable selection is performed.
OBJECTIVES: This study aims to investigate the influence of the amount of clustering [intraclass correlation (ICC) = 0%, 5%, or 20%], the number of events per variable (EPV) or candidate predictor (EPV = 5, 10, 20, or 50), and backward variable selection on the performance of prediction models. STUDY DESIGN AND SETTING: Researchers frequently combine data from several centers to develop clinical prediction models. In our simulation study, we developed models from clustered training data using multilevel logistic regression and validated them in external data. RESULTS: The amount of clustering was not meaningfully associated with the models' predictive performance. The median calibration slope of models built in samples with EPV = 5 and strong clustering (ICC = 20%) was 0.71. With EPV = 5 and ICC = 0%, it was 0.72. A higher EPV related to an increased performance: the calibration slope was 0.85 at EPV = 10 and ICC = 20% and 0.96 at EPV = 50 and ICC = 20%. Variable selection sometimes led to a substantial relative bias in the estimated predictor effects (up to 118% at EPV = 5), but this had little influence on the model's performance in our simulations. CONCLUSION: We recommend at least 10 EPV to fit prediction models in clustered data using logistic regression. Up to 50 EPV may be needed when variable selection is performed.
Authors: Parambir S Dulai; Brigid S Boland; Siddharth Singh; Khadija Chaudrey; Jenna L Koliani-Pace; Gursimran Kochhar; Malav P Parikh; Eugenia Shmidt; Justin Hartke; Prianka Chilukuri; Joseph Meserve; Diana Whitehead; Robert Hirten; Adam C Winters; Leah G Katta; Farhad Peerani; Neeraj Narula; Keith Sultan; Arun Swaminath; Matthew Bohm; Dana Lukin; David Hudesman; John T Chang; Jesus Rivera-Nieves; Vipul Jairath; G Y Zou; Brian G Feagan; Bo Shen; Corey A Siegel; Edward V Loftus; Sunanda Kane; Bruce E Sands; Jean-Frederic Colombel; William J Sandborn; Karen Lasch; Charlie Cao Journal: Gastroenterology Date: 2018-05-30 Impact factor: 22.682
Authors: Berk Ustun; Lenard A Adler; Cynthia Rudin; Stephen V Faraone; Thomas J Spencer; Patricia Berglund; Michael J Gruber; Ronald C Kessler Journal: JAMA Psychiatry Date: 2017-05-01 Impact factor: 21.596
Authors: Jeffrey S Schachar; Hemikaa Devakumar; Laura Martin; Sara Farag; Eric A Hurtado; G Willy Davila Journal: Int Urogynecol J Date: 2018-03-19 Impact factor: 2.894
Authors: Dan J Stein; Elie G Karam; Victoria Shahly; Eric D Hill; Andrew King; Maria Petukhova; Lukoye Atwoli; Evelyn J Bromet; Silvia Florescu; Josep Maria Haro; Hristo Hinkov; Aimee Karam; María Elena Medina-Mora; Fernando Navarro-Mateu; Marina Piazza; Arieh Shalev; Yolanda Torres; Alan M Zaslavsky; Ronald C Kessler Journal: BMC Psychiatry Date: 2016-07-22 Impact factor: 3.630