Ben Van Calster1,2, Ewout W Steyerberg2, Gary S Collins3, Tim Smits4. 1. Department of Development and Regeneration, KU Leuven, Leuven, Belgium. 2. Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands. 3. Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK. 4. Institute for Media Studies, KU Leuven, Leuven, Belgium.
Abstract
BACKGROUND: Despite regular criticisms of null hypothesis significance testing (NHST), a focus on testing persists, sometimes in the belief to get published and sometimes encouraged by journal reviewers. This paper aims to demonstrate known key limitations of NHST using simple nontechnical illustrations. DESIGN: The first illustration is based on simulated data of 20 000 studies that compare two groups for an outcome event. The true effect size (difference in event rates) and sample size (20-100 per group) were varied. The second illustration used real data from a meta-analysis on alpha-blockers for the treatment of ureteric stones. RESULTS: The simulations demonstrated the large between-study variability in P-values (range between <.0001 and 1 for most simulation conditions). A focus on statistically significant effects (P < .05), notably in small to moderate samples, led to strongly overestimated effect sizes (up to 240%) and many false-positive conclusions, that is statistically significant effects that were, in fact, true null effects. Effect sizes also exerted strong between-study variability, but confidence intervals accounted for this: the interval width decreased with larger sample size, and the percentage of intervals that contained the true effect size was accurate across simulation conditions. Reducing alpha level, as recently suggested, reduced false-positive conclusions but strongly increased the overestimation of significant effects (up to 320%). CONCLUSIONS: Researchers and journals should abandon statistical significance as a pivotal element in most scientific publications. Confidence intervals around effect sizes are more informative, but should not merely be reported to comply with journal requirements.
BACKGROUND: Despite regular criticisms of null hypothesis significance testing (NHST), a focus on testing persists, sometimes in the belief to get published and sometimes encouraged by journal reviewers. This paper aims to demonstrate known key limitations of NHST using simple nontechnical illustrations. DESIGN: The first illustration is based on simulated data of 20 000 studies that compare two groups for an outcome event. The true effect size (difference in event rates) and sample size (20-100 per group) were varied. The second illustration used real data from a meta-analysis on alpha-blockers for the treatment of ureteric stones. RESULTS: The simulations demonstrated the large between-study variability in P-values (range between <.0001 and 1 for most simulation conditions). A focus on statistically significant effects (P < .05), notably in small to moderate samples, led to strongly overestimated effect sizes (up to 240%) and many false-positive conclusions, that is statistically significant effects that were, in fact, true null effects. Effect sizes also exerted strong between-study variability, but confidence intervals accounted for this: the interval width decreased with larger sample size, and the percentage of intervals that contained the true effect size was accurate across simulation conditions. Reducing alpha level, as recently suggested, reduced false-positive conclusions but strongly increased the overestimation of significant effects (up to 320%). CONCLUSIONS: Researchers and journals should abandon statistical significance as a pivotal element in most scientific publications. Confidence intervals around effect sizes are more informative, but should not merely be reported to comply with journal requirements.
Authors: Gema Ibanez-Sanchez; Carlos Fernandez-Llatas; Antonio Martinez-Millana; Angeles Celda; Jesus Mandingorra; Lucia Aparici-Tortajada; Zoe Valero-Ramon; Jorge Munoz-Gama; Marcos Sepúlveda; Eric Rojas; Víctor Gálvez; Daniel Capurro; Vicente Traver Journal: Int J Environ Res Public Health Date: 2019-05-20 Impact factor: 3.390
Authors: Samuel Hawley; M Sanni Ali; Klara Berencsi; Andrew Judge; Daniel Prieto-Alhambra Journal: Clin Epidemiol Date: 2019-02-25 Impact factor: 4.790
Authors: F Arntz; B Mkaouer; A Markov; B J Schoenfeld; J Moran; R Ramirez-Campillo; M Behrens; P Baumert; R M Erskine; L Hauser; H Chaabene Journal: Front Physiol Date: 2022-06-27 Impact factor: 4.755
Authors: Salem Alawbathani; Mehreen Batool; Jan Fleckhaus; Sarkawt Hamad; Floyd Hassenrück; Yanhong Hou; Xia Li; Jon Salmanton-García; Sami Ullah; Frederique Wieters; Martin C Michel Journal: Naunyn Schmiedebergs Arch Pharmacol Date: 2021-01-14 Impact factor: 3.000
Authors: Alexandre Sepriano; Sofia Ramiro; Robert Landewé; Anna Moltó; Pascal Claudepierre; Daniel Wendling; Maxime Dougados; Désirée van der Heijde Journal: Arthritis Care Res (Hoboken) Date: 2021-12-28 Impact factor: 5.178