| Literature DB >> 35582935 |
Daniel Berner1, Valentin Amrhein1.
Abstract
A paradigm shift away from null hypothesis significance testing seems in progress. Based on simulations, we illustrate some of the underlying motivations. First, p-values vary strongly from study to study, hence dichotomous inference using significance thresholds is usually unjustified. Second, 'statistically significant' results have overestimated effect sizes, a bias declining with increasing statistical power. Third, 'statistically non-significant' results have underestimated effect sizes, and this bias gets stronger with higher statistical power. Fourth, the tested statistical hypotheses usually lack biological justification and are often uninformative. Despite these problems, a screen of 48 papers from the 2020 volume of the Journal of Evolutionary Biology exemplifies that significance testing is still used almost universally in evolutionary biology. All screened studies tested default null hypotheses of zero effect with the default significance threshold of p = 0.05, none presented a pre-specified alternative hypothesis, pre-study power calculation and the probability of 'false negatives' (beta error rate). The results sections of the papers presented 49 significance tests on average (median 23, range 0-390). Of 41 studies that contained verbal descriptions of a 'statistically non-significant' result, 26 (63%) falsely claimed the absence of an effect. We conclude that studies in ecology and evolutionary biology are mostly exploratory and descriptive. We should thus shift from claiming to 'test' specific hypotheses statistically to describing and discussing many hypotheses (possible true effect sizes) that are most compatible with our data, given our statistical model. We already have the means for doing so, because we routinely present compatibility ('confidence') intervals covering these hypotheses.Entities:
Keywords: compatibility interval; effect size; null hypothesis; p-value; scientific method; statistical inference
Mesh:
Year: 2022 PMID: 35582935 PMCID: PMC9322409 DOI: 10.1111/jeb.14009
Source DB: PubMed Journal: J Evol Biol ISSN: 1010-061X Impact factor: 2.516
FIGURE 1P‐values and effect sizes in statistical hypothesis tests in relation to sample size. Shown are summary statistics based on null hypothesis significance tests of a simulated true correlation between a predictor and a response variable, for sample sizes ranging from eight to 100 in increments of two. The simulated true effect sizes are Pearson correlation coefficients that were chosen to be relatively large (r = 0.45) in (a) and smaller (r = 0.24) in (b). For each sample size, 10,000 replicate bivariate data sets were simulated and tested. In the upper panels, the blue bars represent the central 90% (light blue) and 50% (darker blue) of the p‐value distribution among the replicate tests, and the short horizontal lines are the medians. The dark blue bullet points indicate power, which is the fraction of tests that are ‘statistically significant’ (p < 0.05; significance threshold shown as black horizontal line). For the same set of simulations, the lower panels visualize the distribution of the effect size estimates, which are the correlation coefficients (r) observed in the samples. The effect sizes are presented separately for the subset of replicate tests that were ‘statistically significant’ (orange) or ‘statistically non‐significant’ (purple). The light and darker bars represent the central 90% and 50% of the effect sizes observed across the replicate tests, and the short horizontal lines are the medians. The true effect sizes underlying the simulations are shown as black horizontal lines. Note that in (b), the 90% interval for the significant tests with n = 8 extended to −0.75 but was truncated to facilitate presentation; with small sample size, the effect size distribution of significant tests was bimodal because a fraction of the significant tests had strong negative correlations. Overall, the simulations highlight that when the true effect size and/or the sample size is modest, p‐values are highly variable, and most ‘statistically significant’ effect estimates are biased upwards. Analogously, ‘statistically non‐significant’ effect estimates tend to underestimate the true effect size, and here the bias gets stronger with increasing sample size and/or when the true effect size is substantial
FIGURE 2Visualizing the range of values for the true effect size (or in other words, of hypotheses) that are most compatible with the observed data, given the statistical model, by means of compatibility curves. The two curves illustrate the most compatible values for the true Pearson correlation coefficients based on two exemplary simulated samples of n = 30 and n = 80, generated using the bivariate simulation model underlying Figure 1a. Unlike in real research, the true correlation coefficient is known to be r = 0.45 (dashed vertical line). The black horizontal line under the left curve shows the 95% compatibility (‘confidence’) interval based on the n = 30 sample. Here, one of the many values that are most compatible is a zero relationship (solid vertical line). Because zero is included, this interval would traditionally be called ‘statistically non‐significant’, although zero is clearly not the value most compatible with the data: zero is not at the highest point of the compatibility curve. One can imagine the compatibility curve as horizontally stacked compatibility intervals, with compatibility levels ranging from near zero to one; from the bottom, the lowest interval is approximately the 100%‐interval and the highest is the 0%‐interval. The peak of the curve is thus the shortest (0%) compatibility interval that is just one point, known as the point estimate. This point estimate, which is the observed effect size, is the correlation coefficient estimate that is most (100%) compatible with the sample data and the statistical model (but because many other hypotheses are also reasonably compatible, 100% compatibility does not imply truth). The curve was drawn by determining the stacked compatibility intervals non‐parametrically based on quantiles from a distribution obtained by bootstrapping the original samples and recalculating the correlation coefficient 100,000 times, but a similar curve would arise when stacking conventional parametric ‘confidence’ intervals (see Appendix S1). Another way to interpret the compatibility curve is that it indicates the p‐values one would obtain, given the sample data and the statistical model, when using a correlation coefficient given on the x‐axis as null hypothesis in a test. The 95% interval shown therefore covers correlation coefficients that have p > 0.05 and are thus most compatible with the data and the model. For more details on the interpretation of compatibility curves, see section 4.3 of this paper as well as Infanger and Schmidt‐Trucksäss (2019), Poole (1987), and Rafi and Greenland (2020)