| Literature DB >> 29740363 |
Antonia Krefeld-Schwalb1, Erich H Witte2, Frank Zenker3.
Abstract
In psychology as elsewhere, the main statistical inference strategy to establish empirical effects is null-hypothesis significance testing (NHST). The recent failure to replicate allegedly well-established NHST-results, however, implies that such results lack sufficient statistical power, and thus feature unacceptably high error-rates. Using data-simulation to estimate the error-rates of NHST-results, we advocate the research program strategy (RPS) as a superior methodology. RPS integrates Frequentist with Bayesian inference elements, and leads from a preliminary discovery against a (random) H0-hypothesis to a statistical H1-verification. Not only do RPS-results feature significantly lower error-rates than NHST-results, RPS also addresses key-deficits of a "pure" Frequentist and a standard Bayesian approach. In particular, RPS aggregates underpowered results safely. RPS therefore provides a tool to regain the trust the discipline had lost during the ongoing replicability-crisis.Entities:
Keywords: Bayes' theorem; Wald criterion; inferential statistics; likelihood; replication; research program strategy; t-test
Year: 2018 PMID: 29740363 PMCID: PMC5928294 DOI: 10.3389/fpsyg.2018.00460
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Figure 1The six steps of the research program strategy (RPS).
The estimated minimum sample size for a two sample t-test as a function of test-power (1–β) and effect size δ, given α = 0.05.
| 0.4 | 38,726 | 97 | 15 | 6 |
| 0.5 | 54,111 | 135 | 22 | 8 |
| 0.8 | 123,651 | 309 | 49 | 19 |
| 0.95 | 216,443 | 541 | 87 | 34 |
The proportion P of substantial discoveries, indicated by p-values below the significance level α = 0.05, as a function of the effect-size δ and test-power (1–β).
| 0.4 | 0.40 | 0.23 | 0.41 | 0.17 | 0.31 | 0.20 | 0.30 | 0.19 |
| 0.5 | 0.56 | 0.14 | 0.42 | 0.17 | 0.49 | 0.16 | 0.39 | 0.17 |
| 0.8 | 0.84 | 0.07 | 0.85 | 0.09 | 0.76 | 0.10 | 0.78 | 0.10 |
| 0.95 | 0.95 | 0.03 | 0.98 | 0.04 | 0.98 | 0.02 | 0.95 | 0.03 |
P(p < α) = proportion of significant results; σ(p) = standard deviation of p-value.
Figure 2Illustration of true positives. Bar plots indicate the frequencies of likelihood ratios ( set in light gray, and in dark gray) that, respectively, fall above the criterion (two leftmost bars), between this criterion and three (two middle bars), and below three (two rightmost bars), as a function of induction quality of data, provided the H1 is true, under α = 0.05 [itself defined via d and (1–β), the latter here abbreviated as “pow”].
The proportion of preliminary verifications as per , given the empirical effect-size d lies outside the interval comprising 95% of expected values placed around the H1, where .
| 0.4 | 0.07 | 0.03 | 0.02 | 0.00 |
| 0.5 | 0.08 | 0.05 | 0.04 | 0.00 |
| 0.8 | 0.05 | 0.04 | 0.05 | 0.04 |
| 0.95 | 0.05 | 0.02 | 0.04 | 0.02 |
pdf, Probability density function; P.
The proportion of substantial verifications, after substantial discoveries and subsequent preliminary verifications were obtained, given the H0 had been substantially falsified.
| 0.4 | 0.09 | 0.05 | 0.04 | 0.03 |
| 0.5 | 0.20 | 0.14 | 0.20 | 0.15 |
| 0.8 | 0.53 | 0.47 | 0.53 | 0.56 |
| 0.95 | 0.67 | 0.75 | 0.73 | 0.79 |
Sample size for a t-test as a function of δ, given α = β = 0.01.
| 1,082 | 173 | 68 | |
The proportion of substantial discoveries (indicated by the p-value) as a function of δ, given α = β = 0.01.
| 1 | <0.001 | 0.98 | 0.004 | 0.99 | 0.002 |
The proportion of substantial falsifications and preliminary verifications, indicated by the respective LR, as a function of δ under α = β = 0.01.
| 0.2 | 0.5 | 0.8 | 0.2 | 0.5 | 0.8 | |
| 0.97 | 0.97 | 0.98 | 0.90 | 0.95 | 0.92 | |
The proportion of preliminary verifications as per , where the empirical effect size d, however, lies outside the area spanned by the 95%-interval of expected values centered on the H1, and where .
| 0.2 | 0.5 | 0.8 | |
| 0.06 | 0.04 | 0.04 | |
pdf, Probability density function; P.
The proportion of substantial verifications (subsequent to achieving substantial discoveries and preliminary verifications), given that the H0 was substantially falsified under α = β = 0.01.
| 0.99 | 0.86 | 0.86 | 0.91 |
The proportion of false positives, where the sample size, N, is obtained by a priori power analysis, given δ = 0.2 and where (1–β) = [0.4, 0.5, 0.8, 0.95].
| 97 | 0.04 | 0.29 |
| 135 | 0.03 | 0.28 |
| 309 | 0.03 | 0.29 |
| 541 | 0.05 | 0.29 |
The proportion of false substantial falsifications and false preliminary verifications using .
| 97 | 0.09 | 0.00 |
| 135 | 0.04 | 0.00 |
| 309 | 0.03 | 0.00 |
| 541 | 0.04 | 0.01 |
Figure 3Illustration of false positives. Bar plots indicate the frequency of likelihood ratios ( in light gray and in dark gray) repeatedly falling above the criterion , between the criterion and three, and below three, as a function of the sample size, provided the H0 is true.
The proportion of false preliminary verifications using LR = 3.
| 97 | 0.10 |
| 135 | 0.09 |
| 309 | 0.07 |
| 541 | 0.00 |
The proportion of false substantial falsifications and false preliminary verifications, given one had obtained a preliminary discovery (as per the p-value and LR), after 10% of least hypothesis supporting data were removed.
| 87 | 0.36 | 0.24 | 0.19 |
| 121 | 0.40 | 0.25 | 0.18 |
| 287 | 0.80 | 0.58 | 0.47 |
| 487 | 0.95 | 0.87 | 0.80 |
The proportions of when adding the log(LR) of individually underpowered studies featuring (1–β) = [0.4, 0.5, 0.8].
| Preliminary verification | 0.71 | 0.73 | 0.75 | 0.74 |
| Substantial falsification | 0.99 | 1 | 1 | 1 |
The proportion of substantial falsifications and preliminary verifications, as indicated by the respective likelihood ratio (LR) meeting or exceeding the threshold .
| 0.4 | 0.12 | 0.15 | 0.11 | 0.10 | 0.25 | 0.17 | 0.22 | 0.05 |
| 0.5 | 0.26 | 0.15 | 0.19 | 0.14 | 0.43 | 0.23 | 0.20 | 0.11 |
| 0.8 | 0.67 | 0.60 | 0.60 | 0.54 | 0.50 | 0.57 | 0.59 | 0.52 |
| 0.95 | 0.95 | 0.89 | 0.83 | 0.83 | 0.74 | 0.76 | 0.81 | 0.78 |
LR, likelihood ratio; D, data.