| Literature DB >> 23900808 |
Martijn J Schuemie1, Patrick B Ryan, William DuMouchel, Marc A Suchard, David Madigan.
Abstract
Often the literature makes assertions of medical product effects on the basis of ' p < 0.05'. The underlying premise is that at this threshold, there is only a 5% probability that the observed effect would be seen by chance when in reality there is no effect. In observational studies, much more than in randomized trials, bias and confounding may undermine this premise. To test this premise, we selected three exemplar drug safety studies from literature, representing a case-control, a cohort, and a self-controlled case series design. We attempted to replicate these studies as best we could for the drugs studied in the original articles. Next, we applied the same three designs to sets of negative controls: drugs that are not believed to cause the outcome of interest. We observed how often p < 0.05 when the null hypothesis is true, and we fitted distributions to the effect estimates. Using these distributions, we compute calibrated p-values that reflect the probability of observing the effect estimate under the null hypothesis, taking both random and systematic error into account. An automated analysis of scientific literature was performed to evaluate the potential impact of such a calibration. Our experiment provides evidence that the majority of observational studies would declare statistical significance when no effect is present. Empirical calibration was found to reduce spurious results to the desired 5% level. Applying these adjustments to literature suggests that at least 54% of findings with p < 0.05 are not actually statistically significant and should be reevaluated.Entities:
Keywords: calibration; hypothesis testing; negative controls; observational studies
Mesh:
Substances:
Year: 2013 PMID: 23900808 PMCID: PMC4285234 DOI: 10.1002/sim.5925
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Figure 1Forest plots of negative controls. Lines show 95% confidence interval. Orange indicates statistically significant estimates (two-sided p < 0.05), and blue indicates non-significant estimates.
Estimated mean and variance of the empirical null distribution for the three study designs.
| Design | ||
|---|---|---|
| Cohort | − 0.05 | 0.54 |
| Case–control | 0.90 | 0.35 |
| SCCS | 0.79 | 0.28 |
Figure 2Calibration plots. Each subplot shows the fraction of negative controls with p < α, for different levels of α. Both traditional p-value calculation and p-values using calibration are shown. For the calibrated p-value, a leave-one-out design was used.
Figure 3Traditional and calibrated significance testing. Estimates below the dashed line (gray area) have p < 0.05 using traditional p-value calculation. Estimates in the orange areas have p < 0.05 using the calibrated p-value calculation. Blue dots indicate negative controls, and the yellow diamond indicates the drugs of interest: isoniazid (A) and sertraline (B and C).
Figure 4Effect estimates extracted from MEDLINE abstracts of observational studies using healthcare databases, by publication year. The number of estimates reaching statistical significance (p < 0.05) is estimated using four assumption on the null distribution: no bias (traditional significance testing): mean = 0, SD = 0; small bias: mean = 0, SD = 0.25; medium bias: mean = 0.25, SD = 0.25; large bias: mean = 0.5, SD = 0.5.