| Literature DB >> 25692012 |
Abstract
Ideally, any experienced investigator with the right tools should be able to reproduce a finding published in a peer-reviewed biomedical science journal. In fact, the reproducibility of a large percentage of published findings has been questioned. Undoubtedly, there are many reasons for this, but one reason may be that investigators fool themselves due to a poor understanding of statistical concepts. In particular, investigators often make these mistakes: (1) P-Hacking. This is when you reanalyze a data set in many different ways, or perhaps reanalyze with additional replicates, until you get the result you want. (2) Overemphasis on P values rather than on the actual size of the observed effect. (3) Overuse of statistical hypothesis testing, and being seduced by the word "significant". (4) Overreliance on standard errors, which are often misunderstood.Entities:
Year: 2014 PMID: 25692012 PMCID: PMC4317225 DOI: 10.1002/prp2.93
Source DB: PubMed Journal: Pharmacol Res Perspect ISSN: 2052-1707
Figure 1The many forms of P-hacking. When you P-hack, the results cannot be interpreted at face value. Not shown in the figure is that after trying various forms of P-hacking without getting a small P values, you will eventually give up when you run out of time, funds, or curiosity.
Figure 2The problem of ad hoc sample size selection. I simulated 10,000 experiments sampling data from a Gaussian distribution with means of 5.0 and standard deviations of 1.0, and comparing two samples with n = 5 each using an unpaired t-test. The first column shows the percentage of those experiments with a P value less than 0.05. Since both populations have the same mean, the null hypothesis is true and so (as expected) about 5.0% of the simulations have P values less than 0.05. For the experiments where the P value was higher than 0.05, I added five more values to each group. The middle column (“n = 5 + 5”) shows the fraction of P values where the P value was less than 0.05 either in the first analysis with n = 5 or after increasing the sample size to 10. For the third column, I added yet another 5 values to each group if the P value was greater than 0.05 for both of the first two analyses. Now 13% of the experiments (not 5%) have reached a P value less than 0.05. For the fourth column, I looked at all 10,000 of the simulated experiments with n = 15. As expected, very close to 5% of those experiments had P values less than 0.05. The higher fraction of “significant” findings in the n = 5 + 5 and n = 5 + 5+5 is due to the fact that I increased sample size only when the P value was high with smaller sample sizes. In many cases, when the P value was less than 0.05 with n = 5 the P value would have been higher than 0.05 with n = 10 or 15, but an experimenter seeing the small P value with the small sample size would not have increased sample size.
Figure 3The problem of hypothesizing after the results are known (HARKing). From http://www.explainxkcd.com/wiki/index.php/File:significant.png.
Figure 4P values depend upon sample size. This graph shows P values computed by unpaired t tests comparing two sets of data. The means of the two samples are 10 and 12. The SD of each sample is 5.0. I computed a t test using various sample sizes plotted on the X axis. You can see that the P value depends on sample size. Note that both axes use a logarithmic scale.
Identical P values with very different interpretations
| Treatment 1 (mean ± SD, | Treatment 2 (mean ± SD, | Difference between means | 95% CI of the difference between means | ||
|---|---|---|---|---|---|
| Experiment A | 1000 ± 100, | 990.0 ± 100, | 10 | 0.6 | −30 to 50 |
| Experiment B | 1000 ± 100, | 950.0 ± 100, | 50 | 0.6 | −177 to 277 |
| Experiment C | 100 ± 5.0, | 102 ± 5.0, | 2 | 0.001 | 0.8 to 3.2 |
| Experiment D | 100 ± 5.0, | 135 ± 5.0, | 35 | 0.001 | 24 to 46 |
Experiments A and B have identical P values, but the scientific conclusion is very different. The interpretation depends upon the scientific context, but in most fields Experiment A would be solid negative data proving that there either is no effect or that the effect is tiny. In contrast, Experiment B has such a wide confidence interval as to be consistent with nearly any hypothesis. Those data simply do not help answer your scientific question. Similarly, experiments C and D have identical P values, but should be interpreted differently. In most experimental contexts, experiment C demonstrates convincingly that while the difference not zero, it is quite small. Experiment D provides convincing evidence that the effect is large.
The false discovery rate when P < 0.05.
| Total | |||
|---|---|---|---|
| Really is an effect | 80 | 20 | 100 |
| No effect (null hypothesis true) | 45 | 855 | 900 |
| Total | 125 | 875 | 1000 |
This table tabulates the theoretical results of 1000 experiments where the prior probability that the null hypothesis is false is 10%, the sample size is large enough so that the power is 80%, and the significance level is the traditional 5%. In 100 of the experiments (10%), there really is an effect (the null hypothesis is false), and you will obtain a “statistically significant” result (P < 0.05) in 80 of these (because the power is 80%). In 900 experiments, the null hypothesis is true but you will obtain a statistically significant result in 45 of them (because the significance threshold is 5% and 5% of 900 is 45). In total, you will obtain 80 + 45 = 125 statistically significant results, but 45/125 = 36% of these will be false positive. The proportion of conclusions of “statistical significance” that are false discoveries or false positives depends on the context of the experiment, as expressed by the prior probability (here 10%). If you do obtain a small P value and reject the null hypothesis, you will conclude that the values in the two groups were sampled from different distributions. As noted above, there may be a high chance that you made a false positive conclusion due to random sampling. But even if the conclusion is “true” from a statistical point of view and not a false positive due to random sampling, the effect may have occurred for a reason different than the one you hypothesized. When thinking about why an effect occurred, ignore the statistical calculations, and instead think about blinding, randomization, positive controls, negative controls, calibration, biases, and other aspects of experimental design.
Figure 5Standard error bars do not show variability and do a poor job of showing precision. The figure plots one data set six ways. The leftmost lane shows a scatter plot of every value, so is the most informative. The next lane shows a box-and-whisker plots showing the range of the data, the quartiles, and the median (whiskers can be plotted in various ways, and do not always show the range). The third lane plots the median and quartiles. This shows less detail, but still demonstrates that the distribution is a bit asymmetrical. The fourth lane plots mean with error bars showing plus or minus one standard deviation. Note that these error bars are, by definition, symmetrical so give you no hint about the asymmetry of the data. The next two lanes are different than the others as they do not show scatter. Instead they show how precisely we know the population mean, accounting for scatter and sample size. The fifth lane shows the mean with error bars showing the 95% confidence interval of the mean. The sixth (rightmost) lane plots the mean plus or minus one standard error of the mean, which does not show variation and does a poor job of showing precision.