Literature DB >> 25798124

Is the call to abandon p-values the red herring of the replicability crisis?

Abstract

Entities: Chemical Disease Species

Keywords: Bayes factors; confidence intervals (CIs); crisis of replicability; null hypothesis significance testing (NHST); p-values

Year: 2015 PMID： 25798124 PMCID： PMC4351564 DOI： 10.3389/fpsyg.2015.00245

Source DB: PubMed Journal: Front Psychol ISSN： 1664-1078

× No keyword cloud information.

Introduction

In a recent article, Cumming (2014) called for two major changes to how psychologists conduct research. The first suggested change—encouraging transparency and replication—is clearly worthwhile, but we question the wisdom of the second suggested change: abandoning p-values in favor of reporting confidence intervals (CIs) only in all psychological research reports. This article has three goals. First, we correct the false impression created by Cumming that the debate about the usefulness of NHST has been won by its critics. Second, we take issue with the implied connection between the use of NHST and the current crisis of replicability in psychology. Third, while we agree with other critics of Cumming (2014) that hypothesis testing is an important part of science (Morey et al., 2014), we express skepticism that alternative hypothesis testing frameworks, such as Bayes factors, are a solution to the replicability crisis. Poor methodological practices can compromise the validity of Bayesian and classic statistical analyses alike. When it comes to choosing between competing statistical approaches, we highlight the value of applying the same standards of evidence that psychologists demand in choosing between competing substantive hypotheses.

Has the NHST debate been settled?

Cumming (2014) claims that “very few defenses of NHST have been attempted” (p. 11). In a section titled “Defenses of NHST,” he summarizes a single book chapter by Schmidt and Hunter (1997), which in fact is not a defense but another critique, listing and “refuting” arguments for continued use of NHST. Thus, graduate students and others who are new to the field might understandably be left with the impression that the debate over NHST has been handily won by its critics, with little dissent. This impression is wrong. Indeed, the book that published Schmidt and Hunter's (1997) chapter (Harlow et al., 1997) included several defenses (e.g., Abelson, 1997; Mulaik et al., 1997), and many contributions with more nuanced and varied positions (e.g., Harris, 1997; Reichardt and Gollob, 1997). Defenses have also appeared in the field's leading peer-reviewed journals, including American Psychologist (Krueger, 2001, with commentaries) and APA's quantitative psychology journal Psychological Methods (Frick, 1996; Cortina and Dunlap, 1997; Nickerson, 2000). Nickerson (2000) provided a particularly careful and thoughtful review of the entire debate and concluded “that NHST is easily misunderstood and misused but that when applied with good judgment it can be an effective aid to the interpretation of experimental data” (abstract). Perhaps the most famous critique of the use of NHST in psychology (Cohen, 1994), published in the American Psychologist, has seen several defending commentaries (Baril and Cannon, 1995; Frick, 1995; Parker, 1995), plus a lengthier retort (Hagen, 1997). We do not believe that the debate about the appropriate use of NHST in psychology has been decisively settled. Further, the strong NHST-bashing rhetoric common on the “reformers” side of the debate may prevent many substantive researchers from feeling that they can voice legitimate reservations about abandoning the use of p-values.

Is the replicability crisis caused by NHST?

Cumming (2014) connects the current crisis in the field (e.g., Pashler and Wagenmakers, 2012) to “the severe flaws of null-hypothesis significance testing (NHST).” In our opinion, the reliance of psychologists on NHST is a red herring in the debates about the replicability crisis (see also Krueger, 2001). Cumming cites Ioannidis (2005) to draw the connection between NHST and the replicability crisis. Yet, Cumming does not explain how the fundamental problems articulated by Ioannidis (2005) could be resolved by abandoning NHST and focusing on CIs. Ioannidis (2005) described the intersecting problems that arise from running underpowered studies, conducting numerous statistical tests, and focusing only on the significant results. There is no evidence that replacing p-values with CIs will circumvent these problems. After all, p-values and CIs are based on the same information, and are thus equivalently susceptible to “hacking.” While Cumming warns that using CIs in the same way we use NHST (to reach a binary decision) would be a mistake and advocates not focusing on whether a CI includes zero, it is difficult to imagine researchers and editors ignoring this salient information. In fact, we feel that all claims about the superiority of one statistical technique over another in terms of facilitating correct interpretation and reasoning should be supported by evidence, as we would demand of any other claim made within our discipline. The only experimental study evaluating whether presenting data in terms of CIs reduces binary thinking relative to NHST did not find this to be the case (Hoekstra et al., 2012; see also Poitevineau and Lecoutre, 2001). Another purported advantage of abolishing p-values is that using CIs may make it easier to detect common patterns across studies (e.g., Schmidt, 1996). However, a recent experiment found that presenting the results of multiple studies in terms of CIs rather than in NHST form did not improve meta-analytic thinking (Coulson et al., 2010). It has also been argued that CIs might help improve research practices by making low power more salient, because power is directly related to the width of the confidence interval. There is some evidence that presenting data in terms of CIs rather than p-values makes people less vulnerable to interpreting non-significant results in under-powered studies as support for the null hypothesis (Fidler and Loftus, 2009; Hoekstra et al., 2012). Unfortunately, our reading of this research also suggests that using CIs pushed many participants in the opposite direction, and they tended to interpret CIs that include zero as moderate evidence for the alternative hypothesis. It is worth debating which of these interpretations is more problematic, a judgment call that may depend on the nature of the research. Finally, existing data do not support the notion that CIs are more intuitive. Misinterpretations of the meaning of CIs are as widespread as misinterpretations of p-values (Belia et al., 2005; Hoekstra et al., 2014). Abolishing p-values and replacing them with CIs, thus, is not a panacea. Successfully addressing the replicability crisis demands fundamental changes, such as running much larger studies (Button et al., 2013; Vankov et al., 2014), directly replicating past work (Nosek et al., 2012), publishing null results, avoiding questionable research practices that increase “researcher degrees of freedom” (Simmons et al., 2011; John et al., 2012), and practicing open science more broadly. To the extent that replacing p-values with CIs appears to be an easy, surface-level “solution” to the replicability crisis—while doing little to solve the problems that caused the crisis in the first place—this approach may actually distract attention away from deeper, more effective changes.

Are Bayes factors the solution to the replicability crisis?

Bayes factors have gained some traction in psychology as an alternative hypothesis-testing framework (e.g., Rouder et al., 2009; Dienes, 2011; Kruschke, 2011). This approach may be logically superior in that Bayes factors directly address the relative evidence for the null hypothesis vs. the alternative. Another major advantage is that Bayes factors force researchers to articulate their hypotheses in terms of prior distributions on the effect sizes. A simple “H1: μ > 0” will no longer do the trick, and the answer to the question “Is my hypothesis supported by the data?” will depend on the exact form of that hypothesis. Decades ago, Meehl (1990) argued that such a development was needed to push the science of psychology forward. In the wake of the replicability crisis, some have argued that switching to Bayesian hypothesis testing can help remedy the bias against publishing non-significant results because, unlike NHST, Bayes factors allow researchers to establish support for the null (Dienes, 2014). More evidence is needed, however, that the switch to Bayes factors will have this effect. To the extent that the real source of publication bias is the pressure felt by journal editors to publish novel, striking findings, the rate of publication of null results will not increase, even if those null results are strongly supported by a Bayesian analysis. Further, when it comes to questionable research practices, one can “b-hack” just as one can “p-hack” (Sanborn and Hills, 2014; Simonsohn, 2014; Yu et al., 2014). In fact, Bayes factors and the values of the classic t-test are directly related, given a set sample size and choice of prior (Rouder et al., 2009; Wetzels et al., 2011). Although some have argued that the options for “b-hacking” are more limited (e.g., Wagenmakers, 2007, in an online appendix; Dienes, 2014; Rouder, 2014), no statistical approach is immune to poor methodological practices. Furthermore, as pointed out by Simmons et al. (2011), using Bayes factors further increases “researcher degrees of freedom,” creating another potential QRP, because researchers must select a prior—a subjective expectation about the most likely size of the effect—for their analyses. Although the choice of prior is often inconsequential (Rouder et al., 2009), different priors can lead to different conclusions. For example, in their critique of Bem's (2011) article on pre-cognition, Wagenmakers et al. (2011) have devoted much space to the reanalysis of the data using Bayes factors, and less to pointing out the exploratory flexibility of many of Bem's (2011) analyses. Bem's response to this critique (Bem et al., 2011) was entirely about the Bayesian analyses—debating the choice of prior for psi. Given that the publication of Bem's (2011) article was one of the factors that spurred the current crisis, this statistical debate may have been a red herring, distracting researchers from the much deeper concerns about QRP's.

Conclusion

We agree with Cumming (2014) that raw effect sizes and the associated CIs should routinely be reported. We also believe that Bayes factors represent an intriguing alternative to hypothesis testing via NHST. But, at present we lack empirical evidence that encouraging researchers to abandon p-values will fundamentally change the credibility and replicability of psychological research in practice. In the face of crisis, researchers should return to their core, shared value by demanding rigorous empirical evidence before instituting major changes.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

26 in total

1. Null hypothesis significance testing: a review of an old and continuing controversy.

Authors: R S Nickerson
Journal: Psychol Methods Date: 2000-06

2. Interpretation of significance levels by psychological researchers: the .05 cliff effect may be overstated.

Authors: J Poitevineau; B Lecoutre
Journal: Psychon Bull Rev Date: 2001-12

3. A practical solution to the pervasive problems of p values.

Authors: Eric-Jan Wagenmakers
Journal: Psychon Bull Rev Date: 2007-10

4. The new statistics: why and how.

Authors: Geoff Cumming
Journal: Psychol Sci Date: 2013-11-12

Review 5. Bayesian Assessment of Null Values Via Parameter Estimation and Model Comparison.

Authors: John K Kruschke
Journal: Perspect Psychol Sci Date: 2011-05

6. Editors' Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?

Authors: Harold Pashler; Eric-Jan Wagenmakers
Journal: Perspect Psychol Sci Date: 2012-11

Review 7. Power failure: why small sample size undermines the reliability of neuroscience.

Authors: Katherine S Button; John P A Ioannidis; Claire Mokrysz; Brian A Nosek; Jonathan Flint; Emma S J Robinson; Marcus R Munafò
Journal: Nat Rev Neurosci Date: 2013-04-10 Impact factor: 34.870

8. Why hypothesis tests are essential for psychological science: a comment on Cumming (2014).

Authors: Richard D Morey; Jeffrey N Rouder; Josine Verhagen; Eric-Jan Wagenmakers
Journal: Psychol Sci Date: 2014-03-06

9. When decision heuristics and science collide.

Authors: Erica C Yu; Amber M Sprenger; Rick P Thomas; Michael R Dougherty
Journal: Psychon Bull Rev Date: 2014-04

10. Why most published research findings are false.

Authors: John P A Ioannidis
Journal: PLoS Med Date: 2005-08-30 Impact factor: 11.613

8 in total

1. A Modified Sequential Probability Ratio Test.

Authors: Sandipan Pramanik; Valen E Johnson; Anirban Bhattacharya
Journal: J Math Psychol Date: 2021-03-04 Impact factor: 1.387

2. The Practical Alternative to the p Value Is the Correctly Used p Value.

Authors: Daniël Lakens
Journal: Perspect Psychol Sci Date: 2021-02-09

3. The meaning of significance in data testing.

Authors: Jose D Perezgonzalez
Journal: Front Psychol Date: 2015-08-27

4. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research.

Authors: Valentin Amrhein; Fränzi Korner-Nievergelt; Tobias Roth
Journal: PeerJ Date: 2017-07-07 Impact factor: 2.984

5. Performing Contrast Analysis in Factorial Designs: From NHST to Confidence Intervals and Beyond.

Authors: Stefan Wiens; Mats E Nilsson
Journal: Educ Psychol Meas Date: 2016-10-06 Impact factor: 2.821

Review 6. Reproducibility and replicability of rodent phenotyping in preclinical studies.

Authors: Neri Kafkafi; Joseph Agassi; Elissa J Chesler; John C Crabbe; Wim E Crusio; David Eilam; Robert Gerlai; Ilan Golani; Alex Gomez-Marin; Ruth Heller; Fuad Iraqi; Iman Jaljuli; Natasha A Karp; Hugh Morgan; George Nicholson; Donald W Pfaff; S Helene Richter; Philip B Stark; Oliver Stiedl; Victoria Stodden; Lisa M Tarantino; Valter Tucci; William Valdar; Robert W Williams; Hanno Würbel; Yoav Benjamini
Journal: Neurosci Biobehav Rev Date: 2018-01-31 Impact factor: 9.052

7. The Role of Perceived Energy and Self-Beliefs for Physical Activity and Sports Activity of Patients With Multiple Sclerosis and Chronic Stroke.

Authors: Julia Schüler; Wanja Wolff; Julian Pfeifer; Romina Rihm; Jessica Reichel; Gerhard Rothacher; Christian Dettmers
Journal: Front Psychol Date: 2021-01-28

8. Null hypothesis significance testing: a short tutorial.

Authors: Cyril Pernet
Journal: F1000Res Date: 2015-08-25

8 in total