Literature DB >> 31732498

On reporting and interpreting statistical significance and p values in medical research.

Herman Aguinis1, Matt Vassar2, Cole Wayant3.   

Abstract

Entities:  

Keywords:  Education & training (see medical education & training); general mediciNE; health policy; statistics & research methods

Year:  2019        PMID: 31732498      PMCID: PMC8005799          DOI: 10.1136/bmjebm-2019-111264

Source DB:  PubMed          Journal:  BMJ Evid Based Med        ISSN: 2515-446X


× No keyword cloud information.
Recent proposals to change the p value threshold from 0.05 to 0.005 or to retire statistical significance altogether have garnered much criticism and debate.1 2 As of the writing of our manuscript, the proposal to eliminate statistical significance testing, backed by over 800 signatories, achieved record-breaking status on Altmetrics, with an attention score exceeding 13 000 derived from 19 000 Twitter comments and 35 news stories. We appreciate the renewed enthusiasm for tackling important issues related to the analysis, reporting and interpretation of scientific research results. Our perspective, however, focuses on the current use and reporting of statistical significance and where we should go from here. We begin by saying that p values themselves are not flawed. Rather, the use, misuse or abuse of p values in ways antithetical to rigorous scientific pursuits is the flaw. If p values are a hammer, scientists are the hammer wielders. One would not discard the hammer if the wielder, when using the hammer, repeatedly missed the nail. Similarly, one would not discard the hammer if the wielder used the hammer in a way not suited to the hammer’s purpose, such as in an attempt to drive a screw. Rather, one would expect that the fault lies with the hammer-wielder and recommend ways to refine the hammer’s use. Thus, a focus on education and reform may be more helpful than the abandonment of statistical significance testing, which is a tool that can be used well, or misused and even abused. Similarly, in this perspective, we argue that abandoning statistical significance because scientists misuse p values does not address the underlying problems of statistical negligence. Similarly, it does not address the incorrect belief that statistical significance equates to clinical significance.3 The a priori level (ie, alpha or type I error rate) and the precisely observed probability values (ie, p) should be explicitly stated and justified in protocols and published reports of medical studies. We have examined current guidance on p value reporting in influential sources in medicine (table 1). Generally, this guidance supports reporting exact p values but fails to issue direction on specifying the a priori significance level. The ‘conventional’ a priori significance (ie, type I error) level in many scientific disciplines is 0.05—an arbitrary choice. Two issues arise when scientists arbitrarily default to an a priori significance level: results become misleading and the relative seriousness of making a type I (‘false-positive’) or type II error (‘false-negative’) is ignored.
Table 1

Guidance on p value, alpha prespecification and effect size reporting from influential sources in medicine

SourceVerbatim statement on p value reportingVerbatim statement on alpha specificationVerbatim statement on effect size reporting
New England Journal of Medicine8 Unless one-sided tests are required by study design, such as in non-inferiority clinical trials, all reported p values should be two-sided. In general, p values larger than 0.01 should be reported to two decimal places, and those between 0.01 and 0.001 to three decimal places; p values smaller than 0.001 should be reported as p<0.001. Notable exceptions to this policy include p values arising from tests associated with stopping rules in clinical trials or from genome-wide association studies.When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control the overall type I error—for example, Bonferroni adjustments or prespecified hierarchical procedures. P values adjusted for multiplicity should be reported when appropriate and labelled as such in the manuscript. In hierarchical testing procedures, p values should be reported only until the last comparison for which the p value was statistically significant. P values for the first non-significant comparison and for all comparisons thereafter should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling the false discovery rate described in the SAP—for example, Benjamini-Hochberg procedures.When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of a clinical trial, the report of all secondary and exploratory endpoints should be limited to point estimates of treatment effects with 95% CIs. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn may not be reproducible. No p values should be reported for these analyses.Therefore, in most cases, no p values for interaction should be provided in the forest plots.If significance tests of safety outcomes (when not primary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editors may request that p values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.When appropriate, observational studies should use prespecified accepted methods for controlling family-wise error rate or false discovery rate when multiple tests are conducted. In manuscripts reporting observational studies without a prespecified method for error control, summary statistics should be limited to point estimates and 95% CIs. In such cases, the Methods section should note that the widths of the intervals have not been adjusted for multiplicity and that the inferences drawn from the inferences may not be reproducible. No p values should be reported for these analyses.When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control the overall type I error—for example, Bonferroni adjustments or prespecified hierarchical procedures.Because information contained in the safety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable.Significance tests should be accompanied by CIs for estimated effect sizes, measures of association or other parameters of interest. The CIs should be adjusted to match any adjustment made to significance levels in the corresponding test.
Journal of the American Medical Association9 Avoid solely reporting the results of statistical hypothesis testing, such as p values, which fail to convey important quantitative information. For most studies, p values should follow the reporting of comparisons of absolute numbers or rates and measures of uncertainty (eg, 0.8%, 95% CI −0.2% to 1.8%; p=0.13). P values should never be presented alone without the data that are being compared. If p values are reported, follow standard conventions for decimal places: for p values less than 0.001, report as ‘p<0.001’; for p values between 0.001 and 0.01, report the value to the nearest thousandth; for p values greater than or equal to 0.01, report the value to the nearest hundredth; and for p values greater than 0.99, report as ‘p>0.99’. For studies with exponentially small p values (eg, genetic association studies), p values may be reported with exponents (eg, p=1×10−5). In general, there is no need to present the values of test statistics (eg, F statistics or χ2 results) and df when reporting results.No guidanceMeta-analyses should state the major outcomes that were pooled and include ORs or effect sizes.
The Lancet10 P values should be given to two significant figures, unless p<0.0001.No guidanceNo guidance
BMJNo guidance; refers readers to SAMPL11 No guidanceNo guidance; refers readers to SAMPL
Annals of Internal Medicine12 For p values between 0.001 and 0.20, please report the value to the nearest thousandth. For p values greater than 0.20, please report the value to the nearest hundredth. For p values less than 0.001, report as ‘p<0.001’.No guidanceAuthors should report results for meaningful metrics rather than reporting raw results. For example, rather than reporting the log OR from a logistic regression, authors should transform coefficients into the appropriate measure of effect size, OR, relative risk or risk difference.
ICH Harmonised Tripartite Guideline: Statistical Principles for Clinical Trials E913 When reporting the results of significance tests, precise p values (eg, p=0.034) should be reported rather than making exclusive reference to critical values.Conventionally, the probability of type I error is set at 5% or less or as dictated by any adjustments made necessary for multiplicity considerations; the precise choice may be influenced by the prior plausibility of the hypothesis under test and the desired impact of the results. Alternative values to the conventional levels of type I and type II errors may be acceptable or even preferable in some cases.No guidance
SAMPL guideline11 Although not preferred to CIs, if desired, p values should be reported as equalities when possible and to one or two decimal places (eg, p=0.03 or 0.22 not as inequalities: eg, p<0.05). Do NOT report ‘NS’; give the actual p value. The smallest p value that needs to be reported is p<0.001, save in studies of genetic associations.Report the alpha level (eg, 0.05) that defines statistical significance.Likewise,p values are not sufficient for re-analysis. Needed instead are descriptive statistics for the variables being compared, including sample size of the groups involved, the estimate (or ‘effect size’) associated with the p value and a measure of precision for the estimate, usually a 95% CI.
CONSORT statement14 Actual p values (eg, p=0.003) are strongly preferable to imprecise threshold reports, such as p<0.05.No guidancen—For each outcome, study results should be reported as a summary of the outcome in each group (eg, the number of participants with or without the event and the denominators, or the mean and SD of measurements), together with the contrast between the groups, known as the effect size.

SAP: statistical analysis plan; SAMPL: Statistical Analyses and Methods in the Published Literature; ICH: International Council for Harmonisation; CONSORT: CONsolidated Standards for Reporting Of Trials

Guidance on p value, alpha prespecification and effect size reporting from influential sources in medicine SAP: statistical analysis plan; SAMPL: Statistical Analyses and Methods in the Published Literature; ICH: International Council for Harmonisation; CONSORT: CONsolidated Standards for Reporting Of Trials First, misleading results may fall on either side of the conventional 0.05 threshold, with scientists either rejecting or accepting the null hypothesis blindly—failing to consider sample size, measurement error and other factors that affect observed p values but are unrelated to the size of the effect in the population. Also, when considering the dichotomous interpretation of a truly continuous probability, Rosnow and Rosenthal4 sarcastically lamented that ‘Surely, God loves the 0.06 nearly as much as the 0.05’. Second, the choice of an a priori significance level should be made in the context of the potential for type II error. When researchers arbitrarily default to a type I error rate of 0.05, it has been calculated that the corresponding type II error is approximately 60%, because statistical power (ie, probability to correctly reject a null hypothesis) is usually insufficient given small sample sizes and the pervasive and unavoidable use of less-than-perfectly reliable measures.5 6 In other words, while authors focus on whether their results show an acceptably small type I error rate, type II error—the probability of accepting the null hypothesis erroneously and incorrectly concluding that an effect is absent—looms large. Do authors, peer reviewers, editors and readers of studies that fail to reach statistical significance consider the probability that the results are falsely ‘negative’? A second limitation in the current guidance is the inconsistency in mandating effect size reporting that describes the strength of the relationship and/or the effect found. The only information to be gleaned from p values is whether the observed data are likely where the null hypothesis (that no effect exists) true. Therefore, a p value without an effect size is like peering into a pool of murky water: one cannot determine the depth, just say that it is likely that a pool exists. Consider interventions for improving medication adherence for patients with hypertension. A recent systematic review of medication adherence interventions found that the overall standardised mean difference for systolic blood pressure was 0.235—a 3 mm Hg difference.7 Translating mean differences to clinical differences assists in determining the practical value of the intervention. In this example, the clinician must consider whether a 3 mm Hg reduction in systolic blood pressure is clinically meaningful and weigh this reduction against the factors associated with enacting the intervention as well as whether other interventions might yield a more clinically meaningful improvement. Some of the influential guidance (or omission thereof) provided to authors in medicine (table 1) may serve to promote the poor statistical practices that readers work to mitigate. Therefore, it is our perspective that not only should all guidance emphasise reporting effect sizes, but that all guidance to interpret and report effect sizes in a meaningful way should be included as well. For example, one may report the absolute difference between groups and the number needed to treat for a medical intervention. Readers may be incapable of determining the meaningfulness of a p value but are well-equipped to interpret an absolute difference in effectiveness. Taken together, reporting (1) precise observed p values (rather than whether it is larger or smaller than arbitrary cutoffs), (2) effect sizes and (3) the practical importance of effect sizes (ie, their interpretation for clinical practice) would improve our understanding of the meaning of study findings. Let us not throw out the baby with the bathwater.
  5 in total

1.  Basic statistical reporting for articles published in biomedical journals: the "Statistical Analyses and Methods in the Published Literature" or the SAMPL Guidelines.

Authors:  Thomas A Lang; Douglas G Altman
Journal:  Int J Nurs Stud       Date:  2014-09-28       Impact factor: 5.837

2.  Redefine statistical significance.

Authors:  Daniel J Benjamin; James O Berger; Magnus Johannesson; Brian A Nosek; E-J Wagenmakers; Richard Berk; Kenneth A Bollen; Björn Brembs; Lawrence Brown; Colin Camerer; David Cesarini; Christopher D Chambers; Merlise Clyde; Thomas D Cook; Paul De Boeck; Zoltan Dienes; Anna Dreber; Kenny Easwaran; Charles Efferson; Ernst Fehr; Fiona Fidler; Andy P Field; Malcolm Forster; Edward I George; Richard Gonzalez; Steven Goodman; Edwin Green; Donald P Green; Anthony G Greenwald; Jarrod D Hadfield; Larry V Hedges; Leonhard Held; Teck Hua Ho; Herbert Hoijtink; Daniel J Hruschka; Kosuke Imai; Guido Imbens; John P A Ioannidis; Minjeong Jeon; James Holland Jones; Michael Kirchler; David Laibson; John List; Roderick Little; Arthur Lupia; Edouard Machery; Scott E Maxwell; Michael McCarthy; Don A Moore; Stephen L Morgan; Marcus Munafó; Shinichi Nakagawa; Brendan Nyhan; Timothy H Parker; Luis Pericchi; Marco Perugini; Jeff Rouder; Judith Rousseau; Victoria Savalei; Felix D Schönbrodt; Thomas Sellke; Betsy Sinclair; Dustin Tingley; Trisha Van Zandt; Simine Vazire; Duncan J Watts; Christopher Winship; Robert L Wolpert; Yu Xie; Cristobal Young; Jonathan Zinman; Valen E Johnson
Journal:  Nat Hum Behav       Date:  2018-01

3.  Scientists rise up against statistical significance.

Authors:  Valentin Amrhein; Sander Greenland; Blake McShane
Journal:  Nature       Date:  2019-03       Impact factor: 49.962

Review 4.  Blood pressure outcomes of medication adherence interventions: systematic review and meta-analysis.

Authors:  Vicki S Conn; Todd M Ruppar; Jo-Ana D Chase
Journal:  J Behav Med       Date:  2016-03-11

5.  CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials.

Authors:  Kenneth F Schulz; Douglas G Altman; David Moher
Journal:  BMJ       Date:  2010-03-23
  5 in total
  6 in total

Review 1.  Essential statistical principles of clinical trials of pain treatments.

Authors:  Robert H Dworkin; Scott R Evans; Omar Mbowe; Michael P McDermott
Journal:  Pain Rep       Date:  2020-12-18

2.  Influence of the statistical significance of results and spin on readers' interpretation of the results in an abstract for a hypothetical clinical trial: a randomised trial.

Authors:  Sofyan Jankowski; Isabelle Boutron; Mike Clarke
Journal:  BMJ Open       Date:  2022-04-08       Impact factor: 2.692

3.  Variation in revascularisation use and outcomes of patients in hospital with acute myocardial infarction across six high income countries: cross sectional cohort study.

Authors:  Peter Cram; Laura A Hatfield; Pieter Bakx; Amitava Banerjee; Christina Fu; Michal Gordon; Renaud Heine; Nicole Huang; Dennis Ko; Lisa M Lix; Victor Novack; Laura Pasea; Feng Qiu; Therese A Stukel; Carin Uyl de Groot; Lin Yan; Bruce Landon
Journal:  BMJ       Date:  2022-05-04

4.  Efficacy trials comparing dosages of vitamin D and calcium co-supplementation in gestational diabetes mellitus patients require a methodological revamp

Authors:  Sumanta Saha
Journal:  J Turk Ger Gynecol Assoc       Date:  2022-03-10

Review 5.  Content and delivery of pre-operative interventions for patients undergoing total knee replacement: a rapid review.

Authors:  Anna M Anderson; Benjamin T Drew; Deborah Antcliff; Anthony C Redmond; Christine Comer; Toby O Smith; Gretl A McHugh
Journal:  Syst Rev       Date:  2022-09-02

6.  p value variability and subgroup testing.

Authors:  Graham Horgan
Journal:  Eur J Nutr       Date:  2021-02-13       Impact factor: 5.614

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.