Literature DB >> 31560143

To test or to estimate? P-values versus effect sizes.

Daniela Dunkler¹, Maria Haller^1,2, Rainer Oberbauer³, Georg Heinze¹.

Abstract

Most research in transplant medicine includes statistical analysis of observed data. Too often authors solely rely on P-values derived by statistical tests to answer their research questions. A P-value smaller than 0.05 is typically used to declare "statistical significance" and hence, "proves" that, for example, an intervention has an effect on the outcome of interest. Especially in observational studies, such an approach is highly problematic and can lead to false conclusions. Instead, adequate estimates of the observed size of the effect, for example, expressed as the risk difference, the relative risk or the hazard ratio, should be reported. These effect size measures have to be accompanied with an estimate of their precision, like a 95% confidence interval. Such a duo of effect size measure and confidence interval can then be used to answer the important question of clinical relevance.

Entities: Chemical Disease Gene Species

Keywords: clinical significance; effect size measure; statistical inference; statistical significance; statistical tests

Year: 2019 PMID： 31560143 PMCID： PMC6972498 DOI： 10.1111/tri.13535

Source DB: PubMed Journal: Transpl Int ISSN： 0934-0874 Impact factor: 3.782

The weather will change significantly in the next days (P = 0.04). In a weather report, a comment like this is would be inconceivable, but such statements can be found in many scientific articles. The statement does not contain the information that a recipient is actually interested in: will it get warmer or colder? How much change in weather do we have to expect? The P‐value derived by a statistical test does not answer these questions. It leaves only a vague feeling that the weather may not stay the same. However, these questions can be answered satisfactorily by reporting an adequate measure of the size of the effect, in this example, the expected change in temperature.

Current practice in transplant research

In order to evaluate the current practice of reporting P‐values and effect sizes in transplant research, we reviewed all manuscripts published in Transplant International in 2018 in the category “clinical research” (Table 1). Among 68 summaries of retrospective studies, in 27 (40%) P‐values and measures of effect size were reported, while in 30 (44%) only P‐values were given.

Table 1

Review of all manuscripts published in Transplant International in the category “clinical research” in 2018

P‐values and effect size measures

In transplant research, as in any other scientific discipline, using a P‐value as a measure of “difference” is entirely uninformative and can even be misleading 1, 2, 3, 4, 5, 6, 7, 8. (For definitions on the concepts of statistical testing and estimation, see Table 2.) As an example, consider a study comparing the 5‐year survival after kidney transplantation between two interventions or exposures. Generally, a P‐value depends on two quantities: the observed difference between the groups and the sample size. With a sample size of 100 in each group, assuming 5‐year survival probabilities of 85% and 90% in the two groups, and with complete follow‐up, the P‐value results as 0.39. The statistically nonsignificant result with n = 100 could be falsely interpreted as evidence for lack of a difference between the groups “proving” the null hypothesis that the intervention makes no difference in survival. Researchers more aware of the fallacies in interpreting P‐values would—more cautiously and correctly—verbalize the result as “we could not find any evidence suggesting a difference between the groups.” However, including 1000 patients per group into the study and observing the same survival probabilities in both groups, one would compute a P‐value of 0.0009 and conclude the opposite: strong evidence against the null hypothesis of no difference between the groups. Still, in both examples, the observed 5‐year mortality risks in the two groups are 15% and 10% and thus the relative risk is 1.5 (=0.15/0.10), which is usually considered a strong effect. In the medical field, a relative risk of 1.5 of a hard outcome such as mortality or graft loss is almost never achievable with a single intervention. Therefore, prospective clinical trials are planned expecting much smaller differences in outcome such as relative risks of 1.25. In this example, solely the sample size will determine conclusions about the effect of an intervention and lead to contradictory conclusions if those are based only on the strict dichotomy of statistical significance or nonsignificance; but consider that the 100 observations could just be a subsample of the 1000.

Table 2

Simplified definitions of selected concepts of statistical testing and estimation

Some key ingredients to statistical testing
Null hypothesis	States that two different interventions (or exposures) lead to the same outcome, that is, that the effect size is 0 if expressed as a difference, or 1 if expressed as a ratio.
Alternative hypothesis	States that two different interventions lead to different outcomes, that is, that the effect size is not equal to 0 if expressed as a difference, or not equal to 1 if expressed as a ratio.
Statistical test	Depending on the research question (e.g., scale of the outcome), various statistical tests, like a t‐test, are available. A test statistic measures the “distance” between the data and the null hypothesis. A test is only valid if its underlying assumptions are met. These assumptions do not only encompass direct assumptions of the test, like approximate normal distribution in the case of a t‐test, but also assumptions about the conduct of the study, like random selection of subjects and treatment or that no interim analyses were conducted.
P‐value	To facilitate interpretation and comparison, a test statistic is usually transformed to a probability scale and expressed as a P‐value. The P‐value measures the compatibility of the observed data with the null hypothesis. Technically, it expresses the probability with which, given the null hypothesis was true, data with an effect size as extreme as the observed one or more extreme than the observed one can be obtained. The P‐value cannot separate implausibility of the null hypothesis from implausibility of any of the assumptions: A small P‐value gives evidence that the data are not compatible with the specified model—encompassing the null hypothesis and all assumptions. Hence, the P‐value should be viewed as a continuous measure of compatibility of the data to the model ranging from 0 (complete incompatibility) to 1 (complete compatibility) 1. Consequently, precise P‐values should be presented (e.g., P = 0.07 and not P = NS or P > 0.05).
Key ingredients to estimation
Effect size estimate	Expresses the expected difference or ratio in the outcome between two interventions.
Confidence interval	Expresses the imprecision of an estimate of effect size that arises from a limited sample size. Technically, when the study could be repeated very often and the confidence level is set to 95%, then 95% of the confidence intervals computed on the study repetitions will cover the true effect size.
Clinical relevance	Based on the effect size estimate and confidence interval in addition to subject matter knowledge and other published results, a researcher can finally answer the question of clinical relevance “Are observed differences between the two study groups large enough to be of clinical significance?”

For methodologically correct definitions, we refer to Greenland et al. 2. For information on statistical testing, we refer to textbooks on statistics, for example, Agresti et al. 14.

The null hypothesis implies that the effect is exactly zero. In observational studies, this is hardly ever the case, and even in prospective controlled trials, exact equality of outcomes is unlikely. If there is even only a small difference which will typically have no clinical relevance, then a large enough sample will detect it and reject the null hypothesis (see most cardiovascular trials where the sample size is usually several thousand). In reaction to this unfortunate paradox, it has been proposed to avoid the notion “statistically significant” or “statistical significance” in observational research completely 1. Statements on statistical significance for the primary outcome should be confined to randomized clinical trials, where the sample size is prespecified to detect a minimal clinically relevant difference. Generally, reporting the effect size is much more informative than a statement on statistical significance. In the example, one could report the mortality risk difference or the relative risk. With both sample sizes, the same conclusions would then be drawn: The absolute mortality risk difference is 5%, and the relative risk is 1.5 9.

Quantifying uncertainty of estimates

To address the uncertainty that is attached to these estimates of effect size, they should always be accompanied by 95% confidence intervals. The intervals would clearly reflect that a more precise estimate can be obtained with more data. In our example, the 95% confidence intervals for the mortality risk difference for sample sizes 100 and 1000 range from ‐4.1% to 14.1% and from 2.1% to 7.9%, respectively. These confidence intervals do not contradict each other. By contrast, the confidence interval obtained from n = 100 entirely includes the interval for n = 1000, that is, what already can be concluded from n = 100 can be made more precise with the larger sample.

Emphasizing clinical relevance

Effect size measures and confidence intervals put the emphasis on aspects of clinical relevance (also denoted by “clinical significance”) compared to the concept of statistical significance assessed by P‐values. Clinical relevance is determined depending on whether a difference is “real and noticeable” by individuals 10. Effect size measures like the absolute or relative risk difference address this issue directly. A seeming drawback of the concept of clinical relevance is that its threshold must be determined on a case‐by‐case basis, as it depends on values and preferences patients and healthcare professionals attach to the outcome under investigation, or on existence of other effective treatments. Unlike statistical significance, clinical relevance is never determined by the data. Suppose that a mortality risk difference of 2.5% would constitute the minimal clinically relevant difference, corresponding to a number needed to treat of 40 (=1/0.025). While the analysis with n = 1000 was adequately powered to detect a small difference in mortality risk and excluded parity, it cannot exclude a difference less than 2.5%, as this value is covered by the confidence interval. Hence, such an analysis could not “prove” clinical relevance.

Some practical recommendations

In line with the statistical guidelines of this journal and currently effective reporting guidelines 11, we propose that any study that compares outcomes between two groups should report an appropriate estimate of effect size and provide a 95% confidence interval as a measure of precision. These measures should be interpreted regarding their clinical relevance and should be compared to reported effect size measures stemming from similar studies. Table 3 describes such effect size measures for group comparisons of outcomes typically investigated in transplantation research. An adequate effect size measure is selected depending on the scale of measurement of the outcome, that is, whether it is a continuous, binary, or time‐to‐event variable. For all presented effect size measures, 95% confidence intervals can be estimated. To conveniently obtain effect size measures with accompanying 95% confidence intervals for two proportions or two survival rates at a specific time point, our online calculator https://biometrician.shinyapps.io/effectsizeci/ can be used.

Table 3

Some commonly used effect size measures to compare two interventions in transplantation research

Scale of the outcome	Examples	Effect size measures	Example of interpretation “If intervention 1 is compared to intervention 2, …”*	Statistical method for generalization
Continuous	Glomerular filtration rate, glucose level	Difference of means	“…the expected difference in glomerular filtration rate is 5 mL/min/1.73 m².”	General linear model (linear regression, ANCOVA)
Binary within a fixed, fully observable time frame	Complications during transplantation, delayed graft function	Risk difference p₁‐p₂	“… in 5% of all people the occurrence of a complication during transplantation could be avoided.”	Risk prediction after logistic regression
		Relative risk (RR) p₁/p₂	“… the probability of the occurrence of a complication during transplantation multiplies by 1.25.”	Poisson regression (for rare outcome events)
		Odds ratio (OR) Odds₁/Odds₂	“… the odds of the occurrence of a complication during transplantation multiples by 1.5.” [Odds₁ = p₁/(1−p₁)]	Logistic regression
Binary within varying follow‐up time	Incidence of acute rejection episodes	Incidence rate ratio	“… the expected number of acute rejection episodes per patient year multiplies by 1.15.”	Poisson regression
Time‐to‐event	Patient survival, graft survival	Survival difference at t years, S ₂(t)−S ₁(t)	“… in 7% of all people graft loss within the first two years could be avoided.”	Survival estimation after Cox regression
		Hazard ratio (HR)	“… the instantaneous mortality multiplies by 1.2.”	Cox regression

The choice of effect size measure depends on the scale of the outcome. Examples of correct interpretations for comparison of two interventions are given. Statistical methods to generalize the analysis for adjustment for potential confounders or continuous exposure variables are presented. p1, p2, the observed event rates after intervention 1 or 2; S 1(t), S 2(t), the observed survival proportions at t years after interventions 1 and 2.

*For comparing exposures, change to “If exposed individuals are compared to unexposed individuals, …”

Simplified definitions of selected concepts of statistical testing and estimation For methodologically correct definitions, we refer to Greenland et al. 2. For information on statistical testing, we refer to textbooks on statistics, for example, Agresti et al. 14. If more than two groups should be compared, separate effect size measures, each comparing one group with the reference group, can be computed. Typically, the reference group is chosen as the group receiving the standard therapy, or the group of patients with the most common characteristic. Regression models can be used to obtain effect size measures that are generalized to continuous variables. Consider, for example, that we would like to express the excess mortality risk that is associated with a 10‐year increase in donor age. A Cox regression model could be used to estimate the association of donor age with survival and may come to the conclusion that the hazard ratio per 10‐year increase in donor age is 1.2. In this example, a hazard ratio of 1.2 per 10 years is preferred to a hazard ratio of 1.02 per year. Effect sizes for categorical variables should be unambiguously reported, for example, as a hazard ratio of 1.3 for males vs. females, not just as a hazard ratio of 1.3 for gender. Regression models can also be used to adjust the effect size measure for potential confounders. A confounder is a variable that is associated with the risk factors (e.g., donor age) and causally related to the outcome. In our example, the estimated glomerular filtration rate of the donor could assume the role of a confounder. Adjustment for confounders is especially important in observational studies where patients are not randomized and hence their characteristics likely have influenced the treatment decision 12. It should always be stated if effect size estimates obtained from a regression model are unadjusted or adjusted. If they are adjusted, the adjustment variables should be stated. With continuous outcome variables, the scale of their measurement has to be reported. Generally, the International System of Units (SI units) should be preferred over other systems. Some commonly used effect size measures to compare two interventions in transplantation research The choice of effect size measure depends on the scale of the outcome. Examples of correct interpretations for comparison of two interventions are given. Statistical methods to generalize the analysis for adjustment for potential confounders or continuous exposure variables are presented. p1, p2, the observed event rates after intervention 1 or 2; S 1(t), S 2(t), the observed survival proportions at t years after interventions 1 and 2. *For comparing exposures, change to “If exposed individuals are compared to unexposed individuals, …” Detailed guidance on reporting of results from observational and randomized studies and other study types can be found on the website of the “Enhancing the Quality and Transparency of Health Research (EQUATOR) network” (http://www.equator-network.org, accessed 08 July 2019) 11, 13. The EQUATOR network website links to published reporting guidelines for any type of study in any medical discipline. Summarizing, instead of solely relying on P‐values to answer their research questions, authors are encouraged to present adequate effect size measures accompanied with 95% confidence intervals.

Authorship

DD, GH: wrote the paper. MH, RO: critically revised the paper.

Funding

The authors have declared no funding.

Conflicts of interest

The authors have declared no conflicts of interest.

11 in total

1. Confounding: what it is and how to deal with it.

Authors: K J Jager; C Zoccali; A Macleod; F W Dekker
Journal: Kidney Int Date: 2007-10-31 Impact factor: 10.612

Review 2. A dirty dozen: twelve p-value misconceptions.

Authors: Steven Goodman
Journal: Semin Hematol Date: 2008-07 Impact factor: 3.851

3. Null misinterpretation in statistical testing and its impact on health risk assessment.

Authors: Sander Greenland
Journal: Prev Med Date: 2011-08-17 Impact factor: 4.018

4. Scientists rise up against statistical significance.

Authors: Valentin Amrhein; Sander Greenland; Blake McShane
Journal: Nature Date: 2019-03 Impact factor: 49.962

5. Living with statistics in observational research.

Authors: Sander Greenland; Charles Poole
Journal: Epidemiology Date: 2013-01 Impact factor: 4.822

6. P Value: Significance Is Not All Black and White.

Authors: Harini A Chakkera; Jesse D Schold; Bruce Kaplan
Journal: Transplantation Date: 2016-08 Impact factor: 4.939

7. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials.

Authors: Kenneth F Schulz; Douglas G Altman; David Moher
Journal: PLoS Med Date: 2010-03-24 Impact factor: 11.069

8. Relative risk versus absolute risk: one cannot be interpreted without the other.

Authors: Marlies Noordzij; Merel van Diepen; Fergus C Caskey; Kitty J Jager
Journal: Nephrol Dial Transplant Date: 2017-04-01 Impact factor: 5.992

Review 9. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration.

Authors: Jan P Vandenbroucke; Erik von Elm; Douglas G Altman; Peter C Gøtzsche; Cynthia D Mulrow; Stuart J Pocock; Charles Poole; James J Schlesselman; Matthias Egger
Journal: PLoS Med Date: 2007-10-16 Impact factor: 11.069

10. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.

Authors: Sander Greenland; Stephen J Senn; Kenneth J Rothman; John B Carlin; Charles Poole; Steven N Goodman; Douglas G Altman
Journal: Eur J Epidemiol Date: 2016-05-21 Impact factor: 8.082

9 in total

1. Avocado Consumption for 12 Weeks and Cardiometabolic Risk Factors: A Randomized Controlled Trial in Adults with Overweight or Obesity and Insulin Resistance.

Authors: Xuhuiqun Zhang; Di Xiao; Gabriela Guzman; Indika Edirisinghe; Britt Burton-Freeman
Journal: J Nutr Date: 2022-08-09 Impact factor: 4.687

2. Higher vs Lower Doses of Dexamethasone in Patients with COVID-19 and Severe Hypoxia (COVID STEROID 2) trial: Protocol for a secondary Bayesian analysis.

Authors: Anders Granholm; Marie Warrer Munch; Sheila Nainan Myatra; Bharath Kumar Tirupakuzhi Vijayaraghavan; Maria Cronhjort; Rebecka Rubenson Wahlin; Stephan M Jakob; Luca Cioccari; Maj-Brit Nørregaard Kjaer; Gitte Kingo Vesterlund; Tine Sylvest Meyhoff; Marie Helleberg; Morten Hylander Møller; Thomas Benfield; Balasubramanian Venkatesh; Naomi Hammond; Sharon Micallef; Abhinav Bassi; Oommen John; Vivekanand Jha; Klaus Tjelle Kristiansen; Charlotte Suppli Ulrik; Vibeke Lind Jørgensen; Margit Smitt; Morten H Bestle; Anne Sofie Andreasen; Lone Musaeus Poulsen; Bodil Steen Rasmussen; Anne Craveiro Brøchner; Thomas Strøm; Anders Møller; Mohd Saif Khan; Ajay Padmanaban; Jigeeshu Vasishtha Divatia; Sanjith Saseedharan; Kapil Borawake; Farhad Kapadia; Subhal Dixit; Rajesh Chawla; Urvi Shukla; Pravin Amin; Michelle S Chew; Christian Gluud; Theis Lange; Anders Perner
Journal: Acta Anaesthesiol Scand Date: 2021-02-25 Impact factor: 2.274

3. Exploring the potential effect of paricalcitol on markers of inflammation in de novo renal transplant recipients.

Authors: Hege Kampen Pihlstrøm; Thor Ueland; Annika E Michelsen; Pål Aukrust; Franscesca Gatti; Clara Hammarström; Monika Kasprzycka; Junbai Wang; Guttorm Haraldsen; Geir Mjøen; Dag Olav Dahle; Karsten Midtvedt; Ivar Anders Eide; Anders Hartmann; Hallvard Holdaas
Journal: PLoS One Date: 2020-12-16 Impact factor: 3.240

Review 4. Randomised clinical trials in critical care: past, present and future.

Authors: Anders Granholm; Waleed Alhazzani; Lennie P G Derde; Derek C Angus; Fernando G Zampieri; Naomi E Hammond; Rob Mac Sweeney; Sheila N Myatra; Elie Azoulay; Kathryn Rowan; Paul J Young; Anders Perner; Morten Hylander Møller
Journal: Intensive Care Med Date: 2021-12-02 Impact factor: 41.787

5. Uncertainty in climate change impact studies for irrigated maize cropping systems in southern Spain.

Authors: Bahareh Kamali; Ignacio J Lorite; Heidi A Webber; Ehsan Eyshi Rezaei; Clara Gabaldon-Leal; Claas Nendel; Stefan Siebert; Juan Miguel Ramirez-Cuesta; Frank Ewert; Jonathan J Ojeda
Journal: Sci Rep Date: 2022-03-08 Impact factor: 4.379

6. Analyzing nested experimental designs-A user-friendly resampling method to determine experimental significance.

Authors: Rishikesh U Kulkarni; Catherine L Wang; Carolyn R Bertozzi
Journal: PLoS Comput Biol Date: 2022-05-02 Impact factor: 4.475

7. Agents intervening against delirium in the intensive care unit trial-Protocol for a secondary Bayesian analysis.

Authors: Nina Andersen-Ranberg; Lone M Poulsen; Anders Perner; Johanna Hästbacka; Matthew P G Morgan; Giuseppe Citerio; Marie Oxenbøll-Collet; Sven-Olaf Weber; Anne Sofie Andreasen; Morten H Bestle; Bülent Uslu; Helle B S Pedersen; Louise G Nielsen; Kjeld Damgaard; Troels B Jensen; Trine Sommer; Nilanjan Dey; Ole Mathiesen; Anders Granholm
Journal: Acta Anaesthesiol Scand Date: 2022-05-31 Impact factor: 2.274

8. The interaction of resource use and gene flow on the phenotypic divergence of benthic and pelagic morphs of Icelandic Arctic charr (Salvelinus alpinus).

Authors: Matthew K Brachmann; Kevin Parsons; Skúli Skúlason; Moira M Ferguson
Journal: Ecol Evol Date: 2021-05-02 Impact factor: 2.912

Review 9. Conservative vs. liberal fluid therapy in septic shock - Protocol for secondary Bayesian analyses of the CLASSIC trial.

Authors: Praleene Sivapalan; Tine S Meyhoff; Peter B Hjortrup; Theis Lange; Morten Hylander Møller; Anders Perner; Anders Granholm
Journal: Acta Anaesthesiol Scand Date: 2022-04-06 Impact factor: 2.274

9 in total