Literature DB >> 24595371

Use and misuse of statistical significance in survival analyses.

Yoichi Furuya, Danushka K Wijesundara, Teresa Neeman, Dennis W Metzger.

Abstract

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 24595371 PMCID： PMC3958799 DOI： 10.1128/mBio.00904-14

Source DB: PubMed Journal: mBio Impact factor: 7.867

× No keyword cloud information.

EDITORIAL

Statistical significance, most often defined as a P value of <0.05, simply means that an observed quantitative difference would occur by chance <5% of time and does not necessarily imply biological significance. In the nonparametric analysis of survival data, the order of the events rather than the timing of the events is the basis for assessing differences between treatment groups. For example, if all mice in group 1 die before the mice in group 2, the results will be statistically significant regardless of whether the group 2 mice die 1 h later or 1 month later or not at all. Thus, in an experiment in which all deaths occur within a day or two but animals are monitored to determine precise survival times, group differences could be statistically significant but not biologically relevant. With this in mind, it has become apparent that in many recent publications in various highly respected journals, animals were monitored for survival multiple times a day or even hourly to obtain statistically significant results. These experimental settings allowed statistically significant results to be obtained even when the differences in survival time were clearly of limited biological significance. For example, in many studies, deletion of a particular gene led to a 1-day decrease or increase in the median time to death with a P value as low as <0.001. Such small differences suggest that the deaths were clustered within treatment groups. This could happen if all mice in a treatment group, for example, were kept in the same cage or were monitored as a group. To elaborate on this systematic problem, we generated three sets of hypothetical data that were designed to compare the efficacies of vaccination for protection against viral challenge. Each data set consisted of two experimental groups (mock-treated mice and vaccinated mice), and each group consisted of five mice (Fig. 1). For the first set of data, both groups of mice were monitored once a day and all the mice died 10 days after viral challenge. This yielded a P value of 1, hence not rejecting the null hypothesis and not indicating any efficacy of vaccination (Fig. 1A). For the second data set, the same experiment was conducted except that the mice were monitored on an hourly basis. In this case, a difference of 1 h median time to death was observed between the two groups (Fig. 1B). In this example, all the mice died around the same time, but in the mock-treated group, all mice died before any of the mice in the vaccinated group died. This may have occurred because the vaccine actually afforded a 1-h increase in protection. Alternatively, it could be that all the mice in the mock-treated group were assessed before the vaccinated mice, so that there was a temporal delay in data gathering. In either case, the null hypothesis in this instance was clearly rejected (P = 0.0027) and could be used to argue that indeed the vaccination was effective in providing significant protection. Given the clustering of deaths observed, however, the results may have been due to nothing more than a cage effect, with no biological significance. Although the P values are identical for the vaccination results in Fig. 1B and C, the vaccination protocol in Fig. 1C had a difference of 100% in survival rates between the mock-treated group and the vaccinated group, with at least 10 more days of protection than the results shown in Fig. 1B. This is because the log rank test considers only the order in which the animals die. Thus, from these hypothetical data sets, it is apparent that the use of statistical significance in survival analyses could be extremely deceptive. The biological effect, i.e., duration of protection, clearly needs to be assessed along with the statistical significance of the data. A mismatch between statistical significance and biological significance could be a red flag for a poorly designed study that measures something other than treatment efficacy.

FIG 1

Hypothetical data sets illustrating that statistical significance does not necessarily correlate with biological relevance. Two groups of mice (5 mice per group) were either mock treated or immunized with vaccine X and were challenged with a lethal dose of virus. Survival of the infected mice was monitored either daily (A) or hourly (B). A different vaccine, Y, was also tested for efficacy (C). Mice were monitored daily for survival. P values were derived by the Kaplan-Meier log rank test using GraphPad Prism 4 software. Statistical analyses gave identical P values for panels B and C; however, it is clear that biologically significant protection was seen only in panel C. PBS, phosphate-buffered saline. The intent of this article is to serve as a reminder that statistical and biological significance should never be used interchangeably in survival studies that attempt to predict protective efficacy. In our opinion, for acute infections that cause death in 7 to 10 days, a 3-day difference in survival is the minimum value that would warrant further development. On the other hand, a 3-day difference in survival would be biologically irrelevant for chronic infections such as tuberculosis, in which desired differences would be weeks, months, or even years. Nevertheless, we along with others (1, 2) appreciate that defining particular quantitative changes as biologically or clinically significant is subjective, context dependent, and sometimes obscure. We suggest that the biological outcome from the experiment be considered first and then statistics applied to determine if the results are likely to be due to chance. In this process, it should be remembered that a cutoff P value of 0.05 is relative; a P value of 0.1 indicates that a particular result would occur by chance 10% of the time. This could still reflect a biologically important effect. It is hoped that these considerations will assist investigators in focusing more on the most promising outcomes in preclinical disease models and increase the impact of experimental results on development of effective cures or vaccines in humans.

1 in total

1. Comparative rating of consultation performance: a preliminary study and proposal for collaborative research.

Authors: I M Stanley; C A Webster; J Webster
Journal: J R Coll Gen Pract Date: 1985-08

1 in total

1. Fungal infection has sublethal effects in a lowland subtropical amphibian population.

Authors: Laura A Brannelly; Matthew W H Chatfield; Julia Sonn; Matthew Robak; Corinne L Richards-Zawacki
Journal: BMC Ecol Date: 2018-09-14 Impact factor: 2.964

1 in total