Literature DB >> 35647369

The role of statistical significance testing in public law and health risk assessment.

Tommaso Filippini¹, Silvio Roberto Vinceti².

Abstract

Following a fundamental statement made in 2016 by the American Statistical Associations and broad and consistent changes in data analysis and interpretation methodology in public health and other sciences, statistical significance/null hypothesis testing is being increasingly criticized and abandoned in the reporting and interpretation of the results of biomedical research. This shift in favor of a more comprehensive and non-dichotomous approach in the assessment of causal relationships may have a major impact on human health risk assessment. It is interesting to see, however, that authoritative opinions by the Supreme Court of the United States and European regulatory agencies have somehow anticipated this tide of criticism of statistical significance testing, thus providing additional support to its demise. Current methodological evidence further warrants abandoning this approach in both the biomedical and public law contexts, in favor of a more comprehensive and flexible method of assessing the effects of toxicological exposure on human and environmental health. ©2022 Pacini Editore SRL, Pisa, Italy.

Entities: Chemical

Keywords: Health risk assessment; Null hypothesis testing; Public law; Statistical significance

Mesh：

Year: 2022 PMID： 35647369 PMCID： PMC9121665 DOI： 10.15167/2421-4248/jpmh2022.63.1.2394

Source DB: PubMed Journal: J Prev Med Hyg ISSN： 1121-2233

Introduction

Few aspects of scientific methodology as those related to statistical analysis and interpretation, and particularly to statistical significance testing, had and are currently having an effect on causal inference and more generally in the establishment of causal relations in science, including toxicology and biomedical sciences overall but not restricted to them, having major implications also in psychological and economic research [1]. Statistical tests are in fact becoming more complex and sophisticated, frequently relying on an advanced mathematical basis, and are largely employed in medicine and toxicology, among other sciences, to make inferences about causal relations and to inform the risk assessment of interventions such as drugs or of environmental chemicals. Among statistical tests, the most largely used is the so-called “statistically significance testing”, based on the evaluation of the compliance of the observed data in any study and experiment with the p-value function and the null hypothesis, i.e. the hypothesis of no association between the chemical or more generally the exposure of interest with the study endpoints [1-4]. In particular, statistical significance testing yields the identification of cut-points based on p-value function, e.g. p < 0.05 or p < 0.001, subsequently used as reference values for null hypothesis testing, with an ineludible spread of such deleterious and erroneous dichotomous approach relying only on fixed thresholds [5]. Unfortunately, this statistical significance testing has been the pillar and the tenet of risk assessment and biostatistics for decades, despite the unheard complaints by several investigators and methodologists pointing out its ambiguous and confounded information [5, 6]. In somewhat recent times, however, authoritative bodies and scientific communities have raised their voice against the use of p-value and statistical significance testing, invoking the demise of such approach in establishing causation and performing risk assessment [1, 7, 8]. However, the legal world, through pronunciations of the Supreme Court of the United States and scientific contributions by public law scholars, has been advocating the same perspective, i.e., the dismissal of an approach exclusively reliant upon the existence of a dichotomous “statistical significance” in favor of a more flexible and comprehensive method based on a number of factors that include the overall statistical evidence but are not limited to it. We here summarized the history in the use of statistical significance testing and its implication for toxicological risk assessment and for public law, anticipating that the latter will increasingly deal with these methodological issues particularly when dealing with health risks.

Statistical significance & null hypothesis testing in public health

The statistical training of students and investigators in the biomedical field, including medicine and toxicology, and in other fields such as psychology and economics has been greatly influenced by famous British statistician Ronald Fisher, and more specifically by a small but extremely relevant piece of his intellectual contribution, i.e. the idea of using a single statistical test and even more attractively a single figure to define if results were worth reliance or not in terms of causal inference [1]. Although the influential statistician was not the first to propose the use of p-values, he was the one who suggested a cut-point – 0.05 – “to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach that level” [9]. In other words, Fisher proposed to start from the null hypothesis of being no effect of the investigated “exposure,” to compute a p-value function, and to look at the intersection of such function with the effect size observed in the experiment: should such intersection be below 0.05, the results could be considered as “significant” (later considered to mean “statistically significant”). Although Fisher did not encourage to disregard results having a p-value higher than 0.05 and later tempered his position [10], his approach became the boundary line of most scientific inferences based on data analysis in the biomedical and psychological sciences. Results were “significant,” i.e., “true” and allowing to reject the null hypothesis of no association, in case p-value was lower than 0.05, further allowing the additional use of the expression “highly significant” in case p-value was < 0.001. By contrast, results exceeding this boundary line were generally dismissed, independently of the actual p-value, and the corresponding results were deemed to be due to chance and not reflecting a causal relation. Unfortunately, such an approach was not accompanied by considerations such as the study sample size (that, if low, inherently increases the p-value for any observed association), the risk of bias of the study, the dose-response relation of the observed phenomena, the biological and temporal plausibility of the associations and finally its consistency across studies, all elements of key relevance when assessing the relation of any cause and exposure to a putative effect as originally suggested by Hill’s criteria in 1965 [11], and still relevant when evaluating causal relations in biomedical sciences, especially in public health and toxicology [1, 12]. In many scientific studies and especially in risk assessments, such black and white approach led to the claim that only when p-values are below the 0.05 cut-point we can draw causal inferences and claim the existence of a causal relation between, for instance, a toxic chemical or a drug and any kind of health endpoints. While many statisticians, methodologists, and even official agencies have long claimed the extreme subjectivity and the serious pitfalls of an approach based on statistical significance and null hypothesis testing, it eventually took almost one century to “officially” highlight these flaws and the most serious implications exerted, for instance, in toxicological risk assessment and in the establishment of causality in legal evaluations. While invitations to consider the fallacious nature in Fisher’s claims on statistical significance and p-value cut-points had already been made [1, 13-15], it was only in 2016 that an official statement by the American Statistical Associations officially recognized and highlighted the problem [7]. More recently, a seminal paper that was published in Nature [8] and received the support of a large number of scientists from many disciplines all over the world has convincingly made clear that statistical significance testing and its use in drawing inferences is flawed and may seriously mislead the authors and the readers of scientific articles [2, 3, 16-19]. Along the same line, an increasing number of Editors of scientific journals in the field of epidemiology and public health, medicine, and psychology have accordingly decided to ban or to discourage the reporting of the results as related to “statistical significance testing” [1, 20-23], while putting emphasis on other methodological aspects such as the magnitude and statistical precision of the estimates. Overall, there seems to be an overwhelming majority of methodologists now supporting the demise of statistical significance testing, thus precluding further use of the p-value tool to establish in a black and white manner “causality” in scientific research.

Recent trends in American public law on the use of statistical significance testing

Contrary to the wide and frequently uncritical propagation of statistical significance testing among scientists in the biomedical field, it is interesting to observe that the legal world has generally been more cautious in its use in scholarly inquiries, as well as in public law practice. This is arguably merit of the long tradition of the legal community in approaching with caution single “absolute” sources of certainty of any type-statistical significance testing undoubtedly and erroneously claiming to be one-and instead weighing the entire body of evidence in favor and against a specific thesis in a more balanced and nuanced way. A recent example of such a cautious and thoughtful approach, somehow even become a paradigm, can be seen in the 2010 case Matrixx Initiatives, Inc. v. Syracusan [24], a seminal decision by United States Supreme Court that has been widely commended and appreciated even beyond the legal circuit [25-29]. The case, involving the pharmaceutical company Matrixx Initiatives, centered on the question of “whether a plaintiff can state a claim for securities fraud based on a pharmaceutical company’s failure to disclose reports of adverse events associated with a product” if the reports did not contain statistically significant evidence that the adverse effects may be caused by the use of the product [24]. Delivered by Justice Sonia Sotomayor, the unanimous opinion (9-0) of the Court affirmed the Court of Appeals for the Ninth Circuit’s judgment, concluding that the “allegations, ‘taken collectively,’ give rise to a ‘cogent and compelling’ inference that Matrixx elected not to disclose the reports of adverse events not because meaningless but because it understood their likely effect on the market ‘A reasonable person’ would deem the inference that Matrixx acted with deliberate recklessness (or even intent) ‘at least as compelling as any opposing inference one could draw from the facts alleged.’. We conclude, in agreement with the Court of Appeals, that respondents have adequately pleaded scienter. Whether respondents can ultimately prove their allegations and establish scienter is an altogether different question” [24]. The opinion contains several notable statements that directly address the core of the statistical issue at stake, and more generally the basic issues and limitations of statistical significance testing. For instance, the Supreme Court stated that the “lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events” and that “medical experts rely on other evidence to establish an inference of causation.” In addition, the Supreme Court emphasized that “medical professionals and researchers do not limit the data they consider to the results of randomized clinical trials or to statistically significant evidence.” Moreover, “the FDA similarly does not limit the evidence it considers for purposes of assessing causation and taking regulatory action to statistically significant data. In assessing the safety risk posed by a product, the FDA considers factors such as ‘strength of the association,’ ‘temporal relationship of product use and the event,’ ‘consistency of findings across available data sources,’ ‘evidence of a dose-response for the effect,’ ‘biologic plausibility,’ ‘seriousness of the event relative to the disease being treated,’ ‘potential to mitigate the risk in the population,’ ‘feasibility of further study using observational or controlled clinical study designs,’ and ‘degree of benefit the product provides, including availability of other therapies’”. Moreover, the opinion mentions other statements that support the conclusion that statistical significance is not required (and in some cases not achievable) to consider the possibility of causal relations between exposure and an adverse health effect. Overall, the opinion represents an excellent example of correct handling of the concept of statistical significance, under the assumption that it cannot be used as a surrogate indicator of the absence of causal relations. This approach is highly relevant since it goes beyond the traditional approach based on p-value traditional cut-points of 0.05/0.001, dismissing a key role of null hypothesis testing according to Fisher’s rule in establishing (and refusing) proof of causation. Unsurprisingly, many scholars have expressed appreciation for this highly relevant opinion, thus indicating how public law theory can take on board a correct approach in dealing with a highly specific and “sophisticated” statistical concept such as statistical significance/null hypothesis testing [25-29]. This comes as no surprise, however, since the issues raised in this seminal sentence by the Supreme Court have long been known to the public law scholarship, as comprehensively illustrated in a relevant paper by David Kaye published as early as 1986 on the Washington Law Review [30]. Most recently, the U.S. Supreme Court has returned to the topic of statistical significance testing in the case Brnovich v. Democratic National Committee of March 2021 [31]. Rather than risk assessment and public health, the case dealt with election law and its impact on access to vote. The Democratic National Committee had filed a suit against the State of Arizona’s election law since it allegedly “had an adverse and disparate effect on the State’s American Indian, Hispanic, and African-American citizens,” and had been enacted “with discriminatory intent.” To this article, the interesting aspect lies in the statistical significance argument employed by Elena Kagan in her dissenting opinion, where she affirms that Section 2 of the Voting Rights Act of 1965 “demands proof of a statistically significant racial disparity in electoral opportunities” to strike down election rules. Adhering to the Circuit Court’s argumentation that voided the District Court’s initial dismissal of the suit, Kagan concludes that in the case at hand “Arizona’s policy creates a statistically significant disparity between minority and white voters.” However, the Court’s majority opinion, written by Samuel Alito, rejected what is described as a “procrustean” interpretation of Section 2 of the Voting Rights Act. Citing the Federal Judicial Center’s Reference Manual on Scientific Evidence, Alito’s majority opinion recalls that “statistical significance may provide ‘evidence that something besides random error is at work,’ but does not necessarily determine causes.” Alito’s opinion finds faults with the “statistical manipulation” of emphasizing statistical differences out of a proper context: in that case, while it was factually true that minority voters stood double the chance of having their vote nulled as an out-of-precinct ballot than non-minority voters, the practical difference was in absolute terms so slight that the law could not be held discriminatory. As a final note, it should be emphasized that not only American public law but also the warnings of European risk assessment institutions signaled and somehow anticipated the shifting tide against the use and misuse of statistical significance testing. For instance, in 2011 the European Food Safety Authority, the official body in charge of assessing the toxicity of food and food constituents, issued a relevant opinion to define how statistical significance testing should (and should not) be used in risk assessment [32]. The opinion represents a good example of the growing awareness, even in a period antecedent to the ASA 2016 statement and the subsequent key scientific contributions, that the dichotomous approach entailed methodological pitfalls and that even in risk assessment null hypothesis testing proved inadequate, despite that being a field generally requiring a final yes/no outcome. The opinion correctly highlighted the need to always report effect/risk estimates and their measures of statistical stability (such as confidences limits), and to give attention to the real biological relevance of the effects even in the presence of small p-values and so-called statistically significant findings [32]. Therefore, it is not surprising that subsequent EFSA assessments and opinions have generally given a limited (if any) reliance on statistical significance testing, putting weight on the strength and the precision of the effect estimates, on dose-response relations, consistency across studies and study designs, quality of the studies and biological plausibility of the associations found in human studies. The convergence in legal and toxicological-epidemiologic approaches toward the rejection of statistical significance testing in risk assessment mirrors the evolution of scientific methodology and appears to be much more adequate to account for all the complexities, the uncertainties but also the potential insights characterizing toxicological risk assessment and its public law implications and litigations [33].

Conclusions

Implications of abandoning statistical significance testing in public law and to health risk assessment. For the aforementioned methodological reasons and issues, the approach taken by the U.S. Supreme Court in the case Matrixx v. Syracusan case appears to be scientifically sound and somehow even anticipated the methodological shift of several scientific communities, including the statistical one, indicating the growing awareness of the public health community about the pitfalls of simply relying on a conventional black and white approach instead of a balanced assessment of the entire available evidence. Given the large and serious consequences induced by the use of the erroneous approach in data synthesis and causal interpretation represented by statistical significance testing and conventional p-value cut-points, a complete demise of this simplistic approach appears fully justified in both public law and health risk assessment in favor a more challenging but methodologically correct method based on the comprehensive assessment of the strengths and limitations of all the available evidence, and thus abandoning an unwarranted simplification devoid of scientific basis.

Acknowledgements

None.

16 in total

1. THE ENVIRONMENT AND DISEASE: ASSOCIATION OR CAUSATION?

Authors: A B HILL
Journal: Proc R Soc Med Date: 1965-05

2. That confounded P-value.

Authors: J M Lang; K J Rothman; C I Cann
Journal: Epidemiology Date: 1998-01 Impact factor: 4.822

3. Scientists rise up against statistical significance.

Authors: Valentin Amrhein; Sander Greenland; Blake McShane
Journal: Nature Date: 2019-03 Impact factor: 49.962

4. Scientific method: statistical errors.

Authors: Regina Nuzzo
Journal: Nature Date: 2014-02-13 Impact factor: 49.962

5. Disengaging from statistical significance.

Authors: Kenneth J Rothman
Journal: Eur J Epidemiol Date: 2016-06-07 Impact factor: 8.082

6. New Guidelines for Statistical Reporting. Reply.

Authors: David Harrington
Journal: N Engl J Med Date: 2019-10-17 Impact factor: 91.245

7. The magnitude of small-study effects in the Cochrane Database of Systematic Reviews: an empirical study of nearly 30 000 meta-analyses.

Authors: Lifeng Lin; Linyu Shi; Haitao Chu; Mohammad Hassan Murad
Journal: BMJ Evid Based Med Date: 2019-07-04