Literature DB >> 29455520

Safety Assessments and Multiplicity Adjustment: Comments on a Recent Paper.

Hilko van der Voet1.   

Abstract

Entities:  

Mesh:

Year:  2018        PMID: 29455520      PMCID: PMC5843949          DOI: 10.1021/acs.jafc.7b03686

Source DB:  PubMed          Journal:  J Agric Food Chem        ISSN: 0021-8561            Impact factor:   5.279


× No keyword cloud information.
Many end points representing possible non-target effects have to be evaluated in safety assessments that compare new and accepted products. This is known as the multiple-comparison or multiplicity problem. One popular method to adjust statistical testing procedures for multiplicity is the false discovery rate (FDR) method,[1] often implemented via adjustment of p values, such as for example provided in SAS procedure MULTTEST. FDR-adjusted p values are obtained by multiplication of the raw p values with factors between 1 and m, where m is the number of tested hypotheses. Let p1 ≤ ... ≤ p be the ordered p values for m end points. Then, the FDR-adjusted p values according to a linear step-up algorithm are sequentially calculated as p̃( = p(; p̃( = min(p̃(,(m/j)p(), for j = m – 1, ..., 1. Recently, Hong et al.[2] published an evaluation of the European Food Safety Authority (EFSA) framework for safety assessment of genetically modified (GM) crops using a rat 90 day feeding study,[3] which is a compulsory part of the safety assessment according to current European Union (EU) legislation.[4] The appropriateness of these animal studies and the EFSA framework on how to conduct such studies are both under discussion. For example, the EU research project GRACE (http://www.grace-fp7.eu) has performed and evaluated four 90 day and one 1 year study contributing to this discussion (see the study by Schmidt et al.[5] and references therein). Another currently ongoing EU research project is G-TwYST (https://www.g-twyst.eu), which is evaluating two 90 day studies and one combined chronic/carcinogenicity (2 year) study. Hong et al.[2] also assessed the appropriateness and applicability of the EFSA recommendations using a 90 day study and a battery of statistical approaches, including retrospective and prospective power analyses. This comment is not the place to give a full appraisal of all aspects of this discussion. The discussion here is restricted to just one element of the statistical approach used, which is the treatment of the multiplicity as a result of many end points. Hong et al. evaluated a very large number of end points and adjusted the p values of their tests according to the FDR method. The maximum number of end points for each of the sexes was m = 146; therefore, FDR-adjusted p values are between 1 and up to 146 times as large as the raw p values (FDR adjustment was performed for the set of all end points that were reported across sex and separately for the male- or female-specific comparisons and, thus, may have been lower in practice, but exact values are not given). The main result of Hong et al. regarding the comparisons between test and control groups is that “no treatment-related differences were observed”. This can be contrasted with the detailed comparisons in Appendix D of the paper, where 32 out of 816 of the 95% confidence intervals for observed differences do not contain the value 0 and, therefore, indicate significant differences in an unadjusted test. Note that this rate of significant results (3.9%) is close to the expected rate of 5% false positives that is expected under a null hypothesis of equality for all end points and, therefore, in itself, is not a reason for concern about safety. However, the reported absence of any statistically significant difference should be seen as the direct consequence of using the FDR adjustment. Clearly, with no “discovery” at all in this set of results, the false discovery rate is zero by definition, and in this respect, the methodology can be said to have operated very effectively. In summary, FDR adjustment is not a minor detail but is a main factor that determines the test results. I have two serious concerns about the methodology in this paper. First, the use of standard FDR correction or any other multiple-testing scheme makes no sense in food safety testing. It controls false discoveries and is therefore connected to difference testing, where the null hypothesis is equality of means and false positives are considered the error of the first kind: you want to have a small probability of erroneously reporting a difference. This is useful in studies that set out to find differences between groups, perhaps to find new explanations for biological phenomena or effective treatments. However, in the context of safety or equivalence testing, the purpose is to demonstrate safety with a chosen confidence level. Therefore, the statistical hypotheses are reversed: the null hypothesis is that some difference exists, and we want to show equivalence by rejecting such a null hypothesis (some possible approaches allowing for end points with widely different variations have been described[6−9]). In equivalence testing, the errors of the first kind are the false negatives rather than the false positives, to guarantee a small probability of erroneously reporting equivalence. Consequently, the commonly used methods for multiplicity correction including FDR are addressing the wrong type of error and should not be used in safety assessments. Second, contrary to the tests used to report results by Hong et al., the statistical power analyses in the same paper (both prospective and retrospective) do not use FDR adjustments. Therefore, the results of these power analyses cannot be interpreted as statements about the statistical power obtained using the FDR-adjusted tests. Clearly, the statistical power for any end point separately is much lower than stated (because the p values are adjusted upward). The potential danger of the paper is the message that its approach would be an appropriate procedure because (1) a high power of the difference tests for the proposed effect sizes seems to be attained and, at the same time, (2) not a single statistically significant difference is obtained. However, the statistical approaches followed for 1 and 2 (without and with FDR correction, respectively) are inconsistent. It is misleading to present FDR-adjusted test results together with power analyses that do not incorporate these adjustments. As an additional point, Hong et al. also claim that FDR adjustments would be endorsed by EFSA. Whereas EFSA in its guidance[3] did acknowledge the multiplicity problem (“the issue of multiple testing [...] should be addressed”), they have however not given an endorsement of FDR adjustment. Instead, EFSA[3] leaves the matter to the statistical analyst (“any methods used to adjust for multiplicity should also be clearly documented and referenced”). EFSA[10] already concluded on this: “FDR as usually applied (i.e. in a context of difference testing) is a property of the subset of endpoints for which a significant difference has been found. It does not address the endpoints for which no significance has been found and therefore FDR applied to difference testing does not seem sufficient as a measure in GMO risk assessment. It could be of interest to adapt the FDR concept for equivalence testing, i.e. for a situation where hypotheses are reversed, but the GMO Panel is not aware that this has yet been done.” By now, alternative methods for multiple or multivariate equivalence testing for safety evaluations have been proposed,[11−14] which are currently under debate.
  7 in total

1.  Evaluation of a statistical equivalence test applied to microarray data.

Authors:  Jing Qiu; Xiangqin Cui
Journal:  J Biopharm Stat       Date:  2010-03       Impact factor: 1.051

Review 2.  Simultaneous confidence regions for multivariate bioequivalence.

Authors:  Philip Pallmann; Thomas Jaki
Journal:  Stat Med       Date:  2017-08-30       Impact factor: 2.373

3.  Equivalence testing using existing reference data: An example with genetically modified and conventional crops in animal feeding studies.

Authors:  Hilko van der Voet; Paul W Goedhart; Kerstin Schmidt
Journal:  Food Chem Toxicol       Date:  2017-09-25       Impact factor: 6.023

4.  Safety assessment of plant varieties using transcriptomics profiling and a one-class classifier.

Authors:  Jeroen P van Dijk; Carla Souza de Mello; Marleen M Voorhuijzen; Ronald C B Hutten; Ana Carolina Maisonnave Arisi; Jeroen J Jansen; Lutgarde M C Buydens; Hilko van der Voet; Esther J Kok
Journal:  Regul Toxicol Pharmacol       Date:  2014-07-18       Impact factor: 3.271

5.  A statistical assessment of differences and equivalences between genetically modified and reference plant varieties.

Authors:  Hilko van der Voet; Joe N Perry; Billy Amzal; Claudia Paoletti
Journal:  BMC Biotechnol       Date:  2011-02-16       Impact factor: 2.563

6.  Safety Assessment of Food and Feed from GM Crops in Europe: Evaluating EFSA's Alternative Framework for the Rat 90-day Feeding Study.

Authors:  Bonnie Hong; Yingzhou Du; Pushkor Mukerji; Jason M Roper; Laura M Appenzeller
Journal:  J Agric Food Chem       Date:  2017-06-28       Impact factor: 5.279

7.  Variability of control data and relevance of observed group differences in five oral toxicity studies with genetically modified maize MON810 in rats.

Authors:  Kerstin Schmidt; Jörg Schmidtke; Paul Schmidt; Christian Kohl; Ralf Wilhelm; Joachim Schiemann; Hilko van der Voet; Pablo Steinberg
Journal:  Arch Toxicol       Date:  2016-10-11       Impact factor: 5.153

  7 in total
  1 in total

1.  Equivalence Testing Approaches in Genetically Modified Organism Risk Assessment.

Authors:  Hilko van der Voet; Claudia Paoletti
Journal:  J Agric Food Chem       Date:  2019-11-27       Impact factor: 5.279

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.