Many end points
representing possible non-target effects have to be evaluated in safety
assessments that compare new and accepted products. This is known
as the multiple-comparison or multiplicity problem. One popular method
to adjust statistical testing procedures for multiplicity is the false
discovery rate (FDR) method,[1] often implemented
via adjustment of p values, such as for example provided
in SAS procedure MULTTEST. FDR-adjusted p values
are obtained by multiplication of the raw p values
with factors between 1 and m, where m is the number of tested hypotheses. Let p1 ≤ ... ≤ p be the ordered p values for m end
points. Then, the FDR-adjusted p values according
to a linear step-up algorithm are sequentially calculated as p̃( = p(; p̃( = min(p̃(,(m/j)p(), for j = m – 1, ..., 1.Recently, Hong et al.[2] published an evaluation of the European Food
Safety Authority (EFSA) framework for safety assessment of genetically
modified (GM) crops using a rat 90 day feeding study,[3] which is a compulsory part of the safety assessment according
to current European Union (EU) legislation.[4] The appropriateness of these animal studies and the EFSA framework
on how to conduct such studies are both under discussion. For example,
the EU research project GRACE (http://www.grace-fp7.eu) has performed and evaluated four 90
day and one 1 year study contributing to this discussion (see the
study by Schmidt et al.[5] and references
therein). Another currently ongoing EU research project is G-TwYST
(https://www.g-twyst.eu),
which is evaluating two 90 day studies and one combined chronic/carcinogenicity
(2 year) study. Hong et al.[2] also assessed
the appropriateness and applicability of the EFSA recommendations
using a 90 day study and a battery of statistical approaches, including
retrospective and prospective power analyses. This comment is not
the place to give a full appraisal of all aspects of this discussion.
The discussion here is restricted to just one element of the statistical
approach used, which is the treatment of the multiplicity as a result
of many end points. Hong et al. evaluated a very large number of end
points and adjusted the p values of their tests according
to the FDR method. The maximum number of end points for each of the
sexes was m = 146; therefore, FDR-adjusted p values are between 1 and up to 146 times as large as the
raw p values (FDR adjustment was performed for the
set of all end points that were reported across sex and separately
for the male- or female-specific comparisons and, thus, may have been
lower in practice, but exact values are not given). The main result
of Hong et al. regarding the comparisons between test and control
groups is that “no treatment-related differences were observed”.
This can be contrasted with the detailed comparisons in Appendix D
of the paper, where 32 out of 816 of the 95% confidence intervals
for observed differences do not contain the value 0 and, therefore,
indicate significant differences in an unadjusted test. Note that
this rate of significant results (3.9%) is close to the expected rate
of 5% false positives that is expected under a null hypothesis of
equality for all end points and, therefore, in itself, is not a reason
for concern about safety. However, the reported absence of any statistically
significant difference should be seen as the direct consequence of
using the FDR adjustment. Clearly, with no “discovery”
at all in this set of results, the false discovery rate is zero by
definition, and in this respect, the methodology can be said to have
operated very effectively. In summary, FDR adjustment is not a minor
detail but is a main factor that determines the test results.I have two serious concerns about the methodology in this paper.First, the use of standard FDR correction or any other multiple-testing
scheme makes no sense in food safety testing. It controls false discoveries
and is therefore connected to difference testing, where the null hypothesis
is equality of means and false positives are considered the error
of the first kind: you want to have a small probability of erroneously
reporting a difference. This is useful in studies that set out to
find differences between groups, perhaps to find new explanations
for biological phenomena or effective treatments. However, in the
context of safety or equivalence testing, the purpose is to demonstrate
safety with a chosen confidence level. Therefore, the statistical
hypotheses are reversed: the null hypothesis is that some difference
exists, and we want to show equivalence by rejecting such a null hypothesis
(some possible approaches allowing for end points with widely different
variations have been described[6−9]). In equivalence testing, the errors of the first
kind are the false negatives rather than the false positives, to guarantee
a small probability of erroneously reporting equivalence. Consequently,
the commonly used methods for multiplicity correction including FDR
are addressing the wrong type of error and should not be used in safety
assessments.Second, contrary to the tests used to report results
by Hong et al., the statistical power analyses in the same paper (both
prospective and retrospective) do not use FDR adjustments. Therefore,
the results of these power analyses cannot be interpreted as statements
about the statistical power obtained using the FDR-adjusted tests.
Clearly, the statistical power for any end point separately is much
lower than stated (because the p values are adjusted
upward). The potential danger of the paper is the message that its
approach would be an appropriate procedure because (1) a high power
of the difference tests for the proposed effect sizes seems to be
attained and, at the same time, (2) not a single statistically significant
difference is obtained. However, the statistical approaches followed
for 1 and 2 (without and with FDR correction, respectively) are inconsistent.
It is misleading to present FDR-adjusted test results together with
power analyses that do not incorporate these adjustments.As
an additional point, Hong et al. also claim that FDR adjustments would
be endorsed by EFSA. Whereas EFSA in its guidance[3] did acknowledge the multiplicity problem (“the issue
of multiple testing [...] should be addressed”), they have
however not given an endorsement of FDR adjustment. Instead, EFSA[3] leaves the matter to the statistical analyst
(“any methods used to adjust for multiplicity should also be
clearly documented and referenced”). EFSA[10] already concluded on this: “FDR as usually applied
(i.e. in a context of difference testing) is a property of the subset
of endpoints for which a significant difference has been found. It
does not address the endpoints for which no significance has been
found and therefore FDR applied to difference testing does not seem
sufficient as a measure in GMO risk assessment. It could be of interest
to adapt the FDR concept for equivalence testing, i.e. for a situation
where hypotheses are reversed, but the GMO Panel is not aware that
this has yet been done.” By now, alternative methods for multiple
or multivariate equivalence testing for safety evaluations have been
proposed,[11−14] which are currently under debate.
Authors: Jeroen P van Dijk; Carla Souza de Mello; Marleen M Voorhuijzen; Ronald C B Hutten; Ana Carolina Maisonnave Arisi; Jeroen J Jansen; Lutgarde M C Buydens; Hilko van der Voet; Esther J Kok Journal: Regul Toxicol Pharmacol Date: 2014-07-18 Impact factor: 3.271
Authors: Bonnie Hong; Yingzhou Du; Pushkor Mukerji; Jason M Roper; Laura M Appenzeller Journal: J Agric Food Chem Date: 2017-06-28 Impact factor: 5.279
Authors: Kerstin Schmidt; Jörg Schmidtke; Paul Schmidt; Christian Kohl; Ralf Wilhelm; Joachim Schiemann; Hilko van der Voet; Pablo Steinberg Journal: Arch Toxicol Date: 2016-10-11 Impact factor: 5.153