Literature DB >> 30478944

Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.

Abstract

When evaluating a newly developed statistical test, an important step is to check its type 1 error (T1E) control using simulations. This is often achieved by the standard simulation design S0 under the so-called "theoretical" null of no association. In practice, the whole-genome association analyses scan through a large number of genetic markers ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> s) for the ones associated with an outcome of interest ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> ), where <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> comes from an alternative while the majority of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> s are not associated with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> ; the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi> <mml:mo>-</mml:mo> <mml:mi>G</mml:mi></mml:math> relationships are under the "empirical" null. This reality can be better represented by two other simulation designs, where design S1.1 simulates <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> from analternative model based on <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> , then evaluates its association with independently generated <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mrow/> <mml:msub><mml:mi>G</mml:mi> <mml:mrow><mml:mi>n</mml:mi> <mml:mi>e</mml:mi> <mml:mi>w</mml:mi></mml:mrow> </mml:msub> </mml:mrow> </mml:math> ; while design S1.2 evaluates the association between permutated <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> . More than a decade ago, Efron (2004) has noted the important distinction between the "theoretical" and "empirical" null in false discovery rate control. Using scale tests for variance heterogeneity, direct univariate, and multivariate interaction tests as examples, here we show that not all null simulation designs are equal. In examining the accuracy of a likelihood ratio test, while simulation design S0 suggested the method being accurate, designs S1.1 and S1.2 revealed its increased empirical T1E rate if applied in real data setting. The inflation becomes more severe at the tail and does not diminish as sample size increases. This is an important observation that calls for new practices for methods evaluation and T1E control interpretation.

Entities: Disease Gene Species

Keywords: interaction; simulation; type 1 error; variance heterogeneity; whole-genome association scans

Mesh：

Substances：
Genetic Markers

Year: 2018 PMID： 30478944 PMCID： PMC6518945 DOI： 10.1002/gepi.22172

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.135

INTRODUCTION

Type 1 error (T1E) control evaluation using simulations is always the first step in understanding the performance of any newly developed statistical test. To formulate the problem more precisely, let us consider the current large‐scale genome‐wide association studies (GWAS) or next‐generation sequencing (NGS) studies of complex and heritable traits. These studies scan through millions or more genetic markers (s) across the genome for the ones associated with a trait of interest (), while accounting for environmental effects; without loss of generality we assume these genetic markers are single‐nucleotide polymorphisms (SNPs). Many association tests have been developed, and they often require the assumption of (approximately) normally distributed errors to maintain an accurate T1E, with some tests being more robust than others. For example, Bartlett test for variance heterogeneity has been shown to have large inflated T1E rates when the error term follows a t‐ or ‐distribution (Struchalin, Dehghan, Witteman, van Duijn, & Aulchenko, 2010), and the likelihood ratio test (LRT) is similarly sensitive to nonnormal errors (Cao, Wei, Bailey, Kauwe, & Maxwell, 2014), while Levene's test appears to be more robust (Soave & Sun, 2017; Soave et al., 2015). Let be the associate test statistic to be evaluated. Standard T1E simulation design, denoted as S0, first generates phenotype data under the “theoretical” null of no association. It then independently generates genotype data , applies a fitted model of to test the , derives the asymptotic distribution of denoted as , obtains the corresponding p‐value for each simulated replicate, and finally estimates the empirical T1E rate of . (For notation simplicity and without loss of generality, for the moment we omit the intercept and additional covariates s from the the regression models.) is considered sound if the T1E rate is well controlled under the assumption, and its robustness is then evaluated by assuming other distributional forms for . Given well‐controlled T1E under normality, power will then be studied by generating phenotype under an alternative, , where typically . In that case, one applies the fitted model of to obtain the test statistic under the alternative, calculates the corresponding p‐value based on , and finally estimates the empirical power of . Combining the two, one would then expect that, in practice, maintains good T1E control for a null SNP and has certain amount of power for an alternative one. However, for a real GWAS, the relationship between the phenotype and a null SNP is under the “empirical” null, which we describe below. This inconsistency in T1E evaluation and interpretation is the focus of our study. In practice, a whole‐genome association scan receives an empirical that comes from an alternative, influenced by one or more s. Among the million or more SNPs to be analyzed, most are not associated with . However, the corresponding no phenotype–genotype association is not accurately reflected by the “theoretical” null simulation design S0 as described above. Now consider two alternative simulation designs to evaluate T1E control of the test statistic . Design S1.1 first generates from an alternative, where . It then independently generates for a new SNP, fits the model of , calculates and its corresponding p‐value based on , and finally estimates the T1E rate. Design S1.2 permutates the simulated and evaluates T1E from . Clearly, the “empirical” null simulation designs S1.1 and S1.2 mimic the real data far better than the “theoretical” null S0. Thus, an important question can be asked as to whether the S1.1 and S1.2 designs lead to similar T1E conclusion for as the S0 design. In particular, even if the assumption is true in the generating model and appears to be accurate based on the S0 evaluation, do we expect to perform equally well when applied to real data? The answer would depend on the sensitivity of the test statistic used. Efron (2004) has brought up the discussion of the “theoretical” versus “empirical” null more than a decade ago. Focusing on controlling the false discovery rate, Efron (2004) outlined several possible sources of nonnormality including unobserved covariates and hidden correlation, and he proposed an empirical Bayes approach to the problem. Here, we study the practical implications of T1E evaluation based on , the commonly used “theoretical” null simulation design, in the context of whole‐genome association scans. We show that while a test may appear to be accurate under S0 and assuming normality, it can have incorrect T1E rates under the “empirical” null of S1.1 or S1.2, also “assuming normality.” The fundamental cause of the discrepancy is that, even if the error term in the generating model of is normal, , marginally may not be normal. Thus, in evaluating the null relationship using the fitted model of (or using ), the true null distribution of may not be which was derived for under the “theoretical” null. Essentially, the model is misspecified if was assumed to be normal. Inference of the location parameter , the main effect of a SNP is generally quite robust to model assumptions (Khan & Rayner, 2003). However, for emerging association tests that are designed to improve power by going beyond the first moment, the distinction between the “theoretical” and “empirical” null in T1E evaluation can be consequential. As a proof‐of‐principle, we will focus on testing gene–environment ( × ) effects; testing gene‐gene ( × ) effects is similar. Such interaction effects are expected for complex traits, , and we study three scenarios for testing × interaction. The first scenario assumes that the data on are not available in practice. Thus, direct × interaction analysis is not possible. In that case, because the unmodeled interaction induces variance heterogeneity in when conditional only on , scale tests such as Levene's test, originally developed for model diagnostics, can be used to indirectly test for the interaction effect. We will investigate the several scale tests recently proposed for this purpose (Aschard, Zaitlen, Tamimi, Lindström, & Kraft, 2013; Cao et al., 2014; Paré, Cook, Ridker, & Chasman, 2010; Soave et al., 2015). It is worth noting that the causes of variance heterogeneity are multifaceted beyond potential interactions (Dudbridge & Fletcher, 2014; Sun, Elston, Morris, & Zhu, 2013; Wood et al., 2014). We show that, depending on the robustness of a test, T1E conclusion may differ between the “theoretical” and “empirical” null. The second scenario assumes was available for direct modeling of the × interaction effect. Previously, Voorman, Lumley, McKnight, and Rice (2011) and Rao and Province (2016) showed that T1E of testing for interaction effect in a whole‐genome scan can be more variable than testing or main effect. The authors examined × analysis, where represents a fixed SNP and its interactions with all other SNPs are of primary interest. Statistically, this is similar to × that we will be studying here, because does not vary between SNPs in a genome‐wide × interaction scan. Focusing on inflated or deflated genomic inflation factor (Devlin & Roeder, 1999), Rao and Province (2016) demonstrated a larger variation in (similar to a larger variation in T1E rates between different whole‐genome association scans), when testing the interaction effect as compared to the main effect under the “theoretical” null. They attributed this to dependency between the interaction test statistics, because (or the fixed in our setting) is not changing between the tests. They also noted that increasing sample size mitigates the problem. Here, we use this opportunity to revisit direct × interaction testing in a GWAS setting. We show that, while T1E rates are indeed more variable between simulation replicates (i.e., between genome‐wide interaction scans) under the conventional “theoretical” null, due to dependency between the tests as in Rao and Province (2016), the average T1E rate is correct regardless of the sample size. However, under the “empirical” null, a different picture emerges as in the scale test setting. The third scenario extends the above to a multivariate setting, where the fixed interacts with multiple different s as in gene‐based interaction studies. We will examine sequence kernel association test (SKAT)‐type of variance component test (Wu et al., 2011), together with burden‐type of sum test (Madsen & Browning, 2009), and the classical F‐test, that jointly evaluate multiple interaction effects. Generally speaking, departure from normality is of a lesser concern when an outcome is influenced by multiple genetic and environmental factors (Falconer, 1960; Mackay, 2009). However, we will show that the distinction between the “theoretical” and “empirical” null remains relevant for multivariate models. In Section 2, we first describe the three scenarios for interaction testing, including when is missing, and when is known and interacts with one or multiple s. For clarification, we call these interaction scenarios. Under each interaction scenario, we briefly review all the statistical tests to be investigated. We then describe the three simulation designs for evaluating T1E control, namely, S0 for the “theoretical” null design, and S1.1 and S1.2 for the “empirical” null designs; we call these T1E simulation designs. The implementation of the different T1E simulation designs depends on the test of interest. Thus, we also describe in details how the data are being generated for each of the three interaction scenarios, and for each of the three T1E simulation designs; we call these data‐generating models. In Section 3, we conduct simulation studies, provide the corresponding numerical results, and reveal the existing problems in T1E evaluation based on the “theoretical” null simulation design S0. We show that (a) a T1E conclusion drawn from S0 can be different from that based on the two alternative “empirical” null simulation designs S1.1 and S1.2, (b) the T1E discrepancy can remain as sample size increases, and (c) the T1E issue may be more severe at the tail. The root cause of the issue demonstrated in this study is due to subtle model misspecifications. A test might be shown to be accurate under the idealistic “theoretical” null S0. However, its true T1E behavior, when applied to real whole‐genome association scans, is only uncovered through the “empirical” null designs S1.1 and S1.2. We make additional remarks in Section 4.

METHODS AND MATERIALS

Three G × E interaction scenarios and corresponding statistical tests

For association analysis of a complex trait using a sample of size , we first define genotype data for individual at each SNP under the study. As in convention, denotes the number of copies of the minor allele, coded additively as , 1, and 2. Also as in convention, is assumed to come from a binomial distribution, , where is the minor allele frequency (MAF). In the simplest case, we might assume the true generating model for the trait to be where is the main effect of a causal SNP, is the environmental effect, and is the gene–environment interaction effect. Note that the error term in the true data‐generating model is denoted as and assumed to be normal. We wish to identify the SNP whose genotype influences .

G × E interaction scenario 1: Single G, and E missing

Suppose information regarding was not available, then the working or fitted model can only account for the main effect of , However, it is straightforward to show that variances of stratified by the three genotype groups of differ, if , That is, in the fitted model (2) can behave quite differently from , the error term in the generating model (1). Thus, when is missing and direct interaction modeling is not feasible, scale tests for heteroscedasticity can be utilized to identify associated with variance of (Paré et al., 2010). A joint location‐scale testing framework can provide robustness against either or , and it can improve power if both main and interaction effects are present (Soave & Sun, 2017; Soave et al., 2015). Here we focus on studying the more sensitive scale test, because the power of the joint test depends on the individual components. Different scale tests have been studied in this context, and chief among them are the Levene's test (Levene, 1960) considered by Paré et al. (2010) and Soave et al. (2015), and the LRT considered by Cao et al. (2014). Levene's test for variance heterogeneity between groups is an analysis of variance of the absolute deviation of each observation from its group mean or median. Under the null of variance homogeneity and assuming normality, the resulting test statistic follows a F(, ) distribution, and it is asymptotically distributed, in our case. Using median instead of mean to measure the spread within each group is more robust to nonnormality, particularly for t‐distributed or skewed data (Brown & Forsythe, 1974; Soave & Sun, 2017). And we will be using the median version of in the remaining paper. The variance LRT considered by Cao et al. (2014) contrasts the null model of no variance difference with the alternative model, and conduct the corresponding LRT for . The corresponding test statistic is asymptotically distributed, under the null of homoscedasticity and normality. That is, where the test statistic is . For the purpose of comparison, we will also examine the LRT () and score test () for testing the main effect, . Cao et al. (2014) has pointed out that is sensitive to the normality assumption, but under normality they have demonstrated that has good T1E control. However, they implicitly assumed that the test would work well as long as the error term in the phenotype‐generating model is normally distributed, regardless of the “theoretical” or “empirical” null. The work here is to show why may have T1E issue in practice. Indeed, Soave et al. (2015) applied to a GWAS of lung function measures in 1,409 individuals with cystic fibrosis. Despite the fact that the lung measures were approximately normally distributed and permutated before the variance association analysis, the histogram of GWAS p‐values clearly showed an increased T1E (Supporting Information Figure S2.G of Soave et al., 2015); the actual application was a joint and test, but the T1E issue was due to the component.

G × E interaction scenario 2: Single G, and E known

When is known and its data were collected, we can directly test for the x interaction effect by contrasting the following two fitted models: where . The corresponding and tests are both asymptotically distributed under the null of no interaction effect. That is, under the “theoretical” null that in the true phenotype‐generating model (1), , where the test statistic is either or . This is an important note for our study here. We will show that if but the association test is conducted for (of a new SNP generated independently of ) under the “empirical” null, continuing using to obtain p‐value can lead to T1E result that is quite different from that obtained under the “theoretical” null.

G × E interaction scenario 3: Multiple Gs, and E known

A complex trait is influenced by multiple factors. For example, intelligence has been found to be associated with more than a hundred of SNPs (Dadaev et al., 2018; Hill et al., 2018). Without loss of generality, the simple phenotype‐generating model (1) can be extended to include SNPs, where is the main effect of SNP , is the environmental effect, and is the interaction effect of . To detect any of the interaction effects, the classical F‐test, denoted as , can be applied to the following fitted model: and test simultaneously. Under the “theoretical” null that in the phenotype‐generating model (6), the test statistic is F() distributed and is known to have good T1E control. However, it is not clear how the test would behave in practice under the “empirical” null. In that case, a set of 's under the study do not interact with to influence , but and in the true phenotype‐generating model (6). For rare variants with low MAFs, multivariate testing is common and different methods have been proposed, such as the burden‐type of sum test (Li & Leal, 2008; Madsen & Browning, 2009; Price et al., 2010), and SKAT‐type of variance component test (Wu et al., 2011); see Derkach, Lawless and Sun (2014) for a review. Although these tests were originally developed for main effects of rare variants, they can be applied to common variants (Lin, Lee, Christiani, & Lin, 2013) and used for interaction effects (Section of Lin et al., 2016). Briefly, the interaction test first aggregates the allele counts across the SNPs to obtain . It then tests , using the fitted model of However, the test has T1E issue even under the “theoretical” null; this was studied in Section 3 of Lin et al. (2016). For completeness of method evaluation, we include in our study of “theoretical” versus “empirical” null. Extending the earlier SKAT work for main effects, Lin et al. (2016) then used it to study interaction effects. Without going to the technical details, the main component of the interaction test is , where is the score test statistic for each in the fitted model (7), and depends on the MAF of SNP . We will study the , as well as the and for gene‐based interaction studies of both common and rare variants.

Three T1E simulation designs: The “theoretical” null S0, and the “empirical” null S1.1 and S1.2

The three simulation designs for evaluate T1E can be conceptualized as follows: The exact implementation depends on the test to be evaluated. Thus, we describe below, in detail, how data are being generated for the three T1E simulation designs (S0, S1.1, and S1.2), and under the different interaction testing scenarios ( missing or not, and single or multiple s).

Data‐generating models for the three T1E simulation designs and under each of the three interaction scenarios

Scenario 1: Single G, and E missing

Consider the true phenotype‐generating model (1), the “theoretical” null simulation design S0 assumes ; for simplicity but without loss of generality we also assume . Thus, S0 simulates phenotype data using It then independently simulates genotype data for an nonassociated SNP, where is the MAF. Finally, because was assumed missing in practice, S0 uses the following fitted model: to detect the variance heterogeneity present in , based on the and tests as described in Section 2.1.1. The “empirical” null simulation design S1.1, however, first simulates phenotype data under an alternative. That is, Note that is a truly associated SNP, and , where the MAF does not have to be the same as that of above. S1.1 then independently simulates genotype data for a nonassociated SNP as in (11). Similarly, because was assumed missing in practice, S1.1 uses the fitted model to conduct the and variance tests. The “empirical” null simulation design S1.2 first permutates the generated above. Because is no longer associated with the ‐generating , it then uses the fitted model to assess T1E control of a test. In summary, the true phenotype‐generating models are or , where in both cases. is genotype data of a nonassociated SNP. The three T1E simulation designs estimate T1E rate of a heteroscedasticity test for using, respectively, the fitted models of

Scenario 2: Single G, and E known

In this case, the data‐generating model is the same as above, namely, Because is known in this second interaction scenario, the three T1E simulation designs estimate the T1E rate by testing using, respectively, the fitted models of Note that represents the fact that the permutation must be performed jointly for and . This is to maintain the relationship while breaking the association. For the “theoretical” null S0, we assumed without the main effect in the true generating model (similar to Model I of Rao & Province, 2016). Alternatively, we could consider with the main effect (Model II of Rao & Province, 2016). In that case, the fitted model would be , and is expected to be zero. However, this difference in the “theoretical” null design regarding the main effect does not affect our study of the interaction effect. The work of Rao and Province (2016) studied the effect of dependency in × interaction analysis on T1E control. Similarly, we can assume is fixed to represent the fact that, in a real genome‐wide × interaction scan, the does not change from SNP to SNP. However, as demonstrated below, we show that this dependency is not the source of the T1E issue addressed here.

Scenario 3: Multiple Gs, and E known

By now, it should be clear how the three T1E simulation designs would be implemented in this setting. The true phenotype‐generating models are For the and tests, the three T1E simulation designs estimate the T1E rate of jointly testing , using, respectively, the fitted models of For the test based on , the three T1E simulation designs estimate the T1E rate of testing , using, respectively, the fitted models of

SIMULATION STUDIES

Simulation models and parameter values

For indirect or direct × interaction study of a single SNP (i.e., interaction scenarios 1 and 2), Table 1 provides the details of the data‐generating models and parameter values used. To evaluate the and tests for variance heterogeneity, besides the data‐generating model as described in Section 2.3.1 and as used by Aschard et al. (2013), we also considered the model adopted by Cao et al. (2014) for a more extensive comparison. Cao et al. (2014) used Model (4) to directly simulate variance homogeneity or heterogeneity in stratified by . In contrast, Aschard et al. (2013) used Model (1) to indirectly simulate variance heterogeneity that has better genetic epidemiology interpretation, because the size of corresponds to power of scale tests under alternatives. For direct testing of the interaction effect, conveniently, the data‐generating model of Aschard et al. (2013) in Table 1 is conceptually the same as the simulation model I of Rao and Province (2016), except that is a fixed here.

Table 1

	Introduce variance heterogeneity by σG2 (Cao et al., 2014)	Introduce variance heterogeneity by G × E (Aschard et al., 2013) Or, directly test βGE (assuming E was available)
Null model for S0	Y0=e,e~N(0,σ2)	Y0=βEE+e, e~N(0,σ2)
Alternative models for S1.1 and S1.2	Y1=βGG+e,e~N(0,σG2)	Y1=βGG+βEE+βGEG×E+e,e~N(0,σ2)
Parameters	MAF=0.4forGandGnew βG=0.3,σ02=0.23,σ12=0.25,σ22=0.29	MAF=0.4forGandGnew P(E=1)=0.3,βG=0.01,βE=0.3,βGE=0.1,0.2,…,1,σ2=1
Sample size	n=103 or 104	n=103 or 104
Nominal T1E level	α=0.05	α=0.05, 0.01, 0.001, 10−5
Replications	nrep.in=105	nrep.in=105, or 107 for α=10−5
	nrep.out = 100	nrep.out = 100

If was available for direct × testing, the Aschard et al. (2013) model coincides with Model I of Rao and Province (2016), except was . T1E rate is first estimated from nrep.in simulation replicates in an inner loop in which is fixed (similar to one whole‐genome × interaction scan), then averaged over nrep.out simulation replicates in an outer loop in which varies.

MAF: minor allele frequency.

Summary of the two data‐generating models for indirect and direct × interaction testing, and evaluating the “theoretical” null simulation designs S0 versus the two “empirical” null simulation designs S1.1 and S1.2, as described in Sections 2.3.1 and 2.3.2 If was available for direct × testing, the Aschard et al. (2013) model coincides with Model I of Rao and Province (2016), except was . T1E rate is first estimated from nrep.in simulation replicates in an inner loop in which is fixed (similar to one whole‐genome × interaction scan), then averaged over nrep.out simulation replicates in an outer loop in which varies. MAF: minor allele frequency. To best mimic real data conditions, we also used a “double‐loop” simulation design. Consider the “theoretical” null of for direct × interaction analysis. Within each of nrep.out replicates (e.g., 100) in an outer simulation loop, we simulate based on a fixed , . We then simulate nrep.in replicates (e.g., ) of , test based on the fitted model of , and estimate the T1E rate using the nrep.in replicates. This is similar to one single whole‐genome × interaction scan. Finally, we average the T1E values over nrep.out replicates to account for sampling variation inherent in the simulation of a for one single whole‐genome interaction scan. This can be done similarly for the “empirical” null. For each combination of parameter values in Table 1 that generates under an alternative, instead of studying power (of ), we focused on evaluating T1E control (of or ) contrasting the proposed “empirical” null simulation designs (S1.1 and S1.2) with the previously considered “theoretical” null simulation design (S0), as described in Section 2. Table 2 provides the parameter values for gene‐based interaction analysis (i.e., interaction scenario 3). The number of SNPs was chosen to be as in Lin et al. (2016), among which only six interact with . That is, only for some as detailed in Table 2. Two sets of MAF levels were considered, with one presents gene‐based interaction studies of common variants and the other of rare variants.

Table 2

Null model for S0	Y0=∑j=1JβGjGj+βEE+e,e~N(0,σ2)
Alternative models for S1.1 and S1.2	Y1=∑j=1JβGjGj+βEE+∑j=1JβGEj(Gj×E)+e,e~N(0,σ2)
MAF for G→j and G→newj	Large: (2.15,2.58,2.58,4.16,2.57,2.61,4.95,2.58,2.57,2.58,3.68)×10−1
	Small: (2.15,2.58,2.58,4.16,2.57,2.61,4.95,2.58,2.57,2.58,3.68)×10−2
Parameters	J=11;P(E=1)=0.3, βG→=(−0.218,0,0,−0.476,0,0,−0.151,−0.845,0.0945,0,−0.133), βE=0.3,βGE→=(0.1,−0.1,0,0,0.1,0.1,0,−0.1,0,−0.1,0),σ2=0.27
Sample size	n=103
Nominal T1E level	α=0.05,0.01,0.001
Replications	nrep.in=105;nrep.out=100

T1E rate is first estimated from nrep.in simulation replicates in an inner loop in which is fixed (similar to one whole‐genome gene‐based × interaction scan), then averaged over nrep.out simulation replicates in an outer loop in which varies.

MAF: minor allele frequency.

Summary of the data‐generating models for direct multivariate × interaction testing, and evaluating the “theoretical” null simulation designs S0 versus the two “empirical” null simulation designs S1.1 and S1.2, as described in Section 2.3.3 T1E rate is first estimated from nrep.in simulation replicates in an inner loop in which is fixed (similar to one whole‐genome gene‐based × interaction scan), then averaged over nrep.out simulation replicates in an outer loop in which varies. MAF: minor allele frequency.

Simulation results

For each of the three × interaction scenarios, for each of the statistical tests under the study, and for each of the three T1E simulation designs, we recorded the corresponding T1E rate. We bold the ones that exceed the range, where is the nominal T1E rate, and rep.in is the number of simulation replicates used to estimate the empirical T1E rate for each of the nrep.out replicates; each replicate in an outer loop represents a whole‐genome association scan. Thus, is a conservative interval. Results in Table 3 show that, while location tests for phenotypical mean differences across genotype groups ( and ) are generally robust to the choice of “theoretical” null S0 versus “empirical” null S1.1 or S1.2, it is not the case for the test for variance heterogeneity; the empirical T1E rates of the test were slightly deflated but not significantly. Different choices of the null simulation designs lead to different conclusions regarding the accuracy of . For example, simulation design S0 showed that has the correct T1E control across the parameter values considered. However, designs S1.1 and S1.2 revealed its inflated empirical T1E rate, for example, 0.07 for the nominal level for some settings.

Table 3

Simulation results of interaction scenario 1: Single , and missing

		α=5×10−2, n=103
	βGE	0.0	0.2	0.4	0.6	0.8	1
Location
LRTm	S0	5.029	5.027	5.026	5.027	5.027	5.026
	S1.1	5.039	5.021	5.021	5.019	5.014	5.010
	S1.2	4.997	5.023	5.022	5.020	5.017	5.011
Scorem	S0	5.002	4.998	4.997	4.998	4.998	4.997
	S1.1	5.014	4.992	4.993	4.991	4.985	4.981
	S1.2	4.974	4.994	4.993	4.990	4.988	4.983
Scale
LRTv	S0	5.035	5.083	5.081	5.081	5.081	5.079
	S1.1	5.029	5.188	5.262	5.507	5.979	6.757
	S1.2	5.031	5.198	5.274	5.519	5.994	6.756
Levene	S0	4.956	4.898	4.898	4.898	4.898	4.896
	S1.1	4.989	4.901	4.904	4.905	4.909	4.907
	S1.2	4.906	4.922	4.912	4.911	4.915	4.908

Empirical T1E rates of and location tests for mean difference in across the three groups, and of and scale tests for variance difference in , based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. The alternative data were generated using the Aschard's genetic model as described in Table 1. Empirical T1E rates outside are bolded.

Simulation results of interaction scenario 1: Single , and missing Empirical T1E rates of and location tests for mean difference in across the three groups, and of and scale tests for variance difference in , based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. The alternative data were generated using the Aschard's genetic model as described in Table 1. Empirical T1E rates outside are bolded. While the increased T1E rates under the S1.1and S1.2 “empirical” null designs appeared to be mild and occurred only in extreme models (i.e., large × interaction effect), results in Table 4 demonstrate that the T1E issue of , revealed under the “empirical” null simulation designs of S1.1 and S1.2, can be more severe at the tail. For example, for the nominal level, the empirical T1E rate of can be as high as . Because the genome‐wide significance level for GWAS is (Dudbridge & Gusnanto, 2008), an inflation of false positive findings can be of a real problem in practice. Further, results in Table 5 confirm that increasing sample size (from to ) does not mitigate the discrepancy in T1E conclusion drawn from the “theoretical” versus “empirical” null.

Table 4

Simulation results of interaction scenario 1: Single , and missing; effect of the nominal level

	n=103	α	5×10−2	1×10−2	1×10−3	1×10−5
Location	LRTm	S0	5.016	0.999	0.985	0.988
		S1.1	5.009	1.007	0.988	0.990
		S1.2	5.011	1.009	1.033	1.013
	Scorem	S0	5.008	0.998	0.989	0.990
		S1.1	4.982	0.998	0.982	0.991
		S1.2	4.982	0.999	0.983	0.997
Scale	LRTv	S0	5.009	1.002	1.024	1.033
		S1.1	6.923	1.636	2.059	11.599
		S1.2	6.920	1.639	2.042	11.624
	Levene	S0	4.955	0.961	0.938	0.971
		S1.1	4.964	0.978	0.932	0.952
		S1.2	4.962	0.970	0.953	0.958

Table 5

Simulation results of interaction scenario 1: Single , and missing; effect of sample size

		α=5×10−2
		n=103	n=104
Location
LRTm	S0	5.011	5.012
	S1.1	4.992	5.003
	S1.2	4.934	4.989
Scorem	S0	5.011	5.012
	S1.1	4.993	5.003
	S1.2	4.934	4.989
Scale
LRTv	S0	5.103	5.165
	S1.1	7.034	7.125
	S1.2	7.007	7.020
Levene	S0	4.924	4.945
	S1.1	4.905	4.965
	S1.2	4.874	4.825

Empirical T1E rates of and location tests for mean difference in across the three groups, and of and scale tests for variance difference in , based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. The alternative data were generated using the Cao's genetic model as described in Table 1, and using two difference sample sizes of and . Empirical T1E rates outside are bolded.

Simulation results of interaction scenario 1: Single , and missing; effect of the nominal level Empirical T1E rates of and location tests for mean difference in across the three groups, and of and scale tests for variance difference in , based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. The alternative data were generated using the Aschard's genetic model as described in Table 1, focusing on the extreme case of large interaction effect, . Empirical T1E rates outside are boded. Simulation results of interaction scenario 1: Single , and missing; effect of sample size Empirical T1E rates of and location tests for mean difference in across the three groups, and of and scale tests for variance difference in , based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. The alternative data were generated using the Cao's genetic model as described in Table 1, and using two difference sample sizes of and . Empirical T1E rates outside are bolded. The root cause of this discrepancy is unsuspected model misspecification. Under the S1.1 and S1.2 designs, marginally is not normally distributed, even if it was generated (conditional on the true ) using a normally distributed error term. Using as an example, Cao et al. (2014) has shown that is accurate based on , which is derived for under the “theoretical” null S0 condition and assuming . However, when is applied to , even if for generating conditional on the true , the correct asymptotic distribution of under the “empirical” null is a weighted sum of independent (Supporting Information Materials and Theorem 3.4.1(1) of Yanagihara, Tonda, & Matsumoto, 2005; Tonda, & Wakaki, 2003; Gomes‐Sanchez‐Manzano et al, 2006; Eicker, 1967). Thus, assessing using can have T1E issues if the data were generated under the “empirical” null S1.1 condition as in real data, similarly for S1.2. Figure 1 compares the asymptotic distribution (black solid curve) with the finite‐sample distribution (red dashed curve) of under the “empirical” null, as well as with the asymptotic distribution (, blue dot‐dashed curve) of under the “theoretical” null. While the the finite‐sample distribution approximates well the asymptotic distribution derived under the “empirical” null, it is clear that the distributions of differ between the “empirical” and “theoretical” null; the difference is more visible on the scale of critical value for statistical significance (the vertical lines). The true significance threshold for under the “empirical” null is further away to the tail, as compared with the threshold of under the “theoretical” null. Thus, applying to real GWAS or NGS while using the significance threshold of can lead to inflated T1E rate. Deriving the correct distribution corresponding to the “empirical” null, unfortunately, requires the knowledge of the alternative model which is unknown in practice. Permutation‐based method can provide reasonable estimates which we discuss later.

Figure 1

Comparison of the asymptotic distribution (black solid) and finite‐sample distribution (red dashed) of under the “empirical” null, with the asymptotic distribution (, blue dot‐dashed) of under the “theoretical” null. Vertical lines correspond the quantile cutoffs for In practice, it is routine (and recommended) to display and examine the empirical distribution of a trait under the study. However, Supporting Information Figure S1 shows that even under the most extreme setting where , the marginal histogram of appears to be approximately normal visually, unless a formal diagnostic test for normality was conducted. The slightly right‐skewed empirical distribution of is the result of mixing six conditional distributions of , each perfectly normally distributed conditional on the causal and . This is the key difference between the “theoretical” and “empirical” null simulation designs, regardless of the sample size. For a less extreme case where , although both the histogram and Q–Q plot (Supporting Information Figure S2) suggest that normal distribution is a good fit (passing the Shapiro–Wilk normality test), the T1E discrepancy between the “theoretical” and “empirical” null remains, although less severe, as shown in Table 3 (column ) and Supporting Information Figure S3 (middle row). Tables 3, 4, and 5 also provide T1E results for testing phenotypic mean (as opposed to variance) differences across the genotype groups. Although location testing for main effect is generally quite robust to the assumption of normality, problem can arise when testing for interaction effects, beyond any apparent model misspecifications (Rao & Province, 2016). In direct testing for the × interaction effect ( to be more precise), Rao and Province (2016) used the classical “theoretical” null simulation design S0, with or without the main effect. Regardless, figures 1b,c of Rao and Province (2016) showed that the variation in the resulting was substantially bigger when testing than testing . Their figures 1d,e also demonstrated that the variation diminishes as sample size increases. However, we note that this observation was made before averaging across the 414 simulated interaction scans/datasets; each scan contained 20,000 SNPs from which a value was estimated. The results of Rao and Province (2016) are consistent with ours shown in Supporting Information Figure S6. Supporting Information Figure S6 shows that scan‐specific estimated T1E rates are indeed variable (due to dependency between the tests) and become less so as the sample size increases; 100 × whole‐genome interaction scans with SNPs in each scan and being fixed within a scan. However, it is important to note that the average T1E rate across nrep.out simulated scans reflects better the long‐run behavior of a method. Indeed, results in Table 6 show that the T1E rate of testing , estimated from (nrep.in nrep.out) simulated replicates, is well controlled under the conventional “theoretical” null simulation design of S0. However, this is not the case under the “empirical” null simulation designs of S1.1 or S1.2. Similar to the scale test for variance heterogeneity, the discrepancy in T1E control, between the two types of null simulation designs, becomes more prominent at the tail and persists as sample increases (Table 6).

Table 6

Simulation results of interaction scenario 2: Single , and known

		n=103			n=104
	α =	5×10−2	1×10−2	1×10−3	5×10−2	1×10−2	1×10−3
LRTGE	S0	5.034	1.017	1.029	5.033	1.007	0.979
	S1.1	7.091	1.763	2.422	6.886	1.709	2.410
	S1.2	7.100	1.771	2.389	6.972	1.707	2.339
ScoreGE	S0	4.982	1.003	1.002	5.021	1.004	0.972
	S1.1	7.028	1.738	2.363	6.874	1.705	2.400
	S1.2	7.040	1.747	2.339	6.960	1.703	2.326

Empirical T1E rates of the LRT and Score tests, based on the ‘theoretical’ null design of S0 and the alternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data were generated using the Aschard's genetic model as described in Tables 2 when β = 1, but E was assumed to be known in this case and direction interaction testing was possible. Empirical T1E rates outside are bolded.

Simulation results of interaction scenario 2: Single , and known Empirical T1E rates of the LRT and Score tests, based on the ‘theoretical’ null design of S0 and the alternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data were generated using the Aschard's genetic model as described in Tables 2 when β = 1, but E was assumed to be known in this case and direction interaction testing was possible. Empirical T1E rates outside are bolded. Table 7 contains T1E results for gene‐based interaction studies of both common and rare variants. When the MAFs are between 0.2 and 0.5 (common), we observed that the T1E rates of all tests considered are significantly different between the “theoretical” null and “empirical” null. This result is consistent with that of the interaction scenarios 1 and 2 above. Our simulation results also confirmed that the interaction test has T1E issue even under the “theoretical” null, as previously noted in Lin et al. (2016). When the MAFs are reduced 10‐fold to be between 0.02 and 0.05 (rare), although appears to be robust, the and tests continue to display significant differences between the “theoretical” and “empirical” null simulation designs for evaluating their T1E control.

Table 7

Simulation results of interaction scenario 3: Multiple s, and known

		Small MAF (rare)			large MAF (common)
	α =	5×10−2	1×10−2	1×10−3	5×10−2	1×10−2	1×10−3
FGE	S0	4.991	0.982	0.985	4.983	1.001	0.992
	S1.1	5.448	1.130	2.363	6.905	1.591	2.043
	S1.2	5.448	1.130	1.238	6.818	1.596	1.945
SKATGE	S0	4.904	0.949	0.895	4.934	0.976	0.917
	S1.1	5.038	1.074	1.059	6.460	1.411	1.640
	S1.2	5.034	1.065	1.110	6.378	1.394	1.651
BurdenGE	S0	6.169	3.751	15.049	13.709	4.280	7.440
	S1.1	8.432	2.213	3.144	5.666	1.232	1.410
	S1.2	8.408	2.208	3.148	5.712	1.237	1.420

Empirical T1E rates of the , , and tests of jointly testing for multiple interaction effects, based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. Models and parameters values are given in Table 2. Empirical T1E rates outside are bolded.

MAF: minor allele frequency.

Simulation results of interaction scenario 3: Multiple s, and known Empirical T1E rates of the , , and tests of jointly testing for multiple interaction effects, based on the “theoretical” null design of S0 and the alternative “empirical” null designs of S1.1 and S1.2. Models and parameters values are given in Table 2. Empirical T1E rates outside are bolded. MAF: minor allele frequency.

DISCUSSION

In this study, we highlight the importance of distinguishing the “theoretical” and “empirical” null distributions, first noted by Efron (2004), in a different application context. Starting with scale tests for variance heterogeneity and through simulation studies, we showed that conclusions of T1E control of a statistical test could differ depending on the choice of the null simulation designs. For example, the variance test appears to be accurate under the “theoretical” null design, but it can have inflated T1E under the “empirical” null design that better mimics real data conditions (Tables 3, 4, and 5, and Supporting Information Figure S3). Cao et al. (2014) has pointed out the sensitivity and limitation of when the error term is not normally distributed. However, they implicitly assumed that the test would work well as long as , regardless of “theoretical” null or “empirical” null. In our simulation study, although all the 's for generating the phenotype data were normal, the increased T1E under the “empirical” null are, fundamentally, attributed to the sensitivity of to subtle model misspecifications. This is because the marginal distribution of the empirical outcome data are in fact not normal (Supporting Information Figures S1 and S2). Thus, tests shown to be extremely sensitive to model assumptions are particularly vulnerable when applied to real data. Because empirical data are better represented by the “empirical” null than the “theoretical” null, evaluating T1E control using the “empirical” null design can expose the true behavior of a test when applied to real whole‐genome association scans. Conversely, power calculation for using the significance threshold of the “theoretical” null distribution can be too optimistic. Indeed, this study was motivated by the observation made in Supporting Information Table S7 of Soave et al. (2015) that, “further analysis using permutation estimation of p‐values showed that power of the LRT under asymptotic analysis was greatly inflated.” The work here provides analytical insights on why this is the case. The asymptotic power was obtained using the distribution derived under the “theoretical” null while controlling T1E at , as in Cao et al. (2014). The permutation‐based power was obtained by using the empirical quantile cutoff of the statistic applied to permutated phenotype data. This is equivalent to controlling T1E at under the “empirical” null of S1.2. Under the “empirical” null, however, we showed that follows a different distribution and the corresponding significance quantile cutoff is further to the tail, as compared with the distribution under the “theoretical” null (Figure 1). This leads to (correct) smaller permutation‐based power. The (incorrect) higher asymptotic‐based power is a result of increased T1E rate, when the data were generated under the “empirical” null condition but the test statistic was evaluated using derived under the “theoretical” null condition. Permutation‐based S1.2 null design can estimate the true distribution of a test statistic when applied to a real data set, and identify the correct significance threshold for under the “empirical” null. But, permutation must be carried out carefully in practice, for example, in the presence of sample correlation (Abney, 2015). In practice, investigators often rely on visual inspection of histograms of outcome data as illustrated in Supporting Information Figures S1 and S2. We have noted that the departure from normality does not have to be severe to have an effect on tests such as , as observed in a scan of lung function in cystic fibrosis subjects by Soave et al. (2015). In that case, the lung phenotype is called SaKnorm, defined as the forced expiratory volume in one second, adjusted for sex, age, height, and mortality, and normalized. In the sample analyzed, the distribution of the phenotype indeed appeared to be normal, but the application of to permutated SaKnorm showed inflated T1E even for common SNPs: Supporting Information Figure S2.H of Soave et al. (2015) for 454,764 SNPs with MAF , Supporting Information Figure S2.I for 111,120 SNPs with MAF , and Supporting Information Figure S2.G for all 565,884 GWAS SNPs. Furthermore, for data appear to deviate from normal such as those displayed in Supporting Information Figure S1, even if investigators chose to perform some standard normal transformations, the T1E issue can persist. For example, let us consider the phenotype data simulated based on Aschard's genetic model, as described in Table 1 where (Supporting Information Figure S1). After square‐root or log transformations (Goh & Yap, 2009), although the empirical marginal distribution of the phenotype data improved as expected (Supporting Information Figure S4), the severity of T1E inflation of in fact worsened under the “empirical” null simulation designs S1.1 and S1.2 (Supporting Information Figure S5). Voorman et al. (2011) showed that spurious false positives can occur in genome‐wide scans for × interactions, particularly in the presence of model misspecification. Rao and Province (2016) also presented inflated or deflated genomic inflation factor in a × interaction scan when one SNP is anchored (i.e., x ), using the conventional “theoretical” null simulation design without any apparent model misspecifications. Based on our × simulation studies where was fixed within each scan, we note that the large variation in estimate demonstrated by Rao and Province (2016) corresponds to the sampling variation inherent in estimating T1E rate from nrep.in replicates (or SNPs) within each of the nrep.out replicates (or scans). This, however, does not translate to T1E issue based on the classical frequentist interpretation. Results in Table 6 show that, under the “theoretical” null of S0, there is no T1E issue in × interaction studies even if was fixed within a genome‐wide scan. But, similar to scale test of variance, T1E results differ using the “empirical” S1.1 or S1.2 null simulation designs. Additional theoretical insights are provided in Section 3 of the Supporting Information Materials. Departure from normality is generally weakened under multivariate assumptions. However, the topic addressed here remains relevant. To demonstrate this, in addition to studying interaction between and of a single SNP, we also examined testing for interaction effects between and multiple s as in gene‐based interaction studies. We reached the same conclusion that T1E conclusions could differ between the “theoretical” and “empirical” null simulation designs. To conclude, although we only presented three examples (i.e., scale tests for variance heterogeneity, and direct tests for one or multiple interaction effects jointly), the findings here have important implications for future evaluation of T1E control and interpretation. The newer tests being developed often go beyond the first moment such as the scale tests studied here, and they are increasingly complex and possibly more sensitive to subtle model misspecifications. The conventional “theoretical” null simulation design (S0) is unrealistic and can lead to misleading conclusion regarding T1E control, which in turn affects power study. The alternative “empirical” null simulation designs (S1.1 and S1.2) can reveal the true behavior of a test when applied to real data.

25 in total

1. Pooled association tests for rare variants in exon-resequencing studies.

Authors: Alkes L Price; Gregory V Kryukov; Paul I W de Bakker; Shaun M Purcell; Jeff Staples; Lee-Jen Wei; Shamil R Sunyaev
Journal: Am J Hum Genet Date: 2010-05-13 Impact factor: 11.025

2. What is the significance of difference in phenotypic variability across SNP genotypes?

Authors: Xiangqing Sun; Robert Elston; Nathan Morris; Xiaofeng Zhu
Journal: Am J Hum Genet Date: 2013-08-01 Impact factor: 11.025

3. A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty.

Authors: David Soave; Lei Sun
Journal: Biometrics Date: 2017-01-18 Impact factor: 2.571

4. On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the Women's Genome Health Study.

Authors: Guillaume Paré; Nancy R Cook; Paul M Ridker; Daniel I Chasman
Journal: PLoS Genet Date: 2010-06-17 Impact factor: 5.917

10. Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.

Authors: Ting Zhang; Lei Sun
Journal: Genet Epidemiol Date: 2018-11-26 Impact factor: 2.135

1 in total

1. Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.

Authors: Ting Zhang; Lei Sun
Journal: Genet Epidemiol Date: 2018-11-26 Impact factor: 2.135

1 in total

Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.

INTRODUCTION

METHODS AND MATERIALS

Three G × E interaction scenarios and corresponding statistical tests

G × E interaction scenario 1: Single G, and E missing

G × E interaction scenario 2: Single G, and E known

G × E interaction scenario 3: Multiple Gs, and E known

Three T1E simulation designs: The “theoretical” null S0, and the “empirical” null S1.1 and S1.2

Data‐generating models for the three T1E simulation designs and under each of the three interaction scenarios

Scenario 1: Single G, and E missing

Scenario 2: Single G, and E known

Scenario 3: Multiple Gs, and E known

SIMULATION STUDIES

Simulation models and parameter values

Simulation results

DISCUSSION

1. Pooled association tests for rare variants in exon-resequencing studies.

2. What is the significance of difference in phenotypic variability across SNP genotypes?

3. A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty.

4. On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the Women's Genome Health Study.

5. Variance heterogeneity analysis for detection of potentially interacting genetic loci: method and its limitations.

6. Test for interactions between a genetic marker set and environment in generalized linear models.

7. A nonparametric test to detect quantitative trait loci where the phenotypic distribution differs by genotypes.

8. Behavior of QQ-plots and genomic control in studies of gene-environment interaction.

9. A Joint Location-Scale Test Improves Power to Detect Associated SNPs, Gene Sets, and Pathways.

10. Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.

1. Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null.