Literature DB >> 34849749

Smooth-threshold multivariate genetic prediction incorporating gene-environment interactions.

Abstract

We propose a genetic prediction modeling approach for genome-wide association study (GWAS) data that can include not only marginal gene effects but also gene-environment (GxE) interaction effects-i.e., multiplicative effects of environmental factors with genes rather than merely additive effects of each. The proposed approach is a straightforward extension of our previous multiple regression-based method, STMGP (smooth-threshold multivariate genetic prediction), with the new feature being that genome-wide test statistics from a GxE interaction analysis are used to weight the corresponding variants. We develop a simple univariate regression approximation to the GxE interaction effect that allows a direct fit of the STMGP framework without modification. The sparse nature of our model automatically removes irrelevant predictors (including variants and GxE combinations), and the model is able to simultaneously incorporate multiple environmental variables. Simulation studies to evaluate the proposed method in comparison with other modeling approaches demonstrate its superior performance under the presence of GxE interaction effects. We illustrate the usefulness of our prediction model through application to real GWAS data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).

Entities: Chemical

Keywords: genetic prediction; gene–environment interaction; smooth thresholding

Mesh：

Year: 2021 PMID： 34849749 PMCID： PMC8664495 DOI： 10.1093/g3journal/jkab278

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Introduction

Although discovery of genetic risk factors for disease is an important goal of genome-wide association studies (GWAS), predicting disease development or related traits is an important task for applying GWAS results in precision medicine. Many researchers have explored algorithms for accurate genetic prediction based on GWAS data with a large number of single-nucleotide polymorphisms (SNPs) (Evans ; Purcell ; Yang ; Chatterjee ; de Los Campos ; Dudbridge 2013; Makowsky ; Maier ; Moser ; Vilhjálmsson ; Privé ), but no model has been found that performs universally well with all data, and performance is highly dependent on the data-generating mechanism (Cherlin ). Popular models are linear in the variants (or SNPs), such as Purcell’s gene score (Purcell ) and genomic best linear unbiased prediction (BLUP) (Yang ). As an alternative, we developed a statistical method for genetic prediction modeling called smooth-threshold multivariate genetic prediction (STMGP) (Ueki and Tamiya 2016), and Takahashi recently demonstrated that the performance of STMGP was superior to that of other genetic prediction methods for predicting status of depression with actual GWAS data. STMGP is a sparse modeling method based on a multiple linear regression model such as the lasso (Tibshirani 1996) or the elastic net (Zou and Hastie 2005), and it is able to account for the ultrahigh dimensionality of the situation by filtering variants based on the corresponding marginal-effect P-values calculated from univariate regressions arising from a genome-wide scan. Sparseness is achieved by ignoring irrelevant variants; the corresponding regression coefficient estimates are set to zero as a result of shrinkage based on the strength of the marginal effect through the smooth-threshold estimating equations developed by Ueki (2009). STMGP also automatically tunes the prediction model by a C-type model selection criterion (as with the Akaike information criterion (Akaike 1973)), where the tuning parameter corresponds to the cutoff or threshold value for the marginal P-values that determines which effects to filter. The proposed C-type criterion based on Stein’s unbiased risk estimation (SURE, Stein 1981; Ye 1998; Efron 2004) has a closed-form expression and is a computationally efficient alternative to cross-validation that is often used to choose a P-value cutoff in the genetic prediction context (Purcell ; Warren ). Recent advances in data platforms now make it possible to integrate feature variables other than variants, such as those associated with lifestyle, clinical variables, imaging, etc. The simplest integration is to enter everything as an additive term in a multiple linear regression model as implemented in Ueki and Tamiya (2016) and Takahashi . While such an additive modeling approach is simple and straightforward, there may be cases where other approaches are more appropriate. One example is gene–environment (GxE) interaction, which has received attention recently as one potential candidate to unveil the missing heritability problem (Maher 2008; Manolio ; Manolio 2013). With GxE interaction, the model to be estimated is no longer simply additive; rather, it involves terms that are multiplicative in the covariates. Many investigations have aimed at discovering genetic factors that contribute to GxE interactions in disease risk (Kraft ; Kooperberg and LeBlanc 2008; Hamza ; Ober and Vercelli 2011; Aschard ; Sung ; Kraft and Aschard 2015, 2015; Sung ; Gauderman ; Khoury 2017; McAllister ; Ritchie ; Moore ; Osazuwa-Peters ): the approach using GWAS data is sometimes called a genome-wide environment interaction study (GWEIS) (Meijsen ; Arnau-Soler ; Ueki ). The need for GxE interactions depends on the data and target traits, but as with variant discovery, it would be beneficial to have a model for genetic prediction also that can incorporate GxE interactions (Aschard 2016). However, currently the number of such studies is very limited, especially with respect to human disease prediction. To address this issue, we present a straightforward extension of our STMGP method to allow incorporation of GxE interaction effects for building a genetic prediction model using large-scale genome-wide SNP data in conjunction with environmental variables. The proposed method can incorporate multiple environmental variables. The STMGP method requires as input the marginal association P-values from univariate regression models for each individual variant. This requirement implies that GxE interaction can be fit directly in the STMGP framework if it is expressed in a univariate regression model. The standard univariate GxE interaction model for variant j in n samples is where . This model contains three terms: E, G, and . Here, y is the response variable, μ is the conditional mean of y, E is the environmental variable, G is the jth variant (), p is the number of all variants, ϵ is the error variable, and , and are the corresponding regression coefficients. In general, removing either E or G will change the regression coefficient estimate of the GxE interaction term (see Appendix for additional discussion). In this sense, the three terms—E, G, and —are considered one set, meaning that the GxE interaction effects cannot be represented by a univariate model. To overcome this issue, we propose a simple approximation by a univariate regression model (the rationale is given in the “Materials and Methods” section), in which is the centered value of E, i.e. with the sample mean of . In words, is simply removed from the standard model and is used instead of E. As a result of this approximation, a one-to-one correspondence is made between the regression coefficient and the single predictor variable . Thus, the STMGP method can now incorporate the GxE interaction directly.

Materials and Methods

We use vector and matrix notation. Let , , and (). We first briefly explain the STMGP framework (Ueki and Tamiya 2016), then we present our proposed approach.

STMGP framework

Consider the linear multiple regression model, , where , is the error vector, X is an nxp-dimensional design matrix, and β is the corresponding vector of p regression coefficients. In application to GWAS data without GxE interactions, we set . Note that p is much larger than n in typical GWAS data—i.e., . Sparse modeling in which some of the regression coefficients are set to zero is often used in GWAS (Hoggart ; Ayers and Cordell 2010; Abraham ; Lello ; Privé ). If disease-susceptibility SNPs show relatively large marginal signals, marginal association screening effectively reduces the dimensionality. The polygenic score, including the gene score method (Purcell ) and its multivariate generalization (Warren ), uses upper-ranked SNPs with marginal association as predictors to build the prediction model. The former uses independent SNPs after pruning on the basis of LD (linkage disequilibrium), which means that LD is not modeled. The STMGP method (Ueki and Tamiya 2016) is a variant of the multivariate gene score method (Warren ), which is essentially the multiple regression model for the upper-ranked SNPs, and it accounts for correlations among SNPs by not including LD-based pruning. Let denote a test statistic for marginal association that takes a nonnegative value. Examples of include the squared Pearson’s correlation and the F statistic. Let be a cutoff value for defining inclusion of SNPs. The cutoff value t corresponds to a quantile of the null distribution of , as in hypothesis testing. The linear multiple regression after marginal association screening uses X satisfying in the model. Without loss of generality, assume that a large value of indicates stronger marginal association. Multiple regression after marginal association screening can be expressed by where and indicates the complement set of . Note that the above procedure is similar to sure independence screening (Fan and Lv 2008), which uses predictor variables that are upper-ranked in marginal association analyses. The procedure (1) is feasible for data and is useful in building a predictive model. In view of the normal equations, it can be seen that in (1) satisfies, for , or, in vector form, where , where denotes the indicator function, , and I is the p-dimensional identity matrix. Obviously, for , and (2) reduces to , i.e., a sparse solution; for , and the above normal equations reduce to because . These are the normal equations for an ordinary least squares regression with design matrix . The resulting prediction process forms , which is discontinuous in y due to the thresholding induced by . The main innovative idea in STMGP is to replace the discontinuous thresholding in (2) with a smooth thresholding using the smooth-threshold estimating equations proposed by Ueki (2009). Following Ueki (2009), is replaced by an adaptive lasso smooth-thresholding function where is a tuning parameter. This smooth-thresholding function is chosen so as to be identical to the adaptive lasso estimator under the simplest least squares regression of (Ueki 2009). If (or ), , producing a zero-valued regression coefficient; if (or ), producing a nonzero regression coefficient. Therefore, the condition for a sparse solution with is the same as that with . Note that is monotonically decreasing in , so regression coefficients having large are penalized to a lesser extent than those having small . For a given screening cutoff value , which gives a SNP set , the estimates of the p regression coefficients are where is the cardinality of . The non-negative tuning parameters γ and τ are set to 1 and , respectively, following previous studies (Ueki and Tamiya 2016; Takahashi ), and is a small constant to avoid singularity of . The corresponding prediction of y is then , where is an adaptive lasso smooth-thresholding function defined as . Since if and only if , the screened set with is the same as that with . It can be seen that replaces the discontinuous screening process by a continuous function. As a result, turns out to be continuous in y, enabling stable model selection (Breiman 1996). According to Ueki (2009) and Ueki and Tamiya (2016), the regression coefficients for the screened set in (4) can equivalently be considered as the solution of the generalized ridge regression with loss , in which . The ridge weight for each predictor variable, W, represents the uncertainty of the marginal association screening. If the marginal association is very weak, and W is large, and the corresponding regression coefficient is strongly shrunken toward zero. If the marginal association is strong, and , and the corresponding regression coefficient is less penalized. Continuity due to the smooth thresholding also allows computation of a C-type model selection criterion using SURE. The C-type criterion enables a computationally efficient choice of optimal P-value cutoff from the perspective of model selection. Details are provided in the Supplementary Material of Ueki and Tamiya (2016). We now outline the STMGP algorithm for .

Outline of the STMGP algorithm

Step 1. Perform single-SNP association analysis for p SNPs with a univariate model for each SNP. Step 2. Retain SNPs whose single-SNP association P-value is less than . Step 3. Fix γ = 1 and , and select an optimal α from candidate values in by minimizing the C-type criterion: Step 4. Compute in (4) by using the selected α in Step 3. Here, denotes the predicted value for the ith subject at the P-value threshold α corresponding to the test statistic threshold t; is the maximum P-value in the search, which is set to make the expected number of screened SNPs to be on the order of n in practice; is an error variance estimate; and denotes the generalized degrees of freedom (Ye 1998; Efron 2004). The univariate model for the jth variant G () in Step 1 is Step 3 outputs estimates of regression coefficients, , for the intercept and each variant, which allows computation of the prediction model in an additive form. Some of the regression coefficients can be exactly zero (i.e., sparsity). The predicted value for a new individual who has variants can be calculated as . The above method assumes a linear regression model for a quantitative phenotype. For a binary phenotype, a logistic regression model is used.

Incorporating GxE interactions with univariate regression approximation

In what follows, we describe our procedure to incorporate GxE interactions into the STMGP framework. Consider the standard GxE interaction model for the jth variant G and an environmental variable E, where denotes the Hadamard product—i.e., the ith element of is given by . As seen in Steps 1 and 2 of the STMGP algorithm, because the STMGP framework requires input of multiple predictors that pass a marginal association P-value threshold from each univariate regression model, the above GxE interaction model does not directly fit the STMGP framework due to there being two regression coefficients— and —that associate with G. For example, if is highly significant but is not, it is uncertain whether we may include only G, because differs from the regression coefficient of G in the univariate regression model without interaction term . In contrast, if is highly significant but is not, then it is unclear whether we need only, for the same reason. Furthermore, including both and G might reduce predictive power by increasing the number of predictors included: in other words, the curse of dimensionality. We propose a simple approximation to the above GxE interaction model by using a univariate regression model to eliminate these complications. To this end, we assume independence between E and each G. Such assumption is sometimes made in the literature on GxE interaction (Chatterjee and Carroll 2005; Mukherjee and Chatterjee 2007), and it is reasonable for many real GWAS data as the majority of variants have small marginal effects on environmental factors. Our proposed method (the main result) is simply to use the following univariate regression model instead of (6): in which is the centered E as defined previously. In the Appendix we show that, under independence between G, the least squares estimate of the regression coefficient of in (6) is approximated by that of in (7). This implies a one-to-one correspondence between the effects of the regression coefficient of in (6) and that of the single predictor . As a consequence, the STMGP framework can be directly applied by setting the following design matrix with 2p predictors: If we have m environmental variables, , we may set which has predictors. To implement this proposal, we simply include an additional procedure into Steps 1 and 2 above. The following is the modification to include m environmental variables. Steps 1 and 2 of STMGP algorithm modified to incorporate GxE interactions with m environmental variables Step 1’: Perform single-SNP association analysis for each of the p SNPs with a univariate model for each variant, and perform SNPx interaction analysis for each of the p SNPs and with the model (7) (), where with the sample mean of E. Step 2’: Screen (retain) SNPs on the basis of single-SNP association P-values, and screen SNP–environmental variable pairs on the basis of SNPx interaction P-values () at . The above steps are easily performed with PLINK (Purcell ; Chang ), as follows. Prepare the centered environmental variable in a covariate file, say environment.cov. Then, the PLINK command option is --linear --covar environment.cov --interaction --parameters 1,2,3 --tests 1,3. It is also possible to include additional covariates. We have implemented the above algorithm in our STMGP package. We have also implemented a prediction model for binary traits with a logistic regression model based on the method developed in Ueki and Tamiya (2016).

Simulation study

To examine the performance of the proposed method, we conducted simulation studies based on real SNP-GWAS data analogous to those of Takahashi . We used an ADNI-GWAS dataset obtained from the publicly available Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD. The goal of the ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography, other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see www.adni-info.org (Accessed: 2021 August 31). The ADNI is an ongoing, longitudinal study with the primary purpose being to explore the association of genetic and neuroimaging information with late-onset Alzheimer’s disease. The study investigators recruited subjects older than 65 years of age comprising about 400 subjects with MCI, about 200 subjects with AD, and about 200 healthy controls. Each subject was followed for at least 3 years. During the study period, the subjects were assessed with MRI measures and psychiatric evaluation to determine the diagnostic status at each time point. The ADNI-GWAS data were obtained from 818 DNA samples of ADNI1 participants by using the Illumina Human 610-Quad genotyping array (Shen ). The data initially included 620,901 SNPs. We included the apolipoprotein E (APOE) SNPs rs429358 and rs7412 in our analysis. We used data from 684 non-Hispanic Caucasian samples after we excluded one pair showing cryptic relatedness (revealed by the PLINK pairwise statistic being greater than 0.125) (Purcell ), and we excluded subjects whose reported sex did not match the sex inferred from X-chromosome SNPs. We then applied further quality control measures by excluding SNPs with missing genotype rate , Hardy–Weinberg equilibrium test P-value , and MAF ; the total number of remaining SNPs was 528,984, which is the value of P for this analysis. For the 684 individuals, given that the above real genotype data remain fixed, we artificially generated a quantitative trait, which was used as a target variable to be predicted. We also simulated two environmental variables (sex, E1, and years of education, E2) as follows. E1 was generated from a Bernoulli distribution with success probability 0.5. E2 was generated from a standard normal distribution. Both variables were standardized to have mean zero and variance 1 in the generated sample. First, we denote by p0 the number of causal variants for the main effects of genes, GxE1 effects, and GxE2 effects; note that the p0 variants of each type are not the same. The corresponding regression coefficients, (), were generated from pre-specified distributions. Specifically, the first p0 regression coefficients were generated independently and identically from a normal, NEG2 (normal–exponential–gamma with shape parameter 2), or Laplace distribution with mean zero and variance ; the second p0 regression coefficients were generated independently and identically from a normal, NEG2, or Laplace distribution with mean zero and variance ; the remaining p0 regression coefficients were generated independently and identically from a normal, NEG2, or Laplace distribution with mean zero and variance . Next, we randomly selected causal variants, , from among the p SNPs, . The first p0 variants () had a nonzero gene main effect, the second p0 variants () had a nonzero GxE interaction effect with E1, and the remaining p0 variants () had a nonzero GxE interaction effect with E2. Then, the conditional mean was set as in which , , and denote the corresponding terms standardized to have mean zero and variance one. Finally, a quantitative trait was generated as , where ϵ is an independently and identically distributed normal random variable with mean zero and variance . Note that and , and, similarly, and have mean zero and variance and , respectively. Also note that the three terms in and ϵ are mutually independent. Thus, y has mean zero and variance 1, and the triplet is referred to as heritability throughout this paper. We considered a total of eight scenarios for h2. First, we considered , and , where the first and second are scenarios with gene effect without GxE interactions, and the third and fourth are scenarios with GxE interactions only for E1. Then we considered four additional scenarios: , where the first and second are scenarios with GxE interactions both for E1 and E2, and the third and fourth are scenarios with GxE interactions only for E2. We used cross-validation to evaluate the prediction models. The data were randomly divided into two parts: 20% for training data and the remaining 80% for test data. The training dataset was used to build prediction models, and then the prediction accuracy of each model was evaluated on the basis of how well the simulated quantitative traits in the test dataset were predicted by the trained model. We used the prediction correlation coefficient (PCC) to measure the prediction accuracy. The above procedure was repeated 100 times. We note that the causal SNPs and true regression coefficients differed for each replicate. We also considered simulations for prediction of binary traits. A binary trait was generated by dichotomizing the quantitative trait on the basis of whether or not its value exceeded , in which is the standard normal quantile function. With a binary trait, the prediction accuracy of each model was evaluated by the area under the receiver operating characteristic curve (AUC).

Comparisons among prediction models

We compared the proposed extension of the STMGP method with other prediction models. We included the usual STMGP without GxE interaction as a competitor; specifically, the STMGP models compared were the STMGP without environmental variables, STMGP with environmental variable E1, STMGP with environmental variable E2, and STMGP with both environmental variables E1 and E2. We also compared the proposed STMGP extension with other prediction models based on genomic BLUP. Specifically, we considered the following four genomic BLUP models, where and are fixed effects, and and are random effects that are independently distributed as and , respectively. Similar BLUP models have been considered in previous studies (e Sousa ; Moore ). We constructed the prediction model by BLUP implemented in the BGEE package for R (Granato ) by using the BGEE function with options ite = 20000, burn = 1000, and thin = 3.

Application to prediction of real traits

We applied the proposed extension of the STMGP to the prediction of real traits. All variables were obtained from the ADNIMERGE package for R. We considered four cognitive scores as target traits for prediction: FAQ (Functional Assessment Questionnaire), CDRSB (Clinical Dementia Rating Sum of Boxes), MMSE (Mini-Mental State Examination), and ADAS11 [the 11-item ADAS-cog (Alzheimer’s Disease Assessment Scale-Cognitive Subscale)]. We used SEX and EDU (years of education) as environmental variables. We also considered two additional covariates, AGE and APOE4 genotype. The latter is a known risk allele for AD development. As with the above simulations, we evaluated prediction accuracy via cross-validation. First, we randomly divided the 684 individuals into five groups of roughly equal size. Then, one of the five groups was selected as the test set and the remaining groups were used as the training set. Consequently, by repeating this with each group in turn acting as the test set, we had five different test/training sample combinations (i.e., 5-fold cross-validation). For each of the five combinations, we built a prediction model based on the training set and predicted each trait value for the test set with the constructed prediction model. For each training set, we used 528,984 SNPs as predictors as in the above simulation studies. The prediction models we compared were STMGP with SEX as the environmental variable, STMGP with EDU as the environmental variable, and STMGP with SEX and EDU both as environmental variables. BLUP-based prediction models are (8)–(11). Since the target traits are cognitive scores, we additionally studied regression models including APOE4 genotype interaction without other variants; specifically, we considered the following models without GWAS data: Prediction accuracy was evaluated with PCC, which compares the predicted value with the actual trait in the test set.

Results

Simulation results

Results of the quantitative trait simulation are shown in Figures 1 and 2, and Supplementary Figure S1, where each cell exhibits mean PCC and the number of causal variants is , 1000, and 500, respectively.

Figure 1

Figure 2

Quantitative trait simulation with . Average predictive correlation coefficient (PCC) for eight models. See Figure 1 for explanation of scenarios (shown in rows).

Quantitative trait simulation with . Average predictive correlation coefficient (PCC) for eight models. For each scenario (shown in rows), high values are highlighted in red and low values in white. s: STMGP with E1 and E2 as covariates; sge1: STMGP with E1 and E2 as covariates and E1 as environmental variable for GxE interaction; sge2: STMGP with E1 and E2 as covariates and E2 as environmental variable for GxE interaction; sge12: STMGP with E1 and E2 as covariates, and E1 and E2 as environmental variables for GxE interaction; bg: BLUP with E1 and E2 as covariates; bge1: BLUP with E1 and E2 as covariates and E1 as environmental variable for GxE interaction; bge2: BLUP with E1 and E2 as covariates and E2 as environmental variable for GxE interaction; bge12: BLUP with E1 and E2 as covariates, and average of E1 and E2 as environmental variable for GxE interaction. Scenarios are denoted as _ dist, where dist means effect size distribution: Normal, NEG2, or Laplace. Quantitative trait simulation with . Average predictive correlation coefficient (PCC) for eight models. See Figure 1 for explanation of scenarios (shown in rows). The first and second scenarios for h2, and , are those with gene effects but no GxE interactions. From Figures 1 and 2, and Supplementary Figure S1, all methods showed a higher predictive power in the latter scenario than in the former scenario due to the larger heritability. The four STMGP methods resulted in comparable predictive power, implying that the inclusion of GxE interactions had virtually no effect on predictive power, which is a reasonable result because no GxE interaction effects were assumed in the data-generating model. The BLUP models had lower predictive power than the STMGP methods, which is also reasonable because only a small proportion of variants was assumed to be causal and the BLUP models do not carry out variable selection. Indeed, by comparing Figures 1 and 2 and Supplementary Figure S1, it can be seen that an increase in the number of causal variants made the difference between the STMGP and BLUP methods smaller. The difference in effect size distribution had a non-negligible impact on predictive power. While the BLUP methods assume a normal distribution, the STMGP methods do not rely on the effect size distribution, and the STMGP methods had much higher predictive power than the BLUP methods, in particular, when the effect size distribution was non-normal. The difference between the STMGP and BLUP methods was pronounced under the NEG2 distribution, which has the heaviest tails among the three effect-size distributions compared. A similar result was observed in the simulation studies of Takahashi . The third and fourth scenarios for h2, and , are those with GxE interactions only for E1. As in the scenarios for and , all prediction models gave higher predictive power in the latter scenario than in the former scenario. Unlike the scenarios with no GxE interactions and , the STMGP methods incorporating GxE interaction effects had higher predictive power than the STMGP method without GxE interactions. For example, in scenario under a normal effect-size distribution, the STMGP without GxE interaction produced mean PCC 0.36 (standard deviation 0.26), while the STMGP with GxE interaction on variable E1 resulted in mean PCC 0.41 (standard deviation 0.22). On the other hand, the STMGP with GxE interaction on variable E2 resulted in mean PCC 0.37 (standard deviation 0.26), which is comparable with STMGP without GxE interaction. This is reasonable since no GxE interaction effect on variable E2 was assumed. The STMGP with GxE interaction on both E1 and E2 gave mean PCC 0.41 (standard deviation 0.23), a predictive power comparable to that of STMGP with GxE interaction on variable E1. Total heritability and the difference in effect size distribution had a similar impact on predictive power in scenarios and . For and the larger heritability scenario, , or under the NEG2 distribution, STMGP with GxE interaction on variable E1 tended to produce higher predictive power than the BLUP methods, which is perhaps due to the fact that only a small proportion of variants was assumed to be causal. In the other cases among the third and fourth scenarios (any distribution with other than and , or and NEG2 with any heritability [ or ]), the STMGP methods did not always perform better than the BLUP methods. Results of the additional four scenarios are shown in Supplementary Figures S3–S5. The first and second scenarios for h2, and , are the scenarios with GxE interactions both for E1 and E2. Unlike the scenarios and , all three STMGP methods with GxE interaction had comparably higher predictive power than STGMP without GxE interaction. This is reasonable as GxE interaction was assumed for both variables, E1 and E2. The third and fourth scenarios for h2, and , are those with GxE interactions only for E2. The results were similar to those for and , in which the role of E2 was replaced by E1. Results of the binary trait simulation are shown in Figures 3 and 4, and Supplementary Figure S2, in which each cell exhibits the mean AUC. The results were consistent overall with the results of the quantitative trait simulation, but differences in predictive power between methods were smaller than with the quantitative trait simulation.

Figure 3

Binary trait simulation with . Average area under the ROC curve (AUC) is shown for eight models. For each scenario (in rows), high values are highlighted in red and low values in white. s: STMGP with E1 and E2 as covariates; sge1: STMGP with E1 and E2 as covariates and E1 as environmental variable for GxE interaction; sge2: STMGP with E1 and E2 as covariates and E2 as environmental variable for GxE interaction; sge12: STMGP with E1 and E2 as covariates, and E1 and E2 as environmental variables for GxE interaction; bg: BLUP with E1 and E2 as covariates; bge1: BLUP with E1 and E2 as covariates and E1 as environmental variable for GxE interaction; bge2: BLUP with E1 and E2 as covariates and E2 as environmental variable for GxE interaction; bge12: BLUP with E1 and E2 as covariates, and average of E1 and E2 as environmental variable for GxE interaction. Scenarios are denoted as _ dist, where dist means effect size distribution: Normal, NEG2, or Laplace.

Figure 4

Binary trait simulation with . Average area under the ROC curve (AUC) for eight models. See Figure 3 for explanation of scenarios (shown in rows).

Prediction of real quantitative trait

Results of predicting the four cognitive scores—FAQ, CDRSB, MMSE, and ADAS11—as target traits are shown in Table 1, which convey the five PCCs from 5-fold cross-validation. Generally, the prediction accuracy differed across the four traits. By comparing l0 with l, lge1, lge2, and lge12, which correspond to formulae (12)–(16), we see that inclusion of the APOE4 genotype (without genome-wide variants) gave much higher predictive power. However, the observed comparable prediction ability among models l, lge1, lge2, and lge12 implies that the inclusion of an interaction between APOE4 and either SEX or EDU did not impact predictive power. The BLUP methods, s, sge1, sge2, and sge12, resulted in performance that was comparable to those of l, lge1, lge2, and lge12, and did not show any extremely distinctive behavior. Similarly, the STMGP methods did not behave much differently from the other methods, but STMGP with a GxE interaction with EDU (sge2) tended to show slightly higher predictive power and improved upon the STMGP without GxE interaction. In particular, for prediction of FAQ, STMGP with a GxE interaction with EDU (sge2) gave the highest mean PCC (0.22; standard deviation 0.07) among the methods. However, the differences among models were small: for example, the second best mean PCC was 0.21 for l, lge1, lge2, bg, bge1, bge2, and the mean PCC for the STMGP without GxE interaction was is 0.20 with standard deviation 0.08. On the other hand, the STMGPs with GxE interaction with SEX (sge1) or with both SEX and EDU (sge12) produced lower or more variable prediction results.

Table 1

Results of predicting four quantitative traits, FAQ, CDRSB, MMSE, and ADAS11

Trait^a	Data^b	l0^c	l^d	lge1^e	lge2^f	lge12^g	s^h	sge1ⁱ	sge2^j	sge12^k	bg^l	bge1^m	bge2ⁿ	bge12^o
FAQ	CV 1	0.07	0.16	0.15	0.17	0.16	0.11	–0.01	0.15	0.05	0.14	0.13	0.13	0.12
	CV 2	0.17	0.35	0.33	0.36	0.34	0.26	0.24	0.32	0.31	0.32	0.35	0.33	0.33
	CV 3	0.19	0.15	0.15	0.16	0.16	0.19	0.13	0.21	0.15	0.17	0.15	0.18	0.17
	CV 4	0.01	0.26	0.26	0.27	0.27	0.31	0.18	0.24	0.19	0.23	0.28	0.25	0.23
	CV 5	0.08	0.16	0.16	0.10	0.09	0.15	0.14	0.17	0.15	0.17	0.14	0.17	0.15
	Mean	0.10	0.21	0.21	0.21	0.20	0.20	0.14	0.22	0.17	0.21	0.21	0.21	0.20
	SD	0.08	0.09	0.08	0.10	0.10	0.08	0.09	0.07	0.09	0.07	0.10	0.08	0.09
CDRSB	CV 1	0.07	0.13	0.13	0.12	0.12	0.21	0.18	0.22	0.17	0.12	0.13	0.10	0.11
	CV 2	0.16	0.38	0.37	0.36	0.35	0.33	0.28	0.33	0.30	0.34	0.36	0.34	0.33
	CV 3	0.22	0.26	0.26	0.26	0.26	0.28	0.26	0.26	0.25	0.25	0.25	0.26	0.27
	CV 4	0.10	0.37	0.37	0.37	0.37	0.44	0.36	0.41	0.31	0.36	0.39	0.37	0.36
	CV 5	0.19	0.27	0.26	0.25	0.22	0.27	0.25	0.28	0.27	0.27	0.25	0.27	0.27
	Mean	0.15	0.28	0.27	0.27	0.26	0.31	0.27	0.30	0.26	0.27	0.27	0.27	0.27
	SD	0.06	0.10	0.10	0.10	0.10	0.08	0.06	0.07	0.06	0.09	0.10	0.10	0.10
MMSE	CV 1	0.10	0.27	0.25	0.26	0.25	0.13	0.21	0.18	0.16	0.22	0.23	0.23	0.22
	CV 2	0.19	0.34	0.33	0.33	0.32	0.30	0.33	0.33	0.33	0.29	0.30	0.31	0.30
	CV 3	0.30	0.35	0.35	0.35	0.35	0.28	0.26	0.34	0.35	0.37	0.38	0.36	0.36
	CV 4	0.27	0.35	0.35	0.35	0.36	0.35	0.34	0.39	0.37	0.36	0.37	0.36	0.37
	CV 5	0.17	0.28	0.26	0.28	0.25	0.25	0.23	0.26	0.22	0.29	0.28	0.29	0.27
	Mean	0.21	0.32	0.31	0.31	0.31	0.26	0.27	0.30	0.29	0.31	0.31	0.31	0.30
	SD	0.08	0.04	0.05	0.04	0.05	0.08	0.06	0.08	0.09	0.06	0.06	0.05	0.06
ADAS11	CV 1	0.12	0.31	0.32	0.30	0.31	0.30	0.28	0.29	0.26	0.29	0.29	0.28	0.27
	CV 2	0.17	0.30	0.30	0.30	0.30	0.22	0.23	0.24	0.22	0.28	0.27	0.28	0.29
	CV 3	0.15	0.29	0.30	0.29	0.30	0.22	0.26	0.24	0.26	0.29	0.29	0.29	0.29
	CV 4	0.11	0.36	0.36	0.35	0.35	0.29	0.29	0.37	0.29	0.37	0.38	0.35	0.36
	CV 5	0.22	0.34	0.32	0.33	0.32	0.30	0.28	0.34	0.24	0.33	0.31	0.32	0.31
	Mean	0.15	0.32	0.32	0.31	0.32	0.27	0.27	0.30	0.25	0.31	0.31	0.30	0.30
	SD	0.04	0.03	0.03	0.03	0.02	0.04	0.02	0.06	0.03	0.04	0.04	0.03	0.03

Prediction of each target trait is evaluated by the prediction correlation coefficient (PCC) from 5-fold cross-validation.

Data used to calculate PCC (CV 1–CV 5 denote each cross-validated dataset from 5-fold cross-validation) for each model are shown in row together with mean and standard deviation (SD).

Linear regression with SEX and EDU as predictors.

Linear regression with SEX, EDU, AGE, and APOE4 as predictors.

Linear regression with SEX, EDU, AGE, APOE4, and APOE4xSEX as predictors.

Linear regression with SEX, EDU, AGE, APOE4, and APOE4xEDU as predictors.

Linear regression with SEX, EDU, AGE, APOE4, APOE4xSEX, and APOE4xEDU as predictors.

STMGP with SEX, EDU, AGE, and APOE4 as covariates.

STMGP with SEX, EDU, AGE, and APOE4 as covariates, and SEX as environmental variable for GxE interaction.

STMGP with SEX, EDU, AGE, and APOE4 as covariates, and EDU as environmental variable for GxE interaction.

STMGP with SEX, EDU, AGE, and APOE4 as covariates, and AGE and EDU as environmental variables for GxE interaction.

BLUP with SEX, EDU, AGE, and APOE4 as covariates.

BLUP with SEX, EDU, AGE, and APOE4 as covariates, and SEX as environmental variable for GxE interaction.

BLUP with SEX, EDU, AGE, and APOE4 as covariates, and EDU as environmental variable for GxE interaction.

BLUP with SEX, EDU, AGE, and APOE4 as covariates, and average of AGE and EDU as environmental variable for GxE interaction.

Results of predicting four quantitative traits, FAQ, CDRSB, MMSE, and ADAS11 Prediction of each target trait is evaluated by the prediction correlation coefficient (PCC) from 5-fold cross-validation. Data used to calculate PCC (CV 1–CV 5 denote each cross-validated dataset from 5-fold cross-validation) for each model are shown in row together with mean and standard deviation (SD). Linear regression with SEX and EDU as predictors. Linear regression with SEX, EDU, AGE, and APOE4 as predictors. Linear regression with SEX, EDU, AGE, APOE4, and APOE4xSEX as predictors. Linear regression with SEX, EDU, AGE, APOE4, and APOE4xEDU as predictors. Linear regression with SEX, EDU, AGE, APOE4, APOE4xSEX, and APOE4xEDU as predictors. STMGP with SEX, EDU, AGE, and APOE4 as covariates. STMGP with SEX, EDU, AGE, and APOE4 as covariates, and SEX as environmental variable for GxE interaction. STMGP with SEX, EDU, AGE, and APOE4 as covariates, and EDU as environmental variable for GxE interaction. STMGP with SEX, EDU, AGE, and APOE4 as covariates, and AGE and EDU as environmental variables for GxE interaction. BLUP with SEX, EDU, AGE, and APOE4 as covariates. BLUP with SEX, EDU, AGE, and APOE4 as covariates, and SEX as environmental variable for GxE interaction. BLUP with SEX, EDU, AGE, and APOE4 as covariates, and EDU as environmental variable for GxE interaction. BLUP with SEX, EDU, AGE, and APOE4 as covariates, and average of AGE and EDU as environmental variable for GxE interaction. The above results indicate the possibility that incorporating GxE interactions leads to improved predictive performance. Of course, whether the predictive performance is improved or not depends on the choice of environmental variable, which was also observed in the simulation studies. Finally, we checked the validity of the proposed univariate regression approximation in the real data application. Supplementary Figures S9–S16 show the accuracy of the proposed approximation, where each figure gives a scatter plot matrix of P-values associated with the GxE interaction term from models (6) and (7) with environmental variables either centered or not. Since centering of environmental variable E does not change the model (6), we only compared three P-values: model (6), model (7) with centered E, and model (7) with non-centered E. Among the figures, Supplementary Figures S9, S11, S13, and S15 show the P-values associated with GxE interaction for SEX as the environmental variable, and Supplementary Figures S10, S12, S14, and S16 show the P-values associated with GxE interaction for EDU as the environmental variable. In all figures, the P-values for the GxE interaction term in the approximate univariate regression (i.e., with no gene main effect) using a centered environmental variable were highly correlated (>0.99) with the P-values for the GxE interaction term in the interaction model having a gene main effect. On the other hand, with a non-centered environmental variable the same sets of P-values for the GxE interaction terms were either less correlated (correlation around 0.65 for SEX as E) or uncorrelated ( for EDU). These results confirm the validity of the proposed univariate regression approximation.

Discussion

In this article, we presented a procedure to incorporate GxE interaction effects into our previously developed genetic modeling approach, the STMGP method. Since the STMGP method relies on univariate regression to screen for high-dimensional predictors, we developed a univariate regression approximation to the GxE interaction model so that the STMGP framework can be directly applied without modification. The approximation is simply to use “centered” environmental variables and remove gene main effect terms from the standard GxE interaction regression model. Simulation studies and real data analysis showed that incorporating GxE interactions may improve the performance of the STMGP, but, as expected, its effectiveness depends to a great extent on the underlying genetic structure. An important point to note is that genome-wide GxE interaction analysis is more sensitive to model misspecification than marginal association analysis, as pointed out by Voorman , Almli , and Ueki . Since the model misspecification issue applies to all GxE interaction analyses, special care should be taken in modeling GxE interaction, such as selection of the environmental variable. We recommend using the check statistic proposed by Ueki ) before performing a GxE interaction analysis; this enables prediction of problematic behavior in the GxE interaction analysis without having to perform the actual genome-wide scan. Most of the existing genetic prediction models treat genetic data separately from non-genetic data. While the widely used additive models to combine genetic and non-genetic data are simple and easy to handle, there must be situations where non-additive models, such as models with GxE interactions, improve upon the additive models. However, studies have reported low power of GxE interaction analysis (Kraft ). Nevertheless, analogous to the relationship between an association study and prediction modeling, the goal is not to discover GxE interactions but to have a better prediction model. Low statistical power is not necessarily a severe issue in this context: GxE interactions, even if not genome-wide significant, may be useful in helping to improve predictive power.

Data availability

All data necessary to reproduce the conclusions are fully presented in the paper. The authors do not have ownership of the data used; the data obtained were collected and are owned by the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Researchers may request and access the data through the ADNI website (http://adni.loni.usc.edu/ (Accessed: 2021 August 31)). The authors had no special access privileges to use these data. A computer program for the method proposed in this paper is available from the R package stmgp (version 1.0.4). Supplementary material is available at G3 online. Click here for additional data file. Click here for additional data file. Click here for additional data file.

49 in total

1. GCTA: a tool for genome-wide complex trait analysis.

Authors: Jian Yang; S Hong Lee; Michael E Goddard; Peter M Visscher
Journal: Am J Hum Genet Date: 2010-12-17 Impact factor: 11.025

2. Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space.

Authors: Hao Helen Zhang
Journal: J R Stat Soc Series B Stat Methodol Date: 2008-11 Impact factor: 4.488

3. Editorial: Emergence of Gene-Environment Interaction Analysis in Epidemiologic Research.

Authors: Muin J Khoury
Journal: Am J Epidemiol Date: 2017-10-01 Impact factor: 4.897

Review 4. Gene-environment interactions in human disease: nuisance or opportunity?

Authors: Carole Ober; Donata Vercelli
Journal: Trends Genet Date: 2011-01-07 Impact factor: 11.639

5. Genetic prediction of quantitative lipid traits: comparing shrinkage models to gene scores.

Authors: Helen Warren; Juan-Pablo Casas; Aroon Hingorani; Frank Dudbridge; John Whittaker
Journal: Genet Epidemiol Date: 2013-11-23 Impact factor: 2.135

6. Beyond missing heritability: prediction of complex traits.

Authors: Robert Makowsky; Nicholas M Pajewski; Yann C Klimentidis; Ana I Vazquez; Christine W Duarte; David B Allison; Gustavo de los Campos
Journal: PLoS Genet Date: 2011-04-28 Impact factor: 5.917

7. Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524

Review 8. Genetic analysis of quantitative phenotypes in AD and MCI: imaging, cognition and biomarkers.

Authors: Li Shen; Paul M Thompson; Steven G Potkin; Lars Bertram; Lindsay A Farrer; Tatiana M Foroud; Robert C Green; Xiaolan Hu; Matthew J Huentelman; Sungeun Kim; John S K Kauwe; Qingqin Li; Enchi Liu; Fabio Macciardi; Jason H Moore; Leanne Munsie; Kwangsik Nho; Vijay K Ramanan; Shannon L Risacher; David J Stone; Shanker Swaminathan; Arthur W Toga; Michael W Weiner; Andrew J Saykin
Journal: Brain Imaging Behav Date: 2014-06 Impact factor: 3.978

9. Phenotypic and genetic analysis of cognitive performance in Major Depressive Disorder in the Generation Scotland: Scottish Family Health Study.

Authors: Joeri J Meijsen; Archie Campbell; Caroline Hayward; David J Porteous; Ian J Deary; Riccardo E Marioni; Kristin K Nicodemus
Journal: Transl Psychiatry Date: 2018-03-13 Impact factor: 6.222

10. Current Challenges and New Opportunities for Gene-Environment Interaction Studies of Complex Diseases.

Authors: Kimberly McAllister; Leah E Mechanic; Christopher Amos; Hugues Aschard; Ian A Blair; Nilanjan Chatterjee; David Conti; W James Gauderman; Li Hsu; Carolyn M Hutter; Marta M Jankowska; Jacqueline Kerr; Peter Kraft; Stephen B Montgomery; Bhramar Mukherjee; George J Papanicolaou; Chirag J Patel; Marylyn D Ritchie; Beate R Ritz; Duncan C Thomas; Peng Wei; John S Witte
Journal: Am J Epidemiol Date: 2017-10-01 Impact factor: 5.363