Literature DB >> 28479867

Statistical Identification of Gene-gene Interactions Triggered By Nonlinear Environmental Modulation.

Xu Liu1, Honglang Wang1, Yuehua Cui1,2.   

Abstract

Complex diseases are often caused by the function of multiple genes, gene-gene (G×G) interactions as well as gene-environment (G×E) interactions. G×G and G×E interactions are ubiquitous in nature. Empirical evidences have shown that the effect of G×G interaction on disease risk could be largely modified by environmental changes. Such a G×G×E triple interaction could be a potential contributing factor to phenotypic plasticity. Although the role of environmental factors moderating genetic influences on disease risk has been broadly recognized, no statistical method has been developed to rigorously assess how environmental changes modify G×G interactions to affect disease risk. To address this issue, we developed a G×G×E triple interaction model in this work. We modeled the environmental modification effect via a varying-coefficient model where the structure of the varying effect is determined by data. Thus the model has the flexibility to assess nonlinear environmental moderation effect on G×G interaction. Simulation and real data analysis were conducted to show the utility of the method. Our approach provides a quantitative framework to assess triple interactions hypothesized in literature.

Entities:  

Keywords:  Associate study; Epistasis; Gene-gene-environment triple interaction; Nonlinear interaction; Varying coefficient model

Year:  2016        PMID: 28479867      PMCID: PMC5320540          DOI: 10.2174/1389202917666160726150417

Source DB:  PubMed          Journal:  Curr Genomics        ISSN: 1389-2029            Impact factor:   2.236


INTRODUCTION

Gene-gene (GxG) interactions are ubiquitous in nature. It is less likely that a disease status is resulted from the function of single genes working separately, but rather due to the complex interactions of genes functioning together [1]. Statistical methods for the identification of gene-gene interactions have been flourished in literature [2]. Some are focused on parametric models based on simple linear or generalized linear regression models, while others are based on nonparametric models such as the Multidimensional Reduction (MDR) method [3]. As for the unit of analysis, some are focused on single nucleotide polymorphism (SNP) interaction analysis. Others are focused on gene level interaction analysis (e.g., [4, 5]), and such knowledge-driven gene-based interaction analyses are attractive given that genes are the functional units in living organisms. Although the topic of GxG interaction has been studied for several decades, the field has not been advanced much until the recent wave of genome-wide association studies (GWAS). Recent GWAS have identified thousands of disease variants. However, such successes are undermined due to limited heritability explained by these variants in many complex diseases [6]. Some investigators suggest that G×G interaction may potentially contribute to the missing heritability of many complex diseases [7]. Although many statistical methods have been developed for the identification of GxG interaction [2, 8], due to the limitation of model assumptions as well as the underlying mechanism of gene action modes in different diseases, there are still large needs for the development of advanced methods with biological relevance and statistical flexibility. In addition to G×G interaction, gene-environment (G×E) interaction could also be a contributing factor for the missing heritability [9, 10]. G×E interaction, defined as how genotypes influence phenotypes differently under different environments [11], is also interpreted as genetic sensitivity to environmental stimulus. Vast amount of studies have reported the role of GxE interaction in many diseases, such as mental illness [12], Parkinson disease [13, 14], and type 2 diabetes [15]. The development of statistical methods has also been evolving, from the identification of linear interaction with traditional simple linear or logistic regression models to more flexible nonparametric methods for the identification of nonlinear moderation of environmental influences on genetic risk [16, 17]. Similar as the function of single genes, the influence of G×G interaction on disease risk could also be modified by environmental changes. In a study of rheumatoid arthritis (RA), Padyukov et al. [18] reported the role of smoking in affecting G×G interaction on developing rheumatoid factor (RF)-seropositive disease. The authors found that the relative risk of developing RF-seropositive disease is higher in smokers who carrying double shared epitope (SE) HLA-DR genes than that in those who carrying either SE gene. Zouachel et at. [19] recently showed that a three-way gene-gene-environment (G×G×E) interaction analysis explains the differences in chikungunya virus transmission by Ae. albopictus populations at different temperatures. These empirical evidences underline the importance of evaluating three-way G×G×E interactions on disease risk, and further dissect the mechanism in which how G×G interactions on disease risk are moderated by environmental changes. Recently, Hu et al. [20] proposed a three-way gene-gene-gene interaction model for pure gene epistasis analysis based on information theory. Such model can be applied to study three-way G×G×E interactions. However, the model can only be applied for discrete environment factors and it cannot estimate the interaction size (only a detection test). As discussed in [16], when continuous environmental factors are considered, a varying-coefficient nonparametric regression model shows its flexibility in capturing potential nonlinear G×E interactions. In this work, we extend our previous model on nonlinear G×E detection to study how environmental changes modify G×G interactions on disease risk. Simulation and real data analysis are conducted to show the utility of the model.

STATISTICAL METHODS

The Model

From a G×E perspective, we propose the following partial linear varying-coefficient model (PLVCM): where Y is the disease trait response; X represents a ρ-dimensional vector of covariates containing clinical covariates such as age, smoking and gender; G and G are two SNP variables; U is the environmental variable of interest;ε is an i.i.d. error term with mean 0 and finite variance; α is a ρ-dimensional unknown parameter vector; and β1(u), β2 and β3(u) are parameters of interest which are varying functions of variable U. We are interested in evaluating how the effects of each SNP variable as well as the interaction of the two are modified by the environmental variable U to affect the trait distribution of Y, in particular the effect of β3(U) which can be interpreted as the triple G×G×E interaction effect. In the model, we also allow the nonlinear marginal effect of U on Y, denoted as β0(U) which can be determined by the data. Our model is motivated by a recent genome-wide association study to identify genetic factors interacting with maternal uterine environments for birth weight [21]. As a fetus resides inside its mother's womb, intensive signalling and chemical exchanges between the two are expected. As a result, the effect of fetal genes on birth weight can be modified by maternal conditions such as mother's glucose level, BMI level and blood pressure. In addition to identify major G×E interactions in the context of the maternal-fetal unit, we are also interested in identifying how G×G interactions in fetal genome are modified by maternal conditions to control birth weight. The varying coefficient function has much flexibility to capture the underlying functional mechanism of triple interactions which can be linear or nonlinear that must be determined by the data.

The Estimation

Denote Fn as the space of B-spline basis function of order r (r≥2) [22] with the B-spline basis, , where J and N=N is the number of interior knots for a knot sequence in which N increases along with the sample size n. Then function , can be approximated by the following spline expansion. where . The B-spline coefficients λ can be estimated by Where Let , where and and . Denote which is an n x(p+4J matrix, and Y = (Ya,..,Yn)T, where X = (X1,...,Xn)T is an n x p matrix. Then the least squares estimators of can be obtained as It is easy to obtain the estimator of the nonparametric function by We use the Bayesian Information Criterion (BIC) to select the number of interior knots and the order of basis function based on in model . More specific, we minimize the following criterion where, and are the estimates based on model , with some preset interior knots and spline orders. The selected knots and orders are then fixed when estimating functions = 1, 2, 3 to save computational time.

Testing the Overall Genetic Effect

One merit of our model is that it can assess the interaction effect of environmental exposures with genes. This can be achieved by testing the nonparametric component = 1, 2, 3, which allows one to discover the dynamic changes of the interaction effects. We first consider the following hypothesis test to detect if there is any genetic effect of two SNP variables on Y, i.e., at least one is not equal zero (2.4) via a log-likelihood ratio test (LRT). Under H0, we can estimate by where . The log-likelihood function under H0 and H1 are, respectively, where are estimators of variance of response Y under H0 and H1, respectively. The LRT is defined as , which asymptotically follows a X2-distribution with 3J degrees of freedom. Fail to reject H0 indicates that the effects of G1 and G2 on Y are not significant. Otherwise, we pursue further tests to dissect the interaction mechanism.

Testing the Interaction Effect

If the null hypothesis in (2.4) is rejected, we can further test which component is significant by formulating the following hypotheses, Of particular interest is the test of where the effect reflects the triple G×G×E interaction. In the following, we show the derivation of the test by focusing on this component. Similar procedure applies to the test of the other two components, i.e., Denote . Under H03I, we can estimate by where . The estimates under is the same as (2.2). The log-likelihood function under is given by , where is the estimator of variance of response Y under . We can construct the LRT as ℒI, which has asymptotic χ2-distribution with Jn degrees of freedom. Rejecting indicates significant interaction effect of the two SNPs. However, whether the interaction effect is moderated by environmental variable U needs to be further assessed by statistical tests.

Assessing the Environmental Moderation Effect on G×G Interaction

If in test (2.5) is rejected, we can further test if the G× G interaction is modified by environmental variable U by testing , where c is a constant. The null hypothesis implies that the G× G interaction effect is not sensitive to the change of U, hence is not modified by U. To do this, we define a transformation matrix such that ([23] chapter 4). In practice, we can simply define . Thus, can be rewritten as where = and . Denote . Under H, we can estimate by where . The estimates under H is the same as (2.2). The log-likelihood function under H is given by , where is the estimator of variance of response Y under . We can construct the LRT as , which has asymptotic -distribution with J degrees of freedom . Rejecting indicates there exists G× G× E triple interaction.

MONTE CARLO SIMULATION

The finite sample performance of the proposed method was evaluated by simulation studies. Considering model (2.1), we generated the environmental variable U from a Uniform distribution U(0, 1) and two covariates X1, X2 from an independent Normal distribution N(0, 1). The SNP variable G, coded as (2, 1, 0) corresponding to genotypes (AA, Aa, aa), was simulated from a multinomial distribution with corresponding frequencies where the frequency of the minor allele is specified as p = (0.1, 0.3, 0.5). The error term was simulated from a normal distribution , where = 0.1, 0.5, 1.0. The testing performance was compared under different MAFs and different error distributions. For the varying coefficient functions, we set . We run 1000 simulation replicates each with sample size n = 500, 1000, 2000. The number of interior knots N and spline order r were selected by the BIC criterion. We first evaluated the performance for testing the functional coefficients in hypothesis (2.4), i.e., . The testing power was evaluated under a sequence of alternative models indexed by . When , the test results gave the false positive rates. Fig. () depicts the empirical sizes and power functions to detect the overall genetic effect under the different scenarios with triplets (n, p, ), where n = 500, 1000, 2000, p = 0.1, 0.3, 0.5 and = 0.1, 0.5, 1.0. As shown in Fig. (), the empirical type I errors under all scenarios are very close to the nominal level 0.05. As we expected, the power increases as the sample size and MAF increase under a fixed error variance. Fox fixed n and MAF, the power increases as the error variance decreases. The results indicate that our method can reasonably control the false positives and has appropriate power to detect genetic effect. For the performance of testing , Fig. () shows the empirical sizes and power functions under different scenarios. Similar as the previous simulation setup, the alternative hypothesis was index by . We observed similar trends as described for the overall genetic effect test’s results shown in Fig. (). The testing sizes are reasonably controlled. The results indicate relatively good performance of the method.

A CASE STUDY

We applied the proposed PLVCM model to a data set from the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI). Low and high birth weights are major causes of neonatal morbidity and mortality. They are also associated with increased risk of metabolic diseases in adult life. New born baby’s weight is determined by fetal genes, and also controlled by maternal uterine environmental conditions, leading to complicated gene-environment interactions. For this dataset, we aimed at identifying potential genes as well as gene-gene interactions and further explore the mechanisms in which their effects are modified by maternal environmental conditions. We focused on the Thai population with 1126 subjects genotyped with the Omni1_Quad_v1_0_B platform after removing potential outliers. We picked mother’s one hour OGTT glucose level (denoted as U) as the environmental moderator in our analysis and baby’s gender as the covariate (denoted as X). There are total 590,913 SNPs after filtering out SNPs with MAF < 0.05, missing rate < 0.05 and deviating from Hardy-Weinberg equilibrium (p-value< 0.001). We first fitted the SNP data with a simple liner model, i.e., and test H0: = 0 using the Plink software with centered birth weight. There are total 61 SNPs with p-value < 10-4. The left panel in Fig. () shows the QQ plot of the -log10(p-values). No significant deviation from the expected diagonal line was observed. The right panel in Fig. () shows the Manhattan plot of the signals. We then fitted the data with the following varying-coefficient model to allow the effect of G vary over U, and allow nonlinear marginal effect of U, i.e., then we test H0: = 0. There are total 59 SNPs remained significant with a p-value threshold 10-4. The left panel in Fig. () shows the QQ plot of the -log10 (p-values). Again, we did not observe significant inflation of the p-values. The right panel in Fig. () shows the Manhattan plot of the signals. Since doing a pairwise interaction search with nearly 600K SNPs is technically infeasible, we merged the two SNP sets obtained in previous tests and got 115 SNPs in total (the two sets share 5 SNPs in common). Then we applied the proposed triple interaction PLVCM model (2.1) to the selected 115 SNPs. This strategy is also statistically valid since we only focused on the interaction analysis for SNPs showing marginal significance. The results are summarized in Table . We assessed the overall genetic effect by testing . The corresponding p-values are denoted by . The p-values for testing are denoted by . The last column of the table is the p-value for testing . As a comparison, we also fitted the 115 SNPs with a linear interaction model, i.e., then test with the Plink software. The corresponding p-values are denoted by , respectively in the table. We reported the results in Table based on the interaction p-values, i.e., . The top panel shows the SNP pairs with when fitting the PLVCM model. We observed consistently smaller p-values for testing the overall genetic effect compared to the p-values fitted with the linear model . When fitting the linear interaction model, the interaction effect does not show significance at the 0.05 significance level. Further tests show that the interaction effects are not constant for these SNP pairs at the 0.05 level, which implies that the interactions are significantly modified by mother’s glucose level to affect baby’s birth weight. If we fit a linear interaction model, we could miss these interaction effects. The lower panel in Table shows the SNP pairs with when fitting the linear interaction model. There are total 7 SNP pairs showing significant interaction effects based on the test . The p-values for testing the overall genetic effect when fitting both models are quite similar to each other. However, the p-values for testing interaction effect with the PLVCM model are larger than the ones obtained when fitting the data with a linear interaction model. This is not surprise since we failed to reject the null hypothesis H0: = c. The large p-values suggest that the coefficient functions are all constant. We could miss these interaction effects if we only fit the data with the proposed PLVCM model [22]. To show the estimated varying-coefficient function for , we picked SNP rs6744005 in gene PLB1 and SNP rs969981 in gene CEP112 and plotted their interaction effect in Fig. (). The constant coefficient fitted with a linear model (4.8) is shown as the dash-dotted line in the figure. Clearly the estimated function is not a constant. The increasing pattern of the function indicates that the baby’s birth weight increases as the mother’s glucose level increases. In summary, we identified 5 SNP pairs in which their interactions are significantly modified by mother’s glucose level based on the proposed PLVCM model. Such triple G×G×E interaction effects could be missed by fitting a linear interaction model. On the other hand, if a G × G interaction effect is not sensitive to environmental changes, fitting the PLVCM model could lead to potential model mis-specification, hence losing power. Thus, a practical guidance when fitting the PLVCM model is to assess the functional form of the varying coefficient first. If the coefficient is a constant, then we fit the data with a constant coefficient model. Otherwise, one can fit the proposed varying coefficient model.

DISCUSSION

Identifying gene-gene and gene-environment interactions underlying complex disease traits has been one of the central foci in genetic association studies. Vast amount of empirical studies have supported the role of both types of interaction in understanding the etiology of human diseases. Our previous investigations on nonlinear gene-environment interaction studies [16, 17] have suggested the power of statistical methods in hunting for nonlinear environmental modification effect on genetic risk. Although empirical studies have suggested the role of environmental changes on the effect of gene-gene interactions [18, 19], there has been no rigorous statistical treatment to assess the role of gene-gene interaction on complex diseases triggered by environmental stimulus [23]. In this work, we proposed a triple G×G×E interaction model in which we allow for nonlinear modification effect of environmental changes on gene-gene interactions to affect a disease trait. The proposed PLVCM model has the flexibility to incorporate both parametric (linear part) and nonparametric (nonlinear part) interaction effect. The varying coefficient function is estimated through nonparametric B-spline techniques, hence has the flexibility to capture the underlying functional form, either linear or nonlinear which can be evaluated via statistical tests. Our model is biologically attractive in which it exhibits two important features: (1) It addresses a long-term question on how environmental exposures moderate G×G influences on disease risk; and (2) It has the flexibility to detect nonlinear interaction, thus more powerful when G×G effects are nonlinearly modified by environmental stimuli. In a typical genetic association study, there are usually large number of genetic variables (e.g., SNPs). When focusing on a gene set based analysis, it is important to fit multiple SNPs within a gene to a single interaction model and select important players moderated by environmental changes. In addition, the proposed model is implemented based on a quantitative trait in current work. The framework can be extended to a binary disease trait with a known link function. The parameter estimation will be a little different due to the nonlinear link function to be adopted. Such a generalized PLVCM has particular power to dissect gene-gene interaction effects triggered by nonlinear environmental moderation. These will be considered in our future investigations. The computational code for implementing the proposed method is available upon request.
Table 1

List of SNP pairs with <0.001 or <0.001.

SNP 1 SNP 2 P-value
SNP ID MAF Chr Gene Alleles SNP ID MAF Chr Gene Alleles pvcO pvcI pLO pLI pvcC
rs98248190.403-A/Grs13321600.209-C/A1.27E-8 1.96E-4 1.84E-80.7177.49E-5
rs13853310.443-G/Ars13321600.209-C/A5.22E-9 2.02E-4 3.99E-80.5927.85E-5
rs25487540.225-A/Grs109820000.399ZNF618A/G3.60E-10 3.18E-4 1.01E-40.4022.12E-4
rs67440050.242PLB1A/Crs9699810.1317CEP112A/G1.11E-7 3.25E-4 2.79E-70.7261.13E-4
rs110301060.3711STIM 1G/Ars101520640.3214-G/A7.07E-9 4.22E-4 4.11E-40.6561.62E-4
rs79031750.4810-G/Ars11449130.1414EML5G/A4.65E-91.14E-36.55E-9 9.92E-5 0.25
rs79031750.4810-G/Ars101444750.1314ZC3H14C/G5.23E-85.89E-36.65E-8 1.10E-4 0.61
rs25843640.108-A/Grs80278260.2915-G/A4.58E-93.12E-36.45E-9 2.24E-4 0.71
rs30082030.341-G/Ars9039570.138-A/C1.02E-81.69E-21.39E-8 3.11E-4 0.49
rs15098690.341-G/Ars9039570.138-A/C8.09E-91.56E-21.11E-8 4.06E-4 0.44
rs99897450.182THSD7BA/Grs15651900.3412ITPR2G/A1.10E-61.59E-21.26E-6 7.36E-4 0.36
rs20834290.431-C/Ars118005490.151DNM3G/A2.43E-93.74E-33.51E-9 8.98E-4 0.14
  19 in total

1.  The mystery of missing heritability: Genetic interactions create phantom heritability.

Authors:  Or Zuk; Eliana Hechter; Shamil R Sunyaev; Eric S Lander
Journal:  Proc Natl Acad Sci U S A       Date:  2012-01-05       Impact factor: 11.205

2.  Exploring gene-environment interactions in Parkinson's disease.

Authors:  Colin C McCulloch; Denise M Kay; Stewart A Factor; Ali Samii; John G Nutt; Donald S Higgins; Alida Griffith; John W Roberts; Berta C Leis; Jennifer S Montimurro; Cyrus P Zabetian; Haydeh Payami
Journal:  Hum Genet       Date:  2008-01-22       Impact factor: 4.132

3.  A novel method for identifying nonlinear gene-environment interactions in case-control association studies.

Authors:  Cen Wu; Yuehua Cui
Journal:  Hum Genet       Date:  2013-08-24       Impact factor: 4.132

4.  Three-way interactions between mosquito population, viral strain and temperature underlying chikungunya virus transmission potential.

Authors:  Karima Zouache; Albin Fontaine; Anubis Vega-Rua; Laurence Mousson; Jean-Michel Thiberge; Ricardo Lourenco-De-Oliveira; Valérie Caro; Louis Lambrechts; Anna-Bella Failloux
Journal:  Proc Biol Sci       Date:  2014-10-07       Impact factor: 5.349

Review 5.  Global and societal implications of the diabetes epidemic.

Authors:  P Zimmet; K G Alberti; J Shaw
Journal:  Nature       Date:  2001-12-13       Impact factor: 49.962

6.  A gene-environment interaction between smoking and shared epitope genes in HLA-DR provides a high risk of seropositive rheumatoid arthritis.

Authors:  Leonid Padyukov; Camilla Silva; Patrik Stolt; Lars Alfredsson; Lars Klareskog
Journal:  Arthritis Rheum       Date:  2004-10

Review 7.  Detecting gene-gene interactions that underlie human diseases.

Authors:  Heather J Cordell
Journal:  Nat Rev Genet       Date:  2009-06       Impact factor: 53.242

Review 8.  Gene-environment interactions in Parkinson's disease.

Authors:  Christopher A Ross; Wanli W Smith
Journal:  Parkinsonism Relat Disord       Date:  2007       Impact factor: 4.891

Review 9.  Gene-environment interactions in psychiatry: joining forces with neuroscience.

Authors:  Avshalom Caspi; Terrie E Moffitt
Journal:  Nat Rev Neurosci       Date:  2006-07       Impact factor: 34.870

10.  An information-gain approach to detecting three-way epistatic interactions in genetic association studies.

Authors:  Ting Hu; Yuanzhu Chen; Jeff W Kiralis; Ryan L Collins; Christian Wejse; Giorgio Sirugo; Scott M Williams; Jason H Moore
Journal:  J Am Med Inform Assoc       Date:  2013-02-08       Impact factor: 4.497

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.