Literature DB >> 24803592

Epistasis analysis for quantitative traits by functional regression model.

Futao Zhang¹, Eric Boerwinkle², Momiao Xiong².

Abstract

The critical barrier in interaction analysis for rare variants is that most traditional statistical methods for testing interactions were originally designed for testing the interaction between common variants and are difficult to apply to rare variants because of their prohibitive computational time and poor ability. The great challenges for successful detection of interactions with next-generation sequencing (NGS) data are (1) lack of methods for interaction analysis with rare variants, (2) severe multiple testing, and (3) time-consuming computations. To meet these challenges, we shift the paradigm of interaction analysis between two loci to interaction analysis between two sets of loci or genomic regions and collectively test interactions between all possible pairs of SNPs within two genomic regions. In other words, we take a genome region as a basic unit of interaction analysis and use high-dimensional data reduction and functional data analysis techniques to develop a novel functional regression model to collectively test interactions between all possible pairs of single nucleotide polymorphisms (SNPs) within two genome regions. By intensive simulations, we demonstrate that the functional regression models for interaction analysis of the quantitative trait have the correct type 1 error rates and a much better ability to detect interactions than the current pairwise interaction analysis. The proposed method was applied to exome sequence data from the NHLBI's Exome Sequencing Project (ESP) and CHARGE-S study. We discovered 27 pairs of genes showing significant interactions after applying the Bonferroni correction (P-values < 4.58 × 10(-10)) in the ESP, and 11 were replicated in the CHARGE-S study.

Mesh：

Year: 2014 PMID： 24803592 PMCID： PMC4032862 DOI： 10.1101/gr.161760.113

Source DB: PubMed Journal: Genome Res ISSN： 1088-9051 Impact factor: 9.043

Epistasis is the primary factor in molecular evolution (Breen et al. 2012) and plays an important role in quantitative genetic analysis (Steen 2011). Epistasis is a phenomenon in which the effect of one genetic variant is masked or modified by one or more genetic variants and is often defined as the departure from additive effects in a linear model (Fisher 1918). Many statistical methods, including regression-based methods, have been developed to detect epistasis in quantitative genetic analysis (Cordell 2009; Chen and Cui 2010; Bocianowski 2012). However, these methods were originally designed to detect epistasis for common variants (Steen 2011) and are difficult to apply to rare variants because of their high type 1 error rates and poor ability to detect interactions between rare variants. Next-generation sequencing (NGS) data raise two serious problems. The first problem is the curse of dimensionality of the data, and the second problem comes from the low frequencies of rare variants in the data. The recently reported average number of single nucleotide polymorphisms (SNPs) per kb in the 202 drug target genes sequenced in 12,514 European subjects is about 48 SNPs (Nelson et al. 2012). The total number of all possible pairs of SNPs across the genome for large sample sizes can reach as many as 1016. The dimension of whole-genome sequencing is extremely high. The high dimension of the data for interaction analysis poses two great challenges. The first challenge is the requirement of a prohibitive amount of computational time. Suppose that 5000 pairwise tests can be finished in 1 sec (Steen 2011), then a possible pairwise interaction test would take ∼65,956 yr to finish. The second challenge for genome-wide interaction analysis arises from the multiple statistical tests. The power of the statistics that exhaustively test all possible pairs of interaction will be severely hampered by extremely large numbers of multiple tests. The popular strategies for reducing the dimensionality of the data, the number of tests, and the time of computations, and for improving the power to detect interactions are feature extraction, which projects the original high-dimensional data to low-dimensional space (Guyon et al. 2006; Li et al. 2009); feature selection, which selects subsets of variables of interests (Guyon and Elisseeff 2003; Saeys et al. 2007); and possibly, approximately complete testing to reduce computational time (Prabhu and Pe’er 2012). Feature extraction in association studies with NGS data is often carried out by collapsing multiple variants into a single variable (Li and Leal 2008; Bansal et al. 2010; Luo et al. 2011; Wu et al. 2011). However, much important interaction information may be lost after the multiple variants are collapsed. The collapsing methods may lack the power to detect interactions between variants. To address the critical barrier in detection of gene–gene interactions with NGS data, we take a genome region (or gene) as a basic unit of interaction analysis and use all the information that can be accessed to collectively test interactions between all possible pairs of SNPs within two genome regions (or genes). This will shift the paradigm of interaction studies from pairwise interaction analysis to region–region (gene–gene) interaction analysis, in which we collectively test interactions between two sets of loci within genomic regions or genes. To effectively reduce the dimension of the data, unlike the recently proposed group association tests (Li and Leal 2008; Madsen and Browning 2009), which ignore differences in genetic effects between SNPs in different locations, we use genetic variant profiles that will recognize information contained in the physical location of the SNP as a major data form. The densely distributed genetic variants across the genomes in large samples can be viewed as realizations of a Poisson process (Joyce and Tavaré 1995). The densely typed genetic variants in a genomic region for each individual are so close that these genetic variant profiles can be treated as observed data taken from curves (Luo et al. 2012). The genetic variant profiles are called functional. Since standard multivariate statistical analyses often fail with functional data (Ferraty and Romain 2010), we formulate a test for the interaction between two genomic regions in quantitative trait analysis as a functional regression (FRG) model (Ramsay and Silverman 2005) with scalar response. In the FRG model, the genotype functions (genetic variant profiles) are defined as a function of the genomic position of the genetic variants rather than a set of discrete genotype values, and the quantitative trait is predicted by genotype functions with their interaction terms. We will show that the FRG with scale response is a natural extension of the multivariate regression for interaction analysis. To evaluate its performance for interaction analysis, we use large-scale simulations to calculate the type I error rates of the FRG for testing the interaction between two genomic regions and to compare its power with pairwise interaction analysis and regression on principal components (PCs). To further evaluate its performance, the FRG for interaction analysis is applied to high-density lipoprotein (HDL) and exome sequence data from the NHLBI’s Exome Sequencing Project (ESP) and to whole-genome sequencing data from the CHARGE-S project.

Methods

Functional regression model for interaction analysis with a quantitative trait

Consider the two genomic regions and . Let be the phenotypic value of a quantitative trait measured on the ith individual. Let and be a genomic position in the first and second genomic regions, respectively. Let and be genotype functions of the ith individual in the regions and , respectively. The genotype function of the ith individual is defined aswhere M and m are two alleles of the SNP at the genomic position t. Recall that a regression model for interaction analysis is defined aswhere is an overall mean; is the main genetic additive effect of the jth SNP in the first genomic region; is the main genetic additive effect of the lth SNP in the second genomic region; is an additive × additive interaction effect between the jth SNP in the first genomic region and the jth SNP in the second genomic region; are indicator variables for the genotypes at the jth SNP and the lth SNP, respectively; and are independent and identically distributed normal variables with mean of zero and variance . Similar to the multiple regression models for interaction analysis with a quantitative trait, the FRG model for a quantitative trait can be defined aswhere is an overall mean; are genetic additive effects of two putative QTLs located at the genomic positions and , respectively; is the interaction effect between two putative QTLs located at the genomic positions and ; and are genotype function; and are independent and identically distributed normal variables with mean of zero and variance . In theory, the genetic additive effect and interaction effect functions can be obtained by variation of theory (Supplemental Note 1). The classical concept of genetic additive variance and interaction variance can be extended to the functional model (Supplemental Note 2). Below we take a numerical approach to estimate the genetic additive and interaction effect functions.

Estimation of interaction effects

We assume that both phenotypes and genotype functions are centered. The genotype functions and are expanded in terms of the orthonormal basis functions aswhere and are sequences of the orthonormal basis functions. The expansion coefficients and are estimated byIn practice, numerical methods for the integral will be used to calculate the expansion coefficients. Substituting Equation 3 into Equation 2, we obtainwhere , and . The parameters α, β, and γ are referred to as genetic additive and as additive × additive interaction effect scores. These scores can also be viewed as the expansion coefficients of the genetic effect functions with respect to orthonormal basis functions:Let ,where the values J and K are chosen such that genotype function expansions can account for 80% of total genetic variation in the first and second genes, respectively. If we use the above notations, Equation 5 can be reduced towhere and . Therefore, the interaction models with integrals are transformed to the traditional multivariate regression models (Equation 7) for interaction analysis. The standard least square estimator of b is given byand its variance iswhereSubstituting the estimated genetic effect scores , and into Equation 6 yields the estimated genetic additive effect and additive × additive interaction effect functions , and . If basis functions for expansion of genotype functions are functional PCs or eigenfunctions (Ash and Gardner 1975), then we can estimate the genetic additive and additive × additive interaction variances in Equation 2 (Supplemental Note 3).

Test statistics

An essential problem in genetic interaction studies of the quantitative trait is to test the interaction between two genomic regions (or genes). Formally, we investigate the problem of testing the following hypothesis:which is equivalent to testing the hypothesiswhere is defined in Equation 7. Let be the matrix corresponding to the parameter of the variance matrix Var ( in Equation 9. Define the test statistic for testing the interaction between the two genomic regions and asThen, under the null hypothesis , is asymptotically distributed as a central distribution if components are taken in the expansion Equation 6.

Results

Null distribution of test statistics

In the previous section, we showed that the test statistics are asymptotically distributed as a central distribution. To examine the validity of this statement, we performed a series of simulation studies to compare their empirical levels with the nominal ones. The type I error rates for rare variants and both common and rare variants were calculated. We assumed the three models: model 1 (without marginal effects), model 2 (with marginal effect of one gene), and model 3 (with marginal effects of two genes to generate a phenotype) (Supplemental Note 4). We generated a population with 1 million individuals by resampling from 3212 individuals with variants in eight genes selected from the NHLBI’s ESP, where the description of eight genes is summarized in Supplemental Table S1. To examine whether presence of the linkage disequilibrium (LD) between SNPs will seriously affect the type 1 error rates, we included some genes with linked variants. The number of sampled individuals range from 500 to 5000, and 5000 simulations were repeated. Table 1 and Supplemental Table S2 summarize the average type I error rates of the test statistics for testing the interaction between two genes with rare variants and mixed common and rare variants over all possible pairs of eight genes (28 pairs of genes), respectively, at the nominal levels α = 0.05, α = 0.01, and α = 0.001. These tables show that the type I error rates of the test statistics for testing interactions between two genes with or without marginal effects are not appreciably different from the nominal α levels.

Table 1.

Average type 1 error rates of the statistics for testing interaction between two genes with rare variants

Average type 1 error rates of the statistics for testing interaction between two genes with rare variants To study the impact of the LD between SNPs, we present Supplemental Table S3. Supplemental Table S3 summarizes the type 1 error rates of the FRG for testing interactions between genes: GBP3 and KANK4. The LD map of genes GBP3 and KANK4 is shown in Supplemental Figure 1. Supplemental Table S2 demonstrates that the presence of LD between genes being tested did not have a significant impact on the type 1 error rates. The impact of the lengths of the genes and sequencing error (on the type 1 error rates) will be limited (Supplemental Note 3).

Power evaluation

To evaluate the performance of the FRG for testing the interaction between two genes or genomic regions for a quantitative trait, simulated data were used to estimate their power to detect a true interaction. A true quantitative genetic model is given as follows. Consider pairs of quantitative trait loci (QTLs) from two genes (genomic regions). Let and be two alleles at the first QTL, and and be two alleles at the second QTL, for the pair of QTLs. Let be the genotypes of the uth individual with and , and be its genotypic value. The following multiple linear regression is used as a genetic model for a quantitative trait:where is a genotypic value of the hth pair of QTLs, and is distributed as a standard normal distribution . Four models of interactions are considered: (1) Dominant OR Dominant, (2) Dominant AND Dominant, (3) Recessive OR Recessive, and (4) Threshold model (Supplemental Table S4). The Recessive AND Recessive model is excluded due to infrequency of that condition with rare variants. The parameter varies from zero to one. We generated 1 million individuals by resampling from 3212 individuals of European origin with variants in the two genes IQGAP3 and ACTN2 selected from the ESP data set. We randomly selected 20% of the variants as causal variants. A total of 2000 individuals for the four interaction models were sampled from the populations. A total of 1000 simulations were repeated for the power calculation. The power of the proposed method is compared with the regression on PCs. For SNP genotypes in each genomic region, PC analysis (PCA) was performed. The number of PCs for each individual that can explain 80% of the total genetic variation in the genomic region will be selected as the variable. Specifically, the PC score of the ith individual in the first and second genomic regions is denoted by and , respectively. The regression model for detection of interaction is then given by The power of the proposed method is compared with the traditional point-wise interaction test, which takes the following model: For a pair of genes, we assume that the first gene has SNPs and the second gene has SNPs, and then the total number of all possible pairs is . For each pair of SNPs, we calculate a statistic for testing pairwise interaction . Finally, the maximum of : is computed. By permutation of 1000 times of the phenotypic values , we can find the distribution of ; i.e., we have 1000 values of . From this empirical distribution, we can find the P-value of , which can be used to calculate the power of testing for interaction between two genes (genomic regions) by pairwise tests. We first study the power of statistics for testing interactions between two genomic regions with all rare variants where 20% of the rare variants were chosen as causal variants. Figure 1, A through D, plots the power curves of three statistics: FRG, regression on PCs, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing interactions between two genomic regions (genes) that consist of only rare variants for a quantitative trait under Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive, and Threshold models, respectively. These power curves are a function of the risk parameter at the significance level . From these figures we observed several remarkable features. First, under all four interaction models, the test based on the FRG model was the most effective, followed by the regression on PCA. The pairwise tests where we tested the interaction between all possible pairs of SNPs in two genomic regions (genes) was the least effective. Second, the pairwise test almost had no power to detect interaction between two genomic regions (genes). Third, the effectiveness of the FRG-based test was substantially better than that of the pairwise tests. Fourth, the difference in power between the FRG and regression on PCA increases when the complexity of the interaction models increases.

Figure 1.

(A) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions that consist of rare variants, for a quantitative trait as a function of the relative risk parameter r at the significance level α = 0.05, under the Dominant OR Dominant model, assuming sample sizes of 2000. (B) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions that consist of rare variants, for a quantitative trait as a function of the relative risk parameter r at the significance level α = 0.05, under the Dominant AND Dominant model, assuming sample sizes of 2000. (C) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions that consist of rare variants, for a quantitative trait as a function of the relative risk parameter r at the significance level α = 0.05, under the Recessive OR Recessive model, assuming sample sizes of 2000. (D) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions that consist of rare variants, for a quantitative trait as a function of the relative risk parameter r at the significance level α = 0.05, under the Threshold model, assuming sample sizes of 2000. (E) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions that consist of rare variants, for a quantitative trait as a function of the sample size at the significance level α = 0.05, under the Dominant OR Dominant model, assuming the relative risk parameter r = 0.1. (F) Power curves of three statistics: the FRG, the regression on PCA, and pairwise interaction tests. Permutations were used to adjust for multiple testing, that is, for testing the interaction between two genomic regions with both common and rare variants, where 10% of the common variants and 10% of the rare variants were chosen as causal variants, as a function of the relative risk parameter r at the significance level α = 0.05, under the Dominant OR Dominant model, assuming sample sizes of 2000. To investigate the impact of sample size on the power, we plotted Figure 1E and Supplemental Figures 2 through 4, showing the power of three statistics for testing the interaction between two genomic regions (or genes) with only rare variants as a function of sample sizes under four interaction models, assuming 20% of the risk rare variants and the risk parameter for Dominant OR Dominant and Recessive OR Recessive, and for the Dominant AND Dominant and Threshold models, respectively. We observed similar power patterns of the three statistics under the four interaction models as those previously discussed. When sample sizes reach 10,000, the FRG model can be highly effective, but the effectiveness of the pairwise interaction test was still low even if the sample sizes increased to 10,000. The FRG can also be applied to the presence of both common and rare variants. Figure 1F plotted the power curves of three statistics for testing interactions between two genomic regions (or genes) with both common and rare variants where 10% of the common variants and 10% of the rare variants were chosen as causal variants under the Dominant OR Dominant interaction model. Again, the FRG was the most effective among the three statistics. The power patterns of the tests for the interactions under the other three interaction models were similar. To limit the length of this publication, the investigation of the power of the tests in other scenarios is presented in Supplemental Note 5.

Application to real data examples

To further evaluate its performance, the FRG for testing interaction was applied to data from the NHLBI’s ESP Project. The trait we considered was HDL. A total of 2225 individuals of European origin from 15 different cohorts in the ESP Project with no missing HDL phenotype value were included in the analysis. No evidence of cohort- and/or phenotype-specific effects or of other systematic biases was found (Tennessen et al. 2012). Exomes from related individuals were excluded from further analysis. The logarithm of HDL was taken as a trait value. The total number of genes tested for interactions, which included both common and rare variants, was 18,498. The remaining annotated human genes that did not contain any SNPs in our data set were excluded from the analysis. A P-value for declaring significant interactions after applying the Bonferroni correction for multiple tests was 4.75 × 10−10. To examine the behavior of the FRG, we plotted QQ plots of the test (Fig. 2). The QQ plots showed that the false-positive rate of the FRG for detection of interaction in some degree is controlled.

Figure 2.

(A) QQ plot for the ESP data set. (B) QQ plot for the CHARGE-S data set.

(A) QQ plot for the ESP data set. (B) QQ plot for the CHARGE-S data set. In total, 27 pairs of genes showed significant evidence of interaction with P-values < 4.58 × 10−10, which were calculated using the FRG model and logarithm transformation of the HDL. The results are summarized in Table 4 (below), where P-values for testing interactions between genes by regression on PCA and the minimum of P-values for testing all possible pairs of SNPs between two genes using a standard regression model are also listed. Since some complex traits in genetic studies often have non-normal distribution, we also used the rank-based inverse normal transformation (INT) to transform HDL (Beasley et al. 2009). The P-values for using INT of HDL with are included in Table 2. These 27 pairs of genes were derived from 35 genes. An additional 130 pairs of genes with P-values < 9.87 × 10−9 are listed in Supplemental Table 5.

Table 4.

P-values of 11 pairs of genes that were significantly interacted in the ESP and CHARGE-S studies

Table 2.

P-values of 27 pairs of significantly interacted genes identified by FRG

P-values of 27 pairs of significantly interacted genes identified by FRG Several remarkable features from these results were observed. First, we frequently observed the pairwise interaction between rare and rare variants (65.56%), and rare and common variants (34.44%). Less observed was the significant pairwise interaction between common and common variants with P-values for testing interactions <1.0 × 10−6 in Table 2 and Supplemental Table 5, where variants with MAF < 0.05 are defined as rare variants, and variants with MAF ≥ 0.05 are defined as common variants. Second, pairs of SNPs between two genes jointly have significant interaction effects, but individually, each pair of SNPs makes mild contributions to the interaction effects, as shown in Table 3 and Supplemental Table 6. There were a total of 684 pairs of SNPs between genes KCNK5 and PRDM13. Table 3 lists 35 pairs of SNPs with P-values < 0.0497. None of the 35 pairs of SNPs showed strong evidence of interaction. However, a number of pairs of SNPs between the genes KCNK5 and PRDM13 collectively demonstrated significant interaction. We observed similar interaction patterns in Supplemental Table 6, where eight pairs of SNPs between the genes BHMT2 and BMF with P-values < 0.045 are listed. Third, the FRG often had a much smaller P-value to detect interaction than regression on the PCA and the minimum of P-values of pairwise tests had. Fourth, some investigators suggest that in genome-wide interaction analysis, only genes with large or mildly marginal genetic effects should be tested for interaction. However, we observed that genes may not show even mild marginal association, but they did demonstrate significant evidence of interaction (data were not shown). Fifth, computational time for gene-based interaction analysis is much less than that for pairwise tests. In Table 2, we tested a total of 27 pairs of genes and 9696 pairs of SNPs within them. The computational times for the FRG method for testing 27 pairs of gene interactions and by pairwise test for testing 9696 pairs of SNP interactions were 2.18 sec and 91.91 sec, respectively. The computer configuration is as follows: CPU, Intel Core i7-3770 CPU at 3.4 GHz; memory (RAM), 16 GB. The interaction analysis by FRG on the entire set of genes was carried out on the cluster with 10 nodes, with each node having 24 cores (Intel Xeon CPU X5690 at 3.47 GHz). The running time for FRG on the entire set of genes was 18.3 h. Sixth, although interacting genes did not form large connected networks, we did observe some small interacted networks (Fig. 3). We observed three hub genes: TBC1D3B, SNTB1, and PRDM13. TBC1D3B had significant interactions with 12 genes (P-values < 4.20 × 10−10) and interactions with 14 genes (P-values range from 9.10 × 10−9 to 5.10 × 10−10). SNTB1 strongly interacted with two genes and had modest interactions with another 26 genes (P-values varying from 9.19 × 10−9 to 9.10 × 10−10). PRDM13 strongly interacted with five genes (P-values < 4.58 × 10−10) and had modest interactions with another 10 genes (P-values varying from 9.83 × 10−9 to 1.55 × 10−9) (Table 2; Supplemental Table 5). SNTB1 is a peripheral membrane protein. It is reported that SNTB1 plays an essential role in regulating vascular tone and blood pressure (Lyssand et al. 2008). The multiple copies of TBC1D3B are located within a cluster of chemokine genes and might be a hominoid oncoprotein (Hodzic et al. 2006). We also observed modest interactions between SNTB1 and LDLR (P-value < 4.76 × 10−7) and between SNTB1 and LIPC (P-value < 7.85 × 10−6). LDLR and LIPC were reported to influence lipid levels in genome-wide association studies (GWAS) (Aulchenko et al. 2009). PRDM13 is involved in transcriptional regulation (Chang et al. 2013). Point mutation in its strongly interacted gene KCNK5 causes early-onset of autosomal dominant hypertension (Charmandari et al. 2012). Two interacted subnetworks with PRDM13 and TBC1D3B as hub genes were connected via directed interaction between two hub genes or via their interactions with WASF2 and EFNA3 (Fig. 3). EFNA3 is a key regulator of embryogenesis and is expressed in human atherosclerotic plaque (Sakamoto et al. 2011). It was reported that EFNA3 was a potential target of microRNA 210 as a novel therapy for treatment of ischemic heart disease (Hu et al. 2010).

Table 3.

P-values of 35 pairs of SNPs between genes KCNK5 and PRDM13 for testing interaction

Figure 3.

Networks of 27 pairs of genes showing significant evidence of interactions and genes showing mild interactions in Supplemental Table S5.

P-values of 35 pairs of SNPs between genes KCNK5 and PRDM13 for testing interaction Networks of 27 pairs of genes showing significant evidence of interactions and genes showing mild interactions in Supplemental Table S5. To further evaluate the performance of the FRG for interaction analysis, we investigated whether 27 pairs of interacted genes (Table 2) in the ESP can be replicated in the CHARGE-S studies, which generated low-coverage, whole-genome sequencing data of 955 individuals from the ARIC (Atherosclerosis Risk in Communities), Framingham, and CHS (Cardiovascular Health Study) longitudinal cohorts after quality control with rich phenotypes including HDL cholesterol levels. A total of 25 pairs of genes in Table 2 in CHARGE-S were sequenced (SNTB1 was not sequenced in CHARGE-S). Since we carried out 25 tests, the P-value for declaring replication after the Bonferroni correction for multiple tests was 0.002. We observed that 11 of the 25 pairs of significantly interacted genes (involving 14 genes) in the ESP project were replicated in the CHARGE-S study (Table 4). To further evaluate the performance of the FRG, we also considered a scenario where INT transformation of the HDL was taken as a trait value. The P-values for testing interactions between 10 pairs of genes selected from Table 2 using INT transformation of the HDL as a trait are included in Supplemental Note 6 (Supplemental Table S10). It is interesting to note that a subnetwork including six interactions with hub gene PRDM13 and four interactions with hub gene TBC1D3B (Fig. 4) were replicated in the CHARGE-S study. This again showed that PRDM13 and TBC1D3B may make a large contribution to HDL-level variation.

Figure 4.

Nine interactions (pink color) between genes (green color) which form a subnetwork were replicated in the NHLBI’s ESP and CHARGE-S studies.

P-values of 11 pairs of genes that were significantly interacted in the ESP and CHARGE-S studies Nine interactions (pink color) between genes (green color) which form a subnetwork were replicated in the NHLBI’s ESP and CHARGE-S studies.

Discussion

The current paradigm of pairwise interaction analysis was originally designed for testing the interaction for common variants and cannot be applied to genome-wide interaction analysis with rare variants due to its poor ability to detect interaction between rare variants, and rare and common variants, its prohibitive computational time, and the extremely large number of tests being conducted. To address these central themes and critical barriers in interaction analysis, we shift the paradigm of interaction analysis from the pairwise test to the collective group test, where we take a genome region (or gene) as a basic unit of interaction analysis and collectively test the interaction between all possible pairs of SNPs within two genome regions (or genes) and use FRG to develop a novel statistical framework for testing the interaction between two genomic regions (or genes). Using large simulations and real data analysis, we demonstrate the merits and limitations of the proposed new paradigm of interaction analysis. The new approach uses all genetic information in the genome region to collectively test interaction between multiple SNPs within the regions. In the FRG approach to interaction analysis, we first expand the genotype function in a genomic region (gene) in terms of orthonormal basis functions. Genetic information across all variants in the genomic region, including all single variant variation and their linkage disequilibrium, is compressed into expansion coefficients. We use the compressed genetic information to globally test interaction between two genomic regions (genes). Therefore, the FRG for interaction analysis overcomes limitations inherent in pairwise interaction tests. By large simulations and real data analysis, we showed that the proposed FRG substantially increased the power and dramatically reduced the computational burden. In real data analysis, we also clearly demonstrate that pairs of SNPs between two genes jointly have significant interaction effects, but individually each pair of SNPs makes a mild contribution to interaction effects. The pairwise interaction analysis is designed to test interactions between common and common variants, and is difficult to use to test interactions between rare and rare variants, and rare and common variants. There is an increasing need to develop statistics that can be used to test interactions among the entire allelic spectrum of variants. The FRG can efficiently test the interaction between rare and rare, rare and common, and common and common variants. The essential problems in performing genome-wide interaction analysis in practice are the power of the test statistics, feasibility of computations, and efficient methods for P-value correction of multiple tests. Due to the lack of power of the widely used pairwise tests for interaction and its computational intensity, exploration of genome-wide gene–gene interactions has been limited (Ay 2002; Costanzo et al. 2010). Many geneticists question the universal presence of significant gene–gene interactions. Very few genome-wide interaction analyses with NGS data and very few results of significant interactions between rare and rare variants, and rare and common variants, have been reported. To our knowledge, we are among the first to conduct genome-wide interaction analysis with exome sequencing data. From genome-wide interaction analysis of HDL using the NHLBI’s Exome Sequencing data, we have several important observations. We observed that the majority of the significantly interacting genes showed no marginal association. Surprisingly, in 157 top pairs of interacting genes, the P-values for testing the marginal association of genes by the functional linear model ranged from 0.9933 to 0.00017. This strongly suggested that testing interactions for only genes with strong or mild marginal association will miss the majority of the interactions. An interesting question to ask is what types of variants–rare or common–are more often present in the interaction. Our limited results showed that large proportions of interactions were due to interaction between rare and rare variants, and rare and common variants, but less significant pairwise interaction arose from interaction between common and common variants. Whether interactions are most often present in isolation or whether interacting genes form networks is an open question. Our results indicated that interacting genes formed small interacting networks and that hub genes were present in the networks. These hub genes might be essential for interaction, which in turn may lead to important biological functions causing phenotype variation. We identified large networks that were generated from examining interactions between loci associated with serum lipid levels in recent GWAS, although interactions between genes in the networks were mild. We suspect that these genes in the network may jointly make contributions to the phenotype variation. Our preliminary results also showed that interactions can be replicated in two independent studies and observed that interactions with hub genes were more easily replicated. It is well known that population stratification or cryptic relatedness may create artifactual LD, which in turn will lead to spurious interaction. In the presence of population structure and cryptic relatedness, in general, we need to use mixed FRG to avoid the impact of population structure and cryptic relatedness on the tests. A detailed investigation is beyond the scope of this article. NGS techniques generalize extremely high-dimensional genomic data. Transition of analysis from low-dimensional data to extremely high-dimensional data demands changes in the concept of interaction and quantitative trait models. Functional data analysis and the concept of group tests will provide a powerful tool for interaction analysis. However, the results presented in this article are considered preliminary. The number of basis functions in the expansion of genotype function will influence the performance of the FRG for interaction analysis. In practice, we select the number of basis functions which can explain 80%–90% of the genetic variation. Gene–gene interaction is an important but complex concept. Although functional data analysis and taking genomic regions as the unit of analysis can largely reduce the dimension of data for interaction analysis, genome-wide gene–gene interaction analysis still needs intensive computations. Genome-wide interaction analysis still poses great challenges. The main purpose of this article is to stimulate discussion about the optimal strategies for genome-wide interaction analysis. We hope that our results will greatly increase confidence in applying them to genome-wide gene–gene interaction analysis.

Data access

Software for implementing the proposed methods can be downloaded from Bioconductor (http://www.bioconductor.org/packages/2.14/bioc/html/FRGEpistasis.html) and our website http://www.sph.uth.tmc.edu/hgc/faculty/xiong/index.htm.

27 in total

Review 1. A review of feature selection techniques in bioinformatics.

Authors: Yvan Saeys; Iñaki Inza; Pedro Larrañaga
Journal: Bioinformatics Date: 2007-08-24 Impact factor: 6.937

2. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey
Journal: Science Date: 2012-05-17 Impact factor: 47.728

3. The distribution of rare alleles.

Authors: P Joyce; S Tavaré
Journal: J Math Biol Date: 1995 Impact factor: 2.259

4. MicroRNA-210 as a novel therapy for treatment of ischemic heart disease.

Authors: Shijun Hu; Mei Huang; Zongjin Li; Fangjun Jia; Zhumur Ghosh; Maarten A Lijkwan; Pasquale Fasanaro; Ning Sun; Xi Wang; Fabio Martelli; Robert C Robbins; Joseph C Wu
Journal: Circulation Date: 2010-09-14 Impact factor: 29.690

5. TBC1D3, a hominoid oncoprotein, is encoded by a cluster of paralogues located on chromosome 17q12.

Authors: Didier Hodzic; Chen Kong; Marisa J Wainszelbaum; Audra J Charron; Xiong Su; Philip D Stahl
Journal: Genomics Date: 2006-07-24 Impact factor: 5.736

6. Quantitative trait locus analysis for next-generation sequencing with the functional linear models.

Authors: Li Luo; Yun Zhu; Momiao Xiong
Journal: J Med Genet Date: 2012-08 Impact factor: 6.318

7. The genetic landscape of a cell.

Authors: Michael Costanzo; Anastasia Baryshnikova; Jeremy Bellay; Yungil Kim; Eric D Spear; Carolyn S Sevier; Huiming Ding; Judice L Y Koh; Kiana Toufighi; Sara Mostafavi; Jeany Prinz; Robert P St Onge; Benjamin VanderSluis; Taras Makhnevych; Franco J Vizeacoumar; Solmaz Alizadeh; Sondra Bahr; Renee L Brost; Yiqun Chen; Murat Cokol; Raamesh Deshpande; Zhijian Li; Zhen-Yuan Lin; Wendy Liang; Michaela Marback; Jadine Paw; Bryan-Joseph San Luis; Ermira Shuteriqi; Amy Hin Yan Tong; Nydia van Dyk; Iain M Wallace; Joseph A Whitney; Matthew T Weirauch; Guoqing Zhong; Hongwei Zhu; Walid A Houry; Michael Brudno; Sasan Ragibizadeh; Balázs Papp; Csaba Pál; Frederick P Roth; Guri Giaever; Corey Nislow; Olga G Troyanskaya; Howard Bussey; Gary D Bader; Anne-Claude Gingras; Quaid D Morris; Philip M Kim; Chris A Kaiser; Chad L Myers; Brenda J Andrews; Charles Boone
Journal: Science Date: 2010-01-22 Impact factor: 47.728

8. Prdm13 mediates the balance of inhibitory and excitatory neurons in somatosensory circuits.

Authors: Joshua C Chang; David M Meredith; Paul R Mayer; Mark D Borromeo; Helen C Lai; Yi-Hung Ou; Jane E Johnson
Journal: Dev Cell Date: 2013-04-29 Impact factor: 12.270

9. Rank-based inverse normal transformations are increasingly used, but are they merited?

Authors: T Mark Beasley; Stephen Erickson; David B Allison
Journal: Behav Genet Date: 2009-06-14 Impact factor: 2.805

10. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917

16 in total

1. Meta-analysis of Complex Diseases at Gene Level with Generalized Functional Linear Models.

Authors: Ruzong Fan; Yifan Wang; Chi-Yang Chiu; Wei Chen; Haobo Ren; Yun Li; Michael Boehnke; Christopher I Amos; Jason H Moore; Momiao Xiong
Journal: Genetics Date: 2015-12-29 Impact factor: 4.562

2. Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models.

Authors: Chi-Yang Chiu; Jeesun Jung; Wei Chen; Daniel E Weeks; Haobo Ren; Michael Boehnke; Christopher I Amos; Aiyi Liu; James L Mills; Mei-Ling Ting Lee; Momiao Xiong; Ruzong Fan
Journal: Eur J Hum Genet Date: 2016-12-21 Impact factor: 4.246

3. Gene-Based Association Analysis for Censored Traits Via Fixed Effect Functional Regressions.

Authors: Ruzong Fan; Yifan Wang; Qi Yan; Ying Ding; Daniel E Weeks; Zhaohui Lu; Haobo Ren; Richard J Cook; Momiao Xiong; Anand Swaroop; Emily Y Chew; Wei Chen
Journal: Genet Epidemiol Date: 2016-01-18 Impact factor: 2.135

4. Human L1 Transposition Dynamics Unraveled with Functional Data Analysis.

Authors: Di Chen; Marzia A Cremona; Zongtai Qi; Robi D Mitra; Francesca Chiaromonte; Kateryna D Makova
Journal: Mol Biol Evol Date: 2020-12-16 Impact factor: 16.240

Review 5. Global Genetic Networks and the Genotype-to-Phenotype Relationship.

Authors: Michael Costanzo; Elena Kuzmin; Jolanda van Leeuwen; Barbara Mair; Jason Moffat; Charles Boone; Brenda Andrews
Journal: Cell Date: 2019-03-21 Impact factor: 41.582

Review 6. Practical aspects of genome-wide association interaction analysis.

Authors: Elena S Gusareva; Kristel Van Steen
Journal: Hum Genet Date: 2014-08-28 Impact factor: 4.132

7. Gene-Based Association Testing of Dichotomous Traits With Generalized Functional Linear Mixed Models Using Extended Pedigrees: Applications to Age-Related Macular Degeneration.

Authors: Yingda Jiang; Chi-Yang Chiu; Qi Yan; Wei Chen; Michael B Gorin; Yvette P Conley; M'Hamed Lajmi Lakhal-Chaieb; Richard J Cook; Christopher I Amos; Alexander F Wilson; Joan E Bailey-Wilson; Francis J McMahon; Ana I Vazquez; Ao Yuan; Xiaogang Zhong; Momiao Xiong; Daniel E Weeks; Ruzong Fan
Journal: J Am Stat Assoc Date: 2020-07-28 Impact factor: 5.033

8. Powerful Tukey's One Degree-of-Freedom Test for Detecting Gene-Gene and Gene-Environment Interactions.

Authors: Yaping Wang; Donghui Li; Peng Wei
Journal: Cancer Inform Date: 2015-06-04

9. Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits.

Authors: Futao Zhang; Dan Xie; Meimei Liang; Momiao Xiong
Journal: PLoS Genet Date: 2016-04-22 Impact factor: 5.917

10. Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis.

Authors: Rebeca Campos-Sánchez; Marzia A Cremona; Alessia Pini; Francesca Chiaromonte; Kateryna D Makova
Journal: PLoS Comput Biol Date: 2016-06-16 Impact factor: 4.475