Literature DB >> 22837643

Simple f test reveals gene-gene interactions in case-control studies.

Guanjie Chen¹, Ao Yuan, Jie Zhou, Amy R Bentley, Adebowale Adeyemo, Charles N Rotimi.

Abstract

Missing heritability is still a challenge for Genome Wide Association Studies (GWAS). Gene-gene interactions may partially explain this residual genetic influence and contribute broadly to complex disease. To analyze the gene-gene interactions in case-control studies of complex disease, we propose a simple, non-parametric method that utilizes the F-statistic. This approach consists of three steps. First, we examine the joint distribution of a pair of SNPs in cases and controls separately. Second, an F-test is used to evaluate the ratio of dependence in cases to that of controls. Finally, results are adjusted for multiple tests. This method was used to evaluate gene-gene interactions that are associated with risk of Type 2 Diabetes among African Americans in the Howard University Family Study. We identified 18 gene-gene interactions (P < 0.0001). Compared with the commonly-used logistical regression method, we demonstrate that the F-ratio test is an efficient approach to measuring gene-gene interactions, especially for studies with limited sample size.

Entities: Chemical Disease Gene Mutation Species

Keywords: F-ratio test; g-g interactions; power

Year: 2012 PMID： 22837643 PMCID： PMC3399554 DOI： 10.4137/BBI.S9867

Source DB: PubMed Journal: Bioinform Biol Insights ISSN： 1177-9322

Introduction

Both genetic and environmental risk factors play critical roles in the development of human diseases. Understanding the etiology of complex diseases, such as Type 2 diabetes (T2D), is proving to be a challenging task.1–4 Partly responsible for this difficulty is the current inability to systematically account for genetic effects that manifest solely or partially in interaction with other genes.5 Many studies6–8 suggest that gene-gene interactions may play an important role in disease etiology. As such, the development of statistical tools to detect these genetic effects has received considerable attention. One of the most commonly-used methods for identifying gene-gene interactions is logistic regression, which models the relationship between genotypes and qualitative clinical outcomes.9–15 Although convenient in application and efficient in inference when the model represents the true relationship in the population, there are a few limitations to this method that should be considered. First, a major challenge of parametric methods, like the logistic model, is the robustness and reliability of the modeling. It is known, for example,16,17 that when a given model does not represent the true relationships in the population being evaluated, bias will be introduced. This is a serious issue in practice when researchers are not sure of the validity of the underlying parametric model. Model justification, except for very simple cases, is a daunting task, especially with multi-dimensional data. Second, the number of possible interaction terms grows exponentially with the addition of each main effect; logistic regression is limited with regards to interaction data involving many simultaneous factors.18,19 Third, parametric approaches have less power for detecting interactions than independent main effects, necessitating large sample sizes.20 Finally, interpreting the parameter estimates for interaction terms resulting from this type of analysis is not straightforward.21 In contrast, the non-parametric approach, although generally requiring larger sample sizes than parametric methods, are robust and reliable and have been successfully used in genetic analysis. Non-parametric methods are generally more complicated in formulation and computation than parametric methods, due to the non-parametric modeling of the data distribution. However, it is usually simpler to construct the test statistic and compute results for hypothesis testing using non-parametric methods, as accurate asymptotic results can be applied without concern over robustness. Here we present a non-parametric, model-free approach to detect gene-gene interactions in case-control studies. When both case and control SNP frequencies are in Hardy-Weinburg equilibrium (HWE), the test statistic is simplified to a standard F-distribution by asymptotic approximation; when the SNPs are not in HWE, the test statistic approximates a non-centralized F-distribution. The corresponding P-value and its power under the alternative can easily be computed via simulation. We demonstrate this method in an analysis of T2D in the Howard University Family Study (HUFS).22

The Method

Let SNP1 and SNP2 be trait-related loci, with genotypes represented by values of 0, 1, and 2. Let (x11, x21), ..., (x1, x2) be genotypes for SNP1 and SNP2 among cases, while (y11, y21), ..., (y1, y2) are the genotypes of SNPs among controls. To investigate whether a SNP by SNP interaction influences the outcome of interest, we will determine whether a joint frequency of these SNPs differs by case status. Let H0 be the hypothesis that the two SNPs are independent. Statistically, this can be tested by constructing two 3 by 3 contingency tables. For the cases, the (i, j)-cell is the count n(1) among the (x1, x2) takes value (i, j) (i, j = 0, 1, 2), and controls counts n(2) (i, j = 0, 1, 2). Let n(1) = ∑02 n(1) (i = 0,1,2) ni(2) = ∑=02 n(2) (i = 0,1,2), and and Under H0, both the case and control cell counts will be in Hardy-Weinberg equilibrium, thus χ12 and χ22 will be asymptotically independent chi-square distribution with degree of freedom three, so asymptotically, a F distribution with degree (3,3), and if H0 is true, this statistic will be close to 1; If H0 is not true, it will deviate significantly from 1. For relevant P-value for a specific level of a (typically, α = 0.05, 0.03, 0.02 or 0.01), can be determined using an F3,3 table. To quantify the magnitude of the interaction, we may define r = 2T/(1 + T ) − 1 as a measurement for this. Note −1 ≤ r ≤ 1, thus, r = 0 corresponds to no interaction, r = −1 is the maximum negative correlation, and r = 1 is the maximum positive correlation. Note that spurious interactions may occur as a result of SNPs being in linkage disequilibrium (LD) with each other. While LD could first be tested among controls, this step is not necessary with this method. In the absence of an interaction, LD should not differ between cases and controls and, as the test statistic is the ratio of cases to controls, LD should not affect the results. Deviation from Hardy-Weinberg equilibrium is possible for reasons other then linkage to the trait. In this situation, χ12 will be an asymptotically independent non-central chisquare distribution with 3 degree of freedom, with parameter of non-centrality where p is the frequency for SNP i (i = 0, 1, 2), and p is the frequency of joint SNP type (i, j) (i, j = 0, 1, 2) for the cases. Similarly, χ22 will be asymptotically independent non-central chisquare distribution with 3 degree of freedom, with parameter of non-centrality where q is the frequency for SNP i (i = 0, 1, 2), and q is the frequency of joint SNP type (i, j) (i, j = 0, 1, 2) for the controls. So asymptotically, follows an F distribution with degrees of freedom (3,3) and non-centrality parameters nδ1 and nδ2. Under H0, p = q, p = q (i, j = 0, 1, 2), so δ1 = δ2, the ratio will be close to 1. If H0 is not true, typically δ1 > δ2, the ratio will tend to deviate from 1 significantly. For given data, n, and (δ2, δ2), the P-value of the observed ratio and the power of the level α test can be computed via simulation. Specifically, under H0, for each given δ1 = δ2 = δ, the P-value of the observed statistic T is computed as below. Choose a large m (typically, m = 100, 000), for j = 1, ..., m, do the following: Sample X and Y independently from N ((nδ/3)1/2, 1), (k = 1, 2, 3). Let Z = (X,12 + X,22 + X,32)/)/(Y,12 + Y,22 + Y,32), then Z is a sample from F3,3(nδ, nδ). Let V = I(Z > T ), here I(·) is the indicator function, ie, V takes value 1 if Z > T, and 0 otherwise. Then P (δ) = ∑=1 V /m is the simulated P-value at δ of the observed T. Let Z(1) ≤ Z(2) ≤ … ≤ Z() be the ordered values of the Z’s. Let r = [(1 − α)m], the largest integer under (1 − α)m, the upper (1 − α)-th quantile of the F3,3(nδ, nδ) distribution at δ is simulated as Q(1 − α, δ) = Z(). The P-value can be tabulated for a list of different δ’s, for example, for δ = 0.1, 0.2, … Similarly, for given δ1 > δ2, the power of the level α test is simulated as below. For j = 1, …, m, do the following: Sample X (k = 1, 2, 3) independently from N ((nδ1/3)1/2, 1), and Y (k = 1, 2, 3) independently from N((nδ2/3)1/2, 1). Let Z (X12 + X22 + X32) /(Y12 + Y22 + Y32), then Z is a sample from F3,3(nδ1, nδ2). Let V = I(Z > Q(1 − α, δ2)), then P (δ1, δ2) = ∑ =1 V/m is simulated power at (δ1, δ2). Here Q(1 − α, δ2) is computed before. For given level of α, let F (1 − α) be the (1 − α)-th quantile, the rejection rule for H0 is and the power β(δ), when the true data is generated with δ > 0, is The power at a given level of α can be tabulated for a list of different (δ1, δ2)’s, and n’s for example, for (δ1, δ2) = (0.1, 0), (0.2, 0), …, (1, 0), and for n = 30, 50, 100, 150, 200… When one (or both) of the minor alleles for the SNP pair being tested has a small frequency, the rare homozygote SNP type will have extremely small frequency in the contingency table. In this case, the asymptotic approximation of the F-distribution for the T statistic is not justified. Let n0 be the smallest observed frequency in either the case and control contingency Tables. As a rule of thumb, when n0 < 10, the sample size is not large enough for the asymptotic approximation to be valid. In this case, the ‘exact’ P-value (under the null) of the observed statistic T can be computed by the standard exact method. Departures from Hardy-Weinberg equilibrium among controls was assessed by comparing the observed genotype frequencies to the expected frequencies using the exact test. Odds ratio and 95% confidence intervals for single locus associations were obtained using unconditional logistic regression. As a basis for comparison, logistic regression models were also performed to evaluate the gene-gene interactions. Models included each SNP individually as well as a SNP × SNP product term. The FDR method was used to adjust for multiple testing,23 although, if all the tests are independent, a Bonferroni correction may also used.24 Analysis and the software used are written in SAS and can be provided upon request to chengu@mail.nih.gov.

Data Analysis

We applied our method to T2D using the Howard University Family Study (HUFS) data.22 Briefly, the HUFS is a population based family study of African Americans in the Washington, D.C. metropolitan area. The major objective of the HUFS was to enroll and examine a randomly-ascertained sample of African American families, along with a set of unrelated individuals, for the study of the genetic and environmental bases of common complex diseases including hypertension, obesity, diabetes and associated phenotypes. A total of 1082 unrelated individuals had both phenotype and genotype (Affymetrix 6.0) data. Of these, 221 individuals were classified as T2D (defined as fasting plasma glucose concentration > 126 mg/dL, report of a doctor’s diagnosis of T2D, or report of current T2D treatment). Based on previous publications,25,26 19 T2D candidate gene regions (Table 1) were selected for analysis. Of note, the issue of loci interaction is independent from consideration of main effect: loci that strongly interact may or may not be associated individually with the trait. Thus, the SNPs included in our analysis were not first limited to those with a main effect on T2D. Of these, 608 SNPs passed quality control filters: call rate ≥ 95%, Minor Allele Frequency (MAF > 0.05), and Hardy-Weinberg Equilibrium (P-values of HWE > 0.01). After using window size of 50 and R2 score ≥ 0.3 between two loci, 298 SNPs not in LD with each other were used for analysis (Table 1) in 19 candidate T2D gene regions.

Table 1

The list of candidate genes that were analyzed.

Genes	Location	No. of SNPs	Order*
GCKR	2p23	4	1–4
BCL11A	2p16.1	9	5–13
IRS1	2q36	6	14–19
PPARG	3p25	15	20–34
WFS1	4p16.1	8	35–42
KLF14	7q32.3	1	43–43
TP53INP1	8q22	3	44–46
TCF7L2	10q25.3	30	47–76
KCNQ1	11p15.5	71	77–147
KCNJ11	11p15.1	3	148–150
CENTD2/ARAP1	11q13.4	3	151–153
MTNR1B	11q21	4	154–157
HMGA2	12q15	15	158–172
IGF1	12q23.2	8	173–180
HNF1A	12q24.2	3	181–183
ZFAND6	15q25.1	7	184–190
PRC1	15q26.1	6	191–196
FTO	16q12.2	78	197–274
HNF1B	17q21.3	24	275–298

Note:

The order represents the position of the SNP in Figures 1 and 2.

For reference, logistic regression analysis of each of the loci without interaction was conducted (all results P < 0.01 are presented in Table 2). After correction for multiple tests, no SNP reached the threshold for statistical significance (Bonferroni significant level P < 1.7 × 10−4). The threshold for statistical significance for the gene-gene interaction evaluated by the F-ratio method was set at P < 0.001 (corresponding to an FDR q-value of 0.027); at this level of statistical significance, the dependence between the two loci among cases was over 141 times higher than among controls. 18 significant gene-gene interactions were discovered (the top 7 are presented in Table 3). For comparison, logistic regression was also used to evaluate gene-gene interactions in the same data. To illutrate the overall similarity of these approaches, a heat map was created showing the statistical significance of the interaction term for each pair of SNPs evaluated using the F-ratio (Fig. 1) and logistic regression (Fig. 2) analyses. Similar patterns were observed with both of these methods; at the same level of statistical significance (P = 0.05), there was a concordance rate of 94.09% between the two methods. The generally lower P-values observed with logistic regression are presumed to represent the fact that logistic regression models are already adjusted for the main effect of each of the SNPs, while the F-ratio method is not. Displayed in Figure 3 is the power of the F-ratio method for a variety of δ values (a measure of the deviation from HWE between two SNPs), sample size, and α levels. At an α = 0.05, a δ1 = 0.4, and a sample size of n = 100, the F-ratio method reaches 0.80 power. The strong power that can be achieved at this moderate value of δ with less than 200 individuals suggests the practicality of using this method when sample size is limited.

Table 2

Significant results for single locus association of 298 SNPs in 19 genes.

SNPs	Genes	Odds ratio	95% C.I.	P-values
rs10956932	TP53INP1	1.62	1.27–2.05	0.00008
rs8053888	FTO	0.66	0.11–0.53	0.00025
rs12573128	TCF7L2	0.67	0.53–0.84	0.00070
rs231901	KCNQ1	0.49	0.21–0.75	0.00092
rs9806929	FTO	0.60	0.42–0.84	0.00284
rs11649763	KNF1B	0.44	0.25–0.77	0.00403
rs7069007	TCF7L2	1.56	1.13–2.15	0.00738
rs5742652	IGF1	2.08	1.19–3.62	0.00981

Table 3

Top results of gene-gene interactions from 298 SNPs in 19 genes.

Locus (name of genes)	Locus (name of genes)	P-value
rs10519280 (ZFAND6)	rs12149010 (FTO)	2.62 × 10⁻⁶
rs5742652 (IGF1)	rs7205617 (FTO)	7.71 × 10⁻⁶
rs17130192 (TCF7L2)	rs12425829 (HMGA2)	8.27 × 10⁻⁶
rs17130192 (TCF7L2)	rs11111262 (IGF1)	8.57 × 10⁻⁶
rs17130192 (TCF7L2)	rs17636091 (PRC1)	1.10 × 10⁻⁵
rs2272046 (HMGA2)	rs17636091 (PRC1)	1.35 × 10⁻⁵
rs2272046 (HMGA2)	rs6824720 (WFS1)	1.45 × 10⁻⁵

Figure 1

Gene-gene interactions among 298 SNPs distributed in 19 genes using our simple F-ratio non-parametric method.

Note: Colors from dark blue to red represent P-value from 0.00 to 1.00.

Figure 2

Gene-gene interactions among 298 SNPs which distributed on 19 genes using logistic regression.

Note: Colors from dark blue to red represent P-value from 0.00 to 1.00.

Figure 3

Powers of simple F-ratio test.

Notes: x-axis: δ values in cases (δ = 0 in controls). y-axis: the powers. Panels from left to right represent α of 0.01, 0.03 and 0.05. The color of the lines indicate sample size: black (n = 30), red (n = 50), green (n = 100), purple (n = 150), and blue (n = 200).

Discussion

We present a new method for evaluating gene-gene interactions that uses the F-ratio test. Using this method, 18 gene-gene interactions were found to influence risk of Type 2 diabetes among African Americans of the Howard University Family Study. As each of the genes investigated are candidate genes, their individual role in disease risk is presumed. Identifying the specific mechanisms by which these genes would be expected to interact is beyond the scope of this work, but the top results suggest that some of the effect of genes involving in insulin sensitivity (such as ZFAND6, and IGF1) is mediated through obesity (FTO)26,27 a reasonable hypothesis. In comparison with logistic regression, the F-ratio test was shown to be an efficient method with minimal potential bias and good power to detect moderate gene-gene interaction even in relatively small sample sizes. An exhaustive investigation of all pairwise loci interactions search in genome-wide data is time consuming. Given 500,000 to 1,000,000 SNPs in 5,000 individuals, computation time may be several weeks or even months.21 Although the F-ratio method does not decrease the number of tests, it significantly reduces CPU time per test from 0.04 (logistic regression) to 0.01 (F-ratio method) seconds in the same computing environments. The results of gene-gene interaction analysis were corrected by using the FDR method. As SNPs in LD were excluded from the analysis in order to increase effciency, a Bonferroni correction could have been used28 [ correcting for [(# of locus)2 – (# of locus)]/2 tests]. Using Bonferroni correction would be overly conservative; the existance of marginal effects negates the multiple testing cost.24

Conclusion

The F-ratio test was used as a non-parametric method for comparing the relationship between trait-associated loci in cases to that in controls. A different pattern of joint genotype frequencies in cases compared to controls indicates an interaction between these loci that affects case status. This method represents a novel technique to identify the combination of polymorphisms associated with the risk of common complex diseases. This method overcomes some limitations of logistic regression modeling for detection and characterization of gene-gene interactions. The F-ratio method performed well in Type 2 Diabetes case-control data, identifying 18 gene-gene interactions. This F-ratio test is a useful statistical tool for the analysis of gene-gene interactions and represents a significant contribution in the context of the heritability that remains unexplained by single locus association studies.

21 in total

1. Detecting gene-gene interactions using affected sib pair analysis with covariates.

Authors: Peter Holmans
Journal: Hum Hered Date: 2002 Impact factor: 0.444

Review 2. Genetic associations in large versus small studies: an empirical assessment.

Authors: John P A Ioannidis; Thomas A Trikalinos; Evangelia E Ntzani; Despina G Contopoulos-Ioannidis
Journal: Lancet Date: 2003-02-15 Impact factor: 79.321

Review 3. Candidate genes for type 2 diabetes.

Authors: Hemang Parikh; Leif Groop
Journal: Rev Endocr Metab Disord Date: 2004-05 Impact factor: 6.514

4. Genome-wide strategies for detecting multiple loci that influence complex diseases.

Authors: Jonathan Marchini; Peter Donnelly; Lon R Cardon
Journal: Nat Genet Date: 2005-03-27 Impact factor: 38.330

5. Polymorphisms of CYP1A1 and GSTM1 influence the in vivo function of CYP1A2.

Authors: S MacLeod; R Sinha; F F Kadlubar; N P Lang
Journal: Mutat Res Date: 1997-05-12 Impact factor: 2.433

6. Upregulation of rat Ccnd1 gene by exendin-4 in pancreatic beta cell line INS-1: interaction of early growth response-1 with cis-regulatory element.

Authors: J-H Kang; M-J Kim; S-H Ko; I-K Jeong; K-H Koh; D-J Rhie; S-H Yoon; S-J Hahn; M-S Kim; Y-H Jo
Journal: Diabetologia Date: 2006-03-18 Impact factor: 10.122