Literature DB >> 22303409

Modeling haplotype-haplotype interactions in case-control genetic association studies.

Li Zhang1, Ruitao Liu, Zhong Wang, Daniel A Culver, Rongling Wu.   

Abstract

Haplotype analysis has been increasingly used to study the genetic basis of human diseases, but models for characterizing genetic interactions between haplotypes from different chromosomal regions have not been well developed in the current literature. In this article, we describe a statistical model for testing haplotype-haplotype interactions for human diseases with a case-control genetic association design. The model is formulated on a contingency table in which cases and controls are typed for the same set of molecular markers. By integrating well-established quantitative genetic principles, the model is equipped with a capacity to characterize physiologically meaningful epistasis arising from interactions between haplotypes from different chromosomal regions. The model allows the partition of epistasis into different components due to additive × additive, additive × dominance, dominance × additive, and dominance × dominance interactions. We derive the EM algorithm to estimate and test the effects of each of these components on differences in the pattern of genetic variation between cases and controls and, therefore, examine their role in the pathogenesis of human diseases. The method was further extended to investigate gene-environment interactions expressed at the haplotype level. The statistical properties of the models were investigated through simulation studies and its usefulness and utilization validated by analyzing the genetic association of sarcoidosis from a human genetics project.

Entities:  

Keywords:  EM algorithm; epistasis; haplotype; linkage disequilibrium; risk haplotype

Year:  2012        PMID: 22303409      PMCID: PMC3260479          DOI: 10.3389/fgene.2012.00002

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


Introduction

One important aspect of the genetic complexity of human diseases is that the effect of one gene depends on the expression of one or more other genes, regulated by environmental and developmental signals (Martin et al., 2002; Gabutero et al., 2007). Such dependence, called epistasis, is thought to pervade biological kingdoms and play a pivotal role in determining the genetic architecture of complex traits (Wu et al., 2004; Shao et al., 2008). Because of increasing recognition of the importance of epistasis, there has been an explosive interest over the past 5 years in modeling epistasis and estimating its effects on complex traits and diseases (Marchini et al., 2005; Purcell et al., 2007; An et al., 2009; Lambrechts, 2010; Wang et al., 2010; Wray and Goddard, 2010; Wu et al., 2010). With the continuous reduction of cost and time to generate high-throughput genotyping data, the analysis and detection of epistatic effects will become a routine procedure for genetic association studies, particularly in genome-wide association studies. Most existing work of epistatic modeling focuses on the genetic interaction between individual genes and its different components, additive × additive, additive × dominance, dominance × additive, and dominance × dominance components, originally defined by Fisher (1918). Early geneticists incorporated Fisher’s theory into an experimental design by which gene-gene epistasis and its individual components can be tested and estimated using statistical models (Cockerham, 1954; Mather and Jinks, 1982). Epistatic modeling has been implemented in genetic mapping derived from an experimental cross or natural population, allowing the characterization of epistatic effects on any complex trait (Kao and Zeng, 2002; Wu et al., 2004, 2006). More recently, several researchers have incorporated epistasis into commonly used case-control genetic association studies (Zhang and Liu, 2007; Gayan et al., 2008) to accommodate the recent rapid development of genome-wide association studies. Most of these studies focus on the statistical definition of epistasis, i.e., the deviation from additivity of individual loci due to a non-linear relationship between multilocus genotypes and phenotypic variation in a population. This so-called statistical epistasis is a population concept depending on allele frequencies. In fact, epistasis has its own physiological meaning, in which the mutual dependence of multiple genes results from physiological interactions among biomolecules within gene regulatory networks and biochemical pathways in an individual (Cheverud and Routman, 1995; Moore and Williams, 2005). In general, statistical epistasis can better be used in evolutionary genetics and plant breeding for complex traits (Hallauer et al., 2010), aimed to study the change of population means. Physiological epistasis is relevant to the medical genetics of complex diseases, interested in genotypic values of every individual. Liu et al. (2011) incorporated physiological epistasis into the contingency table of a case-control design by embedding Fisher’s epistatic definition (see also North et al., 2005). They have proved a favorable statistical property of epistatic test and showed increased power to detect epistasis using a simple χ2 test. Wang et al. (2010) derived a general model for analyzing epistasis of any number of genes in a case-control study, showing the importance of high-order epistasis in genetic control of human diseases. With the availability of HapMap data (The International HapMap Consortium, 2003, 2005), a growing body of evidence shows that haplotypes, i.e., a sequence of alleles for single nucleotide polymorphisms (SNPs) on the same chromosomal region, may impact on a complex disease or trait in a different way from what individual alleles do (Judson et al., 2000; Bader, 2001; Rha et al., 2007). Computer simulation also demonstrates that haplotype analysis involving multiple SNPs may be more powerful than single SNP analysis (Collins et al., 1997; Akey et al., 2001; Morris and Kaplan, 2002; Zaykin et al., 2002). Before current array technologies can directly genotype haplotypes, we will always need a computational algorithm to estimate haplotype effects and variation based on a statistical mixture model (Liu et al., 2004; Huang et al., 2007). The motivation of this article is to develop a general procedure for testing epistasis expressed at the haplotype level and its genetic components in genetic association studies. Lin and Wu (2006) developed a model for testing epistatic effects of haplotypes derived from the same genome on quantitative traits. This model was extended to study two-way interactions between different haplotypes expressed at the host genome and tumor genome in cancer research (Li and Wu, 2009) and three-way interactions among different haplotypes from the viral genome, transmitter genome, and recipient genome in infectious disease studies (Li et al., 2009). In this article, we combine these well developed haplotype-haplotype interaction models into a case-control design to test and characterize how different components of epistasis exert genetic effects on a complex disease. We extend this new procedure to allow a genome-wide search for the distribution and magnitude of epistatic interactions and gene-environment interactions expressed at the haplotype level through the genome. Unlike likelihood-based approaches proposed by Lake et al. (2003) and Lin and Zeng (2006), our proposed method is based on stratified contingency tables. By analyzing a real data set from the Cleveland Clinic Sarcoidosis project, the new model identified significant haplotype-expressed epistasis for sarcoidosis. The statistical behavior of the new model was tested and validated through simulation studies.

Methods

Study design

We assume that a natural human population is at Hardy-Weinberg equilibrium (HWE), from which two groups of samples were drawn at random. The first group includes cases who display a disease, and the second group includes controls with no disease. It is assumed that cases and controls are matched for potentially important covariates such as age, sex, ethnicity, geographical location, environmental factors, etc. All the cases and controls are genotyped for a panel of SNPs from different haplotype blocks. Let us first consider two SNPs A and B from block 1 and two SNPs C and D from block 2, with the capital letters A, B, C, and D standing for major alleles and small letters a, b, c, and d for minor alleles at the corresponding SNPs. The four SNPs form a total of 81 genotypes. Let j1, j2, k1, and k2 denote the genotypes at SNPs A, B, C, and D, respectively, at each of which there are three genotypes denoted by 0, 1, and 2. These values stand for the homozygote of the minor allele, the heterozygote of the minor and major alleles, and the homozygote of the major allele, respectively; for example, j1 = 0 for aa, 1 for Aa, and 2 for AA at SNP A. Let and denote the numbers of the observations of four-SNP genotypes j1 at SNP A and j2 at SNP B from block 1 and k1 at SNP C and k2 at SNP D from block 2 in cases and controls, respectively. From the observations of four-SNP genotypes from two blocks, we calculate the marginal observations of two-SNP genotypes from each block, which are expressed as A model will be developed to detect and test the effects on disease of haplotypes from each block and the interactions between different haplotypes from two blocks in the case-control study.

Detecting risk haplotype

Estimation of haplotype frequencies

Let us first consider two SNPs from block 1. In a recent book by Wu and Lin (2008), the EM algorithm with a closed form is given to estimate haplotype frequencies from unphased genotypic data. In our case, the complete data are the diplotype (haplotype) and disease status, but the observed data are genotypes and disease status, thus the missing data is the connection from genotypes to diplotypes (haplotypes). If subject i is a double heterozygote, A1A2B1B2, it may be one and only one of the two possible diplotypes. We use to denote the proportion of diplotype AB|ab in the total amount of double heterozygote AaBb in cases and controls, respectively. Note AB|ab and Ab|aB denote two underlying diplotypes for genotype AaBb, where the vertical lines separate two chromosomes each derived from a different parent. Under HWE, the frequency of a diplotype is expressed as the product of frequencies of the two haplotypes that constitute the diplotype. Thus, we have P = p and P = p for cases and Q = q and Q = q for controls, where p, p, p, and p are the haplotype frequencies for cases and q, q, q, and q are the haplotype frequencies for controls. The EM algorithm is implemented to separate these two diplotypes from the double heterozygote genotypes. In the E step, the proportion of one diplotype accounting for the double heterozygote in cases and controls are calculated using equation (1). In the M step, haplotype frequencies are estimated respectively, using the following formulas: where and (j1, j2 = 0, 1, 2). The E and M steps are iterated between equations (1) and (2), leading to the maximum likelihood estimates of haplotype frequencies in cases and controls, respectively. Four possible haplotypes, AB, Ab, aB, and ab of these two SNPs may perform differently in association with the disease. Yet, for simplicity, we assume that one haplotype is distinct from the remaining three. Lin and Wu (2006) called this distinct haplotype the risk haplotype (denoted by R1) and the remaining the non-risk haplotype (denoted by R0). Wu et al. (2007) extended this assumption to handle any number of risk haplotypes and implemented a model selection criterion to select an optimal risk haplotype(s) from a given data set. The risk and non-risk haplotypes combine randomly to form three composite diplotypes, R1R1, R1R0, and R0R0. If these composite diplotypes function differently, it suggests that two SNPs A and B affect a disease in a unit of haplotypes. Without loss of generality, we first assume that AB is the risk haplotype and then the composite diplotypes for each genotype are formed as in Table 1, where observations in cases and controls are also given. Table 2 tabulates the counts of composite diplotypes in cases and controls, respectively, after the genotypes of the same composite diplotypes are added together.
Table 1

Frequencies of 2-SNP genotypes at block 1 calculated from genotypic observations for 2 SNPs at block 1 in both cases and controls. The composite diplotypes (CD) of each genotype are also given assuming that AB is the risk haplotype.

GenotypeAABBAABbAAbbAaBBAaBb
AabbaaBBaaBbaabb
CDR1R1R1R0R0R0R1R0R1R0R0R0R0R0R0R0R0R0R0R0
Casesm22··m21··m20··m12··m11··m10··m02··m01··m00··
Controlsn22··n21··n20··n12··n11··n10··n02··n01··n00··
Table 2

Frequencies of composite diplotypes for 2 SNPs at block 1 in both cases and controls assuming that .

Disease statusR1R1 (r = 2)R1R0 (r = 1)R0R0 (r = 0)
Casesm2 = m22··m1 = m21·· + m12·· + ψ11m11··m0 = m20·· + (1 − ψ11)m11·· + m10·· + m02·· + m01·· + m00··
Controlsn2 = n22··n1 = n21·· + n12·· + ψ10n11··n0 = n20·· + (1 − ψ11)n11·· + n10·· + n02·· + n01·· + n00·
Frequencies of 2-SNP genotypes at block 1 calculated from genotypic observations for 2 SNPs at block 1 in both cases and controls. The composite diplotypes (CD) of each genotype are also given assuming that AB is the risk haplotype. Frequencies of composite diplotypes for 2 SNPs at block 1 in both cases and controls assuming that .

Test and estimation of haplotype effects

The overall haplotype effect due to risk haplotype AB can be tested by analyzing a 2 × 3 contingency table (Table 2) using a simple χ2 approach with df = 2. Let π denote the proportion of cases carrying a given composite diplotype r (r = 2 for R1R1, 1 for R1R0, and 0 for R0R0). The expectation for this proportion is related to the penetrance of that composite diplotype, with the extent being dependent on the number of cases and controls and the population prevalence of the disease (North et al., 2005). We apply a logistic model to test and estimate the haplotype effect, where μ denotes the genetic value of a composite diplotype. Following quantitative genetics principle (Mather and Jinks, 1982), the genetic value of a composite diplotype is expressed as μ2 = μ + a for R1R1, μ1 = μ + d for R1R0, and μ0 = μ − a for R0R0 where μ, a, and d are the overall mean, the additive effect and the dominance effect of the risk haplotype, respectively. Thus, the model (3) can be parameterized in terms of additive and dominance effects at each SNP and their interactions. By simple algebra, we have where denotes the disease odds ratio of composite diplotype r1 vs. composite diplotype r2 (r1, r2 = 2, 1, 0). Let â and are the maximum likelihood estimates (MLEs) of a and d, respectively. Note that â and asymptotically follow a normal distribution, Appendix A gives detailed derivations. By equation (5), we are not only be able to test the existence of additive (H0:a = 0) and dominance effects (H0:d = 0) by a z-test separately, but also be able to obtain their estimates. In practice, a true risk haplotype is unknown. We may assign each of all possible haplotypes as a risk haplotype and then calculate test statistics in each case. An optimal risk haplotype is chosen, if it corresponds to the smallest false positive report probability (FPRP; Wacholder et al., 2004), expressed as As can be seen, the FPRP is not only determined by the observed p-value, but also depends on both the prior probability that the association between the genetic variant and disease is real and the statistical power of the test. It reflects the probability of no true association between a genetic variant and disease given that a significant association is found. Here, we assume that all haplotypes have an equal chance to be the risk haplotype, thus choosing the smallest FPRP values is equivalent to choosing the smallest value of ratio between p-value and the power of the test. A similar procedure can be used to detect the risk haplotype and estimate haplotype effects for two SNPs from block 2.

Testing epistasis effects between two blocks

The interactions of haplotypes between two different blocks are modeled as follows. Without loss of generality, we assume that AB and CD are the risk haplotypes (R1) and (S1) for block 1 and 2, respectively, whereas the rest are the non-risk haplotypes, denoted as R0 (Ab, aB, ab) for block 1 and S0 (Cd, cD, cd) for block 2. Three composite diplotypes from one block are combined to those from the second block to form nine across-block composite diplotypes with observations tabulated in Table 3. Table A1 in Appendix gives the detailed counts only for cases, with a similar table can be obtained for controls. In these tables, the relative proportions of a particular diplotype in the total amount of a double heterozygote are expressed as
Table 3

Frequencies of across-block composite diplotypes in both cases and controls, assuming that .

Disease statusComposite diplotype
Total
R1R1R1R1R1R1R1R0R1R0R1R0R0R0R0R0R0R0
S1S1S1S0S0S0S1S1S1S0S0S0S1S1S1S0S0S0
(22)(21)(20)(12)(11)(10)(02)(01)(00)
CasesM22M21M20M12M11M10M02M01M00M
ControlsN22N21N20N12N11N10N02N01N00N
Table A1

Frequencies of across-block composite diplotypes calculated from across-block genotypes in cases, assuming that .

S1S1S1S0S0S0
R1R1M22 = m2222M21 = m2221 + m2212 + ψ21m2211M20 = m2220 + (1 − ψ21)m2211 + m2210 + m2202 + m2201 + m2200
R1R0M12 = m2122 + m1222 + ψ11n1122M11 = m2121 + m2112 + ψ21m2111 + m1221 + m1212 + i21m1211 + ψ11(m1121 + m1112 + ψ21m1111)M10 = m2120 + (1 − ψ21)m2111 + m2110 + m2102 + m2101 + m2100 + m1220 + (1 − ψ21)m1211 + m1210 + m1202 + m1201 + m1200 + ψ11[m1120 + (1 − ψ21)m1111 + m1110 + m1102 + m1101 + m1101]
R0R0M02 = m2022 + (1 − ψ11)m1122 + m1022 + m0222 + m0122 + m0022M01 = m2021 + m2012 + ψ21m2011 + (1 − i11)[m1121 + m1112 + ψ21m1111] + m1021 + m1012 + ψ21m1011 + m0221 + m0212 + ψ21m0211 + m0121 + m0112 + ψ21m0111 + m0021 + m0012 + ψ21m0011M00 = m2020 + (1 − ψ21)m2011 + m2010 + m2002 + m2001 + m2000 + (1 − ψ11)(m1120 + (1 − ψ21)m1111 + m1110) + m1102 + m1101 + m1100 + m1020 + (1 − ψ21)m1011 + m1010 + m1002 + m1001 + m1000 + m0220 + (1 − ψ21)m0211 + m0210 + m0202 + m0201 + m0200 + m0120 + (1 − ψ21)m0111 + m0110 + m0102 + m0101 + m0100 + m0020 + (1 − ψ21)m0011 + m0010 + m0002 + m0001 + m0000
Frequencies of across-block composite diplotypes in both cases and controls, assuming that . The frequency of a diplotype is expressed as the product of haplotype frequencies under HWE. Haplotype frequencies can be estimated by implementing the EM algorithm proposed in the previous section for each block. However, there are two cases here: (1) the two blocks are independent, and (2) the two blocks are not independent with linkage disequilibria of different orders. Wu and Lin (2008) provided the EM algorithm to estimate haplotype frequencies in each of these two cases. The overall effect of two haplotypes from two blocks can be tested using a 2 × 9 contingency table (Table 3) based on a simple χ2 test with df = 8. The estimation of haplotype effects including interactions is based on a logistic model where π is the proportion of subjects carrying a given across-block composite diplotype rs (r = 2 for R1R1, 1 for R1R0, and 0 for R0R0 at block 1 and s = 2 for S1S1, 1 for S1S0, and 0 for S0S0 at block 2) who are cases rather than controls. Mather and Jinks’ (1982) model is used to describe the genotypic values of a across-block composite diplotype, μrs’s, as described in Table 4.
Table 4

The genotypic values of across-block composite diplotypes in terms of additive and dominant effects at each block and interactions between two blocks.

S1S1S1S0S0S0
R1R1μ22 = μ + a1 + a2 + iaaμ21 = μ + a1 + d2 + iadμ20 = μ + a1 − a2 + iaa
R1R0μ12 = μ + d1 + a2 + idaμ11 = μ + d1 + d2 + iddμ10 = μ + d1 − a2 + ida
R0R0μ02 = μ − a1 + a2 − iaaμ01 = μ − a1 + d2 − iadμ00 = μ − a1 − a2 + iaa
The genotypic values of across-block composite diplotypes in terms of additive and dominant effects at each block and interactions between two blocks. Let (r1 ≤ r2, s1 ≤ s2 = 2, 1, 0) denote the disease odds ratio of composite diplotype r1s1 vs. r2s2. By simple algebra, we have μ = 1/4(μ22 + μ20 + μ02 + μ00) which is the overall mean (which is, as a nuisance parameter, not one of interest); which is the additive genetic effect of haplotypes from block 1; which is the additive genetic effect of haplotypes from block 2; which is the dominance genetic effect of haplotypes from block 1; which is the dominance genetic effect of haplotypes from block 2; Which is the additive × additive genetic effect between the two blocks; which is the additive × dominance genetic effect between the blocks; which is the dominance × additive genetic effect between the two blocks; and which is the dominance × dominance genetic effect between the two blocks. We have proved that each of the genetic effects above has an asymptotical normality property (see Appendix B for a detail). Therefore, testing each of the effects can be based on a z-test. In practice, a true risk haplotype combination from two blocks is determined using a combinatory approach. This approach assigns any possible haplotype as a risk haplotype for each block. Thus, there are a total of 16 combinations for risk haplotypes from the two blocks. Similarly as in the test of the haplotype effect in one block, an optimal combination of risk haplotype is one that yields the smallest FPRP values, which is equivalent to the smallest ratio of p-value versus power for each single hypothesis test. Extension of the proposed method to multiple SNPs (>2) in each block is straightforward (see Wu et al., 2007), which could be easily accomplished by sliding windows. As shown in the following section with 7 SNPs in one gene and 11 SNPs from the second gene, we consider a sliding window of SNP 1–2 in gene 1 with SNPs 1–2, SNPs 2–3, …, SNPs 10–11 in gene 2 sequentially and, then do the same thing for SNP 2–3, …, SNPs 6–7 in gene 1.

Incorporating gene-environment interactions

It has been widely recognized that environment affects the expression of genetic effects including epistasis (Lukens and Doebley, 1999; Willett and Burton, 2003). Here, we incorporate environment effects into our epistasis model to explore additive × environment and additive × additive × environment interactions in a case-control study. Consider a 1/0 binary environment (E) factor like gender or smoking status. The environment is implemented into a stratified contingency table (Table A2 in Appendix).
Table A2

The genotypic values of 9 across-block composite diplotypes stratified by a binary environmental factor (.

Environ. factorComposite diplotypeS1S1 (s = 2)S1S0 (s = 1)S0S0 (s = 0)
E = 0R1R1 (r = 2)μ0,22 = μ + a1 + a2 − ξμ0,21 = μ + a1 + d2 − ξμ0,20 = μ + a1 + a2 − ξ
+iaaia1ξia2ξiaaξ+iadia1ξid2ξiadξiaaia1ξ+ia2ξ+iaaξ
R1R0 (r = 1)μ0,12 = μ + d1 + a2 − ξμ0,11 = μ + d1 + d2 − ξμ0,10 = μ + d1 + a2 − ξ
+idaia2ξid1ξidaξ+iddid1ξid2ξiddξ+idaia2ξid1ξidaξ
R0R0 (r = 0)μ0,02 = μ − a1 + a2 − ξμ0,02 = μ − a1 + a2 − ξμ0,00 = μ − a1a2 − ξ
iaa+ia1ξia2ξ+iaaξiad+ia1ξid2ξ+iadξ+iaa+ia1ξ+ia2ξiaaξ
E = 1R1R1 (r = 2)μ1,22 = μ + a1 + a2 + ξμ1,21 = μ + a1 + d2 + ξμ1,20 = μ + a1a2 + ξ
+iaa+ia1ξ+ia2ξ+iaaξ+iad+ia1ξ+id2ξ+iadξiaa+ia1ξia2ξiaaξ
R1R0 (r = 1)μ1,12 = μ + d1 + a2 + ξμ1,11 = μ + d1 + d2 + ξμ1,10 = μ + d1a2 + ξ
+ida+ia2ξ+id1ξ+idaξ+idd+id1ξ+id2ξ+iddξidaia2ξ+id1ξidaξ
R0R0 (r = 0)μ1,02 = μ − a1 + a2 + ξμ1,01 = μ − a1 + d2 + ξμ1,00 = μ − a1a2 + ξ
iaaia1ξ+ia2ξiaaξiadia1ξ+id2ξiadξ+iaaia1ξia2ξ+iaaξ
The environment-specific genotypic value is expressed as As shown in Table A2 in Appendix, the genotypic values can be dissolved into different components including additive (a1, a2) and dominant effects (d1, d2) at each block, the environmental effect ξ, epistatic interactions between two blocks (i, i, i, i), gene-environment interactions expressed by the additive and dominant effects at each block and gene-environment interactions expressed by epistasis (i, i, i, i). Similarly, each of these parameters can be written in a linear combination of μ’s, which is equivalent to a linear combination of logarithm of disease odds ratios (see Appendix C). It can be proved that asymptotic normality property with gene-environment interactions still holds (results not shown).

Results

Sarcoidosis data

Sarcoidosis is a disease of unknown cause that leads to inflammation (swelling) in the lymph nodes, lungs, liver, eyes, skin, or other tissues. Like many other complex disease, sarcoidosis is thought to involve a genetic component (Grunewald, 2008), but genetic risk factors for sarcoidosis have rarely been assessed in the context of gene-gene interactions. In the sarcoidosis study conducted in the Cleveland Clinic, five candidate genes were measured for both Caucasian and African Americans, which were chosen based on gene expression data. Specifically, 89 race-specific tagging SNPs for the coding and promoter regions of the PPAR-gamma, MMP-7, MMP-9, MMP-12, and ADAMDec1 genes, were genotyped using the SNPlex platform. Since the patients were all from Northeast Ohio, we assume that there is no population admixture in the study population. To demonstrate how our method works in a real data analysis, we first analyzed a cohort of African Americans without implementing the environment, including 284 sarcoidosis subjects and 240 healthy volunteers recruited at the Cleveland Clinic. Of all SNPs, 82 passed quality control. We tested the additive, dominant, and epistasis between different haplotypes from each pair of five genes using sliding windows of size = 2. Because of multiple comparisons involved, we need to adjust for the significance level detected using false discovery rates (FDR, Benjamini and Hochberg, 1995) across all pairs of SNPs between the two genes studied for interaction. We found several significant dominant effects due to haplotypes from MMP9 after the adjustment (Table 5). Figure 1 illustrates the map of additive × dominant epistatic tests between haplotypes from genes MMP7 and MMP9 using sliding widows. It was found that genomic region from SNP rs880197 to rs11568819 within MMP7 displays significant additive × dominant epistatic (i) effects with genomic region from SNP rs3918249 to rs3918254 within MMP9 (p = 0.0085 − 0.0004). After the adjustment for the significance level by FDR, i is significant at p = 0.102 − 0.018 (Table 5).
Table 5

Estimation and significance test (corrected for multiple comparisons using FDR) of genetic effects on sarcoidosis risk by haplotypes composed of SNPs from MMP7 and MMP9.

Effect[MMP7 SNPs][MMP9 SNPs]Risk HapRisk Hap freq.
Est(SD)pFDR
MMP7
MMP9
CaseControlCaseControl
d2[rs11568819–rs11568818][rs2250889–rs17577]CA-CG0.5960.5220.6780.6811.21(0.35)0.00060.036
[rs11568818–rs17098318][rs4810482–rs3918241]GA-CT0.3640.4100.4600.519−0.75(0.27)0.00590.089
[rs11568818–rs17098318][rs2250889–rs17577]AG-CG0.5970.5210.6780.6810.94(0.32)0.00350.089
[rs17098318–rs880197][rs4810482–rs3918241]AA-CT0.3650.4220.4600.519−0.76(0.27)0.00450.089
iad[rs11568818–rs17098318][rs3918249–rs3918253]AA-TT0.3650.4220.2950.234−1.00(0.35)0.00400.080
[rs11568818–rs17098318][rs3918253–rs3918254]AA-CC0.3650.4220.6480.697−1.17(0.33)0.00040.018
[rs17098318-rs880197][rs3918249–rs3918253]GA-TT0.3640.4100.2950.234−0.96(0.35)0.00560.084
[rs17098318–rs880197][rs3918253–rs3918254]GA-CC0.3640.4100.6480.697−1.14(0.33)0.00060.018
[rs11568819–rs11568818][rs3918253–rs3918254]CA-CC0.5960.5220.6480.6970.81(0.31)0.00850.102

Risk haplotypes in the two genes and their frequencies in cases and controls are given.

Figure 1

Pairwise significance tests of additive × dominant epistasis between 7 SNPs at MMP7 and 11 SNPs at MMP9 in a genetic association study of sarcoidosis risk.

Pairwise significance tests of additive × dominant epistasis between 7 SNPs at MMP7 and 11 SNPs at MMP9 in a genetic association study of sarcoidosis risk. Estimation and significance test (corrected for multiple comparisons using FDR) of genetic effects on sarcoidosis risk by haplotypes composed of SNPs from MMP7 and MMP9. Risk haplotypes in the two genes and their frequencies in cases and controls are given. A significant i effect between haplotypes from a pair of SNPs rs880197 and rs17098318 (MMP7) and a pair of SNPs rs3918249 and rs3918253 (MMP9) operates through risk haplotypes AA and TT. The negative i value suggests that composite diplotype AA|AA at MMP7 combines composite diplotype are the rest of haplotypes, CC, CT, and TC) at MMP9 to reduce the sarcoidosis risk. The same pair from MMP7 also exerts a significant i effect with a pair of SNPs rs3918253 and rs3918254 within MMP9 in a similar way. Like pair rs880197 and rs17098318, haplotypes from pair rs17098138 and rs11568818 within MMP7 interacts with haplotypes from pairs rs3918249 and rs3918253 as well as pair rs3918253 and rs3918255 to reduce sarcoidosis risk to a similar extent. A pair of SNPs rs11568819 and rs11568819 (CA) within MMP7, displayed a significant additive × dominant epistatic effect with a pair of SNPs rs3918253 and rs3918254 (CC) within MMP9. The risk haplotypes detected are CA for MMP7 and CC for MMP9 and the additive × dominant effect produced by these two risk haplotypes was estimated as i = 0.81. The significant positive value of i suggests that the homozygous composite diplotype at gene MMP7 CA|CA combines with the heterozygous composite diplotype at MMP9 are the rest of haplotypes, CT, TC, and TT) to increase the risk of sarcoidosis. A conventional method, such as in SNPassoc of R, was used to analyze SNP-SNP interactions of the same data although only an overall interaction effect can be tested. It was found that SNP rs11568819 at MMP7 has a significant interaction with SNP rs3918253 at MMP9 (p = 0.003). This finding is basically consistent with those from our method, although the latter can be explained in a physiologically meaningful way. In the second analysis, we explored how the environment influences the expression of additive, dominant, and epistatic effects on sarcoidosis risk. We consider smoking status as an environmental factor. Data analysis based on Table 5 suggests that smoking status displays a significant interaction (p = 0.0005) with the dominant effect triggered by a haplotype of SNPs rs9650409 and rs12334642 (ADAMDEC1), with the estimated interaction value and SE = 0.47. However, this result should be interpreted with caution because 30% of the subjects have no smoking information and, thus, were excluded from our analysis.

Simulation study

We performed simulation studies to investigate the power and Type I error rate of epistatic detection by our model. Two independent haplotype blocks were simulated, each composed of two SNPs. A natural population at HWE with the prevalence of disease = 0.35% was first generated, from which 250 cases and 250 controls (or 500 cases and 500 controls) were randomly selected. The first two scenarios of simulation were used to evaluate the power and precision of the estimation, and the last one is to assess the Type I error rate under the null model of no interactions: • Scenario 1: mimicking SNPs rs11568819rs11568818 at MMP7 and SNPs rs3918253rs3918254 at MMP9 with a significant epistatic effect of i; • Scenario 2: mimicking SNPs rs11568818rs17098318 at MMP7 and SNPs rs4810482rs3918241 at MMP9 with significant genetic effects of d2 and i. • Scenario 3: mimicking SNPs rs11568819rs11568818 at MMP7 and SNPs rs3918253rs3918254 at MMP9 without any significant epistatic effects, i.e., i = i = i = i = 0. Haplotype frequencies, risk haplotypes, and other genetic effects used for the simulation were their MLEs from the real data at the corresponding SNPs. Simulation results based on 1000 simulations are summarized in Tables A3– A5 in Appendix.
Table A3

Estimation of genetic parameters and power of epistatic detection from a simulated data for two independent blocks under Scenario 1 for all 1000 simulations.

ParameterTrue250 Cases/250 controls
500 Cases/500 controls
EstMSEPowerEstMSEPower
a1−0.125−0.1230.061−0.1200.028
a2−0.275−0.3010.062−0.2930.029
d1−0.4850.4840.115−0.4720.057
d2−0.363−0.3920.174−0.3740.092
iaa−0.125−0.1230.062−0.1200.028
iad0.8100.8270.1100.7190.8100.0500.961
ida−0.506−0.4810.117−0.4980.055
idd0.3010.3210.1970.2960.103
S1,1S2,1*0.0870.122
S1,1S2,2*0.0650.074
S1,2S2,1*0.1290.214
S1,2S2,2*0.0500.056
S1,1**0.0810.115
S1,2**0.8520.996
S2,1**0.7560.969
S2,2**0.2440.430

.

.

Table A5

Type I error rate for all epistatic detection of two independent blocks under Scenario 3 for 1000 simulations.

ParameterTrue250 Cases/250 controls
500 Cases/500 controls
EstMSEType I errorEstMSEType I error
iaa0−0.1270.0700.049−0.1270.0440.050
iad0−0.0080.0980.0540.0040.0470.048
ida00.0190.1020.0430.0070.0550.053
idd00.0370.1880.0450.0050.0930.049
S1,1S2,1*0.0600.050
S1,1S2,2*0.0640.043
S1,2S2,1*0.0600.050
S1,2S2,2*0.0480.049
S1,1**0.0500.044
S1,2**0.1560.235
S2,1**0.0460.098
S2,2**0.0620.081

.

.

In general, our method provides reasonably precise estimates for all genetic effects parameters with small biases and MSEs (mean squared errors) even when a moderate sample size is used. However, if the study is aimed to estimate additive effects of haplotypes, 250 cases and 250 controls should be adequately enough to get convincing results. If we are interested in the estimation of dominant effects and their interactions with other effects, 500 cases and 500 controls are recommended. We also analyzed the power of our model based on 1000 simulations at the significance level of 0.05. In Scenario 1, the power of detecting the additive × dominant effect is 0.72 for 250 cases and 250 controls and increases to 0.96 for doubled sample sizes. To compare with existing models, we use SNPassoc, R, to estimate the power of interaction detection, which is much lower than our method (Table A3 in Appendix). In Scenario 2 with one more significant effect, similar results of power were detected, although the comparison between our method and existing approach is more sharply contrast relative to Scenario 1. Scenario 3 shows that the Type I error rates are comparative (around 0.05) comparing with the conventional SNP-SNP interaction method (around 0.2) and, as expected, the Type I error rate does not depend on sample sizes (Table A5 in Appendix).

Discussion

By dissecting epistasis into its different components according to traditional quantitative genetic theory (Mather and Jinks, 1982), we develop a new model for detecting haplotype-haplotype interactions in case-control studies via population-based sampling. The model was founded on two bodies of increasingly proven evidence that the genetic variation of complex traits arises from haplotype diversity and effects (Judson et al., 2000; Bader, 2001; Rha et al., 2007) and genetic interactions (Marchini et al., 2005; North et al., 2005; Purcell et al., 2007; An et al., 2009; Lambrechts, 2010; Wang et al., 2010; Wray and Goddard, 2010; Wu et al., 2010). The advantage of the new model is not only a simple combination of the power of haplotype analysis and epistatic modeling, but also lies in its capacity to provide more genetic meaningful explanations and conclusions for a given data set. Liu et al. (2011) provided an elegant proof for the asymptotic property of epistatic detection at the individual gene level from a contingency table. This property greatly enhances the computing efficiency of epistatic testing throughout the genome and also displays greater power to detect epistasis for additive × dominant, dominant × additive, and dominant × dominant components, compared with conventional log-logistic models. In this article, the statistical property proven at the gene level is found to hold for epistatic identification at the haplotype level. This consistency greatly facilitates the practical usefulness of our new model to characterize genetic interactions that occur between haplotypes derived from chromosomal segments. When this property is used to study genome-genome interactions (Li and Wu, 2009; Li et al., 2009), we will be able to dictate a comprehensive picture of the genetic architecture of complex diseases and pathogenesis. Although our derivation focuses on two genome blocks in a genome, we are currently working to expand this model into a genome-wide association study in which all possible pairs of haplotype interactions can be detected and tested. Li et al. (2011) have used a lasso model to detect significant genetic variants from high-dimension genetic data through penalized regression. A similar line of model selection can be implemented into our case-control design aimed to detect significant haplotype-haplotype interactions. The determination of an optimal length of haplotype may be an issue. Yet, our model is flexible to include any number of SNPs for haplotype analysis. As shown in Li et al. (2006), an increasing number of disequilibrium parameters expressed at different orders will need to be modeled when multiple SNPs are included, which may prohibit computation. A cluster of paralleled machines can be deployed to accelerate analysis and discrimination processes of haplotype-haplotype interactions. Our model was built on a two-SNP haplotype assumption. The idea presented can be readily extended to include an arbitrary number of SNPs with no technical difficulty. For L1 SNPs in gene one, there are haplotype. But we can still assume one risk haplotype that is combined with  − 1 non-risk haplotypes to generate three composite diplotypes. When gene two with L2 SNPs is also considered, we will still have nine composite diplotypes. However, the selection of an optimal combination of risk haplotypes at the two genes will be made from 2( haplotype combinations. It should be pointed out that there will be lower numbers in some cells of the contingency table when long haplotype are modeled. To make our normality assumption a case in this situation, we need to determine the maximum length of haplotypes which can be reasonably used to identify the risk haplotype for a given sample size.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Disease status (D)Composite diplotype
Total
R1R1 (r = 2)R1R0 (r = 1)R0R0 (r = 0)
Casesm2m1m0m
Controlsn2n1n0n
Table 5

Estimation and significance test (corrected for multiple comparisons using FDR) of genetic effects on sarcoidosis risk by haplotypes composed of SNPs from MMP7 and MMP9.

Disease statusComposite diplotype
Total
R1R1R1R1R1R1R1R0R1R0R1R0R0R0R0R0R0R0
S1S1S1S0S0S0S1S1S1S0S0S0S1S1S1S0S0S0
(22)(21)(20)(12)(11)(10)(02)(01)(00)
CasesM22M21M20M12M11M10M02M01M00M
ControlsN22N21N20N12N11N10N02N01N00N
Table A4

Estimation of genetic parameters and power of epistatic detection from a simulated data for two independent blocks under Scenario 2 for all 1000 simulations.

ParameterTrue250 Cases/250 controls
500 Cases/500 controls
EstMSEPowerEstMSEPower
a10.3410.3570.0530.3530.024
a2−0.713−0.7350.051−0.7250.024
d1−0.459−0.4930.117−0.4670.051
d2−0.746−0.7840.1010.731−0.7490.0440.950
iaa0.3410.3570.0500.3530.024
iad−0.210−0.2300.089−0.2310.045
ida−0.499−0.5000.104−0.4990.050
idd1.0381.0890.2020.7341.0530.0830.955
S1,1S2,1*0.0500.040
S1,1S2,2*0.0530.057
S1,2S2,1*0.0420.046
S1,2S2,2*0.0520.049
S1,1**0.3650.600
S1,2**0.4380.715
S2,1**0.9951.000
S2,2**0.6960.929

.

.

  39 in total

1.  Haplotypes vs single marker linkage disequilibrium tests: what do we gain?

Authors:  J Akey; L Jin; M Xiong
Journal:  Eur J Hum Genet       Date:  2001-04       Impact factor: 4.246

2.  A haplotype map of the human genome.

Authors: 
Journal:  Nature       Date:  2005-10-27       Impact factor: 49.962

3.  Application of logistic regression to case-control association studies involving two causative loci.

Authors:  Bernard V North; David Curtis; Pak C Sham
Journal:  Hum Hered       Date:  2005-04-18       Impact factor: 0.444

4.  Identification of quantitative trait nucleotides that regulate cancer growth: a simulation approach.

Authors:  Hongying Li; Bong-Rae Kim; Rongling Wu
Journal:  J Theor Biol       Date:  2006-05-02       Impact factor: 2.691

5.  A statistical model for genetic mapping of viral infection by integrating epidemiological behavior.

Authors:  Yao Li; Arthur Berg; Myron N Chang; Ping Du; Kwangmi Ahn; David Mauger; Duanping Liao; Rongling Wu
Journal:  Stat Appl Genet Mol Biol       Date:  2009-09-09

6.  Asymptotic distribution for epistatic tests in case-control studies.

Authors:  Tian Liu; A Thalamuthu; J J Liu; C Chen; Zhong Wang; Rongling Wu
Journal:  Genomics       Date:  2011-05-15       Impact factor: 5.736

7.  Variations on a theme: cataloging human DNA sequence variation.

Authors:  F S Collins; M S Guyer; A Charkravarti
Journal:  Science       Date:  1997-11-28       Impact factor: 47.728

8.  Interaction between allelic variation in IL12B and CCR5 affects the development of AIDS: IL12B/CCR5 interaction and HIV/AIDS.

Authors:  Elwyn Gabutero; Corey Moore; Simon Mallal; Graeme Stewart; Peter Williamson
Journal:  AIDS       Date:  2007-01-02       Impact factor: 4.177

9.  The challenge of detecting epistasis (G x G interactions): Genetic Analysis Workshop 16.

Authors:  Ping An; Odity Mukherjee; Pritam Chanda; Li Yao; Corinne D Engelman; Chien-Hsun Huang; Tian Zheng; Ilija P Kovac; Marie-Pierre Dubé; Xueying Liang; Jia Li; Mariza de Andrade; Robert Culverhouse; Doerthe Malzahn; Alisa K Manning; Geraldine M Clarke; Jeesun Jung; Michael A Province
Journal:  Genet Epidemiol       Date:  2009       Impact factor: 2.135

10.  A general quantitative genetic model for haplotyping a complex trait in humans.

Authors:  Song Wu; Jie Yang; Chenguang Wang; Rongling Wu
Journal:  Curr Genomics       Date:  2007-08       Impact factor: 2.236

View more
  7 in total

1.  A model for family-based case-control studies of genetic imprinting and epistasis.

Authors:  Xin Li; Yihan Sui; Tian Liu; Jianxin Wang; Yongci Li; Zhenwu Lin; John Hegarty; Walter A Koltun; Zuoheng Wang; Rongling Wu
Journal:  Brief Bioinform       Date:  2013-07-24       Impact factor: 11.622

2.  A case-control design for testing and estimating epigenetic effects on complex diseases.

Authors:  Yihan Sui; Weimiao Wu; Zhong Wang; Jianxin Wang; Zuoheng Wang; Rongling Wu
Journal:  Brief Bioinform       Date:  2013-01-17       Impact factor: 11.622

3.  Detecting maternal-fetal genotype interactions associated with conotruncal heart defects: a haplotype-based analysis with penalized logistic regression.

Authors:  Ming Li; Stephen W Erickson; Charlotte A Hobbs; Jingyun Li; Xinyu Tang; Todd G Nick; Stewart L Macleod; Mario A Cleves
Journal:  Genet Epidemiol       Date:  2014-03-02       Impact factor: 2.135

4.  A genetic association study detects haplotypes associated with obstructive heart defects.

Authors:  Ming Li; Mario A Cleves; Himel Mallick; Stephen W Erickson; Xinyu Tang; Todd G Nick; Stewart L Macleod; Charlotte A Hobbs
Journal:  Hum Genet       Date:  2014-06-04       Impact factor: 4.132

Review 5.  Cervical Carcinogenesis and Immune Response Gene Polymorphisms: A Review.

Authors:  Akash M Mehta; Merel Mooij; Ivan Branković; Sander Ouburg; Servaas A Morré; Ekaterina S Jordanova
Journal:  J Immunol Res       Date:  2017-02-09       Impact factor: 4.818

6.  A quantitative genetic and epigenetic model of complex traits.

Authors:  Zhong Wang; Zuoheng Wang; Jianxin Wang; Yihan Sui; Jian Zhang; Duanping Liao; Rongling Wu
Journal:  BMC Bioinformatics       Date:  2012-10-26       Impact factor: 3.169

7.  A Low Resolution Epistasis Mapping Approach To Identify Chromosome Arm Interactions in Allohexaploid Wheat.

Authors:  Nicholas Santantonio; Jean-Luc Jannink; Mark Sorrells
Journal:  G3 (Bethesda)       Date:  2019-03-07       Impact factor: 3.154

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.