Literature DB >> 19384427

Mapping Nucleotide Sequences that Encode Complex Binary Disease Traits with HapMap.

Yuehua Cui¹, Wenjiang Fu, Kelian Sun, Roberto Romero, Rongling Wu.

Abstract

Detecting the patterns of DNA sequence variants across the human genome is a crucial step for unraveling the genetic basis of complex human diseases. The human HapMap constructed by single nucleotide polymorphisms (SNPs) provides efficient sequence variation information that can speed up the discovery of genes related to common diseases. In this article, we present a generalized linear model for identifying specific nucleotide variants that encode complex human diseases. A novel approach is derived to group haplotypes to form composite diplotypes, which largely reduces the model degrees of freedom for an association test and hence increases the power when multiple SNP markers are involved. An efficient two-stage estimation procedure based on the expectation-maximization (EM) algorithm is derived to estimate parameters. Non-genetic environmental or clinical risk factors can also be fitted into the model. Computer simulations show that our model has reasonable power and type I error rate with appropriate sample size. It is also suggested through simulations that a balanced design with approximately equal number of cases and controls should be preferred to maintain small estimation bias and reasonable testing power. To illustrate the utility, we apply the method to a genetic association study of large for gestational age (LGA) neonates. The model provides a powerful tool for elucidating the genetic basis of complex binary diseases.

Entities: Chemical Disease Gene Species

Keywords: EM algorithm; Nucleotide sequence; complex disease; haplotype.; logistic regression

Year: 2007 PMID： 19384427 PMCID： PMC2652402 DOI： 10.2174/138920207782446188

Source DB: PubMed Journal: Curr Genomics ISSN： 1389-2029 Impact factor: 2.236

INTRODUCTION

Single nucleotide polymorphisms (SNPs) are the most common genomic variations. Detecting the patterns of DNA sequence variants across the human genome, particularly the patterns of haplotypes, is a crucial step for unravelling the genetic basis of complex human diseases. With the growing density of SNP data produced by the human HapMap project [1,2], association study has been received increasing attention in the most recent years. It provides a more efficient and powerful way for disease gene discovery than traditional linkage methods [3]. The population-based case-control study is a classical method for genetic association mapping and has been widely applied to disease gene mapping with SNP data collected from unrelated individuals. The case-control design has substantial practical advantages over a family-based design given the fact that it is often difficult to collect DNA samples from relatives of affected individuals, especially for late-onset diseases. In case-control studies, disease-gene association is usually tested by focusing on one single SNP at a time using a simple χ2 test by comparing SNP allele frequencies between cases and controls [4]. The χ2 test is a detection test rather than an estimation test since it does not provide estimates of genetic effects. An alternative approach is to apply the logistic regression which can test association and estimate genetic effects while adjusting for other covariates effects. It is well known that many human diseases are complex, which potentially involve multiple disease loci jointly functioning to give rise to an affected individual. In general, the disease status is a result of additive or multiplicative effects of many disease predisposing alleles each having a relatively small effect [5]. Methods that test each locus separately, hence, is inefficient to detect the disease-gene association. Moreover, a significant SNP allele identified by a single SNP test may not be the causal mutation for the disease, but rather shows a significant association due to linkage disequilibrium with a causal mutation [6]. A more natural approach would be to understand the genetic basis of disease status by analyzing a group of SNPs simultaneously through haplotype analysis. The advantage of haplotype inference on disease gene mapping over a single-locus approach has been shown in several studies [7-9]. Biological evidences also confirmed the importance of haplotype analysis. For example, studies showed that the alignment of multiple functional alleles along a chromosome might have great effects on a disease status, where alleles in cis position (as a halpotype) within a gene can function jointly to make a “super allele” with a large effect on disease phenotypes [10]. These statistical and biological evidences underscore the importance of haplotype association mapping. Precise haplotype inference relies on complete haplotype information available for an individual. When linkage phase is ambiguous (i.e., more than one heterozygote sites), however, direct analysis by assuming known haplotypes is infeasible. A number of statistical approaches have been proposed to estimate haplotypes in unrelated individuals (e.g. [11,12]). With estimated haplotype frequencies, association can be detected by a comparison of haplotype frequencies between affected and unaffected individuals [13]. Again, this is a detection test, and hence, does not provide inference on specific haplotype effects. Others considered haplotype effects by including possible haplotypes constructed for each individual as independent variables in a generalized linear regression model setting [10,14-18], and hence ignored the interactions of haplotypes inherited from both parents. Moreover, when there are many haplotypes fitted in the model, these approaches could be suffered from potential power loss with large number of degree of freedoms. More recently, Liu et al. [19] proposed a statistical approach for identifying the distinction of haplotypes and estimating haplotype effects on a quantitatively inherited trait based on the structural and organizational patterns of nucleotide sequences in the human genome [20,21]. This approach allows the characterization of DNA sequence variants that encode quantitative variation, rather than of coarse chromosomal segments as detected by conventional linkage mapping. To generalize this approach to dichotomous disease trait, in this article, we propose a statistical mapping approach based on the information provided by HapMap project to test disease-gene association adjusting for the effects of clinical risk factors. We construct a weighted prospective likelihood function with weights modelled as a function of relative diplotype frequencies. For an individual with unknown phase, the disease trait density function is modelled as a mixture distribution with mixture proportion modeled as a function of haplotype frequencies. To reduce the model degrees of freedom for an association test, we regroup haplotypes to form three composite diplotypes regardless the number of SNP loci involved. By hypothesizing one particular haplotype as the risk haplotype, we can do a systematic model selection and hypothesis test to detect DNA sequence variants, called binary trait nucleotides (BTNs), associated with the phenotypic variation of a binary disease trait. BTNs identified by this approach are biologically more meaningful than traditional mapping approaches aimed to detect quantitative trait loci [22-23]. We develop a two-stage estimation procedure to estimate parameters. Model selection criterion such as AIC is used to select the risk haplotype. Our model is developed in the maximum likelihood context and implemented with the EM and Newton-Raphson algorithm. It allows for adjustment of nongenetic covariates, such as environmental and clinical risk factors, which may provide critical information for detecting disease-gene association. Extensive simulation studies are performed to investigate the statistical behaviors of the model. Specifically, we evaluate the effect of sample size, gene action modes and sampling design on the precision of parameter estimation, testing power and type I error rate. A real example of a study of large for gestational age (LGA) neonates is applied to show the application of the model, in which significant BTNs are detected in association with LGA.

METHODS

Definitions and Notations

Binary trait nucleotides (BTNs) are defined as DNA sequence variants where there exists a distinct haplotype, termed as “risk” haplotype, associated with a binary disease trait. The biological foundation of the current BTN mapping approach is built upon the haplotypes constructed with haplotype tagging SNPs (htSNPs) located within each haplotype block. Due to strong linkage disequilibrium (LD) and low haplotype diversity within each block, a small fraction of htSNPs could explain a large portion of haplotype diversity [20,21,25]. These representative htSNPs greatly facilitate genetic association study with reduced cost and improved statistical testing power. A number of algorithms has been developed for the identification of htSNPs [26-28]. Assume a sample of n unrelated individuals collected from a population with n1 affected (cases) and n2 unaffected (controls). In this sample, one or more candidate genes are selected based on prior knowledge. A number of SNPs are then genotyped for each candidate gene. In the current study, our interest is to search for the pattern of BTNs that are associated with a complex disease. To demonstrate the idea of BTN mapping, we first begin with a simple model containing only two htSNPs (2-SNP BTN model). A generalization for multiple SNPs is given later. Consider two htSNPs within a haplotype block that cosegregate with the linkage disequilibrium D in the population. Each SNP contains two alleles denoted as 1 or 2. Let and and be the frequencies of alleles 1 and 2 respectively at SNP 1, and and be the frequencies of alleles 1 and 2 respectively at SNP 2. for k = 1,2. Here we use the superscript number for SNP index and the subscript for allele index within a SNP. Random combination of these two SNPs form 4 possible haplotypes denoted as [11], [12], [21] and [22]. Their haplotype frequencies are expressed as where r,r denote the alleles of the two SNPs, respectively, and Once haplotype frequencies are estimated, allelic frequencies and LD can be obtained by solving Equation (1). Random combination of the four maternal and paternal haplotypes forms nine observable genotypes (G ) denoted as The double heterozygotic genotype 12/12 contains two possible distinct diplotypes [11][22] and [12][21], and hence is phase ambiguous. The other eight genotypes are phase-known. Each diplotype contains two distinct haplotypes. Totally, there are 10 distinct phase-known diplotypes expressed as formed by two SNPs. Let and denote the diplotype and genotype frequencies, respectively, and let denote the number of observations of the above nine genotypes, where r = 1 or 2 ( j = 1,2). We use upper case P to denote the diplotype frequency and lower case p to denote the haplotype frequency. Assuming HWE, then ten diplotype frequencies can be calculated as a function of the corresponding haplotype frequencies, i.e., A complete list of the genotype and diplotype configurations as well as their frequencies is given in Table . Without loss of generality, we assume that a disease predisposing BTN containing haplotype [11] is associated with the disease phenotype. Such a distinct haplotype [11] is called the “risk” haplotype. Individuals carrying this specific haplotype may potentially have high or low risk to develop a disease with a risk level depending on the composition of the diplotype structure one carries on. All the other three haplotypes are called non-risk haplotypes. To distinguish the risk and non-risk haplotypes, we denote all the non-risk haplotypes as . Random combination of these risk and non-risk haplotypes leads to three groupings which are called composite diplotypes (g) expressed as [11][11], [11] and (Table ). The regrouping method is biologically intuitive and statistically efficient. By formulating the composite diplotype, the additive and dominant effects of a risk haplotype can be estimated. Also, we could greatly reduce the number of parameters in the regression model. For example, when there are m SNPs considered, there could be 2m haplotype parameters need to be estimated for a full haplotype regression model and 2(2+1) parameters need to be estimated for a full diplotype model. When m is large, this could cause over-fitting problems. Moreover, large number of degree of freedom could decrease the power for an association test. With our formulation, there are always three composite diplotypes regardless of large number of SNPs.

Multiple Logistic Regression Model

Let y denote a measured disease trait which takes two values, 1 or 0, corresponding to affected or control respectively. Let X denote a matrix of numerical codes corresponding to the composite diplotype, g, including the intercept as the first column, and let Xenote a matrix of measured clinical risk factors. Assuming that all these covariates influence the mean of the trait and not the scale, so that their effects can be summarized by a function of linear predictors where α contain regression parameters for the intercept and the genetic effects of composite diplotypes on a disease trait; γ contain the effects of clinical risk factors; X = (X, X) and β = (α, γ) Given a binary disease response, we can apply the logit model which corresponds to the natural logit link function with the form with the logistic distribution function A logistic regression model has been broadly applied to the modelling of binary data [29,30]. Given covariates value x, the probability distribution of a disease status Y = y for an individual i can be expressed as where y takes value 1 or 0, x0 is one for all i, the independent variables x1 and x2 are defined as and and variables refers to the p clinical non-genetic covariates of interest. With the coding mechanism defined in (4) and (5), α1 and α2 can be considered as the additive and dominant genetic effect of a risk haplotype [31], α0 is the intercept and is the non-genetic covariate effect. If either parameter estimate, α1 or α2, is positive, the risk haplotype [11] triggers a positive effect to increase a disease risk. The effect of BTNs is considered as pure additive, dominant or recessive if the ratio of the dominant over additive effect (α2 / α1) is 0, 1 or -1 respectively, and is considered as semi-dominant or over-dominant if the absolute value of this ratio is less than 1, or greater than 1 respectively. We call parameters contained in β = (α, γ ) quantitative parameters to distinguish them with the population parameters defined in Eq. (1). We can further partition the logistic function defined in (3) into three distinct logistic regression functions corresponding to different composite diplotype groups as follows for composite diplotype [11][11], and for composite diplotype [11] , and for composite diplotype . We define these three distinct logistic functions as the diplotype functions corresponding to different diplotypes illustrated in Table .

Likelihood Function and Parameter Estimation

The logistic regression model links the interpatient variation in a disease trait (y) with the observed SNP genotypes (G). Our goal is to detect DNA sequence variants or BTNs underlying a disease trait. As shown in Table , most genotypes have one to one relationship with their diplotypes except the one with genotype denoted as 12/12. This double heterozygote can be partitioned into two possible diplotypes, [11][22] and [12][12] with relative frequencies ø and 1 - ø, respectively. Let p(g | G) denote the relative frequency for a diplotype g consistent with the observed genotype Gi. The relative frequencies for all 10 possible diplotypes are given in Table . For individuals with known phase, ( p(g | G) takes value one. The individual contribution to the likelihood is given by where D denotes all possible diplotypes that are consistent with the observed marker genotype. For an individual with known phase, L(β) = π (y | x). For an individual with genotype 12/12, its likelihood contribution follows a mixture distribution with the form where the mixture proportion represents the relative frequency of subject i whose diplotype is [11][22], and and are the logistic regression functions defined in model (7) and (8), respectively. Assuming independence among individuals, the joint prospective likelihood function can be expressed as Noted that the likelihood formulation in (10) is different from the one proposed by Lake et al. [15] in which the likelihood function is given as a weighted sum with weights modeled as a function of haplotype frequencies rather than relative diplotype frequencies. For a 2-SNP model, the log-likelihood function of the observed data can be further partitioned as The maximum likelihood estimate contained in β can be obtained by solving the score equation: . A computational algorithm based on the Expectation-Maximization (EM) algorithm [32] can be formulated to find , with the Newton-Raphson algorithm embedded in the M-step (See Appendix for detailed derivations). Standard model diagnostic approaches such as goodness-of-fit test can be applied to check the model fitting [33]. We array this set of quantitative parameters which include genetic and nongenetic parameters based on Model (3) as Ω = (β) = (α, γ ). The above algorithm is implemented assuming that ø is known. In reality, we do not know ø and it needs to be estimated from the data. To estimate ø, we need to estimate the four haplotype frequencies which is arrayed as . Once we estimate Ω p, ø can be estimated by plugging in the MLE of Ω p. The four haplotype frequencies can be estimated based on the nine observed genotypes for two SNPs (Table ). Assuming HWE, the log-likelihood function of the unknown haplotype frequencies given observed genotypes can be written as a multinomial distribution Again, we have a missing data problem since the two distinct diplotypes for genotype 12/12 can not be observed explicitly. This problem can be solved by applying the EM algorithm (See [19] for a detailed EM procedure). With the estimated haplotype frequencies, we can also solve Equation (1) to obtain the estimates of the SNP allele frequencies and the LD parameter. The estimated ø, denoted as , is then plugged into the likelihood function (11) to obtain the parameter estimation contained in Ω . Since we estimate parameters contained in Ω and Ω separately, this estimation procedure is also called a two-stage estimation procedure. Noted that the parameters contained in Ω do not heavily rely on the estimated haplotype frequencies, especially when the double heterozygous rate is low. Thus, the estimation procedure is quite robust to departure from HWE. Both EM algorithms for estimating Ω and Ω converge very fast.

Hypothesis Tests

To detect the association between a disease and BTNs and fully dissect the genetic effects of BTNs, a series of hypotheses can be conducted. The existence of significant BTNs on a complex disease trait can be tested based on the following hypotheses A general approach is to use the likelihood ratio test, where the test statistic is calculated by comparing the likelihood values under the alternative hypothesis H to the null hypothesis H for the significance of BTNs using where the parameters with tilde and hat denote the MLEs of unknown parameters under H0 and H1, respectively. Assuming fixed ø in the likelihood function (11), the regularity conditions for asymptotic χ2 distribution of LR1 hold as long as the number of observations n11/12 + n12/11 and n11/22 + n12/22 + n22/11 + n22/12 + n22/22 are of comparable size with n12/12. So the LR1 asymptotically follows a χ2 distribution with two degrees of freedom [34]. Upon rejection of H in the above test, we can further test whether the BTNs exert a significant additive haplotype effect or dominant haplotype effect on a disease trait by simply formulating for testing additive effect and for testing dominant effect. Again, the likelihood ratio test can be applied which is asymptotically χ2 -distributed with one degree of freedom. We can also test the allelic association between two SNPs by testing the LD between them with hypotheses: The log-likelihood ratio test statistic (LR2) can be similarly calculated as The LR2 is considered to asymptotically follow a χ2 distribution with one degree of freedom. The MLEs of allelic frequencies under H0 can be estimated using the EM algorithm described above, but with the constraint p11p22 = p12p21. The effect of non-genetic covariates on a disease trait can also be tested in a similar way using likelihood ratio test. Since the association test (12) is conducted after adjusting for the effects of clinical risk factors, it is more informative than the retrospective likelihood approaches (e.g. [14,16]) which do not adjust for the effects of clinical risk factors.

Risk Haplotype Selection and Statistical Inferences

The above model is developed by assuming that haplotype [11] is the risk haplotype. In reality, we have no prior information on which genetic component triggers a potential effect on a disease trait. We adopt the theoretical information criterion approach to select the risk haplotype. Among a pool of criteria, the Akaike's information criteria (AIC) has been widely used in a variety of fields for model selection [35]. For a 2-SNP model, there are 4 possible haplotype structures. By assuming each one of the haplotypes as the risk haplotype, we can calculate the AIC information one at a time for each hypothesized risk haplotype as where s refers to the sth haplotype and p refers to the number of parameters by taking the sth haplotype as the risk haplotype. The one which achieves the minimum AIC value is then subject to statistical test based on test (12). Significant BTNs are detected to be associated with a disease if a significant risk haplotype exists. When there are multiple haplotype blocks involved, corrections for multiple testing using false discovery rate (FDR) approach is required [36]. A number of statistical inferences can be formulated based on the current BTN model. If significant BTNs are detected, one might be interested in quantifying the disease odds or odds ratio. The disease odds can be calculated for individuals carrying different haplotype structures and are exposed to different clinical conditions. For example, the odds of a disease for an individual carrying composite diplotype [11][11] can be calculated as Thus the exponential of the parameters gives rise to a factorial contribution to the odds not only subject to clinical exposure but also to diplotype structure. Even though individuals are exposed to the same clinical condition, the chance to be affected varies depending on the diplotype structures they carry on. For example, the odds ratio of a disease for an individual carrying composite diplotype [11][11] and after controlling for other covariates can be computed as Using delta method, the confidence interval of the odds ratio can be obtained [33]. Note that the intercept α0 does not represent population prevalence for a case-control sample.

Multilocus BTN Model

The idea of BTN mapping based on a two-SNP model can be extended to include an arbitrary number of SNPs whose sequence variants are associated with the disease variation. Consider K(K ≥ 3) htSNPs within a haplotype block constructed from a number of bi-allelic loci. Each of these K htSNPs contains two alleles denoted by , with allele frequencies denoted by for the k th htSNP. The coding form indicates that alleles with the same value of rk are located on the same chromosome. One of the key issues for the multi-SNPs model is to clearly formulate the haplotype and diplotype structures across the K multilocus htSNPs. There are totally 2k possible haplotypes can be formed by the random combination of these K htSNPs. A general form of these haplotypes is expressed as with corresponding haplotype frequencies denoted by . These K htSNPs form 3kobservable multilocus zygotic genotypes expressed as with corresponding genotype frequency and observation expressed as and respectively. The random combination of haplotypes derived from maternal and paternal parents generates distinct diplotypes expressed as with corresponding diplotype frequency expressed as assuming HWE. The composite diplotype can be formulated in a similar way as illustrated in the 2-SNP model. As illustrated in the 2-SNP BTN model, the number of multilocus diplotype is generally greater than the number of genotypes when there are two or more heterozygotes present. For example, for a 3-SNP model, the genotype could form two different diplotypes expressed as while the genotype could form four different diplotypes. If we assume that is the risk haplotype, the three composite diplotypes can be formulated as . The multilocus haplotype frequency can be formulated as a function of allele frequencies and LD parameters of different orders [37]. For example, a haplotype frequency, denoted as , can be decomposed into the following components: where D 's are the linkage disequilibria of different orders among particular htSNPs. The MLEs of quantitative parameters can be estimated by formulating the likelihood function similar to the 2-SNP model. The EM algorithm can be employed to estimate the MLEs of haplotype frequencies, and the quantitative parameters. The AIC-based model selection procedure can be adopted to select the risk hapltoype.

RESULTS

Simulation Study

We perform a series of Monte Carlo simulations to investigate the statistical behavior of the proposed BTN mapping approach. The simulation is designed to evaluate the model performance considering the effects of sample sizes (n = 100,200 and 500), gene action mode (additive, dominant, and recessive), and sampling design on the precision of parameter estimations, type I error rates as well as the power to detect the association. Assuming that one haplotype is distinct from the other ones, haplotype frequencies are calculated based on the given allele frequencies and LD parameter as listed in Table . Then distinct diplotypes are simulated according to a multinomial distribution with a probability for each diplotype calculated from their corresponding haplotype frequencies assuming HWE. A disease status is simulated from a bernoulli distribution with a probability of success defined in model (3) with yi = 1. For simplicity, we only consider one covariate in the model and it is simulated from a standard normal distribution. The given values for population and quantitative parameters are listed in Table . The data simulated with this distinct BTN structure are subject to statistical analysis. In each simulation scenario, 1000 Monte Carlo repetitions are performed. For each Monte Carlo sample, the EM algorithm is used to obtain the MLE's of the haplotype frequencies, allele frequencies and LD parameter as well as the quantitative parameters which include the genetic and non-genetic covariates effects. The MLEs for all parameters are listed in Table and their square root of the mean squared errors (RMSEs) are given in the parenthesis. The proportion of cases for the simulated data is about 40-50% on average under the three gene action modes. As expected, the true association can only be detected with the hypothesized risk haplotype, and all the parameters can be accurately estimated only under the correct haplotype distinction. Overall, our model provides reasonable parameter estimation under different simulation scenarios. All population parameters including the allele frequencies and LD parameter can be well estimated with high precision. The precision depends only on sample size and is not affected by gene action modes. Large sample size always leads to low bias and high precision (Table ), which infers the consistency of the parameter estimation. With the estimated haplotype frequencies, we carry out the second stage estimation to estimate the quantitative parameters. As can be seen from Table , the estimation precision of quantitative parameters depends not only on sample size, but also on gene action modes. In general, trends hold across different simulations are evident. First, the accuracy and precision of all parameter estimation increase as the sample size increases. Small sample size (n = 100) results in poor parameter estimation and the precision is dramatically improved when sample size is increased from 100 to 200. Second, as expected, the additive effect can be better estimated than the dominant effect in all simulations. For instance, the RMSE for additive effect is 30% smaller than the dominant effect under the dominant model with sample size 200. Third, the nongenetic covariate effect is not sensitive to the gene action mode, whereas the genetic parameters act differently under different gene action modes. Type I error evaluation is summarized in Table at the 0.05 nominal level with sample sizes ranging from 100 to 500. It can be seen that the estimated type I error rates are not appreciably different from the nominal level 0.05. For power analysis, we consider three disease models: additive, dominant, and recessive as given in Table . The testing power is defined as the percentage of simulations in which the true association is detected. For each simulation case, we run 1000 replicates. The results show that the power of an association test statistic depends on a number of parameters, such as the sample size and the gene action modes. As expected, the testing power increases as the sample size increases. For example, assuming additive model, the power increases from 64% to 93.3% when the sample size increases from 100 to 200. Simulation results also show that the test power is sensitive to gene action mode for small sample size (n = 100). When sample size increases to 200, we observe dramatic power improvement and the difference among different gene action modes is no longer remarkable. Our association test is conducted based on a prospective likelihood setting assuming a random sample from a population. To evaluate the effect of sampling design on parameter estimation and testing power, we simulate samples through changing the intercept α0 by holding other parameters unchanged as given in Table . The value of intercept determines the proportion of cases in a sample. Fig. () plots the effects of case proportions on the type I error rate with sample size 100 and 200. It can be seen that the type I error rates are inflated when the proportion of cases is away from 50%. As long as the case proportion is kept within the 30%-70% range, the false positive rate can be appropriately controlled. Figs. () and () plot the effect of case proportions on the testing power and the absolute averaged bias of parameter estimation under three disease models out of 1000 simulations. Clearly, the testing power and estimation biases are affected by case proportions. The power is greatly reduced and the parameter estimation is severely biased when the case proportion is far away from 50% for small sample size (say 100). As sample size increases to 200, the power is significantly increased and the bias is dramatically reduced. Higher sample size (500) leads to more dramatic improvement (data not shown). However, to achieve desired power and small bias, we still need to maintain a balanced case-control sample. When sample size is small (say less than 100), maintaining such a balance is even more crucial. To test the performance of multi-SNP model, a simulation study assuming 3 htSNPs in a haplotype block is performed. The simulation design is similar to the 2-SNP model. The results are summarized in Table . In general, we observe similar trends for both population and quantitative parameters as in the 2-SNP model. As compared to the 2-SNP model, a slightly higher testing power is observed compared to the 2-SNP model with 100 sample size, especially under the dominant gene action mode. When sample size increases to 200 or 500, the difference is not remarkable. Similar results are observed for case proportion effect as in the 2-SNP model and hence is omitted.

A Case Study

We apply our model to a genetic association study of LGA neonates. LGA may lead to complications for both newborns and mothers. Studies showed that LGA is associated with increased risk of infant mortality [38], and may further lead to development of overweight for a baby in later stage of life [39, 40]. Risk of mothers of LGA neonates includes prolonged labor [41], risk of postpartum bleeding and genital tract injury [42]. Increasing proportion of LGA infants born has been reported in recent years [43, 44], but the etiology of LGA remains largely unknown. It has been increasingly recognized that complication of pregnancy and delivery is a complex trait determined by multiple environmental and genetic factors [45], few genetic association studies have been reported in literature on the relationship between genetic factors and LGA. To understand the genetic basis of LGA, a number of candidate genes have been genotyped for SNPs. Here we only use one of them to demonstrate the model implementation. Our goal is to study which genetic factors in mother are associated with the LGA neonates. The data set contains 552 unrelated maternal individuals with 117 cases and 435 controls in ages ranging from 13 to 45 years old mothers recruited at the Sotero del Rio Hospital, in Puente Alto, Chile. Each of these subjects was genotyped for SNP markers within the candidate gene apolipoprotein C-III (APOC3) located at chromosome 11q23. There are total 6 positions showing polymorphisms, three at the intron 1 region for SNPs 633938761, 633938806, and 633938845, one at exon 3 region for SNP 633938988, one at intron 3 region for SNP 633939053 and one at exon 4 region for SNP 633939147. We use Haploview software to construct the haploype block [46]. The haplotype block is defined using the confidence intervals definition [21]. Fig. () shows that there are two haplotype blocks with block I containing two SNPs 633938806 and 633938845 and block II containing three SNPs 633938988, 633939053 and 633939147. SNP 633938761 does not belong to any blocks. No SNPs are significant using the single SNP x2 test implemented in Haploview. Also, no haplotypes are significant in block II and one haplotype (TC) is significant in block I using the haplotype test implemented in Haploview. We apply our newly developed method to analyze this data set. We fit the 2-SNP model for SNPs in block I and the 3-SNP model for SNPs in block II. Results show that only SNPs in block I are significantly associated with LGA. In the following, we only focus our analysis on SNPs in block I. The two SNPs in block I form four haplotypes designated as TG, TC, CG, and CC. The two SNPs are in linkage disequilibrium, which suggests the importance of considering haplotype effect on the association study, rather than based on a single SNP. Five clinical risk factors are included in the model, maternal age (MA), maternal weight (MW), number of preterm deliveries (PTD), baby sex (BS), and maternal body mass index (MBMI). Our aim is to detect haplotype variants within this candidate gene which are associated with LGA under a variety of environmental conditions. By assuming that one haplotype is different from the rest of the haplotypes, we performed a systematic test for the four haplotypes. The results are summarized in Table . The MLEs of the haplotype frequencies, allele frequencies and the LD parameter are given. These two SNPs are strongly associated with each other ( ). The estimated allele frequencies are 0.7455 for allele T in SNP 633938806 and 0.3806 for allele C in SNP 633938845. The heterozygote rate for the two SNPs are 40% and 48%, respectively. The smallest AIC value is observed for haplotype TC which also shows significance based on hypotheses test (12) (p-value=0.024). All the other three haplotypes do not show evidence of significance. The MLEs of the quantitative parameters and their standard errors in the parenthesis are listed in Table . The likelihood ratio test shows that both additive and dominant effect for haplotype TC are significant at the 0.01 level.Among the five non-genetic covariates, variables MA and MW are significant at the 0.05 level and variable BS is significant at the 0.01 level, which indicate that both maternal age and weight could be potential risk factors for LGA. Since the additive effect is negative, this indicates that this risk haplotype TC triggers a negative effect on LGA, i.e., individuals who carry composite diplotypes have higher risk to develop LGA than individuals carrying composite diplotype [TC][TC] with odds ratio by holding constant for other risk factors. We also calculated the odds ratio of developing LGA for individuals giving birth to different sex of baby. For example, the risk for women carrying the same composite diplotypes to develop LGA would be 1.73 times higher if they deliver baby boy compared to those who deliver baby girl. The risk to develop LGA for a 40-year old mother would be 1.7 times higher than a 25-year old mother after adjusting for other covariates effects. Holding constant for other covariates, the risk to develop LGA will be increased by 1.54 times for every 22 pounds weight gain.

DISCUSSION

The study of common diseases can be broadly divided into two categories: family-based linkage studies across the entire genome, and population-based association studies of individual candidate genes [2]. While accumulative evidences have shown that linkage methods have lower power than population-based association methods [47,48], a more efficient way to study the genetic architecture of a complex disease is at the population level. Meanwhile, a number of studies have shown that haplotype-based association study is more powerful than single SNP analysis, especially when multiple disease-susceptibility variants occur within the same gene [11,49]. Therefore, hunting for specific DNA sequence patterns that are associated with the variation of a disease would provide efficient information in understanding the disease etiology. The model presented here, aimed to detect the association between DNA sequence variants and a binary disease trait, thus prides a timely tool toward better understanding of the genetic architecture of a complex binary disease trait. In this article, we make an attempt to study genetic association by utilizing the abundant sequence variation information developed by the HapMap project. The model, called BTN mapping, is derived on the basis of multilocus haplotype analysis using a finite number of htSNPs within a haplotype block assuming that the candidate gene is not imprinted. Both simulation studies and a real example show that our BTN mapping approach can detect haplotype association underlying a disease trait with high power and, hence displays a number of merits. First, our approach can characterize the association of DNA sequence variants predisposing to a disease. Traditional disease mapping approaches such as binary trait locus (BTL) mapping, attempt to identify loci called BTLs that are linked with known markers [22-24]. The specific DNA sequence structure for the detected loci remains unknown. As opposed to this traditional “indirect” approach, our model can directly materialize DNA sequences underlying a disease trait, and therefore represents a “direct” approach. It should be noted that our approach is limited by knowledge about the complete functional sequence variants information in candidate regions. With the release of more SNPs by HapMap, our model will become more useful in search for causal variants throughout the whole genome. Second, our approach is likelihood-based and is computationally fast. Regular model selection criteria such as the AIC criterion can be applied to determine the risk haplotype structures on a disease trait. The developed model also allows for nongenetic covariates effect which might provide potential information in genetic association study. It has been shown that risk of complex diseases such as cancers may be determined by both genetic and environmental factors [50]. Incorporating the non-genetic factors should provide more meaningful results in association tests, and hence should be more preferred. Third, the proposed regrouping approach could potentially increase testing power by reducing the degree of freedom for an association test. In general, there are two ways to increase the power of an association test: developing appropriate statistical forms for an association test or reduce the degrees of freedom [51]. A common limitation for existing haplotype-based analyses is that the test statistic often involves a large number of degrees of freedom. When large number of SNP markers are involved to construct haplotypes, the test could suffer from severe power loss as the number of degrees of freedom could be large. Through regrouping haplotypes, the BTM mapping approach has better control of the degrees of freedom because the number of degrees of freedom for an association test remain the same regardless of the number of SNPs fitted in the model. Our simulation studies confirmed that even for small sample size ( < 200), when balanced sampling scheme is maintained, one can still obtain appropriate power. Finally, the model is robust and flexible to different genetic and experimental settings. The results from simulation studies indicate that the association between haplotype and disease phenotype can be well detected under different gene action modes with modest sample size. It is worthy to note that the proportion of cases shows a great impact on testing power and parameter estimation by various simulations. Results indicate that a nearly balanced sampling design provides optimal power and low estimation biases, especially for small sample size. Therefore, a balanced case-control design should always be preferred when recruiting samples in a case-control study. The effect of allele frequencies on testing power and parameter estimation is also investigated (data not shown). Our results indicate that testing power is not affected by allele frequency as long as the proportion of individuals carry double risk haplotypes is not terribly small. However, we do observe inflated variances for the genetic parameters when sample size or allele frequency is small, which might be due to multicollinearity among the genetic covariates given the categorical nature of variables [15]. One possible solution is to apply a penalized regression model in which the log-likelihood function is penalized by a penalty term. Possible choices for the penalty term are Lasso type penalty [52] or ridge penalty [53]. When the likelihood function is penalized, however, the usual likelihood ratio test can not be applied. Efficient and robust model selection approaches need to be developed to solve this problem. Our model is developed based on tag SNPs selected within a haplotype block. The purpose of using htSNPs is to control the number of SNPs used to construct a haplotype, which in turn, minimizes the computation burden for haplotype construction using EM algorithm. We may lose potential information when the efficiency of tag SNP selection is low. With more and more studies focused on tag SNP selection, more robust tagging approaches will improve our mapping efficiency. It should also be noted that the assumption by using tag SNPs can be relaxed in which all SNPs identified within a haplotype block can be used for analysis, especially when the SNPs size within a block is limited. However, corrections for multiple testings are necessary when the number of tested blocks are large. Our mapping approach considers binary disease traits and is a generalization of the model proposed by Liu et al. [19] which considers continuous disease traits. The specific utility of our model to a real example from a genetic association study leads to the successful detection of BTNs genotyped within the APOC3 candidate gene associated with LGA. Although our simulation studies and the example were illustrated based on the 2-SNP and 3-SNP models, our BTN mapping model has been developed to allow for the detection of BTN structures involving any number of SNPs. The model can also be easily extended to model the interaction between genetic factors and environments. It is also possible that haplotypes in one block interact with haplotypes in other blocks. A further extension of the model can be applied to model the effects of BTN-BTN interactions on a disease trait.

Table 1.

Possible Diplotype and Composite Diplotype Configurations of Nine Genotypes at Two SNPs and their Haplotype Composition Frequencies

Genotype	Diplotype			Composite Diplotype		No. of Observation
Genotype	Configuration	Frequency	Relative Frequency	Symbol	Diplotype Function	No. of Observation
11/11	[11][11]	P_[11][11] = p112	1	[11][11]	π₂	n_11/11
11/12	[11][12]	P_[11][12] = 2p₁₁p₁₂	1	[11]11¯	π₁	n_11/12
11/22	[12][12]	P_[12][12] = p122	1	11¯11¯	π₀	n_11/22
12/11	[11][21]	P_[11][21] = 2p₁₁p₂₁	1	[11]11¯	π₁	n_12/11
12/12	11221221	P1122=2p11p22P1221=2p12p21	ϕ1−ϕ	1111¯11¯11¯	π11−π1	n_12/12
12/22	[12][22]	P_[12][22] = 2p₁₂p₂₂	1	11¯11¯	π₀	n_12/22
22/11	[21][21]	P_[21][21] = p212	1	11¯11¯	π₀	n_22/11
22/12	[21][22]	P_[21][22] = 2p₂₁p₂₂	1	11¯11¯	π₀	n_22/12
22/22	[22][22]	P_[22][22] = p222	1	11¯11¯	π₀	n_22/22

where p11 , p12 , p21 and p22 are the frequencies for haplotype [11], 12, 21, and 22, respectively. The relative frequency refers to the probability that a specific diplotype is observed. For unambiguous genotype (phase known), the relative frequency is 1. For the double heterozygotic genotype 12/12, the probability of observing diplotype [11][22] is ø , and observing diplotype [12][12] is 1- ø .

Table 2.

The Type I Error Estimated from 1000 Simulation Replicates Under the 2 and 3-SNP Models with Nominal Level 0.05

n	2-SNP model	3-SNP Model
100	0.073	0.06
200	0.056	0.055
300	0.058	0.054
400	0.045	0.049
500	0.047	0.048

Table 3.

The Mean MLEs with their Square Root Mean Square Errors (RMSEs) (in Parentheses) of Population and Quantitative Parameters of the BTNs Estimated from 1000 Simulation Replicates Under the 2-SNP Model

n	α₀ = 0.5	α₁ = 1	α₂	γ = 1.5	p11=0.7	p12=0.7	D = 0.02	Power
Additive			α ₂= 0
100	0.582	1.082	-0.067	1.618	0.698	0.701	0.021	64
	(0.695)	(0.686)	(0.818)	(0.403)	(0.033)	(0.032)	(0.021)
200	0.505	1.051	0.015	1.565	0.701	0.699	0.02	93.3
	(0.278)	(0.314)	(0.407)	(0.269)	(0.024)	(0.024)	(0.016)
500	0.510	1.007	-0.006	1.521	0.7	0.7	0.02	100
(0.176)	(0.188)	(0.252)	(0.160)	(0.015)	(0.014)	(0.009)
Dominant			α₂ = 1
100	0.586	1.089	1.008	1.643	0.698	0.701	0.021	74.8
	(0.718)	(0.710)	(0.887)	(0.468)	(0.033)	(0.033)	(0.021)
200	0.523	1.038	1.013	1.557	0.701	0.7	0.02	96.4
	(0.279)	(0.308)	(0.423)	(0.284)	(0.023)	(0.023)	(0.015)
500	0.506	1.009	1.002	1.518	0.7	0.7	0.02	100
	(0.169)	(0.191)	(0.265)	(0.166)	(0.015)	(0.014)	(0.009)
Recessive			α₂= -1
100	0.581	1.098	-1.119	1.622	0.699	0.699	0.019	88
	(0.789)	(0.798)	(0.919)	(0.418)	(0.032)	(0.034)	0.022
200	0.519	1.040	-1.036	1.572	0.7	0.699	0.02	97.6
	(0.287)	(0.316)	(0.421)	(0.269)	(0.023)	(0.023)	0.015
500	0.504	1.016	-1.013	1.530	0.7	0.699	0.02	100
	(0.186)	(0.182)	(0.261)	(0.164)	(0.015)	(0.015)	(0.009)

, and D are the allelic frequencies of alleles and at two SNPs and their linkage disequilibrium, respectively. α0 is the intercept, and α1 and α2 are the additive and dominant effects respectively by assuming that haplotype is different from the rest haplotypes. γ is the covariate effect. The first row contains the given values of all parameters. Power is calculated as the percentages of all simulations in which the true disease-gene association is detected.

Table 4.

The Mean MLEs with their Square Root Mean Square Errors (RMSEs) (in Parentheses) of Population and Genetic Parameters of the BTNs Estimated from 1000 Simulation Replicates Under the 3-SNP Model

n	α₀	α₁	α₂	γ	p11	p12	p13	D₁₂	D₁₃	D₂₃	D₁₂₃	Power
True*	0.5	1.0		1.5	0.7	0.7	0.7	0.04	0.025	0.025	0.02
Additive			α₂ = 0
100	0.622	1.156	-0.117	1.614	0.698	0.699	0.699	0.04	0.025	0.024	0.02	68.8
	(0.998)	0.992)	(1.076)	(0.405)	(0.033)	(0.033)	(0.033)	(0.020)	(0.021)	(0.021)	(0.013)
200	0.542	1.038	-0.028	1.547	0.699	0.700	0.699	0.04	0.026	0.024	0.02	93.1
	(0.307)	(0.311)	(0.413)	(0.245)	(0.023)	(0.023)	(0.024)	(0.014)	(0.015)	(0.015)	(0.009)
500	0.504	1.013	-0.004	1.522	0.700	0.700	0.699	0.04	0.025	0.024	0.02	100
	0.179)	0.186)	(0.256)	(0.159)	(0.015)	(0.015)	(0.016)	(0.009)	(0.009)	(0.009)	(0.006)
Dominant			α₂ = 1
100	0.677	1.201	0.975	1.654	0.700	0.699	0.699	0.04	0.025	0.024	0.02	83.2
	(1.276)	(1.302)	(1.566)	(0.480)	(0.033)	(0.033)	(0.033)	(0.020)	(0.021)	(0.020)	(0.013)
200	0.544	1.036	0.997	1.547	0.699	0.700	0.699	0.04	0.026	0.024	0.02	99.2
	(0.305)	(0.318)	(0.459)	(0.274)	(0.023)	(0.023)	(0.024)	(0.014)	(0.015)	(0.015)	(0.009)
500	0.506	1.012	1.013	1.521	0.700	0.700	0.699	0.04	0.025	0.024	0.02	100
	(0.178)	(0.190)	(0.281)	(0.169)	(0.015)	(0.015)	(0.015)	(0.009)	(0.009)	(0.009)	(0.006)
Recessive			α₂ = -1
100	0.637	1.18	-1.151	1.631	0.702	0.701	0.702	0.04	0.025	0.024	0.02	89.3
	(1.128)	(1.15)	(1.217)	(0.429)	(0.033)	(0.033)	(0.032)	(0.021)	(0.021)	(0.021)	(0.013)
200	0.52	1.032	-1.025	1.552	0.699	0.7	0.7	0.04	0.026	0.024	0.02	97.8
	(0.299)	(0.296)	(0.438)	(0.262)	(0.024)	(0.023)	(0.023)	(0.014)	(0.015)	(0.015)	(0.009)
500	0.502	1.01	-1.01	1.521	0.700	0.700	0.699	0.04	0.025	0.024	0.02	100
	(0.183)	(0.193)	(0.262)	(0.153)	(0.015)	(0.015)	(0.015)	(0.009)	(0.009)	(0.009)	(0.006)

, and are the allelic frequencies of alleles , and at three SNPs, and D12 , D23 , D13 , and D123 are their linkage disequilibrium, respectively by assuming that haplotype is different from the rest haplotypes. See Table 3 for explanations of other parameters.

Table 5.

The Maximum Likelihood Estimates (MLEs) of the Population and Quantitative Parameters for Significant BTNs Associated with LGA Detected within APOC3 Gene. The Standard Errors of the Quantitative Parameters are Given in the Parenthesis

		AIC	LR₁	P - value
Risk haplotype	[TC]	558.2	7.429	0.024
	[TG]	565.29	0.336	0.845
	[CG]	-	-	-
	[CC]	562.68	2.948	0.229
Population parameters
Haplotype frequencies	pˆTG	0.6196
	pˆTC	0.1259
	pˆCG	0
	pˆCC	0.2545
Allele frequencies and LD	pˆT1	0.7455
	pˆC2	0.3804
	Dˆ¯	-0.1577
Quantitative parameters
Intercept	αˆ0	-3.235(0.851)
Additive effect	αˆ1	-0.606(0.547) [**]
Dominant effect	αˆ2	-0.092(0.612) [**]
MA	γˆ1	0.036(0.016) [*]
MW	γˆ2	0.043(0.022)[*]
PTD	γˆ3	-0.328(0.281)
BS	γˆ4	-0.547(0.215)[**]
MBMI	γˆ5	-0.076(0.056)

LR1 is the likelihood ratio test statistic based on hypothesis (12). The risk haplotype detected on the basis of the AIC value and LR test is indicated in boldface.** and * refer to significance at the 0.01 and 0.05 level, respectively.

MA=maternal age; MW=maternal weight; PTD=number of preterm deliveries; BS=baby sex; MBMI=maternal body mass index.

44 in total

1. A dynamic programming algorithm for haplotype block partitioning.

Authors: Kui Zhang; Minghua Deng; Ting Chen; Michael S Waterman; Fengzhu Sun
Journal: Proc Natl Acad Sci U S A Date: 2002-05-28 Impact factor: 11.205

2. Haplotypes vs single marker linkage disequilibrium tests: what do we gain?

Authors: J Akey; L Jin; M Xiong
Journal: Eur J Hum Genet Date: 2001-04 Impact factor: 4.246

3. Inference on haplotype effects in case-control studies using unphased genotype data.

Authors: Michael P Epstein; Glen A Satten
Journal: Am J Hum Genet Date: 2003-11-20 Impact factor: 11.025

4. A haplotype map of the human genome.

Authors:
Journal: Nature Date: 2005-10-27 Impact factor: 49.962

5. A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis.

Authors: Xiang-Yang Lou; George Casella; Ramon C Littell; Mark C K Yang; Julie A Johnson; Rongling Wu
Journal: Genetics Date: 2003-04 Impact factor: 4.562

Review 6. Cancer risk factors for selecting cohorts for large-scale chemoprevention trials.

Authors: P Greenwald
Journal: J Cell Biochem Suppl Date: 1996

Review 7. Searching for genetic determinants in the new millennium.

Authors: N J Risch
Journal: Nature Date: 2000-06-15 Impact factor: 49.962

Review 8. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease.

Authors: David Botstein; Neil Risch
Journal: Nat Genet Date: 2003-03 Impact factor: 38.330

Review 9. Apolipoprotein E and Alzheimer disease.

Authors: W J Strittmatter; A D Roses
Journal: Proc Natl Acad Sci U S A Date: 1995-05-23 Impact factor: 11.205

10. Fetal macrosomia--maternal risks and fetal outcome.

Authors: A A Meshari; S De Silva; I Rahman
Journal: Int J Gynaecol Obstet Date: 1990-07 Impact factor: 3.561

8 in total

Review 1. Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology.

Authors: Wenjiang J Fu; Arnold J Stromberg; Kert Viele; Raymond J Carroll; Guoyao Wu
Journal: J Nutr Biochem Date: 2010-03-16 Impact factor: 6.048

2. A genetic association study of maternal and fetal candidate genes that predispose to preterm prelabor rupture of membranes (PROM).

Authors: Roberto Romero; Lara A Friel; Digna R Velez Edwards; Juan Pedro Kusanovic; Sonia S Hassan; Shali Mazaki-Tovi; Edi Vaisbuch; Chong Jai Kim; Offer Erez; Tinnakorn Chaiworapongsa; Brad D Pearce; Jacquelaine Bartlett; Benjamin A Salisbury; Madan Kumar Anant; Gerald F Vovis; Min Seob Lee; Ricardo Gomez; Ernesto Behnke; Enrique Oyarzun; Gerard Tromp; Scott M Williams; Ramkumar Menon
Journal: Am J Obstet Gynecol Date: 2010-07-31 Impact factor: 8.661

3. Detecting maternal-fetal genotype interactions associated with conotruncal heart defects: a haplotype-based analysis with penalized logistic regression.

Authors: Ming Li; Stephen W Erickson; Charlotte A Hobbs; Jingyun Li; Xinyu Tang; Todd G Nick; Stewart L Macleod; Mario A Cleves
Journal: Genet Epidemiol Date: 2014-03-02 Impact factor: 2.135