Literature DB >> 27468297

Analysis of Variance Components for Genetic Markers with Unphased Genotypes.

Abstract

An ANOVA type general multi-allele (GMA) model was proposed in Wang (2014) on analysis of variance components for quantitative trait loci or genetic markers with phased or unphased genotypes. In this study, by applying the GMA model, we further examine estimation of the genetic variance components for genetic markers with unphased genotypes based on a random sample from a study population. In one locus and two loci cases, we first derive the least square estimates (LSE) of model parameters in fitting the GMA model. Then we construct estimators of the genetic variance components for one marker locus in a Hardy-Weinberg disequilibrium population and two marker loci in an equilibrium population. Meanwhile, we explore the difference between the classical general linear model (GLM) and GMA based approaches in association analysis of genetic markers with quantitative traits. We show that the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model, while the GLM cannot. We clarify that the standard F-statistics based on the partial reductions in sums of squares from GLM for testing the fixed allelic effects could be inadequate for testing the existence of the variance component when allelic interactions are present. We point out that the GMA model can reduce the confounding between the allelic effects and allelic interactions at least for independent alleles. As a result, the GMA model could be more beneficial than GLM for detecting allelic interactions.

Entities: Chemical Disease Gene Mutation Species

Keywords: Fisher's ANOVA model; allelic interactions; analysis of variance; general linear model; general multi-allelic model; genetic variance components; least square estimates; orthogonal partition

Year: 2016 PMID： 27468297 PMCID： PMC4942470 DOI： 10.3389/fgene.2016.00123

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.599

1. Introduction

Typically, there are two different ways in assessing the statistical association of a categorical variable with a continuous outcome. We can either make a direct comparison of the group means among groups defined by the categorical variable or assess the variation that the categorical variable may contribute to the total variance of the continuous outcome. The former approach is usually conducted via fitting a general linear model (GLM) with or without an adjustment for other covariates. The latter approach is referred as the analysis of variance (ANOVA), which examines a quantitative outcome variable by partitioning its total variance into variance components attributable to different sources of variation. The original ANOVA model was proposed initially by Fisher (1918) and formalized later in Fisher (1925). The traditional ANOVA approach on estimation of variance components was mainly based on Henderson's method I-III by equating the observed sums of squares to their expected values (Henderson, 1953). Currently, for predictors with observed group levels, their variance components can be estimated via ANOVA tables, which are mostly based on the sequential (Type I) sums of squares or partial reductions in (Type III) sums of squares via fitting GLM (Searle, 1971; Searle et al., 1992). For predictors with observed or unobserved group levels, their variance components could also be evaluated via maximum likelihood estimation (MLE) or restricted maximum likelihood (REML) (see Searle et al., 1992). In genetic association studies, the standard GLM or ANOVA approaches may not be directly applicable to genetic markers with unphased genotypes. In humans, a marker genotype is a combination of one paternal and one maternal allele. But the phase information (i.e., the origin of parental alleles) is often missing from most of the current genotype typing technologies. Appropriate modification on the classical GLM or ANOVA methods is therefore needed in order to overcome this unknown phase problem. In Wang (2011), we introduced several coding schemes for unphased marker genotypes in constructing the dummy-variable based GLM. In Zeng et al. (2005) and Wang (2014), some revised Fisher's ANOVA models were also proposed for ANOVA analysis of quantitative trait loci or genetic markers with unphased genotypes, which were referred as the general biallelic (G2A) or general multi-allelic (GMA) models. The estimation of genetic variance components based on the G2A and GMA models were also briefly explored. In this study, by applying the GMA model, we further examine estimation of genetic variance components for genetic markers with unphased genotypes from a random sample of individuals in a study population. First, for a single locus GMA model, we derive the least square estimates (LSE) of model parameters in fitting the GMA model based on the genotypic group means and allele frequency estimates. Than we construct estimators of the variance components for one marker in Hardy-Weinberg disequilibrium (HWD). Next, we consider a fully parameterized two-locus GMA model. We derive the LSE of model parameters in fitting the GMA model and develop estimators of the variance components for two genetic markers in an equilibrium population. In both one locus and two loci cases, we also explore the difference between the GLM and GMA in association analysis of genetic markers. We show that the GMA models can provide the same partition on the genetic variance components as the original Fisher's ANOVA models but the GLM cannot. We clarify that the F-statistics based on the partial reductions in sums of squares from the standard ANOVA table for association testing of the fixed allelic effects in a GLM could be inadequate for testing the existence of variance components when higher order fixed allelic interactions are present. In addition, we point out that the GMA model can reduce the confounding between allelic effects and allelic interactions at least when the inheritance of alleles are independent. As the result, the GMA model could be more beneficial than GLM for detecting allelic interactions. Finally, a simulation example is presented to show the difference between GLM and GMA models on partitioning variance components. The performance in model selection between using GLM and GMA models is also examined.

2. Models and results

Consider a random sample of size N from a study population. Let y, i = 1, … , N, be their observed phenotypic values for a quantitative trait Y, and g, i = 1, … , N, be their observed unphased genotypes at certain genetic marker loci. The quantitative trait Y is assumed to be affected by both genetic and environmental effects. Let G denote the true (unobservable) genotypic value, which could be affected by many genetic factors. If we ignore the genetic by environmental interactions, the relationship between the quantitative trait and marker genotypes can be modeled via a GLM where z is a vector of the adjusted environmental covariates with fixed effects β, E(G|g) represents the expected genotypic value of the i-th individual given his/her observed marker genotypes g, and ϵ is a model residual error contributed by other environmental and genetic factors that cannot be captured by the covariates z and marker genotypes g. We also assume that ϵ, i = 1, … , N, are independent and identically distributed (i.i.d) with E(ϵ) = 0 and V(ϵ) = Vϵ. Besides, {ϵ, i = 1, … , N} are independent of {g, i = 1, … , N}. Then, the total phenotypic variance V = V(E(G|g)) + E(V(G|g)) = V(E(G|g)) + Vϵ. In the rest of the paper, we focus on comparing the GLM and GMA modeling on the expected genotypic values E(G|g) and assessing the genetic variance components contributed by the allelic effects and allelic interactions at the marker loci to the expected genotypic variance V(E(G|g)).

2.1. One-locus models

Consider a single marker locus with multiple alleles A1, …, A (m ≥ 2). Let p, j = 1, … , m, be the allele frequencies (), and p = P(AA), j, k = 1, … , m (j ≤ k), be the genotype frequencies (), in a study population. For a random sample of size N from the study population, suppose that there are n individuals carrying genotypes AA for j ≤ k, j, k = 1, … , m (). For notation simplicity, we also let p = p and n = n for k > j, j, k = 1, … , m. Then, , for j = 1, … , m. Let , which is the total number of alleles A carried by the sampled individuals (j = 1, … , m). Assume that the observed marker genotypes g, i = 1, … , N, are i.i.d. and follow a multinomial distribution. Then the MLE of the genotype frequencies are given by , and the MLE of the allele frequencies are given by , for j = 1, … , m. Also, let y,i = 1, … , n be the observed phenotypic values of the individuals carrying genotypes AA. We define the observed genotypic group means for j, k = 1, … , m, the allele averaged means for j = 1, … , m, and the grand mean . In addition, we define the allele weighted means for j = 1, … , m, and the weighted overall mean . Note that in a classical two-way factorial ANOVA model, each individual can receive one and only one level assignment from each of the two factors. At a marker locus, however, an individuals can carry a homozygous genotype “AA” (j = 1, … , m) with two A alleles indistinguishable. In general, the allele weighted means may not be the same as the allele averaged means ȳ, and the weighted overall mean may also differ from the grand mean ȳ···. To assess the allelic effects on the expected genotypic values (or phenotypic values), let us first take a brief review of Fisher's one-locus ANOVA model (see Kempthorne, 1957; Weir and Cockerham, 1977). Since the marker genotypes are unphased, we usually assume that the paternal and maternal gametes share the same set of alleles, have the same allele frequencies, and contribute the same allelic effects at the marker locus. The fact that each allele contributes the same genetic effect regardless of its parental origins implies that the expected genotypic values G = E(G|g = AA) should satisfy the symmetric property: G = G, for j ≠ k. So there are totally m(m + 1)/2 possible distinctive expected genotypic values G, j, k = 1, …, m. By treating the paternal and maternal gametes as two independent risk factors, the traditional Fisher's one-locus ANOVA model for the expected genotypic values G can be written as where α is referred as the average additive effect of the paternal or maternal allele A (j = 1, …, m), and δ the average allelic interaction between two parental alleles A and A (j, k = 1, …, m). It is known that not all the parameters in the above model are estimable due to an over-parameterization of the model on the expected genotypic values. Typically, the following constraints can be added on the model parameters (see Kempthorne, 1957). From the symmetric property of G, we also assume that δ = δ, for j ≠ k. With these constraints, if we further incorporate model (2) into GLM (1) and ignore the adjusted covariates, the LSE of parameters are given by , for j = 1, … , m, and for j, k = 1, … , m. It has been known that, under Hardy-Weinberg equilibrium (HWE), the above model (2) can also provide an orthogonal partition of the expected genotypic variance as V(E(G|g)) = V + V, where and are the so-called additive and dominance variance components. Weir and Cockerham (1977) also explored model (2) on partitioning the expected genotypic variance in HWD. In practice, as pointed out in Wang (2014), the symmetric property of δ's and the irregular constraints (3) could make it difficult to fit model (2) using standard statistical software especially when the adjusted covariates are involved. Besides, the random variables that are responsible for the additive and dominance variance components are not explicitly defined in model (2). The expected genotypic values can also be modeled via a classical dummy-variable based GLM. As shown in Wang (2011), we can introduce the following indicator variables to describe the inheritance of the two parental alleles for each individual for j = 1, …, m. Though z1 and z2 cannot be defined on unphased heterozygous genotypes, we can define the following genotype coding variables for unphased genotypes for j = 1, …, m, and for j ≠ k, j, k = 1, …, m. Here, denotes any other allele rather than A. By choosing A as a reference allele, we can construct the following GLM for i = 1, …, N, where a is usually referred as the fixed additive allelic effect of the paternal or maternal allele A, and d the fixed allelic interaction between two parental alleles A and A, with respect to the reference allele A. In terms of the expected genotypic values, we can show that μ0 = G, a = G − G and d = (G − G) − (G − G), for j = 1, …, m − 1 and k = j, …, m − 1. Model (4) provides a full re-parameterization of the m(m + 1)/2 expected genotypic values. Suppose that there are no empty genotype groups for the observed random sample; i.e., n > 0 for any j, k = 1, … , m. When we incorporate model (4) into GLM (1) and ignore the adjusted covariates, the LSE of the expected genotypic values are given by: Ĝ = ȳ, for j, k = 1, … , m − 1 and j < k. Therefore, the LSE of parameters in model (4) can be easily derived in terms of the observed genotypic group means as and â = ȳ − ȳ for j = 1, … , m − 1, and for j, k = 1, … , m − 1 and j ≤ k. Note that these LSE could be sensitive to phenotypic outliers especially for small genotypic groups. For example, a few individuals may have genotype “AA” (j ≠ m) for a rare allele “A.” If the genotypic group mean ȳ is much larger than ȳ, the LSE â will be large. By choosing a common allele “A” as the reference, we could improve the accuracy of LSE. Through incorporating model (4) into a GLM, we could also perform hypothesis tests on its model parameters via contrasts. In analysis of genetic variance components, however, it has been known that w's are often correlated with v's as random variables over the random individuals even when the inheritance of paternal and maternal alleles are independent (Wang and Zeng, 2009). As the result, model (4) leads to a different partition of the expected genotypic variance V(E(G|g)) from the one defined in Fisher's ANOVA model (2). To further dissect the confounding between the additive and dominance effects in model (4), we can make mean corrections on the indicator variables {z1} and {z2}, and introduce the following mean-corrected index variables (see Wang, 2014) Then, we define the following modified genotype coding variables for j = 1, … , m, and for j ≠ k, j, k = 1, …, m. Here, denotes an allele which is different from both A and A. Note that the modified genotype coding variables , , and are well defined on unphased genotypes, although the mean-corrected index variables x1, x2 cannot be defined on unphased heterozygous genotypes. By choosing A as a reference allele, we can construct the following GMA model (Wang, 2014) for i = 1, …, N. Still, in model (5), we refer the parameters as the average additive allelic effects and (j ≤ k) as the average allelic interactions, with respect to the reference allele A. Both the GMA model (5) and the GLM (4) can provide a full parameterization of the expected genotypic values G. Comparing to model (4), one major advantage of model (5) is that it can retain the same partition on the genetic variance components as the one from Fisher's ANOVA model (2). In fact, under the constraints (3) plus the symmetric property of δ, model (2) can be re-written as for i = 1, …, N. Model (5) is a simplified version of model (6) by further replacing the redundant variables , and by , for j = 1, … , m − 1, and . We can see that both models (5) and (2) share exactly the same additive and dominance variance components. They become equivalent when we take μ* = μ, , and , for j, k = 1, … , m − 1 and j ≤ k. Note that model (5) does not contain redundant parameters. Therefore, it does not require constraints on its model parameters. Besides, the random variables that are responsible for the additive and dominance variance components are explicitly defined in model (5). In practice, similar to enforcing the constraints (3) on model (2), we can create the variables and by replacing the allele frequencies p's by their MLE 's. Then, by incorporating model (5) into GLM (1), we can treat these modified genotype coding variables as regular fixed covariates and fit the model using the ordinary LS approach. The hypothesis of H0:V = 0 for existence of the additive variance component can also be tested via testing for the average additive allelic effects. The GLM (4) and GMA model (5) can be transformed easily from one to the other. From the relationship between their genotype coding variables, we can show that and for j ≤ k, j, k = 1, … , m − 1. Here, for notation simplicity, we define d = d, for j < k. To fit model (5), instead of solving its normal equations directly, we can derive the LSE of its model parameters from the LSE of model (4) as the following. and for j ≤ k, j, k = 1, … , m − 1. Note that the LSE of the average additive allelic effects are calculated from allele weighted means, which could be more robust to phenotypic outliers than the LSE of fixed additive allelic effects. It is also interesting to see that both the fixed additive allelic effects a's and the fixed allelic interactions d's may affect the average additive allelic effects 's, though the fixed allelic interactions keep the same as the average allelic interactions. In general, the hypothesis test of H0 : a = 0, j = 1, … , m − 1, for the fixed additive effects in a GLM (4) is not equivalent to the hypothesis test of , for the average additive effects in an equivalent GMA model (5). Therefore, the standard F-statistics for testing the fixed additive effects in model (4) could be inadequate for testing the existence of the additive variance component V when significant fixed (or average) allelic interactions are present. In ANOVA, we treat {x1, j = 1, … , m − 1} and {x2, k = 1, … , m − 1} as random variables over the sampled individuals. Model (5) provides an approximation to the expected genotypic values E(G|g) by a linear combination of the random variables x1, j = 1, … , m − 1 and x2, k = 1, … , m − 1, plus their cross-product terms for allelic interactions. To model the genotypic distributions, let be a vector that describes the unphased genotypic categories for i = 1, … , N. As usual, we assume that the marker genotypes are i.i.d. from a multinomial distribution of Mult(1, {p, j, k = 1, … , m, j ≤ k}). Under this assumption, we can show that the vectors g1 = (z11(g), … , z1(g)), i = 1, … , N, of paternal allele indicator variables are i.i.d. from Mult(1, {p1, … , p}); and the vectors g2 = (z21(g), … , z2(g)), i = 1, … , N, of maternal allele indicator variables are also i.i.d. from Mult(1, {p1, … , p}). When the inheritance of paternal and maternal alleles are independent, the study population is under HWE and the mean corrected index variables {x1(g), j = 1, … , m} are independent of {x2(g), j = 1, … , m} for any randomly sampled individual i. In HWD, however, the mean corrected index variables {x1, j = 1, … , m} could be correlated with {x2, j = 1, … , m}. Let us define for j ≠ k, j, k = 1, … , m. Then, {D, j, k = 1, … , m} measure the departures from HWE, which satisfy . From the observed genotypes, we have MLE for j = 1, … , m, and for j, k = 1, … , m and j ≠ k. It is more convenient to use GMA model (5) rather than GLM (4) for estimation of genetic variance components. In model (5), let and represent the additive and dominance components, respectively. In general, we can partition the expected genotypic variance as V(E(G|g)) = V + V + 2Cov(A, D). In Wang (2014), we have derived formulas for estimating the variance components V, V and covariance component Cov(A, D) based on the parameters in model (5). By plugging in LSE of the parameters, we obtain estimators of the variance components V, V and the covariance component Cov(A, D) as shown in Appendix A in Supplementary Material. It should be pointed out that these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the marker genotypes are in HWD. Unlike the ANOVA estimators of variance components which could be negative when data are unbalanced, our estimators of the variance components are guaranteed to be non-negative. Similar to the ANOVA estimators, when the model residuals are normally distributed, these estimators become MLE of the variance and covariance components and likely possess the asymptotic normality property (see Searle, 1995). Meanwhile, with variations from both the genotypic group means and the MLE estimates of allele frequencies, these estimators are only asymptotically unbiased, while the original ANOVA estimators are unbiased. Under HWE, we would expect an orthogonal partition of the expected genotypic variance as V(E(G|g)) = V + V. By taking for j, k = 1, … , m in Appendix A in Supplementary Material, we obtain the following estimators If we further incorporate model (5) into GLM (1) and ignore the adjusted covariates, the estimator of residual variance is given by . The estimator of total phenotypic variance is given by . We can show that . Note that when for j, k = 1, … , m, the allele weighted means and the allele averaged means become the same; i.e., for j = 1, … , m. Besides, the weighted overall mean is the same as the grand mean ŷ···. However, due to possible sampling variations from the sampled individuals, under HWE we could still have some , j, k = 1, … , m, which may lead to a slight deviation from the orthogonal partition of the expected genotypic variance. When for j, k = 1, … , m, another nice feature of the GMA model (5) is that the LSE of its main effects will keep the same in a reduced GMA model when we ignore the allelic interactions. Let us represent the full GMA model (5) in a matrix form with the design matrix , where 1 is a N by 1 vector with all its elements being 1, and , which correspond to the main effects and allelic interactions , respectively. We can show that , which is a block diagonal matrix when , for j, k = 1, … , m (see proof in Appendix B in Supplementary Material). Therefore, the LSE and , which do not depend on . In other words, ignoring the average allelic interactions in model (5) does not affect the LSE of the intercept and the average additive effects in this case. The GLM (4) does not have such a property. In a reduced GLM (4) without the fixed allelic interactions, the LSE of its additive effects are for j = 1, … , m − 1, while in a full GLM (4) the LSE become â = ȳ − ȳ, for j = 1, … , m − 1. We can also estimate the genetic variance components with an adjustment for other covariates. First, by incorporating the GMA model (5) into GLM (1), we can fit GLM (1) using the ordinary LS approach. Next, based on the fitted model, we can calculate the fitted additive and dominance components and , for each individual i = 1, … , N. Then, we can estimate V and V as the sample variances of {Â, i = 1, … , N} and , respectively. Meanwhile, the covariance component Cov(A, D) can be estimated as the sample covariance between {Â, i = 1, … , N} and .

2.2. Two-locus models

We consider an extension of the previous one-locus models to two loci. Assume that marker 1 has alleles A11, …, A1 (m1 ≥ 2) with p11, … , p1 being the allele frequencies, and marker 2 has alleles A21, …, A2 (m2 ≥ 2) with p21, … , p2 being the allele frequencies. Let p = P(A1A1, A2A2) denote the joint genotype frequencies at the two marker loci in a study population. For a random sample of size N from the study population, let n be the number of individuals who carry unphased genotypes A1A1 (j ≤ k) at locus 1, and A2A2 (r ≤ s) at locus 2, for j, k = 1, … , m1 and r, s = 1, … , m2 (). Then, the MLE of genotype frequencies . Let y,i = 1, … , n, be the observed phenotypic values of individuals carrying the joint genotypes (A1A1, A2A2). We define the observed genotypic group means , for j, k = 1, … , m1 and r, s = 1, …, m2. Without distinguishing the origin of parental alleles, we assume that n = n for k > j, j, k = 1, … , m1, and n = n for r > s, r, s = 1, … , m2. Let for j, k = 1, … , m1, and for r, s = 1, … , m2. We have the MLE of allele frequencies for j = 1, … , m1 at locus 1, and for r = 1, … , m2 at locus 2. For notation simplicity, we define ȳ = ȳ for k > j (j, k = 1, … , m1) and ȳ = ȳ for r > s (r, s = 1, … , m2). In addition, we define the weighted genotypic group means , , , , , and , for j, k = 1, … , m1 and r, s = 1, … , m2. The weighted overall mean . Let G = E(G(g)|g = A1A1, A2A2) be the expected genotypic value of individuals with the joint genotypes (A1A1, A2A2), for j, k = 1, …, m1 and r, s = 1, …, m2. Without distinguishing the origin of parental alleles, we assume that {G} satisfy the symmetric properties: G = G = G = G, for j < k and r < s. In general, there are totally m1m2(m1 + 1)(m2 + 1)/4 possible distinctive expected genotypic values. By treating the paternal and maternal gametes as two independent risk factors, the Fisher's two-locus ANOVA model for the expected genotypic values G can be written as (see Kempthorne, 1957) for j, k = 1, …, m1 and r, s = 1, …, m2. The parameters α1 and δ1 are referred as the average additive and dominance effects at locus 1, α2 and δ2 the average additive and dominance effects at locus 2, (αα) the additive by additive interactions, (αδ) the additive by dominance interactions, (δα) the dominance by additive interactions, and (δδ) the dominance by dominance interactions. Still, due to an over-parameterization of the model for the expected genotypic values, not all the model parameters are estimable. From the symmetric property of the expected genotypic values, we usually assume that δ1 = δ1, δ2 = δ2, (αδ) = (αδ), (δα) = (δα), (δδ) = (δδ) = (δδ). In addition, the following constraints need to be added on the model parameters (Gallais, 1974; Weir and Cockerham, 1977). In other words, a weighted sum of the average genetic effects of a genetic component is zero over any index. It has been known that model (7) under constraints (8) can provide an orthogonal partition on the genetic variance components when the inheritance of four paternal and maternal alleles at the two loci are independent (Kempthorne, 1957; Weir and Cockerham, 1977). Still, as pointed out in Wang (2014), it is difficult to estimate the parameters in model (7) under the complicated constraints (8) when other adjusted covariates are involved. Besides, the random variables that constitute the random sources of the genetic variance components are not explicitly defined in model (7). To model the expected genotypic values through a GLM, we can re-list all the sampled individuals by an index variable k = 1, … , N. Similar to the one-locus case, we introduce the indicator variables and for inheritance of the paternal and maternal alleles at locus 1, and and for inheritance of the paternal and maternal alleles at locus 2. Then, for phase-unknown genotypes, we define the genotype coding variables w1, v1 for j, k = 1, … , m1 (j ≤ k) at locus 1, and w2, v2 for r, s = 1, … , m2 (r ≤ s) at locus 2, in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), a fully parameterized two-locus GLM model can be written as (Wang, 2011), for i = 1, …, N. A nice feature of the above GLM is that we can easily establish the relationship between its model parameters and the expected genotypic values by starting from the lowest order parameter μ0 = G. Suppose that there are no empty genotypes; i.e., n > 0 for any j, k = 1, … , m1 and r, s = 1, … , m2. When we incorporate the above model (9) into a GLM (1) and ignore the adjusted covariates, the LSE of parameters in model (9) can be derived as shown in Appendix C in Supplementary Material. Similar to the one-locus GLM, the LSE of these fixed allelic effects are highly dependent upon the genotypic group means, which could be sensitive to phenotypic outliers in small genotypic groups. More importantly, model (9) cannot provide the same partition of the expected genotypic variance V(E(G|g)) as the one defined in the original Fisher's ANOVA model (7). To construct a two-locus GMA model for analysis of genetic variance components, we can introduce the mean-corrected index variables for j, k = 1, … , m1 at locus 1, and for r, s = 1, … , m2 at locus 2. Then we define the modified genotype coding variables , for j, k = 1, … , m1 at locus 1, and , for r, s = 1, … , m2 at locus 2 in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), we can build a fully parameterized two-locus GMA model as the following. for i = 1, …, N. Similar to the one-locus case, the original Fisher's ANOVA model (7) under constraints (8) and the symmetric properties of the dominance effects can be re-written as Model (10) is actually a simplified version of model (11) by further removing the redundant parameters. The two models (10) and (11) are equivalent when we take for j, k = 1, … , m1 − 1 (j ≤ k) and r, s = 1, … , m2 − 1 (r ≤ s). Therefore, both model (10) and (11) share exactly the same genetic components as the ones defined in the original Fisher's two-locus ANOVA model (7). In model (10), let , , , , , , and , which represent the additive and dominance at locus 1, additive and dominance at locus 2, additive by additive, additive by dominance, dominance by additive and dominance by dominance genetic components, respectively. Each genetic component consists of a set of average allelic effects (additive or interaction) contributed by alleles from the same locus or the same set of loci, and at each locus at most two alleles could be involved. We refer the number of alleles involved in each average allelic effect of a genetic component as the order of the genetic component, and the number of non-redundant average allelic effects involved in a genetic component as the degrees of freedom (df) of the genetic component. Note that genetic components of the same order may have different df. For example, the additive by additive component A1A2 has its order being 2 and df = (m1 − 1)(m2 − 1), while the dominance component D1 at locus 1 has its order being 2 and . Both the GLM (9) and GMA model (10) provide a re-parameterization of the expected genotypic values. From the relationship between their genotype coding variables, these two models can be transformed equivalently from one to the other. We can represent the parameters in GMA model (10) in terms of the parameters in its equivalent GLM (9) as shown in Appendix D in Supplementary Material. It is interesting to see that an average allelic effect can be affected by not only its corresponding fixed allelic effect but also other higher-order fixed allelic effects. As a result, the hypothesis test of an average allelic effect in the GMA model (10) is not equivalent to the hypothesis test of the corresponding fixed allelic effect in an equivalent GLM (9) when higher order fixed allelic effects are present. From Appendices C, D in Supplementary Material, we can also derive the LSE of parameters for the GMA model (10) as the following and , for j, k = 1, …, m1 − 1, j ≤ k; and r, s = 1, …, m2 − 1, r ≤ s. Similar to the one-locus GMA model, the LSE of these average allelic effects depend on the weighted genotypic group means. Except the LSE of the highest order effects for dominance by dominance interactions, the LSE of other lower order average allelic effects could be more robust to phenotypic outliers in small genotypic groups than the LSE of their corresponding fixed allelic effects. When the two markers are unlinked and in HWE, the inheritance of four paternal and maternal alleles at the two loci are independent. Therefore, the four sets of indicator variables , , , and are independent with each other, although the variables within each set could still be correlated. In this case, all the genetic components are independent of each other, and the GMA model (10) can provide an orthogonal partition of the expected genotypic variance. In Wang (2014), we have derived formulas for computing the genetic variance components based on the model parameters in a general multi-locus GMA model. For the two-locus GMA model (10), by plugging in the LSE of its model parameters, we obtain estimators of the genetic variance components as shown in Appendix E in Supplementary Material. It should be pointed out that Weir and Cockerham (1977) also derived estimates of the genetic variance components based on the model parameters in Fisher's two-locus ANOVA model (7). But they did not construct the estimators of genetic variance components in terms of the weighted genotypic group means. Note that the orthogonal partition on the genetic variance components implies that the genetic components constitute independent random sources contributing to the expected genotypic variance. Therefore, we could estimate and test for each genetic component separately. Bonferroni criterion can also be applied to correct for the association testing of multiple genetic components. When the inheritance of paternal and maternal alleles at the two loci are dependent, non-zero covariances among different genetic components may present. The dependency among inheritance of the paternal and maternal alleles at the two markers can be measured by D = p/(2 − 1{)(2 − 1{)−pppp, for j, k = 1, … , m1 and r, s = 1, … , m2. Let . Similar to the one-locus case, when for j, k = 1, … , m1 and r, s = 1, … , m2, we can show that the LSE of the average additive, dominance, additive by additive, additive by dominance, dominance by additive and dominance by dominance effects in the full model (10) can keep consistent when some components are excluded from the model. In general, we can always incorporate the two-locus GMA model (10) into a regression model (1). By treating the modified genotype coding variables , , , and as regular fixed covariates, we can fit the regression model using the ordinary LS approach. Based on the fitted model, we can calculate the fitted genetic components for each individual i = 1, … , N. Then the genetic variance components and their possible covariances can be estimated as the sample variances and covariances from the fitted values of these genetic components. Similar to the one-locus case, these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the two marker are linked or their genotypes are in HWD. As the estimators of variance components coming from the sample variances, they are guaranteed to be non-negative. When the model residuals are normally distributed, these estimators of the variance and covariance components become MLE and likely possess the asymptotic normality property. Besides, these variance and covariance estimators are likely asymptotically unbiased.

2.3. A simulation example

Consider two biallelic marker loci with allele frequencies p1 = 0.4 and q1 = 0.6 for alleles “1” and “0” at locus 1, and p2 = 0.2 and q2 = 0.8 for alleles “1” and “0” at locus 2. Assume that the two marker loci are in linkage and gametic equilibria. The expected genotypic values at the two marker loci are simulated based on a two-locus GLM model (9) with μ0 = 10 and a11 = a21 = d111 = d211 = (aa)11 = 1 and (ad)1, 11 = (da)11, 1 = (dd)11, 11 = 0. From Appendix D in Supplementary Material, we can show that this GLM is equivalent to a GMA model (10) with μ* = 11.72 and , , , , , , , and . Based on the true expected genotypic values and the genotypic distribution, we also know that the total expected genotypic variance, which is V = 3.09, has an orthogonal partition which consists of eight variance components contributed by . Besides, these eight components contribute 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the total expected genotypic variance V. For each random sample of size n, we first generate genotypes of individuals independently based on the genotypic distribution. Then, based on the genotypes, we create dummy variables w1 = w11, v1 = v111, w2 = w21, v2 = v211, ww = w11*w21, wv = w11*v211, vw = v111*w21, vv = v111*v211 and mean-corrected index variables . The phenotypic values is a sum of the genotypic values and the residuals with the residuals being independent of the genotypes. To generate phenotypic values, we assume that the residuals ϵ, i = 1, … , n, are i.i.d. and normally distributed with the residual variance being Vϵ = 17.51, which account for 85% of the total phenotypic variance (i.e., the broad sense heritability is about 15%). First, we show that the GLM and GMA models provide different partitions of the expected genotypic variance. To minimize the sampling variation, we simulate a random sample with n = 100,000. We fit both GLM and GMA models to the random sample using SAS “proc glm” (SAS Institute, INC.). For the fitted GLM, we have LSE , â11 = 1.01, â21 = 1.05, , , , , and , which are close to the true values. From the fitted GLM, we calculate estimates of the variance components contributed by w1, v1, w2, v2, ww, wv, vw, vv as 0.4894, 0.1274, 0.3533, 0.0345, 0.4153, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 1.4199, which is about 46% of the total expected genotypic variance estimate V = 3.0896. In other words, in the GLM based partition of the expected genotypic variance, about 54% is contributed by the covariance components due to the collinearity among the variables w1, v1, w2, v2, ww, wv, vw, vv. For the fitted GMA, we have LSE , , , , , , , and , which are also close to the true values. From the fitted GMA model, we compute estimates of the variance components contributed by , vw*, vv* as 1.5321, 0.0534, 1.3079, 0.0244, 0.1462, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 3.064, which is about 99% of the total expected genotypic variance estimate V = 3.0896. This indicates that the GMA model leads to a partition of the expected genotypic variance, which is almost orthogonal with only about 1% coming from the covariance components. Note also that these estimates of the variance components contributed by , vw*, vv* account for approximately 49.59, 1.73, 42.33, 0.79, 4.73, 0.00, 0.00, 0.00% of the expected genotypic variance estimate V = 3.0896, which are close to the true proportions 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the variables , vw*, vv* contributed to the expected genotypic variance. We also look at the difference between the partition of the expected genotypic variance and the Type III sums of squares for this random sample. Both the GLM and GMA models have the same regression sum of squares SSR = 307788.89 and mean square error MSE = 17.52. In the fitted GLM, the Type III sums of squares for the eight variables w1, v1, w2, v2, ww, wv, vw, vv are 13415.25, 3475.46, 8472.60, 840.54, 4149.25, 0.24, 1.04, and 0.67, respectively. The summation of these Type III sums of squares is 30355.05, which is less than 10% of the total SSR. In the fitted GMA, the Type III sums of squares for the eight variables are 153184.54, 5340.61, 130751.17, 2438.75, 14620.38, 2.92, 0.44, and 0.67, respectively. The summation of these Type III sums of squares is 306339.50, which is about 99.5% of the total SSR. This indicates that these Type III sums of squares also provide an orthogonal partition of the SSR. Or, in other words, the hypothesis tests of the eight genetic components are approximately orthogonal. Next, we examine the performance in model selection between using the GLM and GMA models. We consider varied sample sizes of n = 500, 1000, 2000, and 5000. Under each simulation scenario, we run 1000 simulation. For each simulation sample, we first apply one commonly used method—the forward stepwise selection for model selection on {w1, v1, w2, v2, ww, wv, vw, vv} and , separately. We run “proc glmselect” in SAS with the criterion p=0.05 for both entry and stay in the model. For the GMA model selection on index variables , as these variables are independent in the underlying true model, we can rank them in the order of , and vv* according to their variance contributions from the largest to the smallest. Intuitively, we expect that the selected models would include the higher ranked variables more often than the lower ranked ones. We mainly focus on the top five variables and classify the selected models into 10 categories: I = all the five variables are selected; II = are selected but not ; III = are selected but not ; IV= are selected but not ; V= are selected but not ; VI = are selected but not ; VII = are selected but not ; VIII = ww* is selected but are missed; IX= is selected but miss , or is selected but miss ; X= are missed. For the GLM model selection on the dummy variables {w1, v1, w2, v2, ww, wv, vw, vv}, we also define the similar 10 categories I-VIII based on the variables w1, v1, w2, v2, ww, wv, vw, vv. Under each simulation scenario, the counts of selected GLM and GMA model types for stepwise selection are listed in Table 1.

Table 1

Counts of selected GLM and GMA model types for “Stepwise Selection”.

Model types	GLM				GMA
Model types	n = 500	n = 1000	n = 2000	n = 5000	n = 500	n = 1000	n = 2000	n = 5000
I	0	15	196	605	13	71	285	737
II	3	132	401	342	118	294	433	239
III	0	27	70	33	43	98	99	22
IV	89	266	176	9	370	361	160	2
V	288	213	71	11	456	176	23	0
VI	247	140	47	0	0	0	0	0
VII	82	77	9	0	0	0	0	0
VIII	265	129	30	0	0	0	0	0
IX	26	1	0	0	0	0	0	0
X	0	0	0	0	0	0	0	0
Total	1000	1000	1000	1000	1000	1000	1000	1000

Counts of selected GLM and GMA model types for “Stepwise Selection”. For the stepwise selection, the GMA model shows a clear advantage over the GLM on choosing the Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models. Meanwhile, the GLM tends to miss one of the main fixed effects (Type VI, VII, or VIII) when the sample size ≤ 2000, even though a Type VI, VII, or VIII GLM could imply an equivalent GMA model of Type IV. Overall, as the sample size increases, the model selection improves using either GLM or GMA models. But the selected GLM have a bigger variety, while the selected GMA models are more predictable. It has been known that the conventional stepwise model selection has problems such as the stepwise F-statistics may no longer follow the F-distributions and the aggravated collinearity among the selected variables (Harrell, 2001). Several regularized regression methods have been developed to deal with the model selection problem especially for high-dimensional data. Here we also apply one of the regularized regression methods - the adaptive LASSO, using “proc glmselect” in SAS. We adopt the Bayesian information criterion (BIC) for model selections. The standardization of the variables is also made prior to the model selection. Note that the standardization of the variables {w1, v1, w2, v2, ww, wv, vw, vv} or can only affect the scales of the regression coefficients but does not change the correlation structure among each set of variables, while making mean-corrections on the indicator variables and can lead to different correlation structures between the dummy variables {w1, v1, w2, v2, ww, wv, vw, vv} and the index variables . The counts of selected GLM and GMA model types for adaptive LASSO are listed in Table 2.

Table 2

Counts of selected GLM and GMA model types for “Adaptive LASSO”.

Model types	GLM				GMA
Model types	n = 500	n = 1000	n = 2000	n = 5000	n = 500	n = 1000	n = 2000	n = 5000
I	1	18	91	363	4	17	58	367
II	17	87	246	372	33	89	226	346
III	8	18	41	19	17	20	38	13
IV	87	180	226	153	194	336	371	197
V	271	227	125	25	602	432	237	55
VI	184	212	150	46	0	0	0	0
VII	39	34	13	0	0	0	0	0
VIII	227	145	71	10	0	0	0	0
IX	61	17	6	1	8	0	0	0
X	105	62	31	11	142	106	70	22
Total	1000	1000	1000	1000	1000	1000	1000	1000

Counts of selected GLM and GMA model types for “Adaptive LASSO”. For the adaptive LASSO, it appears that the GMA model performs slightly better on choosing Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models when sample size=500. However, this advantage diminishes for larger sample sizes. Still, the selected GLM appear to have a bigger variety, while the selected GMA models are more predictable. On the other hand, it seems that the selected GMA models could miss all the three top variables more likely than the GLM models for w1, w2, ww. A brief comparison between Tables 1, 2 also reveals that the stepwise selection has a better performance than the adaptive LASSO under our simulation setting. But this may not be a fair comparison as the selected variables in our simulation setting is very limited. Besides, except BIC, many other criteria are available for model selections in “proc glmselect.” Further exploration on those criteria is needed.

3. Discussion

In this study, we applied the GMA models on analysis of variance components for genetic markers with unphased genotypes. We pointed out that the traditional Fisher's ANOVA model does not explicitly specify the random variables that contribute to the various genetic variance components. The constraints on their model parameters also complicate the model fitting. Meanwhile, the classical dummy-variable based GLM does not provide the same partition on the genetic variance components as the original Fisher's ANOVA model. The hypothesis tests of the fixed allelic effects in a GLM based on the partial (Type III) reduction in sums of squares could also be inadequate for testing the existence of variance components when allelic interactions are present. Alternatively, the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model. Similar to the classical GLM, the GMA model does not require constraints on its model parameters and can be fitted using the ordinary least square approach. As the result, the GMA model allows us to estimate and test for the genetic variance components more conveniently than the classical GLM. The classical GLM is appealing in genetic association studies due to its simplicity in interpretation of the model parameters, which often represent certain comparisons of the expected genotypic values in different genotype groups. However, the GLM-based approach faces challenges in dealing with allelic interactions because the fixed lower order allelic effects are often confounded with the higher order allelic interactions in a GLM even when the paternal and maternal alleles are independently inherited. In order to test for a particular fixed allelic interaction based on the classical GLM, we need to include all its lower order effects in the model to make this allelic interaction interpretable. Besides, ignoring certain higher order interactions in the model could also affect the definition of this particular allelic interaction due to their potential confounding, as pointed out in Zeng et al. (2005). On the other hand, analysis of genetic variance components using ANOVA type models such as GMA provides an alternative way on assessing the allelic effects and interactions. A nice feature of the GMA model is that the genetic components are independent of each other when multiple genetic markers are unlinked and in equilibrium populations. Therefore, at least in equilibrium populations, each genetic component can be treated as a random source of variation and tested individually regardless of its lower or higher order genetic components. A statistically significant higher-order genetic variance component implies a variation contributed by varied allele types from a set of loci, which could be more perceivable than a significant fixed allelic interaction for claiming allelic interactions. As shown in Wang (2014), the extension of GMA to multiple loci is straightforward. The GMA model could also be applied to other phenotypic outcomes. For example, under the generalized linear mixed model framework, we can consider the genetic components contributing to a g-transformed (g - a link function) expected outcomes. For survival outcomes, we can examine the genetic components contributing to the hazard functions as well. However, it should be pointed out that the estimation of genetic variance components for genetic markers relies on the genotypic distribution of the markers in a study population. When a sample does not come from the simple random sampling, an adjustment for the sampling strategy is needed in order to make appropriate statistical inference in the general population. Currently, the genome-wide association studies (GWAS) often adopt the case-control design. When the case and control sampling rates are known, we could assess the variance components of genetic markers based on their odds ratio estimates. A thorough exploration on applying the GMA type models to a case-control based GWAS could be a research topic in the future.

Author contributions

TW planned the study, conducted the derivation, and wrote the manuscript.

Conflict of interest statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer ZY and handling editor declared their shared affiliation, and the handling editor states that the process nevertheless met the standards of a fair and objective review.

5 in total

Analysis of Variance Components for Genetic Markers with Unphased Genotypes.

1. Introduction

2. Models and results

2.1. One-locus models

2.2. Two-locus models

2.3. A simulation example

3. Discussion

Author contributions

Conflict of interest statement

1. Modeling quantitative trait Loci and interpretation of models.

2. Covariances between arbitrary relatives with linkage and epistasis in the case of linkage disequilibrium.

3. On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits.

4. Contribution of genetic effects to genetic variance components with epistasis and linkage disequilibrium.

5. A revised Fisher model on analysis of quantitative trait loci with multiple alleles.