Literature DB >> 30614572

A certain invariance property of BLUE in a whole-genome regression context.

Daniel Gianola^1,2, Rohan L Fernando², Dorian J Garrick³.

Abstract

A curious result from mixed linear models applied to genome-wide association studies was expanded. In particular, a model in which one or more markers are considered as fixed but are allowed to contribute to the covariance structure by treating such markers as random as well was examined. The best linear unbiased estimator of marker effects is invariant with respect to whether those markers are employed in constructing a genomic relationship matrix or are ignored, provided marker effects are uncorrelated with those not being tested. Also, the implications of regarding some marker effects as fixed when, in fact, these possess a non-trivial covariance structure with those declared as random were examined.

Entities: Disease Species

Mesh：

Year: 2019 PMID： 30614572 PMCID： PMC6850311 DOI： 10.1111/jbg.12378

Source DB: PubMed Journal: J Anim Breed Genet ISSN： 0931-2668 Impact factor: 2.380

INTRODUCTION

In genome‐wide association (GWAS) studies (e.g., Manolio et al., 2009), an objective is to find statistical connections between molecular markers and genomic regions affecting some complex trait. A linear regression model including marker genotype codes as covariates is used, and the simplest version fits a single marker regression via ordinary least squares (OLS). If aggregation or clustering due to familial or molecular similarity exists in the data, a better estimation approach is generalized least squares (GLS), as it poses a more general covariance structure than OLS (Aulchenko, de Köning, & Haley, 2007; Gianola, Fariello, Naya, & Schön, 2016; Hoffman, 2013; Kang et al., 2008; Listgarten et al., 2012; Yang, Zaitlen, Goddard, Visscher, & Price, 2014). One such structure, for example, results from declaring all or a subset of marker effects as random variables, for example, assuming that with all markers in the set taken as independently and identically distributed random variables. A random effects specification induces marker‐based measures of similarity among individuals called molecular “relationship” or “kinship” matrices . The marker (s) evaluated for association is (are) treated as a fixed effect (s), and a test of nullity of effects on a trait is based on well‐established procedures. Should the marker (s) being tested be included or excluded when building A priori, if a marker is declared random it could not be fixed and vice versa. Including the contribution of a marker to while declaring it as fixed constitutes a form of “double counting.” When the number of markers is very large and a single marker regression is used, the impact on of including or removing the marker is tiny. Listgarten et al. (2012) suggest that markers being tested should be removed from , followed by a concomitant re‐estimation of necessary variance components at each instance of testing. This approach is computationally taxing, especially when is huge, as it is the case with DNA sequence data. In many situations, it may be reasonable to assume that variance component estimates are affected only mildly by including or excluding the tested marker in . For many complex traits in animal and plant breeding, each of the numerous markers in a chip has a small effect on both mean and variance of the data distribution. Gianola et al. (2016) showed that the best linear unbiased estimator (BLUE, also GLS) of the fixed effect of a marker (or sets of markers) examined in GWAS is invariant with respect to whether or not the marker (s) tested for association is (are) included in the construction of , provided that variance components are assumed constant. This short communication expands on the preceding, as follows. First, we provide an expression that gives the variance–covariance matrix of the BLUE of each of the marker effects being tested using a simple adjustment. Second, it is shown that the best linear unbiased predictor (BLUP) of effects treated both as fixed and random is exactly zero, provided that no covariance exists between such effects and other marker effects treated as random in the model. Third, it is shown that if such covariance is not null, the fixed effects of a set of markers affect phenotypes through direct and indirect paths, and over and above the impact of linkage disequilibrium captured by columns of the matrix of genotype codes.

MODEL

The linear regression model (assume that nuisance location effects have been eliminated somehow) used in GWAS is often posed aswhere is an vector of phenotypes, is an matrix of marker genotype codes, is a vector of allelic substitution effects and is a residual vector where is the variance of the distribution of model residuals. Let where is (without loss of generality assume that has full column rank), and is where may be much larger than . An equivalent representation of (1) isand consider two alternative covariance structures for the phenotypes. The first structure results from treating as a fixed vector and assuming :The GLS estimate of the fixed effect under isThe second covariance structure stems from treating as random, with the assumption assigned to all marker effects, but then is estimated as if it were fixed. Here, the phenotypic covariance matrix iswith the GLS estimator of , the fixed effect corresponding to being Note that . Arrays and can be referred to as “similarity” matrices, as in Listgarten et al. (2012).

INVARIANCE PROPERTIES

Best linear unbiased estimation

Gianola et al. (2016) showed that (4) and (6) are identical; the proof is presented in more detail here. To show this, we employ a model representation where the effects of one or more loci with genotypes in are regarded as possessing both fixed and random effects:where are the fixed effects of markers in , with and as before. Using the Sherman–Morrison–Woodbury identity (Seber & Lee, 2003),Using (8), can be written as:Now, using (9),and isFrom (10) and (11), the GLS estimator given in (4) is formed aswhere also given in (6), is the GLS estimator resulting from (7). The preceding shows that the estimator obtained under covariance structure is identical to the estimator resulting from structure . The practical implication is that the same similarity matrix, , can be used for conducting either single marker or sets of markers GWAS studies using linear regression models, provided that is assumed known and kept constant (as well as the residual variance) throughout. Note that the sampling variance–covariance matrix of the estimates of the fixed effects must be taken under , that is, . Since typically has one or a few columns in GWAS, advantage can be taken of (8) for computing , as is obtained only once, whereas changes with the set of markers included in and, therefore, in Further, note from (9) thatso can be used throughout. The preceding representation illustrates the over‐statement of uncertainty and “loss of power” incurred by use of instead of (13). Yang et al. (2014) present a related discussion and recommend that markers in close linkage disequilibrium with the target marker(s) be removed when building . Their approach requires re‐estimation of variance components at every instance of testing. Our results do not apply under such strategy, as the variance–covariance structure would be expected to change over markers tested (Listgarten et al., 2012; Yang et al., 2014). An alternative could be to include the marker tested and some neighbours in close linkage disequilibrium in provided that and that no rank deficiency accrues, and then use all markers when building . A disadvantage of the alternative is the potentially strong collinearity in the set of markers in , producing unstable estimates with large sampling variances. A caveat must be mentioned. In a genomic best linear unbiased prediction setting (e.g., Legarra, 2016; Van Raden, 2008), a similarity matrix under can be constructed aswhere is the “genomic variance” captured by all available markers; is known as the genomic relationship matrix (Van Raden, 2008). Accordingly,where is the genomic variance marked by the variants included in . Clearly, two maximum‐likelihood analyses of variance components, one with a set markers fixed and removed from the covariance structure, and the other one with all markers contributing to similarity, will produce different estimates and interpretations of genomic variance. Estimates of must always be interpreted with great care. In a standard random effects model, the variance among effects of levels of a random factor represents a population parameter, with maximum‐likelihood estimates of variance components interpreted accordingly. For example, if the random factor is the effect of a paternal half‐sib family (a situation known as a “sire” model in animal breeding), the variance among sires, say, has the same interpretation irrespective of whether the number of families is 10 or 10,000. However, in a marker‐based model with the meaning and estimates of depend crucially on as the variance component acts then as a regularization parameter. In the situation, it is typically the case that estimates of decrease as increases, and the rate of decrease in is critical for interpretation of estimates of marker effects when (Gianola, 2013; León‐Novelo & Casella, 2012).

Best linear unbiased prediction

The best linear unbiased predictor of in model (7) iswith calculated as in (4) or (6). The previous result follows because are the GLS equations, implying that . Henderson's mixed model equations (MME) can be employed to verify result (16). For model (7) the MME are as follows:where . Subtracting the equations for from the equations for gives , implying that which verifies (16). Thus, solutions for and from (17) are identical to those from the MME corresponding to a model where is excluded from forming a similarity matrix. Such model isThe result is easy to verify empirically (a reviewer pointed out that it is probably well known by scientists working in genetic evaluation computations) but, to our knowledge, has not been reported in the literature. A “mechanistic” explanation for (16) is that BLUP of random effects with null means depends on through error contrasts that have a null mean vector. So, if a locus is included in a model as a fixed effect, the error contrasts used for BLUP do not possess any information on the effects at such locus. Thus, if any factor (e.g., a marker locus) is included in the model both as fixed and random, the BLUP of the random effect will depend entirely on the “prior” (Bayesian view), and, as shown above, it will be identically equal to 0.

Interdependent sets of marker effects

All elements of were assumed independent and identically distributed, but the result holds for more complex dependency structures. Markers included in may be in linkage disequilibrium with those in and the phenomenon is encoded by correlations between columns of . Further, models have been suggested that include dependencies among marker effects (e.g., Gianola, Pérez‐Enciso, & Toro, 2003). Suppose that and are independently distributed, that where and are non‐singular and assume . Here, the MME equations for the situation in which is treated as both fixed and random take the formSubtracting the from the equations gives , implying that the MME equations reduce towhich provide and under a model where is fixed and is random. Note that, within block, marker effects can be correlated or uncorrelated. Allow now for a covariance structure between the two sets of random effects, and let The mixed model equations where is treated as both fixed and random becomeSubtracting the from the equations produces , which is not null unless contradicting the model assumption. Hence, when use of or produces distinct sets of generalized least‐squares solutions, so the result for the independence case does not hold here. If two random vectors are not independent, fixing the value of one such vector (Listgarten et al., 2012, call this “conditioning”) must alter the distribution of the other vector. Under multivariate normality, one can write , where . The model under fixed becomeswhere . Under this specification, the phenotypic covariance matrix is and should be computed as Likewise,will predict the effect of markers in on phenotypes, conditionally on the effects of markers in , that is, in the absence of genetic variation at loci in marker set 1. Note in (22) that so that the “total signal” on the trait contributed by is decomposed into a “direct” component and an indirect contribution mediated through the covariance between and . This sort of phenomenon is well known in structural equation modelling and path analysis (Wright, 1921).

CONCLUSION

When conducting a GWAS with either a single marker or a set of markers treated as fixed, it is unnecessary to reconstruct the phenotypic variance–covariance matrix at each specific instance of testing, provided that a BLUP model is used and that marker effects in sets regarded as both fixed and random are independent across sets but not necessarily within sets. BLUE is invariant with respect to whether the genotypes of markers being tested in GWAS are employed in the construction of a genetic similarity matrix. Likewise, BLUP of the effects of the sets treated as random is invariant as well. However, the variance–covariance matrix of the GLS estimator and the prediction error‐covariance matrix must be taken with respect to the assumptions made in the model employed for analysis. If marker effects in the two subsets of genotypes have a between‐set non‐trivial dependency structure, the GWAS model requires modification. The results presented in this paper, shown first by Gianola et al. (2016) just for GLS (BLUE), are seemingly unrecognized in the GWAS literature (e.g., Chen, Steibel, & Tempelman, 2017). An additional and probably useful result reported here is that represented by Equation (13): the variance of the estimate of any of the marker effects tested in GWAS can be obtained via a simple adjustment of the variance obtained with all markers entering into the similarity matrix.

13 in total

1. Improved linear mixed models for genome-wide association studies.

Authors: Jennifer Listgarten; Christoph Lippert; Carl M Kadie; Robert I Davidson; Eleazar Eskin; David Heckerman
Journal: Nat Methods Date: 2012-05-30 Impact factor: 28.547

2. Comparing estimates of genetic variance across different relationship models.

Authors: Andres Legarra
Journal: Theor Popul Biol Date: 2015-09-02 Impact factor: 1.570

3. Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis.

Authors: Yurii S Aulchenko; Dirk-Jan de Koning; Chris Haley
Journal: Genetics Date: 2007-07-29 Impact factor: 4.562

4. Efficient methods to compute genomic predictions.

Authors: P M VanRaden
Journal: J Dairy Sci Date: 2008-11 Impact factor: 4.034

5. Efficient control of population structure in model organism association mapping.

Authors: Hyun Min Kang; Noah A Zaitlen; Claire M Wade; Andrew Kirby; David Heckerman; Mark J Daly; Eleazar Eskin
Journal: Genetics Date: 2008-03 Impact factor: 4.562

6. Genome-Wide Association Analyses Based on Broadly Different Specifications for Prior Distributions, Genomic Windows, and Estimation Methods.

Authors: Chunyu Chen; Juan P Steibel; Robert J Tempelman
Journal: Genetics Date: 2017-06-21 Impact factor: 4.562

7. Priors in whole-genome regression: the bayesian alphabet returns.

Authors: Daniel Gianola
Journal: Genetics Date: 2013-05-01 Impact factor: 4.562

Review 8. Finding the missing heritability of complex diseases.

Authors: Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher
Journal: Nature Date: 2009-10-08 Impact factor: 49.962