Literature DB >> 31383770

Best Prediction of the Additive Genomic Variance in Random-Effects Models.

Nicholas Schreck¹, Hans-Peter Piepho², Martin Schlather^3,4.

Abstract

The additive genomic variance in linear models with random marker effects can be defined as a random variable that is in accordance with classical quantitative genetics theory. Common approaches to estimate the genomic variance in random-effects linear models based on genomic marker data can be regarded as estimating the unconditional (or prior) expectation of this random additive genomic variance, and result in a negligence of the contribution of linkage disequilibrium (LD). We introduce a novel best prediction (BP) approach for the additive genomic variance in both the current and the base population in the framework of genomic prediction using the genomic best linear unbiased prediction (gBLUP) method. The resulting best predictor is the conditional (or posterior) expectation of the additive genomic variance when using the additional information given by the phenotypic data, and is structurally in accordance with the genomic equivalent of the classical additive genetic variance in random-effects models. In particular, the best predictor includes the contribution of (marker) LD to the additive genomic variance and possibly fully eliminates the missing contribution of LD that is caused by the assumptions of statistical frameworks such as the random-effects model. We derive an empirical best predictor (eBP) and compare its performance with common approaches to estimate the additive genomic variance in random-effects models on commonly used genomic datasets.

Entities: Disease Species

Keywords: BLUP; best prediction; genetic variance; genomic variance; quantitative genetics; random-effects models; whole-genome regression

Mesh：

Year: 2019 PMID： 31383770 PMCID： PMC6781909 DOI： 10.1534/genetics.119.302324

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

THE additive genetic variance is defined as the variance of the breeding value (BV) and is the most important determinant of the response of a population to selection (Falconer and Mackay 1996). The additive variance can be estimated from observations made on the population and is a principal component of the (narrow-sense) heritability, which is one of the main quantities of interest in many genetic studies (Falconer and Mackay 1996). The heritability is eminent, among other things, for the prediction of the response to selection in the breeder’s equation (Piepho and Moehring 2007; Hill 2010). Although nonadditive genetic variation exists, most of the genetic variation is additive, such that it is usually sufficient to investigate the additive genetic variance (Hill ). More specifically, epistasis is only important on the gene level but not for the genetic variance (Hill ), and Zhu show that, for human complex traits, dominance variation contributes little. Nevertheless, linkage disequilibrium (LD) is an important factor, especially when departing from random mating and Hardy-Weinberg equilibrium, which is often the case in animal breeding (Hill ; L. Dempfle, personal communication). The additive genomic variance is defined as the variance of a trait that can be explained by a linear regression on a set of markers (de los Campos ). Many authors have been chasing what is sometimes coined “missing heritability” (Maher 2008), which means that only a fraction of the “true” genetic variance can be captured by regression on influential markers. Initially, researchers have used genome-wide association studies (GWAS) in order to find quantitative trait loci (QTL) by single-marker fixed-effect regression combined with variable selection. After having added the estimated corresponding genomic variances of the single statistically significant loci, they asserted that they could only account for a fraction of the “true” genetic variance. For instance, Maher (2008) found that only 5% instead of the widely accepted heritability estimate of 80% of human height could be explained. Golan state that the “true” genetic variance is generally underestimated when applying variable selection, e.g., GWAS, to genomic datasets which are typically characterized by their high dimensionality, where the number of variables (markers) p is much larger than the number of observations n. It is well known that a lot of traits are influenced by many genes, and that at least some loci with tiny effects are missed when using variable selection or even single-marker regression models. Consequently, Bernardo (1994) decided to fit all [restriction fragment length polymorphism (RFLP-)] markers in maize jointly using genomic best linear unbiased prediction (gBLUP), where he assumes the marker effect vector to be random. In animal breeding, Meuwissen used Bayesian approaches (BayesA and BayesB) to fit all markers jointly in order to predict BVs. Then, Yang estimated the genomic variance in an approach that they termed genome-wide complex trait analysis genomic restricted maximum likelihood (GCTA-GREML) (Yang ). They showed that quantifying the combined effect of all single-nucleotide polymorphisms (SNPs) explains a larger part of the heritability than only using certain variants quantified by GWAS methods. They illustrate their results on a dataset on human height by pointing out that they could explain a heritability, also termed “chip heritability” (Zhou ), of ∼45%. They concluded that the main reason for the remaining missing heritability was incomplete LD of causal variants with the genotyped SNPs, which refers to the general difference between the genetic variance and the genomic variance (Powell ; de los Campos ). However, the GCTA-GREML approach can be biased upwards as well as downward (Wolc ; de los Campos ; Fernando ; Lehermeier ). Recently, there has been a general discussion whether estimators for the genomic variance account for LD between markers, which is defined as the covariance between the marker genotypes (Bulmer 1971). Some authors argue that estimators similar to GCTA-GREML lack the contribution of LD (Krishna Kumar ,b; Lehermeier ), whereas others (Yang ) resolutely disagree. More specifically, Krishna Kumar ,b) state that, in GCTA-GREML, the contributions of the p markers to the phenotypic values are assumed to be independent normally distributed random variables with equal variances. Thus, they claim that the random contribution made by each marker is not correlated with the random contributions made by any other marker, which leads to a negligence of the contribution of LD to the additive genomic variance. In a study on the model plant Arabidopsis thaliana (1001 Genomes Consortium 2016), Lehermeier use Bayesian ridge regression (BRR) to relate the phenotype flowering time to the genomic data. They use an estimator (termed M2) based on the posterior distribution of the marker effects obtained by Markov Chain Monte Carlo (MCMC) methods and show that this estimator explains a larger proportion of the phenotypic variance than the estimator, termed M1, based on gBLUP (VanRaden 2008; Yang , 2011). Lehermeier argue that the reason for the better performance of the Bayesian estimator for the additive genomic variance [already mentioned in Sorensen , Fernando and Garrick (2013), Zhou , Fernando ] is the explicit inclusion of LD. We show that the additive genomic variance in linear models with random marker effects (RME) can be defined as a random variable. Based on this premise, we propose a novel predictor of the additive genomic variance and place existing estimators in a joint framework permitting comparison with the new predictor. We contribute to the solution of many of the above mentioned controversies by reviewing common approaches to estimate the additive genomic variance, e.g., GCTA-GREML, and show that they estimate the unconditional (or prior) expectation of the random additive genomic variance. Combined with the assumptions on the unconditional distribution of the marker effects in the gBLUP-method, this leads to an insufficient adaptation to the data and a negligence of the contribution of LD. We introduce a novel best prediction approach for the additive genomic variance in both the current and the base population, i.e., we use the conditional (or posterior) expectation of the random additive genomic variance given the additional information by the phenotypic values for an improved adaptation to the data. We decompose the best predictor into the GCTA-GREML estimator and a function for the contribution of marker LD, which determines whether GCTA-GREML is biased up- or downward. The best predictor is structurally in accordance with the genomic equivalent of the additive genetic variance from classical quantitative genetics, i.e., it explicitly includes the contribution of LD. We propose an empirical best predictor (eBP), and illustrate our theoretical results on several commonly used genomic datasets.

Materials and Methods

Linear models

The connection of the n-vector y of phenotypic values and the n-vector g of genomic values is given bywhere μ denotes a fixed intercept, is a n-row-vector containing 1s, denotes environmental deviations, and is the identity matrix of dimension n. The statistical model that is probably most popular in genomic applications is the random-effects model (REM), in which the genomic values are assumed to be normally distributedwhere G is a variance-covariance matrix and is the variance component of the vector of genomic values g. In the following, we assume that the genome is mapped with markers and we denote by X the design matrix coding the genotypes of the markers. In this framework, the -matrix G introduced in Equation 2 is called the genomic relationship matrix (GRM), and is most commonly defined aswhere is the idempotent -matrix used for column-wise mean-centering, and , where denotes the frequency of the minor allele at marker j (VanRaden 2008). Then, the genomic values g can be separated into the coded genotypes of the single markers X and their corresponding p-vector β of marker effects such thatholds in distribution, where for . The p individual components of the marker effect vector β are drawn at random from a common fixed normal distribution for each marker genotype:with positive variance component . Model (1) is called linear equivalent model (Henderson 1984) to the “standard” additive linear regression modelThe centering matrix P in Equations 4 and 6 is mandatory for the equivalence of models (1) and (6), and to be able to consistently apply model (6). The restriction of the column-means of the marker genotypes in model (6) to be 0 guarantees that the sample mean of the genomic values in Equation 4 equals 0 . This ensures the uniqueness of the definition of the vector g of genomic values in Equation 4. Otherwise, a different coding of the marker genotypes leads to different genomic values g. Inferences on quantities based on genomic data in model (1) and (6) are often performed with the gBLUP method (Bernardo 1994). Model (6) allows for marker-specific investigations and inferences on the genomic contribution to the phenotypic values, whereas estimation of parameters in model (1) has computational advantages. For additional information on the gBLUP-method, we refer to Appendix Genomic Best Linear Unbiased Prediction. Model (6) is a realization of n draws from the underlying data-generating process of the (mean-centered) marker genotypes (Bühlmann and van de Geer 2011). This distribution, as well as the corresponding genomic values in Equation 4, relate to the current population of individuals. When we are interested in the genomic values in the corresponding consistent base population, we should take the relationship (correlation) between the individuals into account (Powell ; Legarra 2016). In what follows, we assume a given relationship matrix R. Instead of the genomic values g or we investigate the uncorrelated genomic values defined byThese are realizations of n draws from the underlying data-generating process of marker genotypes in the base population. The sample mean of the genomic values in the base population, , is usually different from 0.

Definitions of the genomic variance

The additive genetic variance (in the base population also called genic variance) with respect to a trait is defined as the theoretical variance of the genetic value based on the respective QTLs (Falconer and Mackay 1996). The genomic variance has been described as the variance of the genomic values that can be explained by a linear regression on a set of markers (de los Campos ). This does not, however, imply a unique formal definition of the genomic variance. Here, we give an overview of different approaches to define a genomic variance in the framework of the linear models (1) and (6). Without further assumptions on the nature of the genome, we can define the sample varianceof the mean-centered n-vector of genomic values (see Equation 4) in the current population (Ould Estaghvirou ). Here, defines the sample variance-covariance matrix of the marker genotypes in the current population. In the base population, we define the sample varianceof the uncorrelated genomic values (see Equation 7). Here, defines the sample variance-covariance matrix of the marker genotypes in the base population. Alternatively, we can define the theoretical variance of the genomic values directly in the REM. The linear model is generated by drawing from the data-generating process of the marker genotypes (representative individual), and the model assumptions in the REM dictate that marker effects are random variables. This gives rise to three different scenarios for the variance of the genomic values in the REM (marker genotypes random, marker effects random, or both random). The additive genomic variance of a randomly sampled (representative) individual (Gianola ; de los Campos ; Fernando ) equalswhere denotes the variance-covariance matrix of the marker genotypes. The variance of a randomly sampled (representative) individual with RME is given byThis is not the additive genomic variance (Gianola ; de los Campos ). The variance of a randomly sampled trait averaged over individuals with fixed genotypes X equalsand does not equal the additive genomic variance. We derive the equalities in Equations 10–12 in more detail in the Appendix Theoretical Variances of the Genomic Values in the REM. These quantities refer to the genotypes in the current population. We can apply the same definitions in the base population by considering the data-generating process of the genotypes in the base population (exchange X by ). In Table 1, we give an overview of the different possibilities to define the variance of the genomic values in the REM.

Table 1

Overview of different definitions of the variance of the genomic values in the current population and their expression in the random-effects model

Variance of genomic values
Sample variance	Theoretical variance
sg2	Var(Xβ)	tr(Cov(Xβ\|X))	Var(Xβ\|β)
β⊤Σ^Xβ	σβ2tr(ΣX)	nσβ2tr(Σ^X)	β⊤ΣXβ

Analogous quantities for the base population can be obtained by exchanging X by . The sample variance and the theoretical variance define the sample and theoretical version of the additive genomic variance. In actual applications, we have to replace in Equation 10 by its estimator . Consequently, the sample variance (Equation 8) as well as the theoretical variance (Equation 10) effectively represent the additive genomic variance, i.e., the genomic equivalent of the additive genetic variance (Bulmer 1971; Falconer and Mackay 1996), in the current population. In the following, we do not explicitly distinguish between the sample or the theoretical version of the variance, and will speak only of the additive genomic variance. From now on, we focus on the estimation of the additive genomic variance in the general formwhich is a non-negative quadratic form of the genomic values. By specifying the positive semidefinite and symmetric -matrix B, we determine whether the genomic variance refers to the current population (see Equation 8), or the base population (see Equation 9). Because the randomness of the marker genotypes is not explicitly necessary to derive Equation 13, we can easily express all results in terms of the genomic values g defined in the equivalent model. In the framework of the REM, the marker effects β in model (6) and the genomic values g in model (1) are random variables. Consequently, the additive genomic variance in Equation 13 is also a random variable, and, as such, has to be predicted in an optimal way before finally being estimated [in the same way that the RME β are predicted by the BLUP and then estimated by the eBLUP (Kackar and Harville 1984; Jiang 1999)]. First, we will show that estimators for the unconditional expectation of Equation 13, like GCTA-GREML, are of the form of Equations 11 and 12, and, therefore, do not estimate the additive genomic variance. Then, we introduce the (frequentist) best predictor for the additive genomic variance and show that this approach maintains the structure of the additive genomic variance (Equation 13)—the genomic equivalent of the additive genetic variance.

The expectation of the additive genomic variance

The expectation of the random variable in Equation 13 minimizes the quadratic formwith respect to all real numbers a, i.e., is the best approximation of in the absence of additional information (van der Vaart 2007). The unconditional (or prior) expectation of equalsbecause of the properties of the trace. For the additive genomic variance in the current population, , we choose in Equation 14 and obtainin model (6) orin the equivalent model (1). Unconditional expectations of the form V for the additive genomic variance are considered in, for example, Ould Estaghvirou . For the additive genomic variance in the base population, , we choose in Equation 14 and obtainin model (6) orin the equivalent model. Often (VanRaden 2008; Yang , 2011; Speed ; Vinkhuyzen ; Legarra 2016), the matrix R used for the transformation to the base population is assumed to be the GRM G defined in Equation 3. Then, the unconditional expectation of the additive genomic variance simplifies toand the variance component from the gBLUP-method is considered as the additive genomic variance in the base population. We have shown, however, that is only the unconditional expectation of the additive genomic variance. We recommend caution when using the simplification in Equation 19 because the GRM G is in general singular (because P is singular), and therefore is not well defined. We emphasize that only the diagonal elements of the sample variance-covariance matrix ( or ) of the marker genotypes influence the unconditional expectations V, and of the additive genomic variance. The model assumptions in the REM dictate the matrix to be diagonal, which leads to a negligence of the off-diagonal elements of or in Equation 14. The covariances (LD) between the marker genotypes are not included, and V, and are of the same form as in Equation 11 and in Equation 12. This implies that the unconditional expectation of the random additive genomic variance is structurally not fully in accordance with the (random) additive genomic variance in Equation 13. Explicit formulae for the estimation of the unconditional expectations V, and will be given in the Appendix Estimation of the Additive Genomic Variance in the REM.

Best prediction of the additive genomic variance

The unconditional (prior) expectation of in Equation 14 is strongly influenced by the model assumption on the marginal distribution of the marker effects, and does not use additional information given by the phenotypic values y in model Equations 1 and 6. Estimation procedures for the unconditional expectation [which include e.g., restricted maximum likelihood (REML)], however, make use of the data y to a certain extent. By contrast, the conditional (posterior) expectation, given the phenotypic values y, makes best use of the information provided by y even before the final estimation is performed. Generally, the conditional (or posterior) expectation of a random variable Z (in our case ) given the knowledge of the random vector Y is defined as the “best prediction” (Searle ; van der Vaart 2007) of the random variable Z. The best predictoris the unique function that minimizes the mean square error of predictionover all functions in Y, i.e., the conditional expectation is the projection (closest element in a given set of functions) of Z onto the linear space of all functions in Y (Searle ; van der Vaart 2007). The best predictor in Equation 20 is, by definition, an unbiased predictor for the random variable Z and maximizes the correlation , i.e., we can replace the target random variable Z by the best predictor defined in Equation 20 in an optimal way (Searle ). Instead of inferring the unobservable target random variable, we conduct inferences on the best predictor. Because the best predictor has realized in a given dataset , it is estimable (Searle ). In the following, we introduce a novel approach of considering the frequentist best predictor instead of the unconditional expectation for the random additive genomic variance in Equation 13. We proceed according to Equation 20, and definefor the given phenotypic values y, because X is constant, and, therefore, independent of y. The matrix of conditional second moments of the marker effects β is usually nondiagonal (contrary to ), and can be expressed asusing the BLUP of the random vector β, and the variance-covariance matrix of β given the data y. Then, the best predictor equalswhere the last equality holds because of the connectionof the BLUPs and the conditional variance-covariance matricesin models (1) and (6), see also Appendix Genomic Best Linear Unbiased Prediction. For the best predictor of the additive genomic variance in the current population we set in Equation 21, and obtainin model (6) orin the terminology of the equivalent model (1). For the best predictor of the additive genomic variance in the base population we set in Equation 21, and obtainin model (6) orin the terminology of the equivalent model (1). We note that Equations 24 and 25 have structural similarities with the iterative maximum-likelihood equation for the variance component when using the expectation-maximization algorithm (Sorensen and Gianola 2002). We emphasize that the best predictor of the additive genomic variance in the current population (W), as well as in the base population includes the contribution of all elements of the sample variance-covariance matrix of marker genotypes ( or ), and, hence, comprises LD information, contrary to the unconditional expectations V, and of the additive genomic variance from the previous section. Explicit formulae for the eBPs of the additive genomic variance, as well as a formula for (approximate approach using the GRM G for transformation to the base population), will be given in the Appendix Estimation of the Additive Genomic Variance in the REM. We compare the use of the unconditional expectation and the best predictor for the prediction of the random additive genomic variance in the REM in Table A.1 and Table A.2 in the Appendix.

Table A.1

Overview of prediction approaches for the random additive genomic variance in the random-effects model with the gBLUP-method

	Unconditional expectation	Best prediction
General Formula	E[sg,B2]=1n−1σβ2tr(X⊤PBPX)	BP(sg,B2)=1n−1tr(X⊤PBPX[μβ\|yμβ\|y⊤+Σβ\|y])
Current population	V=σβ2tr(Σ^X)	W=tr(Σ^X[μβ\|yμβ\|y⊤+Σβ\|y])
Base population	V=σβ2tr(Σ^X)	W=tr(Σ^X[μβ\|yμβ\|y⊤+Σβ\|y])
Features	• Best approximation of the additive genomic variance in the absence of information	• Best approximation of the additive genomic variance using additional information given by phenotypic values (optimal adaptation to the data by application of the conditional or posterior expectation)
	• No inclusion of LD	• Explicit inclusion of LD
		• Orthogonal decomposition of the phenotypic variance in the current population (unique definition of the heritability)
		• Genomic equivalent of the additive genetic variance

X, matrix of marker genotypes; P, matrix for column-wise mean-centering; B, positive semidefinite matrix; variance component of the marker effects β; BLUP of β; conditional covariance matrix of β given the phenotypic data y; sample variance-covariance matrix of the marker genotypes in the current population; sample variance-covariance matrix of the marker genotypes in the base population.

Table A.2

Overview of prediction approaches for the random additive genomic variance in the random-effects model with the gBLUP-method in the equivalent version of the linear model

	Unconditional expectation	Best prediction
General formula	E[sg,B2]=1n−1σg2tr(BG)	BP(sg,B2)=1n−1tr(B[μg\|yμg\|y⊤+Σg\|y])
Current population	V=1n−1σg2tr(G)	W=1n−1tr([μg\|yμg\|y⊤+Σg\|y])
Base population	V*=1n−1σg2tr(PR−0.5GR−0.5)	W*=1n−1tr(PR−0.5[μg\|yμg\|y⊤+Σg\|y]R−0.5)
Features	• Best approximation of the additive genomic variance in the absence of information	• Best approximation of the additive genomic variance using additional information given by phenotypic values (optimal adaptation to the data by application of the conditional or posterior expectation)
	• No inclusion of LD	• Explicit inclusion of marker LD
	• Transformation with GRM: σg2 replaces V*	• Orthogonal decomposition of the phenotypic variance in the current population (unique definition of the heritability)
		• Genomic equivalent of the additive genetic variance

G, genomic relationship matrix; P, matrix for column-wise mean-centering; B, positive semidefinite matrix; variance component of the genomic values g; the BLUP of g; conditional covariance matrix of g given the phenotypic data y; R, relationship matrix.

Statistical analysis (genomic data)

For an illustration of the theoretical results of the previous sections, we used the mice dataset that comes with the R-package BGLR (Pérez and de los Campos 2014). The data originally stem from an experiment by Valdar ,b) in a mouse population. The dataset contains the matrix X with values in of polymorphic marker genotypes that were measured in mice. The trait (n-vector y) under consideration was body length (BL). The relationship of the mice is recorded in the pedigree matrix R, and is used for the transformation to the base population. Additionally, we used the publicly available historical wheat dataset that also comes with the R-package “BGLR” (Pérez and de los Campos 2014). The data originally stems from CIMMYT’s Global Wheat Program and consists of lines of wheat, where the trait under consideration was average grain yield. The phenotypes are divided up into four basic target sets of environments designated as Wheat I, Wheat II, Wheat III, and Wheat IV, of which we considered the only first. The dataset contains the matrix of marker genotypes for markers as well as a relationship matrix. Moreover, we analyzed a population of fully sequenced Arabidopsis lines for which phenotypes and genotypes are publicly available by the effort of the Arabidopsis 1001 Genomes project (1001 Genomes Consortium 2016). The lines represent natural inbred lines and we examined the same trait, namely flowering time at 10° (FT10), and the same SNP-markers that were used in Lehermeier . For these data, no relationship matrix was available. We conducted all calculations with the free software R (R Development Core Team 2017). For each dataset, we used the gBLUP-method in the equivalent version (computational advantages) implemented in the R-package “sommer” (Covarrubias-Pazaran 2017) to fit a REM. We used REML to obtain estimates ( and ) for the variance components. We also obtained estimates of the best predictor of the genomic effects and the their variance-covariance matrix . We used this outcome for the estimation of the unconditional expectation V and the BP W of the additive genomic variance in the current population, as well as the unconditional expectation and the BP for the additive genomic variance in the base population (except for the Arabidopsis dataset, where no relationship matrix was available). Although the GRM is not invertible, we will show in the Appendix Estimation of the Additive Genomic Variance in the REM how to use the GRM for a transformation to the base population, and to calculate the corresponding unconditional expectation and the BP for the additive genomic variance in the base population. Detailed information about the calculations and the programming code is provided in the Supplemental Material, File S1.

Data availability

The authors affirm that all data necessary for confirming the conclusions of this article are represented fully within the manuscript and the supplemental material that has been uploaded to figshare. File S1 contains a detailed description of the estimation of the genomic variances for the gBLUP-method as well as the corresponding R-code and its output. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.8114117.

Results

In the first section of Table 2, we present the estimation results for the unconditional expectation V and the best predictor W for the additive genomic variance in the current population. In the mice and wheat datasets, exceeds , whereas for the Arabidopsis data, the eBP is about double the size of the unconditional expectation.

Table 2

Estimation results for the unconditional expectation V and the best predictor W for the additive genomic variance in the current population for the mice, wheat, and Arabidopsis datasets

Genomic variance/heritability	Population	Mice	Wheat	Arabidopsis
V^(=h^V2)^a	Current	0.3737749	0.6039708	0.47333803
V^+σ^ε2^b		1.0754963	1.1449704	0.54832029
h∼V2:=V^/(V^+σ^ε2)^c		0.3475371	0.5274990	0.86325098
W^(=h^W2)^a		0.2982787	0.4590001	0.92501779
W^+σ^ε2^b		1.0000002	0.9999998	1.00000005
h∼W2:=W^/(W^+σ^ε2)^c		0.2982787	0.4590002	0.92501774
V^*	Base	0.3704021	3.0621134	−
W^*		0.3089758	2.0095836	−
V^s*		0.3639248	1.3158006	0.80762011
W^s*		0.3577692	1.2300300	1.30240520

Heritability with respect to phenotypic sample variance which has been scaled to 1.

Alternative definition of the phenotypic variance that depends on the estimate of the genomic variance.

Alternative definition of the heritability that depends on the alternative definition of the phenotypic variance.

We also present the corresponding heritabilities with respect to the sample variance of the phenotypic values and with respect to the sum of the additive genomic and residual variance In addition, we depict the estimation results for the unconditional expectation ( when using the GRM for the transformation) and the best predictor ( when using the GRM for the transformation) for the additive genomic variance in the base population. Heritability with respect to phenotypic sample variance which has been scaled to 1. Alternative definition of the phenotypic variance that depends on the estimate of the genomic variance. Alternative definition of the heritability that depends on the alternative definition of the phenotypic variance. The sample variance of the phenotypic values has been scaled to 1. The sum of and the residual variance is larger than the phenotypic variance for the mice and wheat data but smaller for the Arabidopsis data. Technically, it is possible to define the heritability in two ways, namely with respect to the phenotypic variance, and with respect to the sum of the additive genomic variance and the residual variance. The sum of the eBP and the residual variance, however, equals the scaled phenotypic variance of 1 remarkably exactly for all datasets considered. In the second section of Table 2, we first present the estimation results for the unconditional expectation and the best predictor for the additive genomic variance in the base population using the given relationship matrices for the transformation. For the mice data, and are similar to their analogs and in the current population. For the wheat data, however, the estimated unconditional expectation and eBP in the base population are around five times larger than those in the current population, and exceed the sample phenotypic variance in the current population. By this approach, it is not possible to define a heritability in the base population because both the estimate of the residual variance and the phenotypic sample variance refer to the current population. The estimation results for the unconditional expectation and the best predictor for the additive genomic variance in the base population using the GRM for the transformation differ from those using the given relationship matrices by a considerable amount. In the mice and wheat data, is larger than , whereas, for the Arabidopsis data the eBP exceeds the estimated unconditional expectation. This conforms to the behavior of V and W in the current population.

Discussion

We have shown that commonly used estimators for the additive genomic variance in the REM with genomic marker data are based on the unconditional expectation of the random additive genomic variance. We have introduced a novel best prediction approach for the random additive genomic variance in both the current and the base population. In the following, we discuss several important implications.

Current and base population

Common ways of estimating the additive genomic variance focus on the base population. These approaches are independent of the actual current population, and, consequently, valid even if the generations change. If one aims at the response of a population to selection, however, it might be more meaningful to estimate the additive genomic variance in the actual given population. This implies that the estimation of the genomic variance has to be conducted again when the individuals change. A formal definition of the heritability is best possible in the current population, where the phenotypic and residual variance are estimable. We have preferred to use given relationship matrices for the transformation of the genomic values to the base population. In the case that such a matrix is not available, we have shown how to use genomic relationship matrices for the transformation, although a formal inversion of GRMs is, in general, not possible. In Table 2, we illustrate that we can decompose the sample phenotypic variance into the sum of the eBP of the additive genomic variance in the current population and into the estimated residual variance. This is due to the orthogonal projection property of the conditional expectation, which gives the best approximation of the random additive genomic variance. This enables a unique definition of the heritability in the current population. It is never possible, however, to transfer the residual variance to the base population. Consequently, a definition of the heritability in the base population is not straightforward.

The gBLUP-method and the Bayesian approach

The frequentist gBLUP-method can also be set up in the context of Bayesian regression models (with prior distribution for the effect vector as defined in Equation 5, and uninformative priors for the variance components). Lehermeier considered the additive genomic variance in the current population (see Equation 8), and used BRR to estimatewhere denotes MCMC samples from the posterior distribution of β. In that approach, Lehermeier estimated the posterior mean of the additive genomic variance in the current population. This approach is the Bayesian equivalent of the (frequentist) empirical version of the best predictor of in Equation 13 in the current population. does not describe the genomic variance in the base population, and should not directly be compared with approaches introduced e.g., in Yang , 2011). Analogously to the best predictor (see Equation 24), for the genomic variance in the base population, one can consideras the posterior mean of the genomic variance in the base population in Bayesian regression models. The frequentist gBLUP-method provides a more formal approach to the prediction of the random additive genomic variance in linear models with random effects than the Bayesian approach. It enables the derivation of explicit formulas for the predictors (unconditional expectation and best predictor) of the random additive genomic variance using the standard output of the gBLUP-method, which goes hand-in-hand with a fast implementation of the empirical version of the predictors. The connection between the BLUP and its covariance for the RME with the additive genomic variance are clearly visible. This enables us, for instance, to derive the decomposition of the best predictor of the random additive genomic variance into the unconditional expectation and a function for the marker LD in the following section.

Influence of LD

In the section Definitions of the genomic variance, we saw that the (random) additive genomic variance equalsin the current population, andin the base population. We emphasize that the variance-covariance matrix of the marker genotypes (marker LD) plays a decisive part in the determination of the additive genomic variances in both the current and the base population. The variances and are structurally in accordance with the classical additive genetic variance (Bulmer 1971; Falconer and Mackay 1996), which is caused by the genotypes, whereas the genotypic effects are fixed. In the REM, however, the marker effects are random, with unconditional expectation 0 and unconditional diagonal variance-covariance matrix with equal variances . As a consequence, the unconditional expectation of the additive genomic variancein the current population andin the base population contain only the variances of the marker genotypes in the corresponding population. In addition, the unconditional expectation resembles both the variance of a randomly sampled trait for a randomly sampled individual and the variance of a randomly sampled trait for individual with fixed genotypes, see Table 1 for an overview. We show in the Appendix Estimation of the Additive Genomic Variance in the REM that we can partition the best predictor of the random additive genomic variance in the following way:wherein the current population, andin the base population . The best predictor, therefore, consists of the unconditional expectation of the additive genomic variance (no contribution of LD) and a function that explicitly contains the weighted contribution of marker LD. This function determines whether estimators like GCTA-GREML (unconditional expectation of the random genomic variance in the base population) are biased upwards or downward, i.e., it determines the direction and the magnitude of the bias of GCTA-GREML (this method is based on the assumption that the function L constantly equals 0). In addition, we notice that this bias depends not only on the sign of the covariance between the marker genotypes, but both on the sign and the magnitude of the weighted covariances. When best (conditional) estimators are considered, the inclusion of LD is intrinsic; in that case, discussions about whether or not LD should be included or even optimally included become immaterial. We emphasize that, contrary to the unconditional expectation, the best predictor maintains the structure of the additive genomic variance and , because the function L can be decomposed into the weighted sample variances and covariances of the marker genotypes. Instead of the marker effects, the components of the matrix , which is typically nonzero and nondiagonal, take the part of the weighting factors of the elements of and The best predictor maintains the structure of the additive genomic variance in both the current and the base population and, thus, conforms to the classical genetic variance (Bulmer 1971; Falconer and Mackay 1996). The difference between the estimators V and W ( and and ) is given by the estimated and can be obtained from Table 2. We notice that the weighted contribution of marker LD is large and positive in the case of the Arabidopsis data, whereas in the mice and wheat data, the weighted contribution of marker LD is slightly negative. To sum up, the application of the unconditional expectation of the additive genomic variance combined with the model assumptions on the marker effects in random effect models cause, at least partially, the missing contribution of LD to the estimated additive genomic variance. This goes hand-in-hand with the critique expressed in Krishna Kumar ,b). It is, however, less important when estimating the additive genomic variance in the base population, where the individuals are uncorrelated and less LD persists (although the marker genotypes need not be uncorrelated). The best prediction approach eliminates the problem of the missing contribution of LD to the additive genomic variance that is caused by mathematical modeling (e.g., the assumptions in the random-effects model).

Concluding remarks

The variability in the marker genotypes causes the variability in the genomic values in a population of individuals. Hence, the additive genomic variance is induced by the variance of the marker genotypes. In the application of the random-effects model, the marker effects are assumed to be random, which introduces another source of variability, namely a statistical one, in the genomic values. We have shown that commonly used estimators use the unconditional expectation to handle this statistical randomness. However, we recommend the use of the best prediction approach (conditional expectation) that uses the additional information given by the phenotypic data in an optimal way, minimizes the mean square error of prediction, includes the contribution of LD, and maintains the structure of the genomic equivalent of the classical additive genetic variance.

29 in total

1. Inferring the trajectory of genetic variance in the course of artificial selection.

Authors: D Sorensen; R Fernando; D Gianola
Journal: Genet Res Date: 2001-02 Impact factor: 1.588

2. Comparing estimates of genetic variance across different relationship models.

Authors: Andres Legarra
Journal: Theor Popul Biol Date: 2015-09-02 Impact factor: 1.570

3. Genome-wide genetic association of complex traits in heterogeneous stock mice.

Authors: William Valdar; Leah C Solberg; Dominique Gauguier; Stephanie Burnett; Paul Klenerman; William O Cookson; Martin S Taylor; J Nicholas P Rawlins; Richard Mott; Jonathan Flint
Journal: Nat Genet Date: 2006-07-09 Impact factor: 38.330

Review 4. Additive genetic variability and the Bayesian alphabet.

Authors: Daniel Gianola; Gustavo de los Campos; William G Hill; Eduardo Manfredi; Rohan Fernando
Journal: Genetics Date: 2009-07-20 Impact factor: 4.562

5. Dominance genetic variation contributes little to the missing heritability for human complex traits.

Authors: Zhihong Zhu; Andrew Bakshi; Anna A E Vinkhuyzen; Gibran Hemani; Sang Hong Lee; Ilja M Nolte; Jana V van Vliet-Ostaptchouk; Harold Snieder; Tonu Esko; Lili Milani; Reedik Mägi; Andres Metspalu; William G Hill; Bruce S Weir; Michael E Goddard; Peter M Visscher; Jian Yang
Journal: Am J Hum Genet Date: 2015-02-12 Impact factor: 11.025

6. Reply to Yang et al.: GCTA produces unreliable heritability estimates.

Authors: Siddharth Krishna Kumar; Marcus W Feldman; David H Rehkopf; Shripad Tuljapurkar
Journal: Proc Natl Acad Sci U S A Date: 2016-07-25 Impact factor: 11.205

7. GCTA-GREML accounts for linkage disequilibrium when estimating genetic variance from genome-wide SNPs.

Authors: Jian Yang; S Hong Lee; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Proc Natl Acad Sci U S A Date: 2016-07-25 Impact factor: 11.205

Review 8. Estimation and partition of heritability in human populations using whole-genome analysis methods.

Authors: Anna A E Vinkhuyzen; Naomi R Wray; Jian Yang; Michael E Goddard; Peter M Visscher
Journal: Annu Rev Genet Date: 2013-08-22 Impact factor: 16.830

9. Polygenic modeling with bayesian sparse linear mixed models.

Authors: Xiang Zhou; Peter Carbonetto; Matthew Stephens
Journal: PLoS Genet Date: 2013-02-07 Impact factor: 5.917

10. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana.

Authors:
Journal: Cell Date: 2016-06-09 Impact factor: 41.582

3 in total

1. Accounting for Group-Specific Allele Effects and Admixture in Genomic Predictions: Theory and Experimental Evaluation in Maize.

Authors: Simon Rio; Laurence Moreau; Alain Charcosset; Tristan Mary-Huard
Journal: Genetics Date: 2020-07-17 Impact factor: 4.562

2. Partitioning gene-level contributions to complex-trait heritability by allele frequency identifies disease-relevant genes.

Authors: Kathryn S Burch; Kangcheng Hou; Yi Ding; Yifei Wang; Steven Gazal; Huwenbo Shi; Bogdan Pasaniuc
Journal: Am J Hum Genet Date: 2022-03-09 Impact factor: 11.043

3. Temporal and genomic analysis of additive genetic variance in breeding programmes.

Authors: Letícia A de C Lara; Ivan Pocrnic; Thiago de P Oliveira; R Chris Gaynor; Gregor Gorjanc
Journal: Heredity (Edinb) Date: 2021-12-15 Impact factor: 3.821

3 in total