Literature DB >> 32186692

Estimation of dynamic SNP-heritability with Bayesian Gaussian process models.

Arttu Arjas¹, Andreas Hauptmann^1,2, Mikko J Sillanpää^1,3.

Abstract

MOTIVATION: Improved DNA technology has made it practical to estimate single-nucleotide polymorphism (SNP)-heritability among distantly related individuals with unknown relationships. For growth- and development-related traits, it is meaningful to base SNP-heritability estimation on longitudinal data due to the time-dependency of the process. However, only few statistical methods have been developed so far for estimating dynamic SNP-heritability and quantifying its full uncertainty.
RESULTS: We introduce a completely tuning-free Bayesian Gaussian process (GP)-based approach for estimating dynamic variance components and heritability as their function. For parameter estimation, we use a modern Markov Chain Monte Carlo method which allows full uncertainty quantification. Several datasets are analysed and our results clearly illustrate that the 95% credible intervals of the proposed joint estimation method (which 'borrows strength' from adjacent time points) are significantly narrower than of a two-stage baseline method that first estimates the variance components at each time point independently and then performs smoothing. We compare the method with a random regression model using MTG2 and BLUPF90 software and quantitative measures indicate superior performance of our method. Results are presented for simulated and real data with up to 1000 time points. Finally, we demonstrate scalability of the proposed method for simulated data with tens of thousands of individuals.
AVAILABILITY AND IMPLEMENTATION: The C++ implementation dynBGP and simulated data are available in GitHub: https://github.com/aarjas/dynBGP. The programmes can be run in R. Real datasets are available in QTL archive: https://phenome.jax.org/centers/QTLA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 32186692 PMCID： PMC7672693 DOI： 10.1093/bioinformatics/btaa199

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Heritability, the proportion of phenotypic variation attributable to genetic factors, is a fundamental parameter in population and quantitative genetics (Visscher ). The aim in narrow-sense heritability estimation is to separate the variance of the trait into additive genetic and environmental variance components such that their sum equals the total variance of the trait. Generally, heritability is a population-specific parameter which can be estimated either using (i) linear mixed model (LMM) techniques (Henderson, 1984) or with (ii) multi-locus association (MLA) approaches (Sillanpää, 2011). In LMMs, information of the (additive) genetic relationships between the individuals in the population must be available, which can be determined either from known pedigree or from genomic data [single-nucleotide polymorphism (SNP)], while in MLA models, heritability can be estimated from genomic data. In LMMs, the trait variation is assumed to be controlled by a high number of small effect genes (polygenic), whereas MLA models assume that there are only few major genes behind the trait variation. We note, that in case heritability is estimated from genomic data, it is called SNP-heritability, and we will from now on refer by heritability specifically to narrow-sense SNP-heritability. It is well known that heritability of a trait may depend on measurement time, age, environmental conditions (like temperature) or size (Stinchcombe ). In all such cases, it is natural to consider and estimate heritability as a dynamic function. In particular, functional variation in heritability is motivated by the fact that environmental changes may affect the environmental variance, and genes that control traits can activate or deactivate (Bryois ), which then may affect the genetic variance. In general, this kind of dynamic modelling of biological processes is a fast-growing field thanks to modern data collection techniques, see e.g. Moore and Li and Sillanpää (2015). However, the number of statistical methods and associated (easy-to-use and publicly available) software packages to estimate dynamic heritability is still limited. To discuss this further, let us first put available methods into context. For approaches in the multivariate LMM framework we have the following options. A simple approach: considers all time points as dependent traits and estimates their trait-specific genetic variances. This multi-trait model estimates variances jointly but does not apply any smoothing over time points (Henderson and Quaas, 1976; Lee and van der Werf, 2016). Smoothing at phenotype level (Fig. 1C): fits a linear or non-linear function over phenotypic time points, and then estimates latent-trait heritability influencing each parameter of that function using univariate or multivariate LMMs (e.g. Canaza-Cayo ).

Fig. 1.

(Top panel) Continuous smoothing over time points can be induced into one of three possible layers in the LMM hierarchy: either in the variance component layer (A), genetic value layer (B) or phenotype layer (C). (Bottom panel) Our suggested LMM structure, where continuous smoothing over time points is induced on the genetic- and residual-variance components

Smoothing at breeding value level (Fig. 1B): fits a linear or non-linear function over time in genomic breeding values (and possibly residual effects, i.e. permanent environmental effects). A common approach referred to as random regression (Campbell ; Schaeffer, 2016; http://animalbiosciences.uoguelph.ca/%7Elrs/BOOKS/rrmbook.pdf), also requires estimation of the residual covariance matrix if the permanent environmental effects are not smoothed. Smoothing at variance component level (Fig. 1A): if the residual variance is also smoothed, the residual covariance matrix can be left out of the model. Such an example is given by a spline-based method He , 2017) which is restricted to only twin data. (Top panel) Continuous smoothing over time points can be induced into one of three possible layers in the LMM hierarchy: either in the variance component layer (A), genetic value layer (B) or phenotype layer (C). (Bottom panel) Our suggested LMM structure, where continuous smoothing over time points is induced on the genetic- and residual-variance components In comparison, for MLA approaches, we have the two possibilities. Smoothing at phenotype level: fits a linear or non-linear function over phenotypes, and then estimates latent-trait heritability for trend parameters (e.g. intercepts, slopes), by either using univariate or multivariate methods (see e.g. Gee ; Heuven and Janss, 2010; Li ; Sillanpää ). Smoothing at quantitative trait locus (QTL) effect level: fits a linear or non-linear function over time to QTL effects. In particular, these methods have been developed for QTL mapping of function-valued traits, but can also be applied for SNP-heritability estimation. Examples are varying-coefficient models, such as Li and Sillanpää (2013) and Vanhatalo . Considering the residual covariance structure in these models is recommended. Dynamic modelling of biological phenomena can be beneficial for a number of reasons. For instance, by improving the precision of the estimation (Li and Sillanpää, 2013). This is due to the simple fact that, as with any statistical modelling, a larger dataset leads to increased precision, and naturally the amount of data increases with the number of measurement points. More importantly, if the nature of a phenomenon is dynamic, it is only rational to model the phenomenon dynamically and exploiting information content over time. Finally, there is a heated debate around heritability estimation as to why the current analysis methods are leading to a noticeable gap (known as missing heritability) between heritability estimates from SNP and pedigree/twin data (see e.g. Eichler ; Gibson, 2012; Young, 2019). One important contributor to this is the time-dependent nature of heritability, which our method aims to address. One way to induce smoothing in the model (to model dynamic phenomena) is given by Gaussian processes (GPs). These are random processes whose degree of smoothness can be controlled by varying a set of parameters. In particular, they are appealing because of their analytical properties: with certain conjugacy structure the end result is given as an easy-to-calculate formula (Rasmussen and Williams, 2006). Furthermore, GPs have been successfully utilized in genetics before, e.g. by Vanhatalo , where associations between molecular markers and function-valued traits were studied. Also, the idea of using GPs in modelling dynamic biological processes is not new. For example, Pletcher and Geyer (1999) and Jaffrézic and Pletcher (2000) suggested that the genetic breeding values and environmental terms of the phenotype in LMM could be viewed as GPs. They also considered multivariate phenotypes by modelling the covariance function of a process through a parametric representation, which reduces the number of parameters in the model. In fact, this approach differs from ours, because we view the variance components of the LMM as GPs, which is similar to He , 2017), where variance processes were modelled with splines. In this article, we specifically consider the Bayesian framework of statistical modelling. This means that the statistical inference is not only based on the measurement data and the statistical model but also on prior assumptions and information about the subject (Gelman ). For example, a common assumption in many research fields about dynamic processes is given by their smoothness, i.e. values at neighbouring time points are expected to be closer than at time points further apart. To estimate the parameters in Bayesian models, it is common to use Markov Chain Monte Carlo (MCMC) simulation methods (Robert and Casella, 2009), but since GP models usually have a high number of highly correlated parameters, efficient MCMC sampling is difficult. Thus, in addition to the traditional Metropolis–Hastings (MH) algorithm, we use the state-of-the-art method of elliptical slice sampling (Murray ), which has shown to perform well in GP models. In addition to dynamic heritability estimation, the aim of this article is to illustrate the difference in the uncertainty of the estimates of joint and independent modelling of dependent traits. If a trait is measured longitudinally over time, the different measurements can be thought as individual traits. If there is dependence between traits, it makes sense to model them jointly. This dependence can be viewed as increased sample size because the value of one trait affects the value of the other. This article is organized as follows: In Section 2, we present our proposed model for dynamic heritability estimation based on GPs. In addition, we introduce a two-stage method that is used as a baseline. In Section 3, we evaluate our method with two simulated and two real datasets and compare with a random regression model (RRM) implemented with MTG2 (Lee and van der Werf, 2016) and BLUPF90 (Misztal ) software and ACEt R-package for estimating dynamic heritability (He ). In Section 4, we examine the obtained results, followed by a discussion in Section 5.

2 Materials and methods

We will address the dynamic estimation of heritability by two methods, a joint estimation and a separated two-stage approach which serves to illustrate why joint modelling is beneficial. Specifically, the two-stage method first estimates the posterior means of the variance components and their credible intervals at every time point individually and subsequently combines the obtained estimates into smooth curves. All models in this study are extensions of the basic LMM, defined as (Henderson, 1984). where is a measurement vector, is a matrix of fixed effects, contains the regression coefficients associated with them and N is the number of individuals. The matrix (in this case ) connects random effects with correct individuals. Here, is the genetic variance and is the genomic relationship matrix (defined in detail below). Lastly, is the error term. Since we do not have any fixed effects, we can express the centred measurement vector as . We note that the overall covariance matrix of can be written as . Narrow-sense heritability is then defined as .

2.1 Genomic relationship matrix

To separate the genetic and environmental variance components, knowledge of the genetic relations between individuals is needed. A genomic relationship matrix can be estimated from molecular marker data. VanRaden (2008) describes the procedure of constructing such a matrix. First, let be an N × M genotype matrix where N is the number of individuals and M is the number of markers. The elements in are coded as for homozygote, heterozygote and the other homozygote, respectively. Second, let p be the allele frequency of the second allele at locus i. Then, we construct a matrix with the same dimensions as and set the ith column of to . Let . Finally, we can construct a genomic relationship matrix as: To reduce the estimation error, we used a shrinkage estimated version of this matrix, defined in detail by Endelman and Jannink (2012) and implemented in the R-package rrBLUP (Endelman, 2011). We note that the estimated genomic relationship matrix is positive-semidefinite and hence might not be invertible. However, for our model formulation, this will not be a problem. This is due to the fact that we always add positive values to the diagonal of the matrix; see matrix K above.

2.2 Joint model over time points

To extend the model (1) to the case of longitudinal data, we can use the Kronecker product denoted by ⊗. The extended model defines a probability distribution for a data vector that can be written as where contains all breeding values and contains all error terms from individual time points consecutively and T is the number of time points. The overall covariance matrix for can be written as . The variance component vectors can now be estimated with a Bayesian approach by setting suitable priors for them. It is worth noting that this model structure implies that all covariances across time (within and between individuals) are ignored. For the rapid performance of the algorithm, this assumption is crucial. It is in the priors of the variance component vectors where we assume dependence over time (see Fig. 1, bottom panel). This is a fundamental difference compared with for example Pletcher and Geyer (1999) who model the dependence on the level of breeding values. The idea behind our assumption is that the dependence of neighbouring trait values induces dependence on the variance level. The priors are based on assumptions about the qualitative features of the variance component functions, namely their continuity and smoothness. Due to the aforementioned reasons, we assume that the variance values at neighbouring time points are on average closer than at time points further apart. To formulate such assumptions mathematically, one can use GPs, fully defined by a mean function and a covariance function (Rasmussen and Williams, 2006). We write and , where and . Logarithms are used to allow negative values for the processes and the zero mean can be achieved approximately by suitable translation of the data. For C and C, we use the Matérn covariance function, given by where is the magnitude parameter that controls the overall variance of the stochastic process, λ is the length scale that governs how fast the covariance drops with respect to the distance of time points, denotes the gamma function and the modified Bessel function of the second kind. The value of ν controls the mean-square differentiability of the process, which affects its smoothness. We note that for , there exists a simple form for the covariance function. In this study, we fix , which means that a sample process is once mean-square differentiable and we obtain This means the process is rather smooth, but it is also allowed to change rapidly (Rasmussen and Williams, 2006). A common problem with GPs is determining values for the hyperparameters and λ. In the Bayesian framework, one can set up separate priors for each and estimate them together with the other unknowns using MCMC. However, it is known that identifiability issues arise when simultaneously estimating both hyperparameters (Zhang, 2004). Thus, it is common to fix one of them. In our case, we fix and scale the data so that the mean of the overall variance over time points equals 2. This means the mean of the environmental and genetic variance over time points is 1 if they are equal. Because our GP model is non-parametric, we have to discretize the time axis and define a grid of points where we estimate the variance components. For simplicity, we estimate the variance components at the same locations the measurements were taken. This means that the covariance function becomes a covariance matrix . Then we have for each element in the matrix that . We note that a covariance matrix defined this way is dense and computationally heavy to operate with. Hence, we use a sparse approximation for the inverse of the covariance matrix (Roininen ), discussed in more detail in the Supplementary Material. The model can be written in hierarchical form as By Bayes’ formula, the unnormalized posterior density can be expressed as where refers to the multinormal density. In practice, with this parameterization, we observed poor mixing of the hyperparameter chains. Hence, we employed whitening (Monterrubio-Gómez ; Murray and Adams, 2010; Yu and Meng, 2011), which breaks the dependencies of the variance component processes and corresponding hyperparameters under the prior. We first note that the logarithmized variance component processes can be expressed as and , where and are both standard multivariate normally distributed. The matrix square roots are obtained straightforwardly through the approximations (see Supplementary Material). Instead of and , we now sample and . The reparametrization corresponds to the following posterior density where The hyperparameters and are set by following Monterrubio-Gómez . In particular, it is based on the idea that the length scale is identifiable between the smallest and largest distance between two time points, say . Hence we want to place most of its prior probability mass in that interval. By using the quantile function of a standard normal distribution, we can assign ∼95% of the prior mass between the interval by solving the following system of equations: We emphasize that this model specification leaves no tuning parameters to fix prior to estimation which eliminates the need for preliminary analyses and consequently saves plenty of time.

2.2.1 Parameter estimation in the joint model

To generate dependent MCMC samples from the posterior distribution of parameters and , we use the elliptical slice sampling method (Murray ). It is a rejection-free sampling algorithm and we noticed that it does perform better for high-dimensional data (more than 100 time points) than the block-update of MH algorithm. The problem with elliptical slice sampling is that the likelihood must be evaluated multiple times in a single MCMC iteration, making it quite a bit slower than MH. For the length scales, we use MH with Gaussian random walk proposals, i.e. the proposal value is sampled from the distribution , where is the current value of the parameter. The variance is adapted following Section 3 in Roberts and Rosenthal (2009) to achieve an acceptance rate of 0.44 which is considered optimal in certain settings. The four parameter subsets () are sampled in alternatingly while keeping the other values fixed. The computationally most intensive part of the algorithm is the evaluation of the likelihood function, due to the inversion and determinant computation of an NT × NT covariance matrix. Fortunately, the form of the matrix allows us to utilize some basic linear algebra to make the computations feasible. We start by examining the structure of matrix and note the block diagonal structure where the blocks consist of time point-specific covariance matrices: This means, we can express the overall log-likelihood function of the parameters as a sum of log-likelihood functions of each time point Similarly to Kang), we can now use eigen decomposition to decorrelate the measurements using a linear transformation to speed up the computations. The covariance matrix of the measurements at time t can be decomposed as , where is the eigenvector matrix of and is a diagonal matrix with and ξ being the eigenvalues of . By the orthogonality of , we have that , implicating that the elements in the transformed measurement vector are independent of each other. This reduces the log-likelihood calculation into a sum. Most importantly, since does not depend on the variance components, the transformation has to be done only once. Setting , the resulting log-likelihood can be written as The pseudocode for the algorithm can be found in the Supplementary Algorithm S1.

2.3 Two-stage method

To illustrate the benefits of the joint model, we compare it to a baseline approach. The method consists of two stages: (i) the estimation of posterior means of the variance components and their 95% credible intervals at each time point separately and (ii) combining individual estimates together by smoothing the obtained estimates from stage one. The estimation in stage one is based on the same model as in the joint estimation but for one time point where is the centred measurement vector at time t, and . As in the joint method, the data are scaled such that the overall variance over time points is 2. The covariance matrix of is and the model can be written as We note that this is a special case of the joint model presented earlier, see Equation (3). Posterior means and 95% credible intervals of the variance components at each time point are obtained from this analysis used in the next stage. Smoothing of the variance component curves over time is based on the model where contains either logarithmized posterior means, lower 95% credible interval limits or upper 95% credible interval limits estimated in stage one, is a smooth process and is an error term. is again the Matérn covariance function, defined in Equation (5). Here, is scaled to have mean zero and variance one. After discretization, the model can be written as follows: where and λ are assumed to be a priori independent. The variance parameter in the error term is given an uninformative prior and the parameters and are fixed similarly as in joint estimation method following Monterrubio-Gómez . The process is analytically integrated out of the model, but we can reconstruct it at the measurement points after the estimation of the hyperparameters (Rasmussen and Williams, 2006) by noting that

2.3.1 Parameter estimation in the two-stage method

The parameters and in stage one are estimated using a random walk MH algorithm with the same adaptation as in the joint estimation method. They are sampled at MCMC iteration i + 1 from the distributions . The proposals are accepted by MH (a single parameter at a time) conditionally on the other parameter fixed to its latest value. The parameters of the smoothing model in phase two are estimated similarly as in stage one with the same adaptation. They are sampled at iteration from the distributions . The proposals are accepted by MH (a single parameter at a time) conditionally on the other parameter fixed to its latest value. The pseudocode for both stages can be found in the Supplementary Algorithms S2 and S3.

2.4 Computational considerations

Since the log-likelihood in the joint estimation method can be expressed as a sum (Equation 9), its computation can be parallelized. We parallelized all log-likelihood calculations, gaining additional speedups. The computation times were between 24 min for the mouse activity data and 31 min for the Arabidopsis data with 300 000 MCMC iterations. The workstation used for the simulations had an AMD Ryzen Threadripper 2950X 3.5 GHz processor with 16 cores and 32 GB of RAM. The method is implemented with C++ integrated with R using the Rcpp library (Eddelbuettel and François, 2011) and Eigen (Guennebaud ) (http://eigen.tuxfamily.org) for linear algebra. The programme is available at https://github.com/aarjas/dynBGP.

3 Example analyses

To test the methods described earlier, we used four different datasets with a large number of time points. Two were simulated and the two others were real datasets from https://phenome.jax.org/centers/QTLA.

3.1 Simulated dataset

The simulated data consist of measurements from individuals from time points. First, a relationship matrix was created by generating an with independent standard normally distributed elements. The relationship matrix was then computed as . A small number was added to the diagonal to make the matrix positive definite. To simulate realistic data, the longitudinal dependencies need be taken into account as well. This was done by computing a GP matrix where the row and column intersection was set to . The genetic and environmental components of the data were simulated from the distributions , respectively. Finally, the components were scaled with the corresponding variances at each time point and summed. A correct relationship matrix was assumed known in the analysis stage. To demonstrate scalability of the approach, we created larger datasets with up to 50 000 individuals and 1000 time points (see Supplementary Fig. S5).

3.2 Arabidopsis thaliana dataset

The second dataset contains A.thaliana root tip angle measurements (Moore ). The population consists of recombinant inbred lines of Arabidopsis seeds with 234 markers. The seeds were placed on Petri dishes that were held in front of a camera and rotated 90° so that the roots grew parallel to the ground. The root tip angle was measured every 2 min for 8 h resulting in time points. Moore describe the measurement process in more detail. We used a version of the data where the phenotype values were averages of multiple individuals representing the same line.

3.3 Mouse activity dataset

The third dataset consists of mouse activity measurements (Xiong ). The activity of mouses was monitored over a period of days. One day was divided into 6-min intervals and an active state probability of a given mouse in each interval was calculated based on the days of data. This results in time points. Since the outcome is a probability, normality assumption is not realistic. Hence, we transformed the phenotypic data using logit-transformation following the procedure described by Li and Sillanpää (2013). The data also have measurements of 251 markers.

3.4 Comparison with ACEt

We also compared our joint estimation method with the freely available ACEt R-package for estimating dynamic heritability (He , 2017). The method is restricted only for twin data and it uses splines in defining the different dynamic variance components. The ACEt model also includes a common environmental effect as a third variance component, which our model lacks. For uncertainty quantification, ACEt uses delta-method or bootstrap. For the method comparison, we used a simulated twin dataset that comes with the ACEt R-package. It consists of 100 twin pairs, half of who are monozygotic and half of who are dizygotic. The individuals have equispaced measurements of an artificial trait over 50 years. In twin models, the covariances in the relationship matrix between a monozygotic twin pair and a dizygotic twin pair are assumed to be 1 and 0.5, respectively, while the diagonal values are all 1. Between-pair covariance is assumed to be 0.

3.5 Completing missing genotype data by imputation

There were no missing phenotype values in any of the datasets. The mouse activity dataset had and the Arabidopsis dataset missing marker values. All missing values were imputed once before calculating the genomic relationship matrix with the mean value of the given marker over individuals.

4 Results

The results from the Bayesian analysis of simulated data are presented in Figure 2 for both methods. The posterior estimated variance component functions generally follow the real functions that were used to generate the data well. Perhaps the most interesting aspect about the results is the difference in the uncertainty of the estimates between the methods. The 95% credible intervals provided by the joint model are far narrower than those of the two-stage method. This is because the joint model makes use of the whole dataset, whereas the two-stage method uses the data of each time point independently.

Fig. 2.

The dynamic variance components and SNP-heritability estimated from the simulated dataset with the two different Bayesian methods. The posterior mean curves are drawn with solid lines and 95% credible intervals with dashed lines The estimated posterior mean curves and their 95% credible intervals for variance components and SNP-heritability calculated from the Arabidopsis seed data are presented for the joint model and two-stage method in Figure 3. Again, one can clearly see much wider 95% credible intervals surrounding the posterior mean curves obtained from the two-stage method. Moore have also estimated the dynamic heritability from this dataset using ANOVA. They calculated the variances within and between genetically distinct lines separately at every point in time. The curve has the same shape but the actual values are somewhat smaller than ours, peaking at 0.25, while our estimate peaks at about 0.4. This offset can be explained by different averaging of used data and in particular, their estimation lacks uncertainty estimates. Vanhatalo have analysed the same data as well and are likely using the same averaging as we do. We note that their obtained heritability estimates coincide very well with ours. However, they also lack uncertainty estimates. Interestingly, similar results were not expected, since the two analysis models differ significantly in structure and especially in terms of assumptions on the underlying genetics.

Fig. 3.

The dynamic variance components and SNP-heritability estimated from the A.thaliana dataset with the two different Bayesian methods. Posterior mean curves are drawn with solid lines and 95% credible intervals with dashed lines The posterior estimates from the mouse activity data are presented in Figure 4. This kind of data seems to be challenging for the GP models. This is mainly due to the fact that here the variance processes have very rough features along with smooth areas. Thus, the amount of smoothing needed is very different at different time points. Especially, the heritability process estimated with the two-stage method does not perform so well and produces highly oscillating features. The same data were analysed in Vanhatalo as well. In this case, their heritability estimates look quite different to ours. In particular, our methods have smoothed most rough edges, whereas their estimate has preserved them. Nevertheless, the overall shape is similar. These differences here are likely the consequence of using very different models.

Fig. 4.

The dynamic variance components and SNP-heritability estimated from the mouse activity dataset with the two different Bayesian methods. Posterior mean curves are drawn with solid lines and 95% credible intervals with dashed lines We also compared the joint model with a RRM [implemented in e.g. BLUPF90 (Misztal ), MCMCglmm (Hadfield, 2010) and MTG2 (Lee and van der Werf, 2016)] which is a well-established method for analyzing dynamic biological phenomena. We simulated 10 different datasets with the method from Section 3.1 and fitted a Legendre polynomial of degree five for both the genomic breeding values and permanent environmental effects. The model assumes heterogenous residual variances over time. A more precise model definition can be found in the Supplementary Material. We computed the mean squared error (MSE) of the estimated genetic and environmental variances as well as heritabilities, where , with being the ground truth at time the estimate at time . We used posterior mean as the point estimate for the joint method. The computed errors can be found in the Supplementary Table S1. This analysis was performed using AI-REML method in the software MTG2 (Lee and van der Werf, 2016) and it took on average 14 min. The MSE of the joint method was on average 17% smaller than of the RRM. We note that the MSE varies with the chosen degree of the Legendre polynomials and for this computation; we chose the best performing combinations with a reasonable computation time. Additionally, we fitted a Bayesian version of the same model using BLUPF90 family of programmes (Misztal ) and GIBBS2F90 in particular. This allowed us to quantify the uncertainty in the variance components estimated with the RRM. The results of the analysis and further details can be found in the Supplementary Figure S7. The 95% credible intervals obtained from the RRM are slightly wider than of the joint model. This might be due to the differences in model structures (cf. Fig. 1). The computation time for the Bayesian RRM for the simulated dataset was over 24 h, whereas for the joint method it was <30 min with 300 000 MCMC-iterations in both. Furthermore, we would like to mention that the RRM implemented in both MTG2 and GIBBS2F90 did not converge on our real dataset examples where the number of individuals is small but the number of time points high. Yet, the joint method manages to do so, i.e. the MCMC sampler converges. It is not completely clear to us why this is the case. One possibility might be the difference in assumptions: In RRMs the assumptions about the parametric shape of the breeding values and environmental effects induce some shape for the variance components. In the joint method, the prior assumptions concern only the variances. This direct modelling strategy might simplify the estimation process. Finally, the comparison of our joint model with the ACEt model for analyzing twin data is presented in the Supplementary Figure S4. The results obtained by both methods are consistent, even though ACEt considers also common environment in its model. In our method, the genetic variance component seems to have absorbed the common environmental variation, resulting in slightly higher heritability estimate than the one given by ACEt. The uncertainty estimates differ a bit between the methods, which is likely due to the common environmental component. Moreover, obtained uncertainty limits are not fully comparable because the frequentist ACEt method uses delta-method and our Bayesian method uses MCMC sampling for generating the limits. Performance of the MCMC-algorithm for the joint model can be evaluated from traceplots of the parameters (Supplementary Figs S1–S3). All of the parameter chains seem to have converged, although there is some oscillation on the variance component parameters with the lowest effective sample size.

5 Discussion and conclusions

We presented a new Bayesian method for estimating dynamic narrow-sense heritability, based on LMMs and GP priors. The method uses data from all time points at once, making it possible for the time points to ‘borrow strength’ from one another through the prior covariance structure. This property makes the resulting posterior distributions narrower compared with the second method where the variance components are estimated separately for each time point and smoothed afterwards. Another benefit of the method presented due to non-parametric smoothing using GPs is that it can handle very general functional shapes. The presented estimation method bears some similarities with random regression as both are based on LMMs, but there are essential differences that we would like to point out. Most notably, in our proposed method the smoothing is based on priors that are set for the variance component vectors, while in random regression the smoothness of the variances is induced through the assumptions made about the functional shape of the breeding values and environmental effects. Additionally, our model assumes independence of different traits (measurements made at different points in time) and ties them together with the priors. In multivariate LMMs, traits are assumed to be dependent and random regression attempts to reduce the size of the estimated covariance matrix by reparametrization. A benefit of our proposed method is the ability to quantify the uncertainty in the variance component estimates. This is also possible in Bayesian RRMs using MCMC but in frequentistic RRMs additional steps such as application of delta-method or bootstrap are needed to produce the respective confidence intervals. In addition, our method is completely free of tuning whereas in random regression the degree of the polynomial or alternatively knot points in splines must be chosen prior to estimation. In practice, we noticed that the estimation in MTG2 is really fast for a few traits but it slows down quickly as the number of traits increases. In fact, trying to run the algorithm with 1000 individuals and 100 traits causes an insufficient virtual memory error. In contrast, our algorithm still worked well with 50 000 individuals and 100 time points within a reasonable time (∼20 h). Additionally, high number of traits is not an issue either as demonstrated in the Supplementary Figure S5. We like to emphasize that the computation times grow almost linearly and hence our algorithm exhibits excellent scalability. Based on our results, we believe that random regression performs best when there is a high number of individuals and low number of time points which was not the case in our real data examples. In MTG2, the time complexity is cubic with respect to the number of time points (Lee and van der Werf, 2016), and we believe this is also the case with other RRM implementations. It is also noted by Schaeffer (2016) that Legendre polynomials might cause artefacts near the boundaries of the covariate domain and hence GPs are favoured. A limitation of our method is the inability to model gene-covariate interactions with possibly only one measurement per individual (cf. Moore ; Ni ). Theoretically, to perform such analysis, the covariate would have to be split up into discrete groups with each group containing suitably large population (cf. Robinson ). This would also result in each group having their own relationship matrix. This is not supported by the algorithm at the moment, however. Furthermore, the method presented here can also be extended for further analysis. For example, given the posterior mean estimates at time , the conditional distribution of genomic breeding values (Rasmussen and Williams, 2006), where The posterior means of genomic breeding values are not smooth functions over time (cf. Campbell ). After computing the breeding values, the SNP effects can be estimated for association mapping purposes by the back-transformation formula , where is the allele frequency of the other allele in marker defined in Section 2.1 (Bernal Rubio ). This kind of longitudinal analysis was recently performed by Campbell who first estimated the breeding values with a RRM and then used the back-transformation to solve for the SNP effects. Results were compared with an alternative single time point analysis and the authors found that the dynamic analysis recovered more significant associations. The presented model can also be extended by adding fixed environmental effects that affect the phenotype values. This could further reduce the estimation error. Another strategy is to apply a two-stage pre-correction similar to He . This means that a linear regression model is first fitted to estimate the fixed effect coefficients and the residuals of that model are then used as phenotype values to estimate the variance components in our LMM. Further extension would be to consider non-Gaussian longitudinal phenotypes, e.g. binary or count data. In case of binary data, an extra latent-trait layer can be added to the model (Albert and Chib, 1993; Felsenstein, 2005; Kärkkäinen and Sillanpää, 2013). In latent-trait modelling, the binary phenotype can be modelled by considering an underlying hypothetical normally distributed latent-trait variable which gives rise to the binary trait at the observed layer. If the latent-trait variable is smaller than the pre-determined threshold, binary trait at the observed layer obtains the value zero and otherwise it obtains the value one. One future extension is also to consider multiple longitudinal quantitative traits simultaneously. Some models and methods have been presented for this purpose—to explain the variation of more than one trait simultaneously over time (Oliveira ; Sung ). To conclude, we presented a new tuning-free method for estimating dynamic heritability using a Bayesian LMM and GP priors. To estimate the parameters in the model, we use MCMC which makes the uncertainty quantification straightforward. Our results clearly illustrate that joint modelling of the data of all time points reduces the uncertainty in the estimates compared with independent modelling. Click here for additional data file.

34 in total

1. The genetic analysis of age-dependent traits: modeling the character process.

Authors: S D Pletcher; C J Geyer
Journal: Genetics Date: 1999-10 Impact factor: 4.562

Review 2. Heritability in the genomics era--concepts and misconceptions.

Authors: Peter M Visscher; William G Hill; Naomi R Wray
Journal: Nat Rev Genet Date: 2008-03-04 Impact factor: 53.242

3. A Bayesian nonparametric approach for mapping dynamic quantitative traits.

Authors: Zitong Li; Mikko J Sillanpää
Journal: Genetics Date: 2013-06-14 Impact factor: 4.562

4. On statistical methods for estimating heritability in wild populations.

Authors: Mikko J Sillanpää
Journal: Mol Ecol Date: 2011-02-17 Impact factor: 6.185

5. Genotype-covariate interaction effects and the heritability of adult body mass index.

Authors: Matthew R Robinson; Geoffrey English; Gerhard Moser; Luke R Lloyd-Jones; Marcus A Triplett; Zhihong Zhu; Ilja M Nolte; Jana V van Vliet-Ostaptchouk; Harold Snieder; Tonu Esko; Lili Milani; Reedik Mägi; Andres Metspalu; Patrik K E Magnusson; Nancy L Pedersen; Erik Ingelsson; Magnus Johannesson; Jian Yang; David Cesarini; Peter M Visscher
Journal: Nat Genet Date: 2017-07-10 Impact factor: 38.330

6. Bayesian multi-QTL mapping for growth curve parameters.

Authors: Henri C M Heuven; Luc L G Janss
Journal: BMC Proc Date: 2010-03-31

7. A linear mixed-model approach to study multivariate gene-environment interactions.

Authors: Rachel Moore; Francesco Paolo Casale; Marc Jan Bonder; Danilo Horta; Lude Franke; Inês Barroso; Oliver Stegle
Journal: Nat Genet Date: 2018-11-26 Impact factor: 38.330

8. Utilizing random regression models for genomic prediction of a longitudinal trait derived from high-throughput phenotyping.

Authors: Malachy Campbell; Harkamal Walia; Gota Morota
Journal: Plant Direct Date: 2018-09-10