Literature DB >> 29748199

GWAS by GBLUP: Single and Multimarker EMMAX and Bayes Factors, with an Example in Detection of a Major Gene for Horse Gait.

Andres Legarra¹, Anne Ricard^2,3, Luis Varona^4,5.

Abstract

Bayesian models for genomic prediction and association mapping are being increasingly used in genetics analysis of quantitative traits. Given a point estimate of variance components, the popular methods SNP-BLUP and GBLUP result in joint estimates of the effect of all markers on the analyzed trait; single and multiple marker frequentist tests (EMMAX) can be constructed from these estimates. Indeed, BLUP methods can be seen simultaneously as Bayesian or frequentist methods. So far there is no formal method to produce Bayesian statistics from GBLUP. Here we show that the Bayes Factor, a commonly admitted statistical procedure, can be computed as the ratio of two normal densities: the first, of the estimate of the marker effect over its posterior standard deviation; the second of the null hypothesis (a value of 0 over the prior standard deviation). We extend the BF to pool evidence from several markers and of several traits. A real data set that we analyze, with ours and existing methods, analyzes 630 horses genotyped for 41711 polymorphic SNPs for the trait "outcome of the qualification test" (which addresses gait, or ambling, of horses) for which a known major gene exists. In the horse data, single marker EMMAX shows a significant effect at the right place at Bonferroni level. The BF points to the same location although with low numerical values. The strength of evidence combining information from several consecutive markers increases using the BF and decreases using EMMAX, which comes from a fundamental difference in the Bayesian and frequentist schools of hypothesis testing. We conclude that our BF method complements frequentist EMMAX analyses because it provides a better pooling of evidence across markers, although its use for primary detection is unclear due to the lack of defined rejection thresholds.

Entities: Chemical Disease Gene Species

Keywords: Bayesian regression; GWAS; GenPred; Genomic Selection; QTL; Shared Data Resources; association analysis; single marker regression

Mesh：

Substances：
Genetic Markers

Year: 2018 PMID： 29748199 PMCID： PMC6027892 DOI： 10.1534/g3.118.200336

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Bayesian models including simultaneously all marker effects are becoming very popular for GWAS analysis (Habier ; Wang , 2016; Moser ). The most frequently used prior for marker effects is the normal distribution, known as RRBLUP or SNP-BLUP (Habier ; VanRaden 2008), which is equivalent to GBLUP (VanRaden 2008), also known in the human literature as GCTA analysis (Yang ). GBLUP is simple and can be generalized to marker data missing in a large fraction of individuals in the so-called Single Step methods (Aguilar ; Christensen 2012), and also to multiple traits, or complex models (random regression, genotype by environment, etc.). Because of the equivalence of GBLUP and SNP-BLUP, it is straightforward to obtain from Single Step methods estimates of marker effects for complex traits like, e.g., multiple trait maternal effects (Lourenco ) or genotype-environment models (Jarquín ). Therefore, GWAS can be done exploiting results of GBLUP (Wang ; Dikmen ; Casiró ). Most of these works (e.g., (Wang ; Dikmen ) do not report classical statistics neither p-values, whereas standard GWAS by fixed regression “one marker at a time” (e.g., EMMAX (Kennedy ; Kang ; Teyssèdre )) yields a normal test, i.e., dividing the estimate of the marker effect by its standard error of the estimate, with associated p-values. Remarkably, Gualdrón-Duarte et al. (2014) and Bernal-Rubio proved that in (SS)GBLUP or SNP-BLUP, dividing the estimate of the marker effect by its standard error is mathematically equivalent to fixed regression EMMAX, even if markers are estimated as random effects in GBLUP and as fixed effects in EMMAX. In addition, Chen generalized the single marker EMMAX test to a multiple marker test that considers simultaneous sets of markers. In this test, signals from neighboring markers are pooled to create a single p-value measuring strength of association. This paper has two objectives. The first one is to show that, in addition to previous frequentist tests (single marker and multiple marker EMMAX) it is possible to obtain from GBLUP analysis single marker and multiple marker Bayes Factors (BF) as strength of evidence for the presence or absence of a QTL. In short, the BF is the ratio of probabilities of the data given two competing models (Kass and Raftery 1995) and has been often used in QTL mapping (Heath 1997; Varona ; Wakefield 2009, 2012; Varona 2010; Habier ; Legarra ). The BF empirically seems to provide a consistent procedure across traits and species (Legarra ). In current Markov Chain MonteCarlo implementations, computation of the BF require indicator variables for “null” or “non null” effects of markers (Habier ; Legarra ) and does not include the extensively used GBLUP and SNP-BLUP. In this work, we show how the BF can be easily computed from results of SNP-BLUP or GBLUP, for evidence of a single loci or a set of loci (possibly contiguous). The resulting BF considers correctly both the estimated effect and its incertitude, at one or several loci. The second objective is to illustrate properties of these two procedures (single and multiple marker EMMAX and BF), plus a Bayesian multiple marker regression (BayesCPi), by analysis of a challenging small horse real data set with presence of a known, yet barely significant, major gene (DMRT3) for gait.

Material and Methods

Distributions of marker effect estimates

The methods use the prior (before observing the data) and posterior (estimates and associated errors) distributions of marker effects assuming a priori multivariate normality (i.e., SNP-BLUP or GBLUP). Most theory can be found in (Gualdrón Duarte ; Bernal Rubio ; Chen ) and we include it in the Appendix for completion. We will assume throughout that variance components are known; this is a frequent assumption that allows obtaining of closed-form estimators. In particular, variance components can be estimated beforehand (e.g., by REML), or (making strong assumptions) they can be borrowed from pedigree analysis. In either case, a point estimate is used “as if” it was exact, which results in optimistic results. The main notation that we need is a vector of marker effects normally distributed with a priori mean 0 and variance , and their prediction error variance .

EMMAX tests of association from GBLUP

This section is a reminder of (Gualdrón Duarte ; Bernal Rubio ; Chen ) and we include it here for completeness.

Single marker:

The single marker EMMAX procedure is a normal test obtained, in our notation, with the statisticwhere is the marker estimate of the locus under consideration, obtained from a single SNP-BLUP evaluation (or an equivalent model), and where is the frequentist distribution (over conceptual repeated sampling of ) of the SNP-BLUP estimator of the effect . Somewhat surprisingly, the numerical value of t is the same as if a was fit as a “fixed regression” GWAS and therefore the distribution of t under the null is (Bernal Rubio ). For instance, assume that is the a priori variance of marker effects. Output of the SNP-BLUP gives an estimate of the marker effect with a standard deviation of the posterior distribution . With these numbers, the frequentist . Thus, which yields a p-value of 0.006.

Multiple marker:

Consider a subset of n markers (possibly consecutive), starting at marker i. The statistic is a quadratic form , where is the frequentist covariance of these marker effects. Chen proved that under multivariate normality the quadratic form x follows a chi-square distribution of n degrees of freedom, which yields p-values for the multiple marker EMMAX. Alternatively, derivation of the Hotelling-t squared test, that tests whether a set of correlated sample means are simultaneously different from zero, yields the same result. The previous normal test for the single marker EMMAX is also equivalent to the chi-squared test. Matrix takes into account uncertainty and collinearity of marker estimates. For instance, consider two markers with effects (similar effects) with (estimates of effects are negatively correlated because of linkage disequilibrium) and . The quadratic form has value with p-value 0.27. The evidence given by the p-value lowers because the two effects are correlated.

Bayes Factors from GBLUP

In this section we include our original derivations. There are two competing models in the BF: that the marker i with effect “has some effect” (1H: ) or «has 0 effect» (H0: ), and the BF can be written asThe BF measures whether the data is more probable under either of the hypothesis. This can be written, alternatively, aswhere is the effect of the marker. Typically, this involves a complex MCMC integration. In the particular case of multivariate normality with known variances, Varona , Varona (2010) showed that the expression (1) is equal towhere is the density of a priori evaluated at , and is the density of a posteriori evaluated at . Computation of BF using (2) is straightforward because is a normal density. In particular, is the density of knowing that there is an estimate with a certain a posteriori variance (e.g., different for each data set). In algebraic form this iswhere is the density of in the normal distribution with mean and variance . Consider the same example as before: , , . The BF is thus, in R code: dnorm(0,0,sqrt(0.2))/dnorm(0,0.5,0.05) which is 20.76 in the log10 scale. According to Kass & Raftery (1995) this is « Very Strong » evidence. Evidence from several consecutive markers in a segment can be pooled together using the BF. Expression (3) is generalized to several SNP markers (markers from i to n) as:where MVN is the density of a multivariate normal distribution and is the posterior (co)variance matrix between the marker estimates. Posterior covariance matrix , which is a submatrix of , takes into account colinearity between markers caused by LD. In this case, the BF tests whether a set of markers are all simultaneously 0, against the alternative that some of them (if not all) are different from zero. Consider the same example as before: two markers with effects , , . The BF can be computed in R as dmvnorm(c(0,0),mean = c(0,0),sigma = diag(0.2,2))/dmvnorm(ahat,mean = c(0,0),sigma = Caa) yielding a BF of 1.65 in the log10 scale, lower than the single marker analysis. In a way, this reflects that there is a confusion of marker effects.

Multiple trait:

Above methods can be easily extended to the multiple trait case. Multiple trait genomic predictions can be done from Bayesian regressions, SNP-BLUP or (Single Step) GBLUP (Tsuruta ; Jia and Jannink 2012; Maier ). Then, the EMMAX tests or the BF for several traits (and possibly markers) simultaneously is very similar to the “Several markers” case considering joint estimates of marker effects for the n traits, the a priori covariance among marker effects for the n traits , and the a posteriori covariance matrix of marker effect estimates . Vector can include either one or several markers. Typically is a function of , the genetic covariance among traits.

Data:

We used a horse real data set to explore and illustrate the properties of the procedures. We also did a limited number of simulations but we chose not to present them as this was extensively done in (Chen ). A single base polymorphism at the gene DMRT3 in chromosome 23 has a strong effect on horse ambling gaits (Andersson ). In French trotters, a SNP marker (marker BIEC2-620109 on chromosome 23 at position 22967656 bp) in strong disequilibrium with this polymorphism has a strong effect in qualification at the race (Ricard 2015; Brard and Ricard 2015). In this work we reanalyzed the same data set, which contains 630 horses and 41711 polymorphic SNP markers. The trait was “outcome of the qualification test”, with a heritability of 0.56. The major gene was not discovered in this data set, and therefore there is no bias due to discovery. We tried the following methods for GWAS: Bayes factors with the mixture model BayesCPi: (Habier ) fixing a priori that only 0.1% of the markers have an effect (see (Legarra ) for a full description). This method provides BFs, although our implementation only considers single markers. Single marker and multiple marker EMMAX tests: as presented in this work, computed via MCMC, up to segments of 100 consecutive markers. Bayes factors from SNP-BLUP: as presented in this work, computed via MCMC, up to segments of 100 consecutive markers. EMMAX was fitted using blupf90 (Misztal ) and homemade scripts, whereas the other used our software GS3 (available at https://github.com/alegarra/gs3), using “OPTION Bayes Factor”. After completion of the analysis, we produced Manhattan plots based on BF and the other statistics; for EMMAX we used Bonferroni corrections to claim genowide significance; for Bayesian procedures, we did not address thresholds for declaring detection; this point will be addressed in the discussion.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. Supplemental material available at Figshare: https://doi.org/10.25387/g3.6241928.

Results

Figure 1 shows results from the single marker association test (EMMAX), the Bayesian multi-marker mixture model GWAS (BayesCPi) and the single marker BF. All three methods point to the SNP (BIEC2-620109 at position 22967656 bp) closest and most associated to the causal gene, and the four significant markers in the EMMAX single marker regression are in LD with each other. This reproduces the results in Ricard (2015). The EMMAX yields significant p-values at the Bonferroni level. Concerning BF, a threshold of 150 (2.17 in the log10 scale) has been suggested (Legarra ), and the BayesCPi analysis in Figure 2 does reach this threshold, but this is not the case in the single marker BF using GBLUP.

Figure 1

Results (from top to bottom) of single marker regression EMMAX, Bayes Factor for BayesCPi, and Bayes Factor for SNP-BLUP. Bonferroni rejection threshold in EMMAX is 5.9.

Figure 2

Bayes factor profiles (top) and p-values (bottom) for qualification test in French trotters, chromosome 23. The location of the causal mutation marked with a red vertical bar. Bonferroni rejection threshold is 5.9.

Results (from top to bottom) of single marker regression EMMAX, Bayes Factor for BayesCPi, and Bayes Factor for SNP-BLUP. Bonferroni rejection threshold in EMMAX is 5.9. Bayes factor profiles (top) and p-values (bottom) for qualification test in French trotters, chromosome 23. The location of the causal mutation marked with a red vertical bar. Bonferroni rejection threshold is 5.9. In both analyses (BayesCPi and single marker BF), a large number of markers fall below the threshold of 0 in log10(BF), in other words, BF < 1. This means that for those regions the hypothesis that these markers have an effect is less likely that the hypothesis that they do not have an effect. Figure 2 shows that evidence of the causal gene increases when using BF across consecutive markers. Systematically, the same location (BIEC2-620109) is spotted. It can be seen that the strength of evidence increase dramatically with increasing consecutive numbers, reaching the suggested “suggestive” threshold of BF > 3 (Kass and Raftery 1995), but not the much higher threshold of 150 suggested (Legarra ). On the other hand, evidence from EMMAX does actually decrease, becomes non-significant, and, moreover, the highest peak deviates from the true location. This is at first sight a rather surprising result that will be discussed later.

Discussion

The standard test for GWAS by association analysis is the single marker association analysis (e.g., Kruglyak 1999). Association analysis can account for genetic relationships (Kennedy ), population structure (Kang ) and also to a part of individuals not genotyped (Legarra and Vitezica 2015). An alternative is to fit multiple marker simultaneously in the form of Bayesian regression (e.g., Fernando and Garrick 2013). Legarra did not see qualitative differences of Bayesian regressions and association analysis over five data sets and species, and concluded that the interest of Bayesian procedures is to complement regular association analysis. Anyway, Bayesian regression is of interest for three reasons: first, the Bayesian analysis has interesting properties of automatically accounting for multiple test, structure, unbiasedness, false discovery rate and power (Wakefield 2009; Fernando and Garrick 2013); second, genomic evaluation routinely generates marker estimates and these may be used for GWAS; third, complex models used in genetic evaluation can be considered, for instance multiple trait disease all-or-none traits (Parker Gaddis ). The analysis that we propose can be seen as an approximation to a mixture analysis such as BayesCPi. For a given marker, we ask the question: “is this marker worth being included in the model?” whereas we pretend that all the other markers are included in the model. Implicitly, the prior is of a normal distribution with a known variance for loci not being tested and a mixture of a point mass at zero and a normal distribution for the locus being tested. In a mixture model (BayesCPi and similar ones), all markers are scrutinized simultaneously, and the strength of evidence compared against the probability value that a marker should be included in the model (usually labeled as π). This is probably why the actual numbers for the BF are so different across both methods. We stress that the SNP-BLUP or GBLUP estimation is run only once, and its results are used to construct BFs for different groups of markers (consecutive or not), if desired. This BF combining information from several markers is quite different from estimating the effect of segments of alleles forming haplotypes, where a haplotype can be seen as a multiallelic marker, and where a different complete estimation must be run for each segment length. In Bayesian regression models there is a lack of unique criterion to define “relevance” of the association and of corresponding well-defined thresholds; see (Legarra ) for a description.The numerical values depend strongly on the assumed prior for marker effects (as can be seen in Figure 1). Thus, two researchers fitting, say, BayesCPi and BayesA may obtain different results. The most popular procedure for genomic evaluation and Bayesian regression is SNP-BLUP or its equivalent GBLUP, both of which assume multivariate normality of marker effects. Most often, a reasonable assumption (point estimate) on the variance of marker effects exists, by a transformation of previous estimates of genetic variance (obtained by pedigree analysis or, using the same data set, by genomic REML or similar methods). Using this point estimate underestimates noise linked to estimation of variance components. Here, we present for the first time a closed-form method to estimate BFs for association analysis based on GBLUP results, and we advocate its use. The statistical properties of the BF have been extensively discussed in the statistics literature, but for mapping causal variants it has two very few relevant properties: the BF can show evidence against and for the null hypothesis, and as data cumulates, the Bayes Factor favors the true hypothesis. Our results from real data sets show that all methods point to the right marker (the one in stronger LD with the unobserved, but known, QTL). Classical regression analysis is significant and BayesCPi yields a “strong” BF signal. However, the BF observed from SNP-BLUP is 1.07 for the truly associated marker, which is very small support. Evidence from BF increases when we extend the BF to gather evidence from several markers. A multi – SNP test captures the divergence of the posterior distribution from the 0 vector, and takes into account the posterior dependencies, due to LD, between marker estimates. This is similar to the idea of using the amount of variance explained by each genomic segment (Pérez-Enciso and Varona 2000; Hayes ; Nagamine ; Fernando and Garrick 2013). The inconvenience of these methods is mostly computational: they require to do either Restricted Maximum Likelihood (Nagamine ) or MCMC (Hayes ; Fernando and Garrick 2013) to estimate variance components, and that only the Restricted Maximum Likelihood estimation has an associated statistical test (Likelihood ratio test), for which consensual threshold exist (such as 0.05 genome-wide corrected by Bonferroni) whereas the MCMC methods use ad hoc thresholds that are less consensual. Our proposal does not require MCMC or Restricted Maximum Likelihood, but establishing a threshold for the BF is still ambiguous. An approximate method pools information from estimates of marker effects (Wang ), but this does not consider not the error in the estimation of marker effects, neither their a posteriori correlation in presence of LD. Our proposal is exact, given a point estimate of variance components but does not necessarily require Restricted Maximum Likelihood or MCMC. The interpretation of the BF in this study is as follows. There are two models, in the first (null) model all markers have 0 effect, whereas in the second (alternative) model at least one of the markers has an effect. In other words, the BF is a contrast between the region contributing, or not, to the genetic variance. When markers’ evidence is pooled across contiguous markers, the evidence for either of the two competing models increases. Strangely, in our study including more markers in multiple marker EMMAX does not reinforce evidence, contrary to the BF. This is contrary to results of Chen . The reason is possibly due to the not-too-strong linkage disequilibrium in our data set, for which p-values do not cumulate information across multiple markers. It would seem that, in our data set, it is more difficult to disprove several null hypotheses (null hypothesis in EMMAX: all markers are zero) than to prove an alternative hypothesis (alternative hypothesis in BF: some marker is different from zero).

Conclusions

We present a Bayesian method (the BF) that complements existing EMMAX methods for QTL detection using marker estimates from SNP-BLUP or (SS)GBLUP from a commonly accepted prior (multivariate normality combined with prior estimates of the genetic variance) and commonly accepted, and used, methods (SNP-BLUP and SSGBLUP). Computations are reasonable and pooling information from several markers is straightforward. Based on our real data set, single marker EMMAX is better to claim significance, whereas multiple marker BF gives a better perspective of influence of LD on the result. This is likely to be data dependent.

40 in total

1. Should we use the single nucleotide polymorphism linked to in genomic evaluation of French trotter?

Authors: S Brard; A Ricard
Journal: J Anim Sci Date: 2015-10 Impact factor: 3.159

2. Estimation of effects of single genes on quantitative traits.

Authors: B W Kennedy; M Quinton; J A van Arendonk
Journal: J Anim Sci Date: 1992-07 Impact factor: 3.159

3. Efficient methods to compute genomic predictions.

Authors: P M VanRaden
Journal: J Dairy Sci Date: 2008-11 Impact factor: 4.034

4. Multiple-trait genomic evaluation of linear type traits using genomic and phenotypic data in US Holsteins.

Authors: S Tsuruta; I Misztal; I Aguilar; T J Lawlor
Journal: J Dairy Sci Date: 2011-08 Impact factor: 4.034

5. Markov chain Monte Carlo segregation and linkage analysis for oligogenic models.

Authors: S C Heath
Journal: Am J Hum Genet Date: 1997-09 Impact factor: 11.025

6. Genomic selection for producer-recorded health event data in US dairy cattle.

Authors: K L Parker Gaddis; J B Cole; J S Clay; C Maltecca
Journal: J Dairy Sci Date: 2014-03-05 Impact factor: 4.034

7. Common SNPs explain a large proportion of the heritability for human height.

Authors: Jian Yang; Beben Benyamin; Brian P McEvoy; Scott Gordon; Anjali K Henders; Dale R Nyholt; Pamela A Madden; Andrew C Heath; Nicholas G Martin; Grant W Montgomery; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2010-06-20 Impact factor: 38.330

8. Genetic evaluation using single-step genomic best linear unbiased predictor in American Angus.

Authors: D A L Lourenco; S Tsuruta; B O Fragomeni; Y Masuda; I Aguilar; A Legarra; J K Bertrand; T S Amen; L Wang; D W Moser; I Misztal
Journal: J Anim Sci Date: 2015-06 Impact factor: 3.159

9. Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in Holstein cattle as contrasting model traits.

Authors: Ben J Hayes; Jennie Pryce; Amanda J Chamberlain; Phil J Bowman; Mike E Goddard
Journal: PLoS Genet Date: 2010-09-23 Impact factor: 5.917

10. A comparison of methods for whole-genome QTL mapping using dense markers in four livestock species.

Authors: Andres Legarra; Pascal Croiseau; Marie Pierre Sanchez; Simon Teyssèdre; Guillaume Sallé; Sophie Allais; Sébastien Fritz; Carole Rénée Moreno; Anne Ricard; Jean-Michel Elsen
Journal: Genet Sel Evol Date: 2015-02-12 Impact factor: 4.297

5 in total

Review 1. Application of Bayesian genomic prediction methods to genome-wide association analyses.

Authors: Anna Wolc; Jack C M Dekkers
Journal: Genet Sel Evol Date: 2022-05-13 Impact factor: 5.100

2. Semi-parametric empirical Bayes factor for genome-wide association studies.

Authors: Junji Morisawa; Takahiro Otani; Jo Nishino; Ryo Emoto; Kunihiko Takahashi; Shigeyuki Matsui
Journal: Eur J Hum Genet Date: 2021-01-25 Impact factor: 5.351

3. Emerging issues in genomic selection.

Authors: Ignacy Misztal; Ignacio Aguilar; Daniela Lourenco; Li Ma; Juan Pedro Steibel; Miguel Toro
Journal: J Anim Sci Date: 2021-06-01 Impact factor: 3.159

4. Genome-wide haplotype-based association analysis of key traits of plant lodging and architecture of maize identifies major determinants for leaf angle: hapLA4.

Authors: Carlos Maldonado; Freddy Mora; Carlos A Scapim; Marlon Coan
Journal: PLoS One Date: 2019-03-06 Impact factor: 3.240

5. Genetic Variance of Metabolomic Features and Their Relationship With Malting Quality Traits in Spring Barley.

Authors: Xiangyu Guo; Pernille Sarup; Jens Due Jensen; Jihad Orabi; Nanna Hellum Kristensen; Frans A A Mulder; Ahmed Jahoor; Just Jensen
Journal: Front Plant Sci Date: 2020-10-19 Impact factor: 5.753

5 in total