Literature DB >> 33020191

A Multiple-Trait Bayesian Variable Selection Regression Method for Integrating Phenotypic Causal Networks in Genome-Wide Association Studies.

Zigui Wang¹, Deborah Chapman¹, Gota Morota², Hao Cheng³.

Abstract

Bayesian regression methods that incorporate different mixture priors for marker effects are used in multi-trait genomic prediction. These methods can also be extended to genome-wide association studies (GWAS). In multiple-trait GWAS, incorporating the underlying causal structures among traits is essential for comprehensively understanding the relationship between genotypes and traits of interest. Therefore, we develop a GWAS methodology, SEM-Bayesian alphabet, which, by applying the structural equation model (SEM), can be used to incorporate causal structures into multi-trait Bayesian regression methods. SEM-Bayesian alphabet provides a more comprehensive understanding of the genotype-phenotype mapping than multi-trait GWAS by performing GWAS based on indirect, direct and overall marker effects. The superior performance of SEM-Bayesian alphabet was demonstrated by comparing its GWAS results with other similar multi-trait GWAS methods on real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.

Entities: Chemical Disease Species

Keywords: Bayesian Regression; GWAS; GenPred; Genomic Prediction; Shared data resources; Structural Equation Models; Variable Selection

Year: 2020 PMID： 33020191 PMCID： PMC7718731 DOI： 10.1534/g3.120.401618

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Genome-wide association studies (GWAS) are widely used to identify associations between single nucleotide polymorphisms (SNPs) and phenotypes (Ozaki ; Visscher ; McCarthy ; Cantor ). GWAS have successfully mapped quantitative trait loci (QTL) associated with traits of interest, e.g., meat quality and quantity in livestock (Sharma ), crop yields in plants (Liu and Yan 2018), and diseases in humans (Visscher ). GWAS are typically based on using linear mixed models to fit one SNP at a time to a single trait (Hackinger and Zeggini 2017). While this allows for a relatively simple statistical model, the interwoven nature of gene expression translates to many traits being correlated with each other (Sodini ). These correlations can be utilized in multi-trait linear mixed models for GWAS to reduce false positives and increase the statistical power for association mapping (O’Reilly ; Korte ). Conventional multi-trait linear mixed models do not consider the causal relationships between traits. To address this issue, researchers have proposed refining the multi-trait methods with structural equation models (SEM) introduced by Wright (1934) that consider the causal relationship among traits. A model that incorporates causal structures should better reflect underlying genetic mechanisms. Gianola and Sorensen (2004) used SEM to extend conventional multi-trait linear mixed models to accommodate for recursive and simultaneous relationships among traits, which allows traits to be explanatory variables for other traits. Recently, Momen , 2019) proposed the SEM-based GWAS (SEM-GWAS) methodology by applying SEM to linear mixed models for GWAS. They showed that while conventional GWAS methodology only provides overall SNP effects, SEM-GWAS can capture the complex causal relationships among traits and further decompose the overall SNP effects into direct and indirect effects. The SEM-GWAS method proposed by Momen , 2019) is based on linear mixed models with a fixed substitution effect for the tested SNP and a random effect with covariances defined by a “genomic relationship matrix” computed from genotypes (VanRaden 2008) to account for genetic relatedness. Markers are usually implicitly assumed to affect all traits when the ”genomic relationship matrix” is constructed in multi-trait analysis. However, this assumption is not biologically meaningful, especially in multi-trait analyses involving many traits. Cheng proposed a general class of multi-trait Bayesian variable selection regression methods that use a broad range of mixture priors, e.g., multi-trait BayesC, where each locus can affect any combination of traits, which allows us to more closely model the true biological mechanisms, e.g., pleiotropy (Cheng ). The primary goal of this current research is to develop a multi-trait Bayesian regression GWAS method that more closely resembles the underlying biological mechanisms including pleiotropy and causal structure among traits. In this paper, we develop and implement a new GWAS method called SEM-Bayesian alphabet, which integrates SEM to the multi-trait Bayesian variable selection methods, to incorporate the underlying biological mechanism. The term “Bayesian alphabet” denotes a collection of Bayesian regression models that differ in the priors adopted for marker effects (Gianola 2013). In this paper, we use SEM-BayesC, a Bayesian variable selection method, to show the utility of the SEM-Bayesian alphabet. The performance of our proposed method is studied using real and simulated data.

Materials And Methods

Multi-trait Bayesian regression model using mixture priors

Assuming that individuals have all traits measured with a general mean as the only fixed effect, we write the multi-trait model for individual i from n genotyped individuals as:where is the vector of phenotypes of t traits for individual i, is a vector of overall means for t traits, p is the number of genotyped loci, is the genotype covariate at locus j for individual i (coded as 0,1,2), is the vector of marker effects of t traits for locus j, and is the vector of residuals of t traits for individual i. The fixed effects are assigned flat priors. The residuals, , are a priori assumed to be independently and identically distributed multivariate normal vectors with null mean and covariance matrix , which is assumed to have an inverse Wishart prior distribution, . Allowing each locus to affect any combination of traits, in a multiple-trait Bayesian variable selection method, e.g., multi-trait BayesC (Cheng ), the vector of marker effects at locus j can be written as , where is a diagonal matrix whose diagonal element is , where is the indicator variable indicating whether the marker effect of locus j for trait k is zero or not, and is a priori assumed to be independently and identically distributed multivariate normal vectors with null mean and covariance matrix , which is assumed to have an inverse Wishart prior distribution, . Given that a locus can have an effect on any combination of traits, we use numeric labels to represent all possible combinations for , in which case the prior distribution for is:where is the prior probability that the vector corresponds to the vector labeled “i” and . We assume the prior for is a uniform distribution.

Structural Equation Model

The linear SEM is composed of two parts: the measurement equation analyzing the relationship between the observable variables and latent variables, and the structural equation capturing the connections among latent variables (Anderson and Gerbing 1988). These two equations can be written as:where is the vector of observable variables for individual i, is a vector of endogenous latent variables, is a vector with exogenous latent variables, and are the matrix of unknown coefficients in structural equation, is a matrix of unknown structural coefficients, and are and q × 1 vectors of residuals. The details of parameter estimation are discussed in Song and Lee (2012). In our study, no latent variables are assumed and the sole observable variables are phenotypes. Thus only the causal relationship among observable variables, i.e., phenotypes, are fitted in the SEM model (also known as path analysis (Wright 1921)) as:where and are defined as above, represents everything that is not explained by , and is an matrix of structural coefficients representing the causal structure recovered from the Inductive Causation (IC) algorithm as described in the next section. To illustrate, we assume that the phenotypes of three traits for each individual (i.e., and for traits 1, 2, and 3) have the following causal relationship:where causal coefficient represents that a 1-unit increase in trait i results in a unit increase in trait j. Given the causal structure above, the can be written as:

Searching causal structure

As described above, fitting the SEM requires the causal structure among all traits to be known before analysis. To explore the wide-range of possible underlying causal structures, we use the method from Valente to discern the causal structure based on the posterior distribution of the residual covariance matrix. The reason we do not directly apply this method to phenotype data are that the covariance among phenotypes is likely confounded by genetic effects. The process of inferring causal structure is composed of three steps: Fit the multi-trait Bayesian model and obtain the posterior distribution of the residual covariance matrix. Follow Valente to derive the conditional independence relationship among traits based on the posterior distribution of the residual covariance matrix. In detail, we derive the residual partial correlation p(, |h), where h is a set of traits, to test whether trait is conditionally independent from . The highest posterior density (HPD) interval of 0.9 was used to make statistical decisions. If HPD interval of p(, |h) contains zero, and are regarded as conditionally independent on h. Apply the IC algorithm (Pearl 2009) as described in the Appendix to the conditional independence relationship from step 2 to obtain the causal structure.

SEM-Bayes

Assume in equation (1) and follow assumptions in multi-trait BayesC, the SEM-Bayes model can be written as:Move from the right side to the left side of equation (3), and define , where is a identity matrix and is a matrix of structural coefficients based on the discerned causal structure, the model becomes:To guarantee that the structural coefficient is identifiable, we assume that the residuals for each trait of individual i are independent with each other, which means the residual covariance matrix is diagonal (Wu ; Momen . The vector of all non-zero elements in , e.g., , is assumed to have a prior distribution:where 1 is a vector of ones, is the identity matrix, and is a known mean for all elements in . is a tuning parameter to adjust the sharpness degree of the prior (Gianola and Sorensen 2004). In this paper, we set and . The priors for the remaining parameters are the same as in the section Multi-trait Bayesian regression model using mixture priors. Gibbs samplers are used to draw samples for all parameters. The full conditional distribution to draw samples for is shown below. The derivations of the full conditional distributions of the remaining parameters of interest for Gibbs samplers are in Cheng . Full conditional distribution of We follow Gianola and Sorensen (2004) to obtain the full conditional distribution of , with the difference between our derivation and Gianola and Sorensen (2004) being that we specify the causal structure with positions of parameters in the . Let denote all parameters except in the SEM-BayesC and use the causal structure as an example, the left hand side of equation (4), , can be written as:The conditional posterior distribution of can be written as:Setting , equation (5) can be written as:Following the derivation in Gianola and Sorensen (2004) and the fact that in a recursive system, the full conditional distribution of iswhere

Decomposition of SNP effects

In SEM-BayesC, the marker effect for locus j, , is considered as the vector of direct marker effect of t traits. The indirect effect of locus j of t traits can be calculated as . The overall effect of locus j on t traits is computed as or , which is the summation of both direct and indirect effect of locus j. For example, given a causal structure , the direct effect for locus j on three traits is , and the indirect effect for locus j on three traits is calculated as and the overall effect of locus j on trait k is .

Inference of association based on genomic windows

Markers in a genomic window are usually highly correlated, indicating that any single marker may not show a strong association with the trait even though a causal variant exists in the window. In this paper, we make an inference of association based on genomic windows, because multiple markers inside a genomic window may jointly capture the signal from the causal variant (Fernando and Garrick 2013; Fernando ). To make an inference of association based on genomic windows, posterior distribution for the proportion of the genetic variance explained by markers in genomic window w, , is estimated from MCMC samples of overall, direct, and indirect marker effects as follows. For one MCMC sample of all marker effects on one trait, let , , and denote direct, indirect, and overall effects of all markers respectively. The genetic value that is attributed to genomic window w is calculated as:where is a matrix of marker covariates in window w and , , and are the MCMC samples of direct, indirect, and overall marker effects for SNPs in window w. Then the variance explained by the genomic window w is defined as:Similarly, the total genetic variance is calculated as:The proportion of the genetic variance explained by direct, indirect, and overall marker effects in the genomic window w is calculated as:Given the MCMC samples of , the window posterior probability of association (WPPA) is calculated as the proportion of MCMC samples of that exceed a specific value T (Fernando and Garrick 2013; Chen ; Lloyd-Jones ). In this paper, associations are tested for non-overlapping windows of 100 SNPs, and genomic windows that explain over of the total genetic variance were deemed to be of potential interest (i.e., , where N is the total number of windows).

Data analysis

Real data:

The Rice Diversity Panel with 413 Oryza sativa individual accessions was used in the analysis. Three traits were considered, including plant height (PH), flowering time in Arkansas (FTA), and panicle number per plant (PN) in our GWAS. After removing the records with missing data for these three traits and genotype with minor allele frequency , 370 individuals with 33,519 SNPs genotyped were included in our analysis. The phenotypic and genotypic data were publicly available for download from http://www.ricediversity.org/. It has been shown that using a threshold of WPPA to declare a significant genomic window restricts the proportion of false positives (PFP) to (Fernando . A previous GWAS (Zhao identified significantly associated SNPs in chromosome 6 for flowering time in Arkansas (FTA) using the same dataset. A threshold of WPPA = 0.8 and p-value = 5 in our GWAS analysis resulted in similarly significant signals. This result suggests that a WPPA of 0.8 and p-value = 5 are reasonable for declaring a significant genomic window.

Simulated data:

To compare SEM-BayesC with SEM-GWAS of Momen , we simulated data based on real genotypes from the Rice Diversity Panel. The simulation scenarios in Chen were applied to simulate different genetic architectures. The QTL effects were generated from unit-gamma distribution (scale = 1) with three different shape parameters (γ): fewer QTL with large effects (), fewer QTL with small or large effects and many QTL with intermediate effects (), and the intermediate case (). In addition to the distribution of QTL effects, the number of QTL () may play an important role in GWAS, thus three numbers of QTL () combined with the three shape parameters were used to create 9 scenarios. For each scenario, 50 replicated populations were simulated with QTL positions randomly sampled across the genome. Trait 1 was assumed to have a causal effect on trait 2 with causal structural coefficient . The QTL effects were simulated under two scenarios (Figure 1): the QTL have direct effect on both trait 1 and trait 2 in scenario 1, where QTL only have direct effect on trait 1 in scenario 2. Half of the QTL were simulated following scenario 1, while the remaining followed scenario 2. Phenotypes for two traits were generated based on heritability of 0.5. In the simulated data analysis, the causal structure was assumed known to exclude the bias caused by searching causal structures. The structural coefficients were assumed known in SEM-GWAS.

Figure 1

: the direct effect on trait ; : the direct effect on trait ; : the indirect effect on trait . The graph shows the simulated two QTL effect simulation scenarios. Scenario 1 represents that the QTL has direct effect on both trait 1 and trait 2, whereas scenario 2 represents that the QTL only have direct effect on trait 1. We have implemented SEM-Bayesian alphabet in JWAS (Cheng ), an open-source, publicly available package for single-trait and multi-trait whole-genome analyses written in the freely available Julia language. More details can be found at https://reworkhow.github.io/JWAS.jl/latest/.

Results

Simulated data result

The performances of SEM-BayesC and SEM-GWAS were compared based on the AUC (Area under Receiver Operating Characteristic). Inference of association on genomic windows for SEM-GWAS was based on the minimum p-value (Begum ), i.e., genomic windows containing at least one significant variant are declared as significant windows. To exclude the irrelevant AUC with low levels of specificity, only the partial area under the curve up until the false positive rate of 5% (pAUC5) (Chen ; Ma ) was calculated. For the convenience of comparison, all pAUC5 measurements were rescaled such that the pAUC5 of the random classifier is equal to 1. The R package ROCR (Sing ) was used to obtain the pAUC5; the paired t-tests (p-value < 0.1) were used for comparing both the pAUC5 mean across all scenarios (overall mean comparison) and the pAUC5 mean for each level of and shape parameters γ (marginal mean comparison). The GWAS results based on overall, direct and indirect effect were shown in Table 1. For the overall effect result on trait 1, there is no significant difference between SEM-BayesC and SEM-GWAS in both overall mean comparison and marginal mean comparison. The direct effect on trait 1 is the same as the overall effect on trait 1 since the direct effect and overall effect on trait 1 are equal based on the causal structure. For the overall effect on trait 2, the pAUC5 mean of SEM-BayesC is significantly higher than that of SEM-GWAS in the overall mean comparison, and some marginal mean comparisons (e.g., and ). For the direct effect on trait 2, though higher overall mean of pAUC5 is usually observed in SEM-BayesC, there is no significant difference (p-value < 0.1) between SEM-BayesC and SEM-GWAS in both overall mean comparison and marginal mean comparison. For the indirect effect result on trait 2, similar to the overall effect result on trait 2, the pAUC5 mean of SEM-BayesC is significantly higher than that of SEM-GWAS in the overall mean comparison, and some marginal mean comparisons (e.g., and ).

Table 1

Overall and marginal mean of rescaled pAUC5 of SEM-BayesC and SEM-GWAS based on overall,direct and indirect effect

		Overall effect		Direct effect		Indirect effect
	factors	SEMBayesCП	SEM-GWAS	SEMBayesCП	SEM-GWAS	SEMBayesCП	SEM-GWAS
trait 1	n_QTL
	30	4.99	4.66	4.99	4.66	NA	NA
	90	2.00	1.91	2.00	1.91	NA	NA
	300	1.38	1.42	1.38	1.42	NA	NA
	shape (γ)
	0.18	2.71	2.52	2.71	2.52	NA	NA
	1.48	2.93	2.79	2.93	2.79	NA	NA
	3.00	2.72	2.69	2.72	2.69	NA	NA
	overall	2.79	2.66	2.79	2.66	NA	NA
trait 2	n_QTL
	30	5.49^‡	4.74^†	4.22	3.91	4.88^‡	3.15^†
	90	2.09	2.08	1.87	1.87	1.91	1.72
	300	1.34	1.38	1.30	1.33	1.28	1.30
	shape (γ)
	0.18	3.10^‡	2.80^†	2.48	2.36	2.79^‡	2.29^†
	1.48	3.03	2.84	2.56	2.38	2.63	2.46
	3.00	2.84	2.60	2.36	2.38	2.64	2.43
	overall	2.99^‡	2.74^†	2.46	2.37	2.69^‡	2.39^†

For both trait 1 and trait 2, comparisons between methods are made for different number of QTL (n) and shape parameters of QTL effects (γ). Estimations are based on 450 simulated data sets including nine scenarios discussed in Simulated data. For each effect, in each row, the values with different symbols have significantly different (p-value <0.1) pAUC5. NA represented pAUC5 was not available because the indirect effect on trait 1 does not exist based on the causal structure. The overall effect result on trait 1 is the same as the direct effect result on trait 1 because the overall effect on trait 1 equals the direct effect on trait 1 based on the causal structure.

Real data result

Causal structure and structural coefficients:

The causal structure among three traits is inferred by the IC algorithm from the estimated posterior distribution of the residual covariance matrix in the multi-trait BayesC model. Figure 2 shows three potential phenotypic causal structures among traits PH (), FTA (), and PN () recovered for the 0.9 HPD interval. The causal structure matrices for IC1 (), IC2 (), and IC3 () are:These three causal structures are fitted in the SEM-BayesC model, and samples from posterior distributions for coefficients in these causal structures are obtained. The 90% credible intervals for structural coefficients in IC1, IC2, and IC3 are shown in Table 2. It is worth noting that the causal structures in Figure 2 provides the same set of marginal and conditional independencies, extra biological knowledge is required to further infer the causal structure. In this paper, the SEM-BayesC model is proposed to incorporate (known) underlying causal structure among traits for GWAS. Thus, for the simplicity of our presentation, the causal structure is assumed known in Real Data Result section, e.g., IC1 is used to demonstrate the performance of SEM-BayesC.

Figure 2

Table 2

The 90% credible interval for causal structural coefficients in the three causal structures

λ	IC1	IC2	IC3
λ_PH→FTA	(0.30, 0.52)	NA	(0.30, 0.55)
λ_PH→PN	(-0.21, -0.02)	(-0.20, -0.01)	NA
λ_FTA→PH	NA	(0.22, 0.44)	NA
λ_PN→PH	NA	NA	(-0.27, -0.05)

PH, FTA, and PN represent traits plant height, flowering time in Arkansas, panicle number per plant, respectively. λ→ represents the causal effects of trait a on trait b. NA denotes structural coefficients those do not exist in the causal structure.

Causal structures among plant height (PH), flowering time in Arkansas (FTA), and panicle number per plant (PN) inferred from the IC algorithms. The edges connecting two traits represent non-null partial correlations as indicated by 0.9 HPD interval. The arrows represent the direction of causal effects. PH, FTA, and PN represent traits plant height, flowering time in Arkansas, panicle number per plant, respectively. λ→ represents the causal effects of trait a on trait b. NA denotes structural coefficients those do not exist in the causal structure.

Decomposition of SNP effects:

Direct, indirect, and overall SNP effects for all markers are estimated from SEM-BayesC and SEM-GWAS. In SEM-BayesC, direct SNP effects are assigned mixture priors, where each locus can affect any combinations of traits directly; samples from posterior distributions of indirect effects are obtained using joint samples from posterior distributions of and direct SNP effects . In IC1, for trait PH, the overall SNP effect is equal to the direct SNP effect, because there is no intermediate trait. For trait FTA, the overall SNP effect is composed of direct SNP effect and indirect SNP effect transmitted from PH. So the overall SNP effect for FTA is given by summing the direct SNP effect and indirect SNP effect. Similarly, for trait PN, the overall SNP effect is obtained by summing the direct SNP effect and indirect SNP effect transmitted from PH. The results of GWAS from SEM-BayesC and SEM-GWAS incorporating causal structure IC1 are shown in Figure 3. Significant signals are found only for trait FTA. SEM-BayesC adopts a threshold of WPPA = 0.8 to declare a significant genomic window and SEM-GWAS adopts a threshold of p-value = . The overall SNP effects are partitioned into direct and indirect effects, and GWAS are performed for the direct, indirect, and overall SNP effects separately for trait FTA. In Figure 3, the blue points, pink points and green points represents the significant genomic windows located in chromosome 1, chromosome 5 and chromosome 6. Window A contains SNPs from “id1000759” to “id1001229”; window B contains SNPs from “id1023967” to “id1024499”; window C contains SNPs from “id5013234” to “id5013920”; window D contains SNPs from “id6005814” to “id6006470”.

Figure 3

GWAS results based on overall, direct and indirect SNP effects from SEM-BayesC and SEM-GWAS incorporating IC1 causal structure for the trait flowering time at Arkansas (FTA). The horizontal dash line represents the threshold 0.8 or -log(). X-axis represents the location of genomic windows along the 12 chromosomes; Y-axis represents window posterior probability of association (WPPA) for SEM-BayesC and negative logarithm of the p-value (-log(p)) for SEM-GWAS. Colored points represent genomic windows with WPPA 0.8 or p-value . The blue points, pink points and green points represents the significant genomic windows located in chromosome 1, chromosome 5 and chromosome 6. For the overall effects, in the SEM-GWAS, window D achieved -log(p-value) 14.21; in the SEM-BayesC, window C achieved WPPA 0.90, window D achieved WPPA 0.88, and window A achieved WPPA 0.82. For the direct effect, in the SEM-GWAS, window D achieved -log(p-value) 12.24; in the SEM-BayesC, window C achieved WPPA 0.90, window D achieved WPPA 0.86, and window A achieved WPPA 0.80. For the indirect effect, in the SEM-GWAS, window B achieved -log(p-value) 14.65; in the SEM-BayesC, although no window is identified as significant in SEM-BayesC, a peak was observed at window B with WPPA 0.52. Further, for all three effects, the results from SEM-BayesC and SEM-GWAS are correlated (the correlation between the WPPA from SEM-BayesC and -log(p-value) from the SEM-GWAS is higher than 0.5). The correlation of indirect effect results from these two methods results achieved 0.70. Also, for both SEM-BayesC and SEM-GWAS, the overall effect is more correlated with direct effect rather than indirect effect. The magnitudes for overall, direct, and indirect SNP effect in SEM-BayesC are also shown in Figure 5. Though most large overall SNP effects consist of a large direct SNP effect and a relatively small indirect SNP effect, the indirect effect of some SNPs play an important role, e.g., the overall effect of SNP “id1024159”, as shown in Figure 5, consists of a large indirect SNP effect and relatively small direct effect.

Figure 5

Magnitude of direct, indirect and overall SNP effect from SEM-BayesC incorporating IC1 causal structure for the trait flowering time at Arkansas (FTA). X-axis represents the location of SNPs along the 12 chromosomes. Y-axis represents the magnitude of the marker effects. The blue points represents the SNP ”id1024159”. For SNP “id1024159”, the overall effect is consists of a small direct SNP effect and a relatively large indirect SNP effect.

Genetic pleiotropy in SEM-BayesC

As shown in Figure 4, the posterior distribution of the parameter is obtained, and markers show different levels of pleiotropy for direct SNP effects. In SEM-BayesC, each SNP can have direct effects on any combination of traits, and the parameter is used to estimate the proportion of SNPs having different levels of pleiotropy. Indirect SNP effects on one trait are transmitted from direct SNP effects on other intermediate traits. For example, in causal structure IC1, the indirect SNP effects on FTA is transmitted from direct SNP effects on the trait PH. A SNP having no direct effect on FTA may have overall effects on FTA if its direct effect on trait PH is non-zero. The proportions of markers affecting different combinations of traits through overall SNP effects can also be estimated. This result is shown in Figure 4, and different probabilities are observed for some cases between overall and direct SNP effects. More SNPs have effects on all traits simultaneously when overall SNP effects are considered compared to direct SNP effects (case 1). If only overall SNP effects are considered, some cases having non-zero probabilities for indirect SNP effects are hidden by the causal relationships among traits (cases 2-4). The same patterns for direct and overall SNP effects are observed in cases 5-8 because there is no causal relationship between trait FTA and PN.

Figure 4

Estimated proportion of markers affecting combinations of traits, , from SEM-BayesC incorporating IC1 causal structure. For each scenario, it is estimated for both direct SNP effects (left) and overall SNP effects (right).

Discussion

The complex causal relationships among multiple traits are usually not considered in conventional multi-trait GWAS. Here we propose the SEM-Bayesian alphabet method to incorporate pre-inferred causal structures among multiple traits into multi-trait Bayesian regression methods. SEM-Bayesian alphabet accounts for causal structures among traits, and has the potential advantage of estimating causal effects, providing genomic window-based inference, as well as providing a comprehensive understanding of the underlying biological mechanism.

GWAS

To show the potential utility of SEM-Bayesian alphabet, simulated data were used to compare SEM-BayesC with SEM-GWAS. A wide variety of potential genomic architectures were constructed by the combination of different levels of skewness of gamma distribution for QTL effects (γ) and different numbers of QTL() (Chen . BayesC was also performed to estimate overall SNP effects on the same datasets, and similar results as those in the SEM-BayesC were obtained (not shown in this paper). This is reasonable since the SEM-Bayesian alphabet model can be reduced to a model similar to Bayesian regression by reparameterization, indicating that the joint likelihood functions of SEM-Bayesian alphabet and Bayesian regression are similar. Compared to Bayesian regression, the SEM-Bayesian alphabet provides a more comprehensive understanding of the underlying biological mechanism by decomposing overall SNP effects into direct and indirect SNP effects (Figures 3, 4 and 5). Magnitude of direct, indirect and overall SNP effect from SEM-BayesC incorporating IC1 causal structure for the trait flowering time at Arkansas (FTA). X-axis represents the location of SNPs along the 12 chromosomes. Y-axis represents the magnitude of the marker effects. The blue points represents the SNP ”id1024159”. For SNP “id1024159”, the overall effect is consists of a small direct SNP effect and a relatively large indirect SNP effect. The comparison between SEM-BayesC and SEM-GWAS using simulated data were shown in Table 1. As shown in our results, SEM-BayesC has relatively the same or higher pAUC5 than SEM-GWAS in all simulation scenarios. In some scenarios, SEM-BayesC has significantly higher pAUC5 than SEM-GWAS. For example, when one trait is affected by few QTL of large effects (e.g., n = 30,), SEM-BayesC has significantly higher pAUC5 than SEM-GWAS to infer indirect and overall effects. Though significant difference is not observed for direct effect, higher overall mean of pAUC5 is usually observed in SEM-BayesC.

Causal structure

The causal structure is assumed to be known in SEM-Bayesian alphabet, and it is usually discerned by three types of algorithms: the constraint-based algorithm, the score-based algorithms, and the hybrid algorithms. The IC algorithm (Pearl 2009; Valente ) used in this paper is a typical constraint-based algorithm, which is based on conditional independence tests. The score-based algorithms apply the heuristic optimization techniques, which set an initial graph structure and assign an initial goodness-of-fit score to it, and then maximize the goodness-of-fit score to obtain the most possible causal structure. The hybrid algorithm is a hybrid of both the constraint-based and the score-based algorithms. It utilizes conditional independence tests to reduce the space of candidate causal structures, and uses network scores to identify the optimal structure among them (Scutari 2014). The causal structures inferred from these algorithms may be different. Note that different evaluation criteria may also result in different outcome causal structures. For example, in this paper, if we choose 0.99 instead of 0.9 HPD interval to search for causal structures, there will be no edge between the traits PH and PN. In some previous analysis (Mi ; Momen , 2019), the indirect SNP effect of locus j of t traits is obtained by multiplying the estimated , , and estimated direct SNP effects, , as . This is similar to using posterior means of causal structural coefficients and direct SNP effects for calculation of the indirect SNP effects. In our method, indirect SNP effects are estimated using joint samples from posterior distributions of and . We compared these two approaches for indirect SNP effect estimation on real rice data, and found that the indirect effects estimated from these two approaches are slightly different. The SEM-BayesC approach should be used in indirect SNP effect estimation due to the fact that and may be highly dependent.

Conclusion

SEM-Bayesian alphabet provides more interpretation into biological mechanisms than Bayesian regression methods by decomposing the overall SNP effects into direct and indirect SNP effects. In SEM-Bayesian alphabet, posterior distributions of the overall, direct, and indirect SNP effects, as well as causal structure coefficients, are obtained, which are used to make inferences about these parameters. Compared to the typical GWAS method incorporating causal structure among multiple traits, such as SEM-GWAS, SEM-Bayesian alphabet obtains the posterior distributions for the proportion of variance attributed to a genomic region to detect causal loci (i.e., the use of WPPA). The level of gene pleiotropy, e.g., proportion of markers affecting different combinations of traits as shown in Figure 4, can also be further dissected into direct and indirect SNP effects. Also, with estimating structural coefficients, SEM-Bayesian alphabet still has relatively same or greater pAUC5 than SEM-GWAS in all scenarios of simulated data. In summary, SEM-Bayesian alphabet offers a more comprehensive understanding of the underlying biological mechanisms including pleiotropy and causal relationships among traits than conventional GWAS, as well as has a potential advantage in the GWAS inference than other GWAS considering complex causal effect among multiple traits.

27 in total

1. ROCR: visualizing classifier performance in R.

Authors: Tobias Sing; Oliver Sander; Niko Beerenwinkel; Thomas Lengauer
Journal: Bioinformatics Date: 2005-08-11 Impact factor: 6.937

Review 2. Genome-wide association studies for complex traits: consensus, uncertainty and challenges.

Authors: Mark I McCarthy; Gonçalo R Abecasis; Lon R Cardon; David B Goldstein; Julian Little; John P A Ioannidis; Joel N Hirschhorn
Journal: Nat Rev Genet Date: 2008-05 Impact factor: 53.242

3. Genome-Wide Association Analyses Based on Broadly Different Specifications for Prior Distributions, Genomic Windows, and Estimation Methods.

Authors: Chunyu Chen; Juan P Steibel; Robert J Tempelman
Journal: Genetics Date: 2017-06-21 Impact factor: 4.562

4. Priors in whole-genome regression: the bayesian alphabet returns.

Authors: Daniel Gianola
Journal: Genetics Date: 2013-05-01 Impact factor: 4.562

5. Genomic Prediction from Multiple-Trait Bayesian Regression Methods Using Mixture Priors.

Authors: Hao Cheng; Kadir Kizilkaya; Jian Zeng; Dorian Garrick; Rohan Fernando
Journal: Genetics Date: 2018-03-07 Impact factor: 4.562

6. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction.

Authors: Kouichi Ozaki; Yozo Ohnishi; Aritoshi Iida; Akihiko Sekine; Ryo Yamada; Tatsuhiko Tsunoda; Hiroshi Sato; Hideyuki Sato; Masatsugu Hori; Yusuke Nakamura; Toshihiro Tanaka
Journal: Nat Genet Date: 2002-11-11 Impact factor: 38.330

Review 7. 10 Years of GWAS Discovery: Biology, Function, and Translation.

Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang
Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025

Review 8. Statistical methods to detect pleiotropy in human complex traits.

Authors: Sophie Hackinger; Eleftheria Zeggini
Journal: Open Biol Date: 2017-11 Impact factor: 6.411

9. Comparison of Genotypic and Phenotypic Correlations: Cheverud's Conjecture in Humans.

Authors: Sebastian M Sodini; Kathryn E Kemper; Naomi R Wray; Maciej Trzaskowski
Journal: Genetics Date: 2018-05-08 Impact factor: 4.562

10. Including Phenotypic Causal Networks in Genome-Wide Association Studies Using Mixed Effects Structural Equation Models.

Authors: Mehdi Momen; Ahmad Ayatollahi Mehrgardi; Mahmoud Amiri Roudbar; Andreas Kranis; Renan Mercuri Pinto; Bruno D Valente; Gota Morota; Guilherme J M Rosa; Daniel Gianola
Journal: Front Genet Date: 2018-10-09 Impact factor: 4.599

3 in total