| Literature DB >> 23636737 |
Daniel E Runcie1, Sayan Mukherjee.
Abstract
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed-effects model. The key idea of our model is that we need consider only G-matrices that are biologically plausible. An organism's entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse - affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set.Entities:
Keywords: Bayesian inference; G matrix; animal model; factor model; sparsity
Mesh:
Year: 2013 PMID: 23636737 PMCID: PMC3697978 DOI: 10.1534/genetics.113.151217
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Simulation parameters
| No. factors | No. traits | Sample size | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| a | b | c | d | e | f | g | h | i | j | |
| No. traits | 100 | 100 | 100 | 100 | 100 | 20 | 1000 | 100 | 100 | 100 |
| Residual type | SF | SF | SF | F | Wishart | SF | SF | SF | SF | SF |
| No. factors | 10 | 25 | 50 | 10 | 5 | 10 | 10 | 10 | 10 | 10 |
| 0.5 (5) | 0.5 (15) | 0.5 (30) | 0.5 (5) | 1.0 (5) | 0.5 (5) | 0.9–0.1 (5) | ||||
| 0.0 (5) | 0.0 (10) | 0.0 (20) | 0.0 (5) | 0.0 (5) | 0.0 (5) | |||||
| Sample size | ||||||||||
| No. sires | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 50 | 100 | 500 |
| No. offspring/sire | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 5 | 10 | 10 |
Eight simulations were designed to demonstrate the capabilities of BSFG. Scenarios a–c test genetic and residual covariance matrices composed of different numbers of factors. Scenarios d–e test residual covariance matrices that are not sparse. Scenarios f–g test different numbers of traits. Scenarios h–j test different sample sizes. All simulations followed a paternal half-sib breeding design. Each simulation was run 10 times.
Sparse factor model for R. Each simulated factor loading (λ) had a 75–97% chance of equaling zero.
Factor model for R. Residual factors (those with ) were not sparse (λ ≠ 0).
R was simulated from a Wishart distribution with p + 1 degrees of freedom and inverse scale matrix . Five additional factors were each assigned a heritability of 1.0.
In each column, factors are divided between those h2 > 0 and those with h2 = 0. The number in parentheses provides the number of factors with the given heritability.
Figure 1BSFG recovers the dominant subspace of high-dimensional G-matrices. Each subplot shows the distribution of Krzanowski’s statistics (, Krzanowski 1979; Blows ) calculated for posterior mean estimates of G across a related set of scenarios. Plotted values are so that statistics are comparable across scenarios with different subspace dimensions. On this scale, identical subspaces have a value of zero and values increase as the subspaces diverge. The value of k used in each scenario is listed inside each box plot. The difference from zero roughly corresponds to the number of eigenvectors of the true subspace missing from the estimated subspace. Different parameters were varied in each set of simulations as listed below each box. (A) Increasing numbers of simulated factors. (B) Different types of R matrices. SF, a sparse-factor form for R. F, a (nonsparse) factor form for R. Wishart, R was sampled from a Wishart distribution. (C) Different numbers of traits. (D) Different numbers of sampled individuals. Note that in scenarios h–j, factor h2’s ranged from 0.0 to 0.9. Complete parameter sets describing each simulation are described in Table 1.
Number of large factors recovered in each scenario
| Scenario | Expected | Median | Range | |
|---|---|---|---|---|
| No. factors | a | 10 | 10 | (10,10) |
| b | 25 | 25 | (23,25) | |
| c | 50 | 49 | (48,50) | |
| d | 10 | 10 | (10,10) | |
| e | NA | 56 | (44,66) | |
| No. traits | f | 10 | 9 | (8,11) |
| g | 10 | 10 | (10,10) | |
| Sample size | h | 10 | 10 | (10,10) |
| i | 10 | 10 | (10,10) | |
| j | 10 | 10 | (10,10) | |
Each scenario was simulated 10 times. Factor magnitude was calculated as the L2-norm of the factor loadings, divided by the total phenotypic variance across all traits. Factors explaining >0.1% of total phenotypic variance were considered large.
In scenario e, the residual matrix did not have a factor form.
Figure 2BSFG successfully fits trait loadings on latent factors. The estimated factors were matched to the true latent traits in each simulation by calculating the vector angle between the trait loadings of each true factor and the most similar estimated factor (column of Λ). The median error angle across factors was calculated for each simulation. Box plots show the distribution of median error angles by scenario. Two identical vectors have an angle of zero. Completely orthogonal vectors have an angle of 90°. (A) Increasing numbers of simulated factors. (B) Different types of R matrices. Angles are shown only for the genetically variable factors in scenarios d and e (factors 1–5, see Methods). (C) Different numbers of traits. (D) Different numbers of sampled individuals.
Figure 3BSFG accurately estimates the heritability of latent traits. Distributions of factor h2 estimates for scenarios h–j. These scenarios differed in the number of individuals sampled. Ten latent traits with h2’s between 0.0 and 0.9 were generated in each simulation. After fitting our factor model to each simulated data set, the estimated factors were matched to the true latent traits based on the trait-loading vector angles. Each box plot shows the distribution of h2 estimates for each simulated factor across 10 simulations. Note that the trait loadings for each factor differed in each simulation; only the h2 values remained the same. Thin horizontal lines in each column show the simulated h2 values. Colors correspond to the scenario, and solid boxes/circles are used for factors with h2 > 0.0.
Figure 4BSFG estimates of individual trait heritability are accurate. The heritability of each individual trait was calculated as . was calculated for each simulation. Box plots show the distribution of RMSE values for each scenario. (A) Increasing numbers of simulated factors. (B) Different types of R matrices. (C) Different numbers of traits. (D) Different numbers of sampled individuals.
Figure 5Among-line covariance of gene expression and competitive fitness in Drosophila is modular. (A–C) Genetic (among-line) architecture of 414 gene expression traits measured in adult flies of 40 wild-caught lines (Ayroles ). (A) Posterior mean broad-sense heritabilities (H2) of the 414 genes. (B) Heat map of posterior mean genetic correlations among these genes. (C) Posterior mean estimates and 95% highest posterior density (HPD) intervals for genetic correlations between each gene and competitive fitness. For comparison, see Ayroles , Figure 7a). (D–F) Latent trait structure underlying gene expression covariances. (D) Posterior mean H2 for each estimated latent trait. (E) Heat map of posterior mean Λ matrix showing gene loadings on each latent trait. (F) Posterior mean estimates and 95% HPD intervals for genetic correlations between each latent trait and competitive fitness. The right axis of E groups genes into modules inferred using modulated modularity clustering (Ayroles ; Stone and Ayroles 2009).