| Literature DB >> 33789350 |
Malachy T Campbell1, Haixiao Hu1, Trevor H Yeats1, Melanie Caffe-Treml2, Lucía Gutiérrez3, Kevin P Smith4, Mark E Sorrells1, Michael A Gore1, Jean-Luc Jannink1,5.
Abstract
Oat (Avena sativa L.) seed is a rich resource of beneficial lipids, soluble fiber, protein, and antioxidants, and is considered a healthful food for humans. Little is known regarding the genetic controllers of variation for these compounds in oat seed. We characterized natural variation in the mature seed metabolome using untargeted metabolomics on 367 diverse lines and leveraged this information to improve prediction for seed quality traits. We used a latent factor approach to define unobserved variables that may drive covariance among metabolites. One hundred latent factors were identified, of which 21% were enriched for compounds associated with lipid metabolism. Through a combination of whole-genome regression and association mapping, we show that latent factors that generate covariance for many metabolites tend to have a complex genetic architecture. Nonetheless, we recovered significant associations for 23% of the latent factors. These associations were used to inform a multi-kernel genomic prediction model, which was used to predict seed lipid and protein traits in two independent studies. Predictions for 8 of the 12 traits were significantly improved compared to genomic best linear unbiased prediction when this prediction model was informed using associations from lipid-enriched factors. This study provides new insights into variation in the oat seed metabolome and provides genomic resources for breeders to improve selection for health-promoting seed quality traits. More broadly, we outline an approach to distill high-dimensional "omics" data to a set of biologically meaningful variables and translate inferences on these data into improved breeding decisions.Entities:
Keywords: GWAS; GenPred; factor analysis; genomic prediction; metabolomics; shared data resource
Year: 2021 PMID: 33789350 PMCID: PMC8045723 DOI: 10.1093/genetics/iyaa043
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 1Principal component (PC) analysis of genotypic and metabolomic data. The first four PCs of gentoypic data are shown in panels (A and B), while the first four PCs of the metabolomic data are shown in panels (C and D). Subpopulations that were defined based on k-means clustering of SNP marker data are indicated by different colored points. PVE, percent variance explained.
Empirical Bayes matrix factorization model selection
| EBNM Appr. | No. Fact. | LL | PVE |
|
| RMSE |
|---|---|---|---|---|---|---|
|
| 102 | −581716.3 | 59.41 | 0.438 | 0.322 | 1.451 |
|
| 106 | −583809.9 | 59.36 | 0.429 | 0.514 | 0.978 |
|
| 100 | −584317.2 | 58.82 | 0.434 | 0.520 | 0.970 |
Each model was fit using degressed BLUPs for 1668 metabolites. Ad. Shr.: adaptive shrinkage family of densities described by Stephens (2016). Cross-validation (CV) was based on a threefold orthogonal CV described by Wang and Stephens (2018) and Owen and Wang (2016) with 10 independent resamplings. Point Nor.: point-normal family of densities which are a normal distribution with a point mass at zero; LL indicates log-likelihood; PVE: percent variance explained; R: adjusted R2; is the Pearson’s correlation between predicted and observed values for observations in the testing set; RMSE: root mean square error.
Figure 2Functional enrichment among latent factors. Number of latent factors enriched (FDR < 0.05) for functional categories at the super-class level (A) and class level (B). Percentage of variance explained for each factor by a given functional category (C). Each point represents a functional class that was significantly enriched for one or more factors with the size of the point being proportional to the percentage of variance explained by that class for a given factor. Only factors and classes that showed significant enrichment (q < 0.05) at the super-class level are pictured. Colors differentiate between the class and subclass levels of the taxonomic hierarchy.
Figure 3Relationships between polygenicity, density, and heritability. (A) Association between polygenicity () and density ranks () after accounting for heritability (h2). Each variable was ranked from smallest to largest and the ranks for and were each regressed on ranks for h2. The scatter plot depicts the relationship between the residuals (Resid.) for each of these models. Colored points indicate factors that were enriched for lipids (Lip. Enr.), and different shapes indicate whether the factor was used to inform the lipid-enriched kernel for genomic prediction (Gen. Pred.). (B) Pairwise relationships between the ranks for each variable.
Factors capturing covariance between many metabolites with simple genetic architectures
| Factor |
|
|
|
|---|---|---|---|
|
|
| 0.621 | 0.08 |
|
|
| 0.369 | 0.29 |
|
|
| 0.413 | 0.19 |
|
|
| 0.247 | 0.06 |
Polygenicity estimates were based on the posterior means of and the proportion of variance for captured by significant GWAS associations for each factor (), and the density of factor loadings are provided as .
Figure 4Genomic prediction for fatty acid compounds. Prediction accuracy was assessed using fivefold cross validation with 50 resampling runs. (A) The distribution of Pearson’s correlation (r) coefficients between observed phenotypes and genetic values for each fatty acid compound. Panels (B and C) show the percent difference (% diff.) in prediction accuracy for the multi-kernel (MK) approach relative to genomic BLUP (gBLUP). The suffixes “-all” and “-lip” indicate models where the biologically informed kernel was constructed from markers associated with any latent factor or lipid-enriched factors, respectively. Three-hundred thirty lines used in this study were also used for factor analysis of metabolomic data.
Figure 5Genomic prediction for lipid and protein content measured via NIRS. Prediction accuracy was assessed using fivefold cross validation with 50 resampling runs. (A) The distribution of Pearson’s correlation (r) coefficients between observed phenotypes and genetic values for each fatty acid compound. (B) The percent difference (% diff.) in prediction accuracy for the multi-kernel (MK) and BayesB approaches relative to genomic BLUP (gBLUP). The suffixes “-all” and “-lip” indicate models where the biologically informed kernel was constructed from markers associated with any latent factor or lipid-enriched factors, respectively. Three-hundred thirty lines used in this study were also used for factor analysis of metabolomic data.