| Literature DB >> 27271781 |
Giovanny Covarrubias-Pazaran1.
Abstract
Most traits of agronomic importance are quantitative in nature, and genetic markers have been used for decades to dissect such traits. Recently, genomic selection has earned attention as next generation sequencing technologies became feasible for major and minor crops. Mixed models have become a key tool for fitting genomic selection models, but most current genomic selection software can only include a single variance component other than the error, making hybrid prediction using additive, dominance and epistatic effects unfeasible for species displaying heterotic effects. Moreover, Likelihood-based software for fitting mixed models with multiple random effects that allows the user to specify the variance-covariance structure of random effects has not been fully exploited. A new open-source R package called sommer is presented to facilitate the use of mixed models for genomic selection and hybrid prediction purposes using more than one variance component and allowing specification of covariance structures. The use of sommer for genomic prediction is demonstrated through several examples using maize and wheat genotypic and phenotypic data. At its core, the program contains three algorithms for estimating variance components: Average information (AI), Expectation-Maximization (EM) and Efficient Mixed Model Association (EMMA). Kernels for calculating the additive, dominance and epistatic relationship matrices are included, along with other useful functions for genomic analysis. Results from sommer were comparable to other software, but the analysis was faster than Bayesian counterparts in the magnitude of hours to days. In addition, ability to deal with missing data, combined with greater flexibility and speed than other REML-based software was achieved by putting together some of the most efficient algorithms to fit models in a gentle environment such as R.Entities:
Mesh:
Year: 2016 PMID: 27271781 PMCID: PMC4894563 DOI: 10.1371/journal.pone.0156744
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Examples of genome-assisted prediction performed using sommer.
In the first row of the figure, genomic prediction of general performance (cross performance prediction) is summarized; in the 1st square, the model to predict crosses with a single additive kernel (wheat example) is depicted. In the 2nd square the prediction including additive, dominance and epistatic kernels is shown (single cross hybrid example). In the second row, genomic prediction of specific performance (within cross performance prediction) is shown; in the 1st square, prediction within a biparental cross is shown using a single additive kernel for species displaying small or null heterotic effects whereas in the 2nd square prediction within biparental populations is shown using additive, dominance and epistatic kernels (examples are included in the package).
Cross-validation of prediction accuracies using sommer (5-fold) for wheat and maize populations.
| Wheat† | Maize | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Env1 | Env2 | Env3 | Env4 | h2 | A(1) | A(2) | A-D(3) | h2 | |
| Accuracy grain yield | 0.51±.09 | 0.48±.10 | 0.38±.10 | 0.46±.09 | 0.21 | 0.18±0.14 | 0.21±0.16 | 0.37±0.16 | 0.18 |
| Accuracy plant height | 0.41±0.12 | 0.43±0.13 | 0.68±0.06 | 0.62 | |||||
Prediction accuracies for grain yield were obtained for each of the 4 mega environments available for the 599 lines of wheat. Prediction accuracies for grain yield and plant height were obtained for a maize population consisting of 100 hybrids tested in 4 locations using only additive (GCA) effects with a single variance component for both parents [A(1)], one variance component for each parent (GCA1 and GCA2; A(2)), and additive (GCA) and dominance (SCA) effects [A-D(3)].
† Accuracy values are averages over 500 runs of a 5-fold cross validation.
‡ Trait not evaluated in wheat.
Fig 2Best linear unbiased prediction (BLUP) comparisons for general and specific combining ability effects (GCA and SCA) using sommer versus ASReml.
BLUPs for GCA and SCA related to grain yield were computed in a corn population with 400 individuals evaluated in 4 locations using sommer and ASReml. Sommer estimates are shown on the y axis and are similar to results from the ASReml estimates shown on the x axis.
Comparison of sommer versus the most common mixed model software available for genomic selection.
| Feature | SAS | ASReml | lme4 | rrBLUP | MCMC-glmm | BGLR | regress | EMM-REML | |
|---|---|---|---|---|---|---|---|---|---|
| Open source | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Ability to specify var-cov structures for random effects | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Estimation of more than one variance component | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Basic expertise | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Platform independent | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Ability to specify var-cov structures for residual structures | ✓ | ✓ | ✓ | ||||||
| Use of sparse methods | ✓ | ✓ | ✓ | ||||||
| Handles missing data | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Advantages and disadvantages of each software with check marks indicating whether they possess the stated feature.
† The pedigreemm package which is an extension of lme4 allows the user to introduce pedigrees, but it does not allow the user to provide the variance-covariance matrices directly. Examples available in the pedigreemm package were run using sommer obtaining similar results, but sommer ran 4 times faster than pedigreemm. Examples are included in sommer documentation.
‡ Information not available.
§ These packages are based on Bayesian methods requiring the user to have a more advanced statistical background to decide the correct number of iterations, burn-in length and ability to analyze trace plots, and therefore the feature was defined as ‘Basic expertise’.
Fig 3Prediction accuracy results using sommer in corn hybrids with a 5-fold cross validation.
Cross validation results for plant height in a corn population of 400 individuals evaluated in 4 locations are shown in blue, whereas grain yield cross validation results for the same population is shown in red. Each dot represents the average of one run of a 5-fold cross validation.
Time comparison among different software for densest genomic models tested in the study.
| No.Var.Comp. | Time | sommer | ASReml | rrBLUP | regress | BGLR | MCMCglmm | EMMREML |
|---|---|---|---|---|---|---|---|---|
| One Var. | User | 232.34 | 438.43 | 765.27 | 1679.88 | 1104.16 | 529527.09 | 610.85 |
| Component | System | 7.69 | 3.59 | 0.94 | 2.79 | 181.89 | 3715.65 | 0.18 |
| N = 5000 | Elapsed | 240.04 | 442.73 | 766.43 | 1683.94 | 1291.71 | 533556.1 | 611.10 |
| Three Var. | User | 352.71 | 6860.85 | 1858.99 | 11712.25 | > 1058886 | 1130.92 | |
| Components | System | 59.7 | 6.35 | 3.53 | 5610.17 | > 7431 | 4.89 | |
| N = 10585 | Elapsed | 425.96 | 6873.60 | 1868.25 | 17364.27 | > 1067112 | 1136.63 |
Time consumption for a GBLUP model with a single variance component (additive) with 5000 individuals and 10000 markers is shown in the first row. A GBLUP model with 3 variance components (additive, dominance, epistasis) with 10,585 hybrids to be predicted genotyped with 35,432 SNPs is displayed in the second row. The two models represent the biggest population sizes used in the study to highlight the differences when big data is encountered.
‡ No more than one variance component other that the error can be estimated.
† Work stopped after 12 days running the model.
§ In both models ASReml returned a warning message of abnormal termination.
¶ Although EMMREML can handle multiple random effects does not return the value for the variance components and cannot handle missing data.
# Using the average information with eigen decomposition proposed by Lee et al. [22].
Fig 4Time performance for different algorithms.
In A) different color lines represent the different likelihood-based algorithms tested for populations sizes from 500 to 5000 in steps of 500 as a function of population size (N) for a single random effect. In B) the different color lines represent the algorithms able to deal with multiple random effects for a model with different population sizes (N), from 1000 to 8000 individuals for 3 random effects (GCA1, GCA2, and SCA).