| Literature DB >> 36011405 |
Osval A Montesinos-López1, Abelardo Montesinos-López2, Bernabe Cano-Paez3, Carlos Moisés Hernández-Suárez4, Pedro C Santana-Mancilla1, José Crossa5,6.
Abstract
Genomic selection (GS) changed the way plant breeders select genotypes. GS takes advantage of phenotypic and genotypic information to training a statistical machine learning model, which is used to predict phenotypic (or breeding) values of new lines for which only genotypic information is available. Therefore, many statistical machine learning methods have been proposed for this task. Multi-trait (MT) genomic prediction models take advantage of correlated traits to improve prediction accuracy. Therefore, some multivariate statistical machine learning methods are popular for GS. In this paper, we compare the prediction performance of three MT methods: the MT genomic best linear unbiased predictor (GBLUP), the MT partial least squares (PLS) and the multi-trait random forest (RF) methods. Benchmarking was performed with six real datasets. We found that the three investigated methods produce similar results, but under predictors with genotype (G) and environment (E), that is, E + G, the MT GBLUP achieved superior performance, whereas under predictors E + G + genotype × environment (GE) and G + GE, random forest achieved the best results. We also found that the best predictions were achieved under the predictors E + G and E + G + GE. Here, we also provide the R code for the implementation of these three statistical machine learning methods in the sparse kernel method (SKM) library, which offers not only options for single-trait prediction with various statistical machine learning methods but also some options for MT predictions that can help to capture improved complex patterns in datasets that are common in genomic selection.Entities:
Keywords: genomic selection; multi-environment; multi-trait; plant breeding; statistical machine learning
Mesh:
Year: 2022 PMID: 36011405 PMCID: PMC9407886 DOI: 10.3390/genes13081494
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.141
Figure 1Prediction performance for each environment and across environments (Global) of dataset 1 (Indica) in terms of normalized mean square error (NRMSE) under four predictors (G, genotypic information; E + G. environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) and under the sevenfold cross-validation (CV) scheme.
Variance components (variance) and heritability estimates for dataset 1 (Indica) for each trait. CV denotes coefficient of variation, and Locs denotes the average number of locations.
| Trait | Component | Variance | Heritability | CV | Locs |
|---|---|---|---|---|---|
| GY | Loc:Hybrid | 394520.80 | 0.47 | 0.11 | 3 |
| GY | Hybrid | 361259.05 | 0.47 | 0.11 | 3 |
| GY | Loc | 496020.35 | 0.47 | 0.11 | 3 |
| GY | Residual | 336143.27 | 0.47 | 0.11 | 3 |
| PHR | Loc:Hybrid | 2.33 | 0.69 | 0.05 | 3 |
| PHR | Hybrid | 3.74 | 0.69 | 0.05 | 3 |
| PHR | Loc | 0.05 | 0.69 | 0.05 | 3 |
| PHR | Residual | 2.65 | 0.69 | 0.05 | 3 |
| GC | Loc:Hybrid | 1.73 | 0.54 | 0.63 | 3 |
| GC | Hybrid | 1.48 | 0.54 | 0.63 | 3 |
| GC | Loc | 0.06 | 0.54 | 0.63 | 3 |
| GC | Residual | 1.96 | 0.54 | 0.63 | 3 |
| PH | Loc:Hybrid | 1.96 | 0.76 | 0.06 | 3 |
| PH | Hybrid | 9.66 | 0.76 | 0.06 | 3 |
| PH | Loc | 4.81 | 0.76 | 0.06 | 3 |
| PH | Residual | 2.58 | 0.76 | 0.06 | 3 |
Prediction performance for each environment and across environments (Global) of dataset 1 (Indica) in terms of normalized mean square error (NRMSE) and relative efficiency (RE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) under sevenfold cross validation. NRMSE_GBLUP, NRMSE_PLS and NRMSE_RF denote the NRMSE under the GBLUP, PLS and random forest models, respectively. RE_PLS and RE_RF denote the relative efficiency (RE) calculated with the NRMSE of the PLS and random forest models, respectively. RE was calculated by dividing the prediction performance (with NRMSE) of the GBLUP model between the prediction performance of the PLS and random forest models; that is, the GBLUP model was considered the reference model.
| Data | Predictor | Env | NRMSE_GBLUP | NRMSE_PLS | NRMSE_RF | RE_PLS | RE_RF |
|---|---|---|---|---|---|---|---|
| Indica | G | 2010 | 0.892 | 0.981 | 0.907 | 0.909 | 0.984 |
| Indica | G | 2011 | 1.040 | 1.079 | 1.046 | 0.964 | 0.994 |
| Indica | G | 2012 | 0.917 | 1.001 | 0.966 | 0.916 | 0.949 |
| Indica | G | Global | 0.880 | 0.948 | 0.900 | 0.928 | 0.978 |
| Indica | E + G | 2010 | 0.876 | 0.969 | 0.853 | 0.904 | 1.027 |
| Indica | E + G | 2011 | 0.924 | 0.961 | 0.900 | 0.962 | 1.027 |
| Indica | E + G | 2012 | 0.839 | 0.901 | 0.836 | 0.931 | 1.004 |
| Indica | E + G | Global | 0.817 | 0.884 | 0.810 | 0.925 | 1.009 |
| Indica | E + G + GE | 2010 | 0.861 | 0.959 | 0.869 | 0.898 | 0.991 |
| Indica | E + G + GE | 2011 | 0.901 | 0.964 | 0.918 | 0.934 | 0.982 |
| Indica | E + G + GE | 2012 | 0.840 | 0.890 | 0.849 | 0.944 | 0.990 |
| Indica | E + G + GE | Global | 0.808 | 0.880 | 0.827 | 0.918 | 0.976 |
| Indica | G + GE | 2010 | 0.874 | 0.957 | 0.877 | 0.913 | 0.996 |
| Indica | G + GE | 2011 | 0.910 | 1.017 | 0.926 | 0.895 | 0.983 |
| Indica | G + GE | 2012 | 0.851 | 0.914 | 0.859 | 0.931 | 0.990 |
| Indica | G + GE | Global | 0.816 | 0.900 | 0.833 | 0.907 | 0.980 |
Figure 2Prediction performance for each environment and across environments (global) of dataset 2 (Japonica) in terms of normalized mean square error (NRMSE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction) and under the sevenfold cross-validation (CV) scheme.
Variance components (variance) and heritability’s estimates for Japonica (dataset 2) for each trait. CV denotes the coefficient of variation, and Locs denotes the average number of locations.
| Trait | Component | Variance | Heritability | CV | Locs |
|---|---|---|---|---|---|
| GY | Loc:Hybrid | 186065.908 | 0.29 | 0.16 | 3.60 |
| GY | Hybrid | 257287.998 | 0.29 | 0.16 | 3.60 |
| GY | Loc | 1860782.427 | 0.29 | 0.16 | 3.60 |
| GY | Residual | 272836.420 | 0.29 | 0.16 | 3.60 |
| PHR | Loc:Hybrid | 0.0001 | 0.46 | 0.07 | 3.60 |
| PHR | Hybrid | 0.0004 | 0.46 | 0.07 | 3.60 |
| PHR | Loc | 0.0012 | 0.46 | 0.07 | 3.60 |
| PHR | Residual | 0.0003 | 0.46 | 0.07 | 3.60 |
| GC | Loc:Hybrid | 0.000 | 0.25 | 0.82 | 3.60 |
| GC | Hybrid | 0.001 | 0.25 | 0.82 | 3.60 |
| GC | Loc | 0.006 | 0.25 | 0.82 | 3.60 |
| GC | Residual | 0.001 | 0.25 | 0.82 | 3.60 |
| PH | Loc:Hybrid | 0.002 | 0.62 | 0.10 | 3.60 |
| PH | Hybrid | 20.528 | 0.62 | 0.10 | 3.60 |
| PH | Loc | 35.950 | 0.62 | 0.10 | 3.60 |
| PH | Residual | 8.576 | 0.62 | 0.10 | 3.60 |
Prediction performance for each environment and across environments (global) of dataset 2 (Japonica) in terms of normalized mean square error (NRMSE) and relative efficiency (RE) under four predictors (G, genotypic information; E + G, environment plus genotypic information; E + G + GE, environment plus genotypic plus genotype by environment interaction information; and G + GE, genotypic plus genotype by environment interaction), under sevenfold cross validation. NRMSE_GBLUP, NRMSE_PLS and NRMSE_RF denote the NRMSE under the GBLUP, PLS and random forest models, respectively. RE_PLS and RE_RF denote the relative efficiency (RE) calculated with the NRMSE of the PLS and random forest models, respectively. RE was calculated by dividing the prediction performance (with NRMSE) of the GBLUP model between the prediction performance of the PLS and random forest models; that is, the GBLUP model was considered the reference model.
| Data | Predictor | Env | NRMSE_GBLUP | NRMSE_PLS | NRMSE_RF | RE_PLS | RE_RF |
|---|---|---|---|---|---|---|---|
| Japonica | G | 2009 | 2.469 | 2.511 | 2.793 | 0.983 | 0.884 |
| Japonica | G | 2010 | 2.269 | 2.263 | 2.387 | 1.003 | 0.951 |
| Japonica | G | 2011 | 1.350 | 1.349 | 1.445 | 1.000 | 0.934 |
| Japonica | G | 2012 | 2.054 | 1.943 | 2.124 | 1.057 | 0.967 |
| Japonica | G | 2013 | 1.263 | 1.348 | 1.356 | 0.937 | 0.932 |
| Japonica | G | Global | 0.957 | 0.983 | 1.056 | 0.973 | 0.906 |
| Japonica | E + G | 2009 | 0.975 | 1.040 | 0.998 | 0.938 | 0.977 |
| Japonica | E + G | 2010 | 0.927 | 0.972 | 0.834 | 0.954 | 1.112 |
| Japonica | E + G | 2011 | 0.790 | 0.914 | 0.851 | 0.864 | 0.928 |
| Japonica | E + G | 2012 | 0.842 | 0.931 | 0.918 | 0.904 | 0.916 |
| Japonica | E + G | 2013 | 0.778 | 0.969 | 0.906 | 0.803 | 0.859 |
| Japonica | E + G | Global | 0.465 | 0.564 | 0.542 | 0.823 | 0.857 |
| Japonica | E + G + GE | 2009 | 1.008 | 1.069 | 0.985 | 0.943 | 1.024 |
| Japonica | E + G + GE | 2010 | 0.819 | 1.008 | 0.872 | 0.812 | 0.939 |
| Japonica | E + G + GE | 2011 | 0.796 | 0.924 | 0.858 | 0.862 | 0.928 |
| Japonica | E + G + GE | 2012 | 0.872 | 0.949 | 0.945 | 0.919 | 0.922 |
| Japonica | E + G + GE | 2013 | 0.782 | 0.978 | 0.919 | 0.800 | 0.851 |
| Japonica | E + G + GE | Global | 0.478 | 0.578 | 0.557 | 0.828 | 0.859 |
| Japonica | G + GE | 2009 | 1.064 | 1.581 | 0.992 | 0.673 | 1.072 |
| Japonica | G + GE | 2010 | 0.848 | 1.794 | 0.877 | 0.473 | 0.968 |
| Japonica | G + GE | 2011 | 0.791 | 1.032 | 0.859 | 0.767 | 0.921 |
| Japonica | G + GE | 2012 | 0.867 | 1.283 | 0.940 | 0.675 | 0.922 |
| Japonica | G + GE | 2013 | 0.766 | 1.120 | 0.920 | 0.684 | 0.833 |
| Japonica | G + GE | Global | 0.478 | 0.720 | 0.556 | 0.663 | 0.859 |