| Literature DB >> 32079240 |
Mohsin Ali1, Yong Zhang1, Awais Rasheed2,3, Jiankang Wang1, Luyan Zhang1.
Abstract
Genomic selection (GS) is a strategy to predict the genetic merits of individuals using genome-wide markers. However, GS prediction accuracy is affected by many factors, including missing rate and minor allele frequency (MAF) of genotypic data, GS models, trait features, etc. In this study, we used one wheat population to investigate prediction accuracies of various GS models on yield and yield-related traits from various quality control (QC) scenarios, missing genotype imputation, and genome-wide association studies (GWAS)-derived markers. Missing rate and MAF of single nucleotide polymorphism (SNP) markers were two major factors in QC. Five missing rate levels (0%, 20%, 40%, 60%, and 80%) and three MAF levels (0%, 5%, and 10%) were considered and the five-fold cross validation was used to estimate the prediction accuracy. The results indicated that a moderate missing rate level (20% to 40%) and MAF (5%) threshold provided better prediction accuracy. Under this QC scenario, prediction accuracies were further calculated for imputed and GWAS-derived markers. It was observed that the accuracies of the six traits were related to their heritability and genetic architecture, as well as the GS prediction model. Moore-Penrose generalized inverse (GenInv), ridge regression (RidgeReg), and random forest (RForest) resulted in higher prediction accuracies than other GS models across traits. Imputation of missing genotypic data had marginal effect on prediction accuracy, while GWAS-derived markers improved the prediction accuracy in most cases. These results demonstrate that QC on missing rate and MAF had positive impact on the predictability of GS models. We failed to identify one single combination of QC scenarios that could outperform the others for all traits and GS models. However, the balance between marker number and marker quality is important for the deployment of GS in wheat breeding. GWAS is able to select markers which are mostly related to traits, and therefore can be used to improve the prediction accuracy of GS.Entities:
Keywords: genomic selection; minor allele frequency; missing data; wheat
Mesh:
Substances:
Year: 2020 PMID: 32079240 PMCID: PMC7073225 DOI: 10.3390/ijms21041342
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Pearson’s correlation matrix among yield and yield-related traits based on their best linear unbiased estimates (BLUE). The upper corner represents the correlation coefficient, with the significance level indicated by asterisks. Three symbols (“*” and “***”) correspond to three p-values (0.05 and 0.001, respectively). The lower corner contains bivariate scatter plots with fitted lines. The diagonally arranged plots show the phenotypic distribution of traits based on BLUE values. GY, indicates grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height.
Variance components and heritability of yield and yield-related traits in 166 wheat accessions.
| Trait 1 | Variance Components (%) | Heritability 3 | ||||
|---|---|---|---|---|---|---|
| Genotype | Environment | G by E Interaction 2 | Random Error | Plot Level | Genotypic Mean Level | |
| GY | 12.12 | 43.00 | 39.39 | 5.50 | 0.42 | 0.85 |
| SN | 34.32 | 24.11 | 36.19 | 5.39 | 0.69 | 0.92 |
| TKW | 41.38 | 27.68 | 23.76 | 7.18 | 0.77 | 0.97 |
| SL | 42.94 | 8.24 | 34.71 | 14.12 | 0.67 | 0.96 |
| HD | 12.64 | 79.29 | 7.26 | 0.81 | 0.81 | 0.97 |
| PH | 60.19 | 11.97 | 23.09 | 4.75 | 0.85 | 0.98 |
1 GY, grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height. 2 G by E; genotype-by-environment. 3 Heritability was estimated from analysis of variance across environments.
Number of markers used for genomic predictions under five missing rate levels (i.e., 0%, <20%, <40%, <60%, and <80%) and three minor allele frequency (MAF) levels (i.e., >0%, >5%, and >10%).
| Missing Rate (%) | MAF (%) | ||
|---|---|---|---|
| 0 1 | 5 | 10 | |
| 0 2 | 1442 | 259 | 181 |
| 20 | 8674 | 5343 | 4368 |
| 40 | 9851 | 5513 | 4494 |
| 60 | 10818 | 5635 | 4596 |
| 80 | 11997 | 5725 | 4675 |
1 MAF greater than zero. In other words, this QC scenario actually only removed non-polymorphism markers in the population, and therefore the remaining markers were polymorphic after this control. 2 Markers contained no missing values.
Figure 2Genomic prediction accuracy of seven genomic selection (GS) models for yield and yield-related traits with five missing rates (columns) and three minor allele frequencies (MAFs) (rows). 0% MAF, represents markers with MAF greater than 0; GY, indicates grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height.
Prediction accuracy with marker quality control (QC) or non-QC for six traits and seven genomic selection (GS) models. The QC to keep markers with missing rate levels <40% and minor allele frequency (MAF) >5% was used as an example.
| Trait 1 | Scenario | Genomic Selection Model 2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| BLUP | GBLUP | GenInv | LASSO | RForest | RidgeReg | RRBLUP | Mean | ||
| GY | QC | 0.531 | 0.458 | 0.503 | 0.495 | 0.522 | |||
| Non-QC 4 | 0.545 | 0.461 | 0.506 | 0.454 | 0.535 | 0.506 | 0.545 | 0.507 | |
| 0.3687 | 0.3892 | 0.3926 | 0.0001 | 0.3688 | 0.4662 | 0.3994 | 0.248 | ||
| SN | QC | 0.491 | 0.484 | 0.383 | 0.488 | 0.480 | |||
| Non-QC | 0.462 | 0.383 | 0.335 | 0.444 | 0.434 | 0.335 | 0.463 | 0.408 | |
| 0.2447 | 0.0096 | 0.0002 | 0.0816 | 0.018 | 0.0002 | 0.2671 | 0.011 | ||
| TKW | QC | 0.600 | 0.499 | 0.603 | 0.601 | 0.601 | |||
| Non-QC | 0.672 | 0.598 | 0.619 | 0.594 | 0.638 | 0.619 | 0.672 | 0.630 | |
| 0.0282 | 0.4463 | 0.19 | 0.0189 | 0.1809 | 0.2122 | 0.0298 | 0.116 | ||
| SL | QC | 0.373 | 0.315 | 0.373 | 0.362 | 0.380 | |||
| Non-QC | 0.307 | 0.367 | 0.358 | 0.146 | 0.284 | 0.358 | 0.296 | 0.302 | |
| 0.1595 | 0.2805 | 0.1629 | 0.0041 | 0.0645 | 0.1575 | 0.1721 | 0.020 | ||
| HD | QC | 0.355 | 0.394 | 0.264 | 0.266 | 0.350 | |||
| Non-QC | 0.326 | 0.276 | 0.262 | 0.161 | 0.340 | 0.262 | 0.272 | 0.271 | |
| 0.2912 | 0.0265 | 0.0065 | 0.0412 | 0.471 | 0.0063 | 0.4991 | 0.017 | ||
| PH | QC | 0.570 | 0.593 | 0.502 | 0.576 | 0.572 | |||
| Non-QC | 0.616 | 0.543 | 0.574 | 0.369 | 0.558 | 0.574 | 0.615 | 0.550 | |
| 0.1438 | 0.2283 | 0.2445 | 0.0057 | 0.3201 | 0.2807 | 0.1372 | 0.269 | ||
| Mean | QC | 0.489 | 0.486 | 0.512 | 0.425 | 0.494 | 0.510 | 0.473 | |
| Non-QC | 0.488 | 0.438 | 0.442 | 0.461 | 0.465 | 0.442 | 0.477 | ||
1 GY, grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height. 2 BLUP, best linear unbiased prediction; GBLUP, genomic-BLUP; GenInv, Moore–Penrose generalized inverse; LASSO, least absolute shrinkage and selection operator; RForest, random forest; RidgeReg, ridge regression; and RRBLUP, ridge regression-BLUP. 3 Values in parenthesis indicate standard errors of the estimated parameter. 4 Non-QC indicates that all polymorphic markers were used. 5 The models with the top three prediction accuracies under QC are bolded for each trait.
Prediction accuracies of yield and yield-related traits for the seven GS models using imputed markers. The QC to keep markers with missing rate levels <40% and minor allele frequency (MAF) >5% was used as an example.
| Trait 1 | Genomic selection Model 2 | |||||||
|---|---|---|---|---|---|---|---|---|
| BLUP | GBLUP | GenInv | LASSO | RForest | RidgeReg | RRBLUP | Mean | |
| GY | 0.517 | 0.491 | 0.520 | 0.537 | ||||
| SN | 0.488 | 0.477 | 0.418 | 0.481 | 0.496 | |||
| TKW | 0.600 | 0.560 | 0.586 | 0.602 | 0.607 | |||
| SL | 0.370 | 0.370 | 0.394 | 0.392 | 0.423 | |||
| HD | 0.305 | 0.377 | 0.312 | 0.256 | 0.345 | |||
| PH | 0.549 | 0.517 | 0.450 | 0.547 | 0.538 | |||
| Mean | 0.472 | 0.480 | 0.521 | 0.440 | 0.509 | 0.524 | 0.466 | |
1 GY, grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height; 2 BLUP, best linear unbiased prediction; GBLUP, genomic-BLUP; GenInv, Moore–Penrose generalized inverse; LASSO, least absolute shrinkage and selection operator; RForest, random forest; RidgeReg, ridge regression; and RRBLUP, ridge regression-BLUP. 3 Values in parenthesis indicate standard errors of the estimated parameter. 4 The models with the top three prediction accuracies with and without markers imputation are bolded for each trait.
The number of significant markers detected by genome-wide association studies (GWAS) under the imputed and non-imputed scenarios. Threshold of −log10 P was set at 1.
| Trait 1 | GWAS QTLs | |
|---|---|---|
| Imputed 2 | Non-imputed | |
| GY | 525 | 514 |
| SN | 537 | 576 |
| TKW | 519 | 553 |
| SL | 520 | 509 |
| HD | 508 | 506 |
| PH | 497 | 522 |
| Total | 3106 | 3080 |
1 GY, grain yield; SN, spike number per square meter; TKW, thousand-grain weight; SL, spike length; HD, heading days; PH, plant height. 2 Markers were imputed before GWAS.
Prediction accuracy for yield and yield-related traits using significant markers detected by genome-wide association studies (GWAS). Both imputed and non-imputed scenarios were considered. The QC to keep markers with missing rate levels <40% and minor allele frequency (MAF) >5% was used as an example.
| Trait 1 | Imputation 2 | Genomic Selection Model 3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| BLUP | GBLUP | GenInv | LASSO | RForest | RidgeReg | RRBLUP | Mean | ||
| GY | Yes | 0.887 | 0.721 | 0.730 | 0.887 | 0.847 | |||
| No | 0.712 | 0.746 | 0.727 | 0.785 | |||||
| SN | Yes | 0.859 | 0.673 | 0.741 | 0.850 | ||||
| No | 0.857 | 0.767 | 0.685 | 0.833 | |||||
| TKW | Yes | 0.848 | 0.737 | 0.763 | 0.938 | 0.873 | |||
| No | 0.79 | 0.763 | 0.733 | 0.880 | 0.843 | ||||
| SL | Yes | 0.861 | 0.938 | 0.628 | 0.674 | 0.843 | |||
| No | 0.773 | 0.669 | 0.613 | 0.840 | 0.785 | ||||
| HD | Yes | 0.792 | 0.846 | 0.660 | 0.648 | 0.793 | |||
| No | 0.818 | 0.728 | 0.653 | 0.826 | 0.800 | ||||
| PH | Yes | 0.756 | 0.621 | 0.712 | 0.798 | ||||
| No | 0.800 | 0.746 | 0.664 | 0.810 | 0.803 | ||||
| Mean | Yes | 0.913 | 0.836 | 0.895 | 0.673 | 0.711 | 0.895 | 0.913 | |
| No | 0.792 | 0.737 | 0.679 | 0.836 | |||||
1 GY, grain yield; SN, spike number per square meter; TKW, thousand-kernel weight; SL, spike length; HD, heading days; PH, plant height. 2 “Yes” indicates that genotypic data was imputed for missing values and then used for GWAS and GS analysis. 3 BLUP, best linear unbiased prediction; GBLUP, genomic-BLUP; GenInv, Moore–Penrose generalized inverse; LASSO, least absolute shrinkage and selection operator; RForest, random forest; RidgeReg, ridge regression; and RRBLUP, ridge regression-BLUP. 4 The models with the top three prediction accuracies with and without markers imputation are bolded for each trait. 5 Values in parenthesis indicate standard errors of the estimated parameter.