| Literature DB >> 33748861 |
Marco Lopez-Cruz1, Gustavo de Los Campos2,3,4.
Abstract
Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.Entities:
Keywords: GenPred; genomic prediction; penalized regression; prediction accuracy; selection index; shared data resources
Mesh:
Year: 2021 PMID: 33748861 PMCID: PMC8128408 DOI: 10.1093/genetics/iyab030
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 1Prediction accuracy for grain yield (average across 100 trn–tst partitions) of the SSI versus the (average) number of support points of the SSIs. The G-BLUP (blue rightmost point) is a special case of an SSI when . Each panel represents one environment within data set. (A) Wheat-large data set. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. (B) Wheat-small data set. ME, mega-environment. Vertical bars represent a 95% confidence interval for the average.
Figure 2Prediction accuracy for grain yield of the optimal SSI versus that of the G-BLUP. Each point represents a trn–tst partitions (a total of 100 partitions were implemented), the point shape and color represent environments. (A) Wheat-large data set. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. (B) Wheat-small data set. ME, mega-environment. The value of in the SSI was estimated using 10 fivefold CVs conducted within the training data. In parenthesis, by the legend, P is the proportion of times the SSI was better than the G-BLUP.
Prediction accuracy for grain yield (average across 100 partitions) achieved by sparse selection indices (SSIs) and by the G-BLUP (standard SI), by data set and environmental condition
| Environment |
|
| Method |
|
| Accuracy (SD) |
|
|---|---|---|---|---|---|---|---|
| Wheat-large | |||||||
| B2I | 1,120 | 2,612 | G-BLUP | 0.0000 | 2,612 | 0.617 (0.031) | 0.97 |
| SSI | 0.0135 | 434 | 0.648 (0.031) | ||||
| B5I | 8,842 | 20,631 | G-BLUP | 0.0000 | 20,631 | 0.555 (0.010) | 1.00 |
| SSI | 0.0107 | 1,470 | 0.609 (0.009) | ||||
| MEL | 1,321 | 3,082 | G-BLUP | 0.0000 | 3,082 | 0.600 (0.045) | 0.99 |
| SSI | 0.0131 | 524 | 0.661 (0.046) | ||||
| LHT | 1,322 | 3,082 | G-BLUP | 0.0000 | 3,082 | 0.669 (0.024) | 0.99 |
| SSI | 0.0168 | 380 | 0.709 (0.025) | ||||
| DRB | 1,129 | 2,634 | G-BLUP | 0.0000 | 2,634 | 0.629 (0.035) | 0.98 |
| SSI | 0.0322 | 136 | 0.675 (0.037) | ||||
| EHT | 612 | 1,428 | G-BLUP | 0.0000 | 1,428 | 0.614 (0.049) | 0.94 |
| SSI | 0.0301 | 178 | 0.649 (0.047) | ||||
| Wheat-small | |||||||
| ME1 | 180 | 419 | G-BLUP | 0.0000 | 419 | 0.721 (0.070) | 0.87 |
| SSI | 0.0413 | 78 | 0.760 (0.067) | ||||
| ME2 | 180 | 419 | G-BLUP | 0.0000 | 419 | 0.702 (0.087) | 0.41 |
| SSI | 0.0123 | 254 | 0.692 (0.085) | ||||
| ME3 | 180 | 419 | G-BLUP | 0.0000 | 419 | 0.585 (0.101) | 0.53 |
| SSI | 0.0613 | 84 | 0.586 (0.093) | ||||
| ME4 | 180 | 419 | G-BLUP | 0.0000 | 419 | 0.663 (0.082) | 0.87 |
| SSI | 0.0617 | 54 | 0.714 (0.075) | ||||
B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation; ME, mega-environment; and , size of the testing and training data sets, respectively; SD, standard deviation across trn–tst partitions.
Optimal value of (average across partitions) estimated by cross-validating the training set.
Average number of support points in the SSIs. G-BLUP model corresponds to an SSI with and .
P: proportion of times (out of the 100 partitions) that the SSI outperformed the G-BLUP in prediction accuracy.
Figure 3Distribution of the number of training support points () in the optimal SSI for grain yield (results obtained over 100 trn–tst partitions; , size of the training data set), by environmental condition. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. Wheat-large data set.
Figure 4First two principal components coordinates for prediction points (yellow) and the corresponding support points (green). Gray points represent genotypes that did not contribute to the prediction of the genetic value of grain yield of the genotype in yellow. All panels represent solutions for the environment. EHT, early planting date, wheat-large data set.
Figure 5(A) Weights () of a standard SI (G-BLUP) and the optimal SSI for grain yield versus the genomic relationship (). (B) Proportion of weights in the SSI that were zero (nonactive) and nonzero (support points); environment. EHT, early planting date, wheat-large data set.