| Literature DB >> 30651285 |
Guillaume P Ramstein1, Michael D Casler2,3.
Abstract
Genomic prediction is a useful tool to accelerate genetic gain in selection using DNA marker information. However, this technology typically relies on standard prediction procedures, such as genomic BLUP, that are not designed to accommodate population heterogeneity resulting from differences in marker effects across populations. In this study, we assayed different prediction procedures to capture marker-by-population interactions in genomic prediction models. Prediction procedures included genomic BLUP and two kernel-based extensions of genomic BLUP which explicitly accounted for population heterogeneity. To model population heterogeneity, dissemblance between populations was either depicted by a unique coefficient (as previously reported), or a more flexible function of genetic distance between populations (proposed herein). Models under investigation were applied in a diverse switchgrass sample under two validation schemes: whole-sample calibration, where all individuals except selection candidates are included in the calibration set, and cross-population calibration, where the target population is entirely excluded from the calibration set. First, we showed that using fixed effects, from principal components or putative population groups, appeared detrimental to prediction accuracy, especially in cross-population calibration. Then we showed that modeling population heterogeneity by our proposed procedure resulted in highly significant improvements in model fit. In such cases, gains in accuracy were often positive. These results suggest that population heterogeneity may be parsimoniously captured by kernel methods. However, in cases where improvement in model fit by our proposed procedure is null-to-moderate, ignoring heterogeneity should probably be preferred due to the robustness and simplicity of the standard genomic BLUP model.Entities:
Keywords: GenPred; Genomic Prediction; Panicum virgatum; Shared Data Resources; kernel functions; marker-by-population interaction; population heterogeneity
Mesh:
Year: 2019 PMID: 30651285 PMCID: PMC6404615 DOI: 10.1534/g3.118.200969
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Description of populations and trait measurements
| Pop. | Size | Loc. | Trait | Years | Mean | Range |
|---|---|---|---|---|---|---|
| 106 | NY | PH | 2009 2011 | 142.9 | 95.8 - 205.2 | |
| HD | 2009 2010 2011 | 547.1 | 422.9 - 810.4 | |||
| St | 2010 2011 | 5.6 | 1.0 - 8.9 | |||
| 37 | NY | PH | 2009 2011 | 209.7 | 130.7 - 240.1 | |
| HD | 2009 2010 2011 | 841.3 | 711.2 - 1075.6 | |||
| St | 2010 2011 | 7.1 | 5.0 - 9.7 | |||
| 110 | WI | PH | 2012 2013 | 185.6 | 133.9 - 239.9 | |
| HD | 2012 2013 2014 | 806.3 | 652.1 - 979.7 | |||
| St | 2013 | 6.2 | 2.7 - 8.9 | |||
| 135 | NY | PH | 2009 2011 | 155.5 | 93.7 - 207.7 | |
| HD | 2009 2010 2011 | 534.3 | 345.4 - 904.0 | |||
| St | 2010 2011 | 5.4 | 1.6 - 8.2 | |||
| 136 | WI | PH | 2012 2013 | 163.8 | 127.9 - 204.6 | |
| HD | 2013 2014 | 527.6 | 405.7 - 692.4 | |||
| St | 2013 | 5.7 | 2.0 - 8.2 | |||
| 97 | NY | PH | 2009 2011 | 168.2 | 101.0 - 225.2 | |
| HD | 2009 2010 2011 | 530.4 | 408.5 - 734.7 | |||
| St | 2010 2011 | 5.6 | 1.7 - 8.0 | |||
| 129 | NY | PH | 2009 2011 | 165.2 | 124.7 - 224.7 | |
| HD | 2009 2010 2011 | 608.0 | 429.2 - 823.1 | |||
| St | 2010 2011 | 3.5 | 0.7 - 7.2 | |||
| 10 | NY | PH | 2009 2011 | 175.4 | 138.7 - 190.8 | |
| HD | 2009 2010 2011 | 716.6 | 569.9 - 859.1 | |||
| St | 2010 2011 | 5.8 | 4.0 - 7.5 |
Population (Pop.): WS4U-C2 is a collection of upland ecotypes; Liberty-C2 is a cross between upland and lowland ecotypes; other populations are designated by ecotype (U: upland; L: lowland), ploidy level (4X: tetraploid; 8X: octoploid) and geographical origin (S: South; W: West; N: North; E: East). Location (Loc.): location of phenotypic trials, Arlington (WI, USA) or Ithaca (NY, USA). Trait: plant height (PH), heading date (HD) or standability (St). Mean and range refer to the means y’s as described in Material and Methods. Units for mean and range are centimeter, growing degree days and scores on a 0-10 scale, for PH, HD and St, respectively.
Figure 1Population structure in the sample (A) Admixture plot of the whole sample, with colors designating the seven inferred population clusters, which roughly matched populations, with the exception of U8X-S which displayed strong admixture; (B) Principal component analysis (PCA) plot of the whole sample of 760 individuals, with colors designating the eight populations.
Figure 2Inferred graphs of relationships, conditional on population structure Each graph represents the relationships as depicted by the graphical LASSO applied to the whole sample of individuals. The parameter λ represents the degree of regularization on conditional relationships, fitted by maximum restricted likelihood for each trait, in a GBLUP model based on regularized relationships (Appendix 1). Nodes (individuals) were positioned using the force-directed placement algorithm of Fruchterman and Reingold (1991), as implemented in function ggnet (R package GGally), so aggregation of nodes reflects connectedness.
Average prediction accuracy by mean structure
| a) Target included in CS | b) Target population omitted from CS | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Trait | Population | ||||||||
| WS4U-C2 | 0.163 (0.121) | 0.163 (0.121) | 0.164 (0.123) | 0.230 (0.144) | 0.217 (0.146) • | 0.225 (0.142) | |||
| Liberty-C2 | 0.476 (0.189) | 0.477 (0.189) | 0.469 (0.191) | 0.025 (0.208) | 0.030 (0.202) | −0.048 (0.196) • | |||
| U4X-N | 0.526 (0.149) | 0.526 (0.143) | 0.525 (0.136) | 0.258 (0.166) • | 0.247 (0.164) * | 0.067 (0.175) * | |||
| L4X-NE | 0.767 (0.074) | 0.766 (0.074) • | 0.766 (0.074) • | 0.403 (0.179) | 0.391 (0.184) * | 0.355 (0.189) * | |||
| WS4U-C2 | 0.272 (0.185) | 0.276 (0.186) • | 0.282 (0.173) | 0.122 (0.166) | 0.132 (0.167) • | 0.129 (0.140) | |||
| Liberty-C2 | 0.532 (0.145) | 0.536 (0.146) • | 0.516 (0.158) | 0.125 (0.185) | 0.127 (0.185) | 0.080 (0.204) • | |||
| U4X-N | 0.693 (0.103) | 0.689 (0.114) | 0.694 (0.110) | 0.438 (0.163) | 0.406 (0.191) • | 0.388 (0.167) • | |||
| L4X-NE | 0.828 (0.074) | 0.826 (0.074) * | 0.828 (0.074) • | 0.338 (0.222) * | 0.377 (0.222) * | 0.229 (0.206) * | |||
| WS4U-C2 | 0.070 (0.208) | 0.074 (0.209) | 0.078 (0.208) | −0.046 (0.193) | −0.047 (0.191) | −0.051 (0.192) | |||
| Liberty-C2 | 0.110 (0.248) | 0.114 (0.248) | 0.098 (0.250) * | 0.145 (0.231) | 0.153 (0.177) | 0.104 (0.150) • | |||
| U4X-N | 0.265 (0.169) | 0.264 (0.167) | 0.270 (0.172) | −0.067 (0.209) • | −0.000 (0.219) • | 0.020 (0.207) | |||
| L4X-NE | 0.589 (0.127) | 0.588 (0.128) | 0.589 (0.126) | 0.090 (0.219) | −0.330 (0.172) * | 0.096 (0.218) • | |||
In parentheses: standard deviation across cross-validation replicates. Validation scheme: (a) whole-sample calibration, where individuals in the target population, except the selection candidates, are included in the calibration set; (b) cross-population calibration, where all individuals from the target population are omitted from the calibration set. Trait: plant height (PH), heading date (HD) or standability (St). Population: population used as target for prediction. Prediction accuracies are averaged over 20 cross-validation replicates. Models differ by mean structure (fixed-effect specification), under the same prediction procedure (GBLUP: whole-sample model). Intercept: only an intercept; PCA: intercept and effects of first four PCs; Panel: effect of panels (AP, association panel; BP, breeding panel); Group: effects of putative population groups (WS4U-C2, Liberty-C2, U4X-N, U8X-W+U8X-S, U8X-E, L4X-NE and L4X-S). Comparisons to Intercept: •: P ≤ 0.05 in unadjusted (naïve) t-test (liberal); *: P ≤ 0.05 in t-test corrected for overlap in testing sets as in Nadeau and Bengio (2003) (conservative). Underlined values correspond to the highest prediction accuracy for each validation scheme, trait and population.
Average prediction accuracy by prediction procedure
| a) Target included in CS | b) Target population omitted from CS | |||||||
|---|---|---|---|---|---|---|---|---|
| Trait | Population | |||||||
| WS4U-C2 | 0.115 (0.123) • | 0.163 (0.121) | 0.133 (0.124) | 0.213 (0.137) • | −0.074 (0.214) * | |||
| Liberty-C2 | 0.467 (0.186) | 0.476 (0.189) | 0.470 (0.186) | 0.025 (0.208) | 0.025 (0.209) | |||
| U4X-N | 0.526 (0.149) | 0.486 (0.160) • | 0.525 (0.149) | 0.253 (0.160) • | 0.265 (0.168) | |||
| L4X-NE | 0.767 (0.074) | 0.767 (0.074) • | 0.762 (0.076) • | 0.403 (0.179) | 0.153 (0.188) * | |||
| WS4U-C2 | 0.272 (0.185) | 0.273 (0.159) | 0.254 (0.178) | 0.122 (0.166) | 0.094 (0.150) | |||
| Liberty-C2 | 0.532 (0.145) | 0.516 (0.152) | 0.524 (0.153) | 0.125 (0.185) | 0.137 (0.191) | |||
| U4X-N | 0.694 (0.103) | 0.693 (0.110) | 0.703 (0.100) * | 0.447 (0.179) | 0.297 (0.163) * | |||
| L4X-NE | 0.828 (0.074) | 0.832 (0.073) * | 0.835 (0.068) | 0.400 (0.212) | 0.352 (0.212) * | |||
| WS4U-C2 | 0.070 (0.208) | 0.067 (0.213) | 0.075 (0.198) | −0.046 (0.193) | −0.015 (0.201) • | |||
| Liberty-C2 | 0.055 (0.234) • | 0.105 (0.251) • | 0.102 (0.252) | 0.164 (0.185) | 0.161 (0.190) | |||
| U4X-N | 0.265 (0.169) | 0.255 (0.174) | 0.266 (0.166) | 0.042 (0.211) | −0.040 (0.204) • | |||
| L4X-NE | 0.589 (0.127) | 0.590 (0.127) | 0.591 (0.129) | 0.090 (0.219) | 0.122 (0.219) • | |||
In parentheses: standard deviation across cross-validation replicates. Validation scheme: (a) whole-sample calibration, where individuals in the target population, except the selection candidates, are included in the calibration set; (b) cross-population calibration, where all individuals from the target population are omitted from the calibration set. Trait: plant height (PH), heading date (HD) or standability (St). Population: population used as target for prediction. Prediction accuracies are averaged over 20 cross-validation replicates. Models differ by prediction procedure, under the same mean structure (Intercept: intercept-only model). GBLUP: whole-sample model; GBLUP-Target: GBLUP model where the CS includes only the individuals from the same population as the TS; MPM: multi-population model with among-population correlations based on admixture coefficients (MPM-Mixture) or PC distances (MPM-Matérn). Comparisons to GBLUP: •: P ≤ 0.05 in unadjusted (naïve) t-test (liberal); *: P ≤ 0.05 in t-test corrected for overlap in testing sets as in Nadeau and Bengio (2003) (conservative). Underlined values correspond to the highest prediction accuracy for each validation scheme, trait and population.
Multi-population model fit: parameter estimates, likelihood-ratio test statistic and p-value, by trait and procedure
| Trait | Procedure | Parameter estimate | LRT statistic | LRT |
|---|---|---|---|---|
| 0.00 (0.00-0.31) | 1.0 (0.58-1.00) | |||
| 7.39 (2.44-10.66) | 0.025 (0.0049-0.29) | |||
| 15.89 (9.22-20.15) | 6.7×10−5 (7.2×10−6-0.0024) | |||
| 42.76 (28.57-40.25) | 5.2×10−10 (1.8×10−9-6.2×10−7) | |||
| 5.72 (5.21-7.99) | 0.017 (0.0047-0.023) | |||
| 7.59 (7.72-9.94) | 0.022 (0.0069-0.021) | |||
In parentheses: range of values for every one of the four target populations omitted, in cross-population calibration. Trait: plant height (PH), heading date (HD) or standability (St). MPM: multi-population model with among-population correlations based on admixture coefficients (MPM-Mixture; ρ: mixture parameter) or PC distances (MPM-Matérn; ν: shape parameter; , with h the scale parameter and D the maximum distance observed over pairs of individuals). LRT (likelihood-ratio test) statistic: where L and L are the restricted maximum likelihoods of GBLUP and one of the MPM models, respectively; p-values were obtained from a χ-ditribution with one (MPM-Mixture) or two (MPM-Matérn) degrees of freedom.
Figure 3Shape of the inferred correlation functions in MPM-Matérn Validation scheme: (A) whole-sample calibration, where individuals in the target population (except the selection candidates) are included in the calibration set; (B) cross-population calibration, where all individuals from the target population are omitted from the calibration set. In (A), dashed curves depict correlation functions inferred in cross-validation replicates (where a part of the target population is included in the testing set), while solid curves depict correlation functions inferred in the whole sample. In (B), solid curves depict correlation functions inferred while omitting one of the four target populations in the calibration set. Correlations are functions of , where D is the Euclidean distance between population-structure PCs for any pair of individual (i,j) and D is the maximum of D over the whole sample.