| Literature DB >> 34117061 |
Yusheng Zhao1, Patrick Thorwarth2, Yong Jiang1, Norman Philipp3, Albert W Schulthess1, Mario Gils4, Philipp H G Boeven5, C Friedrich H Longin2, Johannes Schacht5, Erhard Ebmeyer6, Viktor Korzun7,8, Vilson Mirdita9, Jost Dörnte10, Ulrike Avenhaus11, Ralf Horbach12, Hilmar Cöster13, Josef Holzapfel14, Ludwig Ramgraber15, Simon Kühnle16, Pierrick Varenne17, Anne Starke5, Friederike Schürmann14, Sebastian Beier1, Uwe Scholz1, Fang Liu1, Renate H Schmidt1, Jochen C Reif18.
Abstract
The potential of big data to support businesses has been demonstrated in financial services, manufacturing, and telecommunications. Here, we report on efforts to enter a new data era in plant breeding by collecting genomic and phenotypic information from 12,858 wheat genotypes representing 6575 single-cross hybrids and 6283 inbred lines that were evaluated in six experimental series for yield in field trials encompassing ~125,000 plots. Integrating data resulted in twofold higher prediction ability compared with cases in which hybrid performance was predicted across individual experimental series. Our results suggest that combining data across breeding programs is a particularly appropriate strategy to exploit the potential of big data for predictive plant breeding. This paradigm shift can contribute to increasing yield and resilience, which is needed to feed the growing world population.Entities:
Year: 2021 PMID: 34117061 PMCID: PMC8195483 DOI: 10.1126/sciadv.abf9106
Source DB: PubMed Journal: Sci Adv ISSN: 2375-2548 Impact factor: 14.136
Fig. 1Population genomic analyses of parental lines grouped into six experimental series.
(A) Principal coordinate analysis of the inbred lines based on Rogers’ distances matrix. Percentages in parentheses refer to the proportion of genotypic variance explained by the first and second principal coordinates (PCs). (B) Neighbor-joining tree based on the results of FST statistics for the six experimental series (Exp.). (C) Distribution of Rogers’ distances for inbred lines within and across experimental series. In each histogram plot, the range of Rogers’ distances is displayed on the x axis; on the y axis the percentage of line pairs is provided. (D) Persistence of the LD phase between the six experimental series.
Fig. 2Grain yield performance assessed in multienvironmental field trials.
(A) Broad-sense heritability values for hybrids and lines within experimental series are shown as bars and across experimental series as vertical lines. Light and dark gray refer to hybrids and lines, respectively. (B) Assessing a potential bias in grain yield estimates triggered by merging nonorthogonal phenotypic data across experimental series. Grain yield was estimated on the basis of the combined phenotypic data of all but one overlapping genotypes. For this genotype, grain yield was then estimated separately for experimental series (Exp.) I or VI and a combined set of experimental series II, III, IV, and V. Repeating this procedure for all overlapping genotypes resulted in two sets of estimates. The correlations between these estimates are plotted. ***P < 0.001. (C) Distribution of best linear unbiased estimations for grain yield (Mg ha−1) of the genotypes included in the six experimental series.
Fig. 3Prediction abilities and the effective population size within and across the six experimental series.
(A) Prediction ability within experimental series (Exp.) was estimated for related or unrelated training populations by using a chessboard-like cross-validation in experimental series I, II, and III or by fivefold cross-validation based on random sampling of genotypes (random scenario) in experimental series IV, V, and VI. (B) Prediction abilities across different experimental series. For each of the training populations shown on the x axis, the prediction abilities for the different test populations are displayed as colored bars. (C) Increase in prediction ability combining incremental data across experimental series. The lengths of the colored boxes in each bar represent the proportions of the genotypes of the different experimental series used as training sets. (D) Effective population size within and across the experimental series. The different experimental series are color coded according to the key in (A).
Fig. 4Relationship between prediction ability and effective population size (Ne) in experimental series VI.
(A) Biplot of observed prediction ability and the ratio of sample population size (N) versus the effective population size (Ne) in 500 subsamples ranging from N = 100 to N = 3100 drawn randomly out of experimental series VI. (B) Association between observed and estimated prediction accuracies in 500 subsamples drawn randomly out of experimental series VI. (C) Projection of prediction ability for population size N from 200 to 20,000 and Ne from 2 to 200; the red line corresponds to the square root of heritability, which represents the upper limit of the prediction ability.
Fig. 5Optimized field designs to reduce genotype-by-environment interaction effects exemplified on the basis of yield trials of experimental series II in 12 environments.
(A) In scenario I, all lines and hybrids are tested in a subset of three environments (Env). (B) In scenario II, a core of 10% of the lines and hybrids is sampled and tested in all 12 environments together with 11 check varieties (yellow color). The remaining 90% of lines and hybrids are divided into six groups of equal size and tested in two environments. (C) In scenario III, the lines and hybrids are divided into 10 subgroups, each of which is tested in only three environments, with the restriction that two environments overlap with those of the next group. All 12 environments are linked with 11 check varieties (yellow color). (D) Correlation between grain yield estimates for the data of the subsets of scenario I, II, and III and those for all 12 environments.