| Literature DB >> 34065368 |
Andrés J Cortés1,2, Felipe López-Hernández1.
Abstract
Warming and drought are reducing global crop production with a potential to substantially worsen global malnutrition. As with the green revolution in the last century, plant genetics may offer concrete opportunities to increase yield and crop adaptability. However, the rate at which the threat is happening requires powering new strategies in order to meet the global food demand. In this review, we highlight major recent 'big data' developments from both empirical and theoretical genomics that may speed up the identification, conservation, and breeding of exotic and elite crop varieties with the potential to feed humans. We first emphasize the major bottlenecks to capture and utilize novel sources of variation in abiotic stress (i.e., heat and drought) tolerance. We argue that adaptation of crop wild relatives to dry environments could be informative on how plant phenotypes may react to a drier climate because natural selection has already tested more options than humans ever will. Because isolated pockets of cryptic diversity may still persist in remote semi-arid regions, we encourage new habitat-based population-guided collections for genebanks. We continue discussing how to systematically study abiotic stress tolerance in these crop collections of wild and landraces using geo-referencing and extensive environmental data. By uncovering the genes that underlie the tolerance adaptive trait, natural variation has the potential to be introgressed into elite cultivars. However, unlocking adaptive genetic variation hidden in related wild species and early landraces remains a major challenge for complex traits that, as abiotic stress tolerance, are polygenic (i.e., regulated by many low-effect genes). Therefore, we finish prospecting modern analytical approaches that will serve to overcome this issue. Concretely, genomic prediction, machine learning, and multi-trait gene editing, all offer innovative alternatives to speed up more accurate pre- and breeding efforts toward the increase in crop adaptability and yield, while matching future global food demands in the face of increased heat and drought. In order for these 'big data' approaches to succeed, we advocate for a trans-disciplinary approach with open-source data and long-term funding. The recent developments and perspectives discussed throughout this review ultimately aim to contribute to increased crop adaptability and yield in the face of heat waves and drought events.Entities:
Keywords: abiotic stress tolerance; ex situ conservation; genebanks; genetic adaptation; genome-wide selection scans (GWSS); genome–environment associations (GEA); genomic prediction (GP); germplasm collections; machine learning (ML)
Year: 2021 PMID: 34065368 PMCID: PMC8161384 DOI: 10.3390/genes12050783
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1A roadmap of trans-disciplinary approaches aiming at harnessing genebank utilization for climate change research in the face of heat, and water scarcity. Compiling (a) previous characterizations and (b) geo-referencing-derived climate data/indices of available genetic resources in genebanks is a starting point to (c) assess the extent of abiotic stress tolerance among existing accessions, and the need of (d) new habitat-based population-guided collections targeting isolated pockets of cryptic diversity in dry and semi-arid regions. Planning question-oriented collecting trips of crop wild relatives and hidden landraces across contrasting environments/agro-ecologies is needed now more than ever, despite a century of gathering and preserving diversity in plants throughout genebanks. Coupling ex situ agro-ecological screenings together with (e) ongoing in situ genebanks characterizations for morphological and genetic variation is essential to define (c) putative tolerant reference collections, while understanding the (f) heritability (h) of adaptive traits and their genetic architecture (i.e., underlying genes) via genome-wide selection scans (GWSS), genome–environment associations (GEA), and genome-wide association studies (GWAS). Since identifying these novel sources of heat and drought tolerance demands merging heterogeneous datasets, (g) machine learning (ML, in red letters) promises speeding up genebank characterization. The distinction that clustering (Table 1) and ML (Figure 2 and Table 2) strategies can provide between abiotic stress tolerant and susceptible accessions is essential to (h) transfer useful genetic variation from wild crop donors and early landraces into elite cultivated lines, either by designing (i) genomic-assisted breeding programs such as genomic prediction (GP) and inter-specific marker- and genomic-assisted backcrossing (MAB and GABC) schemes, or by envisioning (j) multi-trait gene editing strategies (e.g., CRISPR-Cas9). Once (k) abiotic stress tolerant varieties are validated across different environments, (l) legal inscription, seed multiplication, seed delivery system to farmers’ associations, and (m) follow-up given the regional needs, market demands, and adoption potential, are necessary downstream validation steps. These heterogeneous datasets are also likely to be inputted into ML, and in turn feedback new needs beyond heat and drought tolerance such as other types of resistances and nutritional quality. For ML to succeed speeding up the breeding of heat and drought-tolerant crops, there must be long-term funding to generate and maintain an assortment of datasets at each step, which in turn need to be publicly available through open access repositories from various geographic locations. Red boxes highlight different reservoirs of wild and cultivated diversity within the Cartesian space, gray boxes are mixed datasets built around these collections, and connectors are methodological approaches.
Non-parametric and parametric classification approaches that can assist clustering efforts to differentiate between abiotic stress tolerant and susceptible germplasm accessions. Habitat types and local adaptation to heat and drought stresses can be inferred using climate variables and physiological indices from each accession’s geo-referencing (Figure 1b) because crop wild relatives and landraces have occupied local niches (e.g., arid vs. wet regions) long enough as to be shaped by natural selection. Predicted thermal tolerance and water use efficiency, together with other data types (Figure 1e,f), can then be merged (Figure 1g) in order to identify and unlock novel sources of heat and drought tolerance. The trained classification may also speed-up the utilization of these tolerant variants by genomic-assisted breeding techniques (Figure 1i). ML approaches (Table 2) are also capable of including further data types for more cohesive multi-dimensional predictions (e.g., Figure 1m).
| Approach | Method | Description of the Method | R Package/Tool | Method’s Reference | Example (Accessions × Markers) |
|---|---|---|---|---|---|
|
| K-means | Each observation belongs to the cluster with the nearest mean. It minimizes the distance between points labeled to be in a cluster and a point designated as the center (mean) | [ | Maize—2022 × 65,995 [ | |
| Ryegrass—1757 × 1,005,590 [ | |||||
| Partitioning Around Medoids (PAM) | It minimizes the distance between points labeled to be in a cluster and a point designated as the center (medoid) of that cluster. PAM chooses data points as centers (medoid) and works with a generalization of the Manhattan Norm to define data points distance | [ | Maize—260 × 11,296,689 [ | ||
| Clustering large applications (CLARA) | It extracts multiple sample sets from the dataset and uses the best cluster as output. It uses PAM for each sample | [ | 90 × 5000 [ | ||
|
| Hierarchical clustering (Hclust) | It is a method of cluster analysis that seeks to build a hierarchy of clusters | [ | Barley—1816 × 1416 & Wheat—478 × 219 [ | |
| Oat—131 × 3567 [ | |||||
| DIANA (Divisive analysis) | It first places all objects in a cluster and then subdivides them into smaller clusters until the desired number of clusters is obtained | [ | These algorithms were systematically compared, and included K-means, PAM, CLARA, Hclust, DIANA, and AGNES [ | ||
| Agglomerative Nesting (AGNES) | It initially takes each object as a cluster, afterwards the clusters are merged step by step according to certain criteria, using a single-link method | [ | |||
| AWclust | The first step of AWclust is to construct the ASD matrix between all pairs of individuals in the sample. The second step is to apply hierarchical clustering to infer clusters of individuals from the ASD matrix using Ward’s minimum variance algorithm |
| [ | Olive—94 × 8088 [ | |
| Pepper—222 × 32,950 [ | |||||
|
| TESS3 | Geography is one of the most important determinants of genetic variation in natural populations. Using genotypic and geographic data, |
| [ | These algorithms have been widely used and compared among them [ |
| fast STRUCTURE | STRUCTURE uses the core Bayesian principle of comparing likelihoods. Prior information about study samples can be supplied to further shape the unsupervised clustering |
| [ |
Machine learning (ML) predictive tools validated within a GP framework that can be extended to assist clustering efforts to differentiate between abiotic stress tolerant and susceptible germplasm accessions. These ML algorithms could be trained to distinguish habitat types and local adaptation to heat and drought stresses by looking into in situ climate variables and physiological indices from each accession’s geo-referencing (Figure 1b). This is possible because crop wild relatives and landraces have occupied local niches (e.g., hot vs. cold regions) for enough time to be shaped by natural selection. Predicted heat and drought tolerance can further harness other data types (Figure 1e,f) in order to identify and unlock novel sources of heat and drought tolerance (Figure 1g). The ML trained classification may also speed-up the utilization of tolerant variants by genomic-assisted breeding techniques (Figure 1i and Figure 2g). Table is sorted by species name and by ML approach.
| ML Approach(es) | Species | Accessions x Genetic Markers | Reference |
|---|---|---|---|
| RF | Barley | 911 × 2146 SNP | Heslot et al., 2012 [ |
| ANN | Bean | 80 × 384 SNP | Rosado et al., 2020 [ |
| SVM | Black tea | 255 × 1421 DArT SNP | Koech et al., 2020 [ |
| RF | Chickpea | 315 × 1568 DArT SNP | Roorkiwal et al., 2016 [ |
| DT, Bagging, Boosting, RF, ANN | Coffee | 245 × 74 AFLP, 58 SSR, 4 RAPD, and 2 primers | Sousa et al., 2021 [ |
| RF | Coffee | 96 × 38,106 SNP | Ferrão et al., 2019 [ |
| SVM | Hybrid Rice | 575 × 116,482 SNP | Xu et al., 2018 [ |
| ANN | Maize | 300 × 55,000 SNP | González-Camacho, et al., 2012 [ |
| DL | Maize | 148,452 × 19,465 SNP | Khaki & Wang, 2019 [ |
| DL | Maize | ~300 × ~1000 SNP | Rachmatia et al., 2017 [ |
| KNN | Maize | 198 × 75 SSR | Maenhout et al., 2007 [ |
| MLP, PNN | Maize | ~300 × 46,374 SNP | González-Camacho et al., 2016 [ |
| RBFNN, ANN | Maize | ~300 × 46,374 SNP | González-Camacho et al., 2012 [ |
| RF | Maize | 240 × 29,619 SNP | Shikha et al., 2017 [ |
| RF | Maize | 240 × 56,110 SNP | Shikha et al., 2017 [ |
| RF, SVM, ANN, Boosting | Maize | 391 × 332,178 SNP | Azodi et al., 2019 [ |
| SVM | Maize | 4,328 × 564,692 SNP | Zhao et al., 2020 [ |
| SVM, RF | Maize | 113 × 47,458 SNP | Li et al., 2020 [ |
| ZAP-RF | Maize | 115 × 1635 SNP | Montesinos-López et al., 2021 [ |
| DL | Maize | 309 × 158,281 SNP | Montesinos-López, et al., 2018 [ |
| RF, SVM | Mice | 1,884 × 9917 SNP | Neves et al., 2012 [ |
| SVM | Pea | 105 × 7521 SNP | Annicchiarico et al., 2017 [ |
| RF, Boosting, KNN | Perennial ryegrass | 86 × 1670 SNP | Grinberg et al., 2016 [ |
| RF, GBM, KNN | Perennial ryegrass | 86 × 1670 SNP | Grinberg et al., 2016 [ |
| Bagging, RF, SVM | Rice | 363 × 73,147 SNP | Banerjee et al., 2020 [ |
| RF | Rice | 110 × 3071 SNP | Onogi et al., 2015 [ |
| SVM, Boosting | Simulated Dataset | 3226 × 10,031 | Ogutu et al., 2011 [ |
| DL | Strawberry | 1358 × 9908 SNP | Zingaretti et al., 2020 [ |
| ANN | Wheat | 599 × 1279 SNP | Gianola, et al., 2011 [ |
| ANN | Wheat | 306 × 1717 SNP | Pérez-Rodríguez et al., 2012 [ |
| DL | Wheat | ~500 × 15,744 SNP | Crossa et al., 2019 [ |
| DL | Wheat | 237 × 27,957 SNP | Guo et al., 2020 [ |
| DL | Wheat | 2000 × 33,709 DArT SNP | Ma et al., 2017 [ |
| GBM, RF, SVM | Wheat | 254 × 33,516 SNP | Grinberg et al., 2020 [ |
| MLP, PNN | Wheat | ~300 × 1717 DArT SNP | González-Camacho et al., 2016 [ |
| RF | Wheat | 254 × 41,371 SNP | Poland et al., 2012 [ |
| RF, KNN | Wheat | 273 × 5054 SNP | Arruda et al., 2015 [ |
| DL, SVM | Wheat | 3486 × 2038 SNP | Montesinos-López et al., 2019 [ |
ML tool abbreviations as follows: adaptive boosting (AdaBoost), artificial neural networks (ANN), decision tree (DT), deep learning (DL), extreme gradient boosting (XGBoost), gradient boosting machines (GBM), multilayer perceptron neural network (MLP), probabilistic neural network (PNN), radial basis function neural network (RBFNN), random forest (RF), support vector machines (SVM), and zero altered Poisson random forest (ZAP-RF). ML-coupled genomic prediction initiatives explicitly related to abiotic stress tolerance are marked with Ψ under reference.
Figure 2A pipeline for machine learning (ML) applications capable of predicting abiotic stress tolerant and susceptible germplasm accessions. First, a subset of the germplasm collection is (a) characterizing genomically, phenotypically (whenever possible), and environmentally (i.e., abiotic stress adaptation indices based on geo-referencing). This subset is later on partitioned between (b) training and (c) testing populations. The training population is used to calibrate (d) ML models that aim using genomic information to predict genomic estimated adaptive values (GEAVs, an analogous rank to the polygenic risk score (PGS) and genomic estimated breeding value (GEBV) from the quantitative genomics literature, e.g., [102,136]). The computer screen depicts a hypothetical hidden neural network (HNN) algorithm, which is one among many potential ML tools; the repertoire includes several regressions, classification, and deep learning models, thoughtfully reviewed this year by Sebestyén et al. [137] and Tong and Nikoloski [138]. Meanwhile, the testing population is used to compute the (e) unbiased predictive ability of the model by comparing the GEAVs with the recorded environmental (or phenotypic) abiotic stress tolerant/susceptible indices. Broadly speaking, calibrated and validated ML models can serve two main purposes when applied on germplasm collections. First, (f) they could enhance our knowledge on the genomic architecture (i.e., genetic basis) of abiotic stress tolerance via ML-based genome-wide association studies (GWAS), and on the genomic landscape of adaptation via ML-based genome-wide selection scans (GWSS) and genome–environment associations (GEA). Second, (g) calibrated and validated ML models can be applied on a (h) query population such as extended germplasm samples for which environmental-based indices or phenotyping are not viable, informing GEAVs and (i) abiotic stress tolerance on a wider genepool. Clusters of abiotic stress tolerance and susceptibility based on phenotypic information and/or environmental-based indices can be built using traditional classification tools such as the ones listed in Table 1, or may also leverage ML prediction approaches (Table 2).