| Literature DB >> 35669178 |
Iulian Gabur1,2, Danut Petru Simioniuc2, Rod J Snowdon1, Dan Cristea3.
Abstract
Large plant breeding populations are traditionally a source of novel allelic diversity and are at the core of selection efforts for elite material. Finding rare diversity requires a deep understanding of biological interactions between the genetic makeup of one genotype and its environmental conditions. Most modern breeding programs still rely on linear regression models to solve this problem, generalizing the complex genotype by phenotype interactions through manually constructed linear features. However, the identification of positive alleles vs. background can be addressed using deep learning approaches that have the capacity to learn complex nonlinear functions for the inputs. Machine learning (ML) is an artificial intelligence (AI) approach involving a range of algorithms to learn from input data sets and predict outcomes in other related samples. This paper describes a variety of techniques that include supervised and unsupervised ML algorithms to improve our understanding of nonlinear interactions from plant breeding data sets. Feature selection (FS) methods are combined with linear and nonlinear predictors and compared to traditional prediction methods used in plant breeding. Recent advances in ML allowed the construction of complex models that have the capacity to better differentiate between positive alleles and the genetic background. Using real plant breeding program data, we show that ML methods have the ability to outperform current approaches, increase prediction accuracies, decrease the computing time drastically, and improve the detection of important alleles involved in qualitative or quantitative traits.Entities:
Keywords: feature selection; genomic selection; linear models; machine learning; oilseed rape; wheat
Year: 2022 PMID: 35669178 PMCID: PMC9164111 DOI: 10.3389/frai.2022.876578
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1Machine learning (ML) workflow that include feature selection (FS) methods [principal component analysis (PCA)], nonlinear dimensionality reduction [random forest (RF)], and random selection using 80% of the input data set, while the evaluation of prediction accuracies was done with the remaining 20%, the validation populations (VPs). Features obtained from the FS filter methods were combined with ridge regression best linear unbiased prediction (rrBLUP), least absolute shrinkage and selection operator (LASSO) regression, gradient boosting machines (GBM), artificial neural networks (ANN), and RF predictors.
Figure 2Boxplots for (A) hybrid yield and (B) days to flower predictions of the B. napus data set obtained on 10-fold cross-validation (CV) sets, using Pearson correlations, with rrBLUP, LASSO, GBM, ANN, and RF, with 100, and 1,000 SNP subsets selected with different filter methods and the entire SNP data set (colored in gray). Filter methods: principal component analysis 100 PCA (red), 1,000 RF (green), 1,000 RS (blue), and 14,718 (gray, no FS using the total number of markers after QC).
Median accuracy of cross-validation (CV) results using the Pearson correlation coefficient, over all pairs of feature scores in the 10-outer training sets obtained with different filter methods for feature selection (FS), of hybrid seed yield (SY), and days to flower for the B. napus panel and grain yield under three management practices [HiN HiF, HiN NoF, low nitrogen inputs (LoN), and no fungicides (NoF)] for the 191 wheat cultivars.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Hybrid yield | - | rrBLUP | 1,4718 | 0.3674 | |
| LASSO | 1,4718 | 0.3566 | |||
| GBM | 1,4718 | 0.3019 | |||
| ANN | 1,4718 |
| |||
| RF | 1,4718 | 0.3603 | |||
| PCA | rrBLUP | 100 | 0.2964 | ||
| LASSO | 100 | 0.2954 | |||
| GBM | 100 | 0.2878 | |||
| ANN | 100 | 0.2810 | |||
| RF | 100 | 0.3401 | |||
| RF | rrBLUP | 1,000 |
| ||
| LASSO | 1,000 | 0.3089 | |||
| GBM | 1,000 | 0.3478 | |||
| ANN | 1,000 | 0.3349 | |||
| RF | 1,000 | 0.3525 | |||
| random | rrBLUP | 1,000 | 0.0047 | ||
| LASSO | 1,000 | −0.0021 | |||
| GBM | 1,000 | 0.0596 | |||
| ANN | 1,000 | −0.0303 | |||
| RF | 1,000 | 0.0244 | |||
| Days to flower | - | rrBLUP | 14,718 |
| |
| LASSO | 14,718 | 0.8089 | |||
| GBM | 14,718 | 0.7932 | |||
| ANN | 14,718 |
| |||
| RF | 14,718 | 0.7658 | |||
| PCA | rrBLUP | 100 | 0.6795 | ||
| LASSO | 100 | 0.6791 | |||
| GBM | 100 | 0.7894 | |||
| ANN | 100 | 0.7388 | |||
| RF | 100 | 0.7600 | |||
| RF | rrBLUP | 1,000 | 0.6810 | ||
| LASSO | 1,000 | 0.7202 | |||
| GBM | 1,000 | 0.7437 | |||
| ANN | 1,000 | 0.7184 | |||
| RF | 1,000 | 0.7056 | |||
| random | rrBLUP | 1,000 | 0.0048 | ||
| LASSO | 1,000 | 0.0022 | |||
| GBM | 1,000 | 0.0597 | |||
| ANN | 1,000 | 0.0303 | |||
| RF | 1,000 | 0.0245 | |||
|
| Grain yield (LoN.NoF) | - | rrBLUP | 8,630 | 0.5588 |
| LASSO | 8,630 | 0.5640 | |||
| GBM | 8,630 |
| |||
| ANN | 8,630 | 0.5496 | |||
| RF | 8,630 | 0.6018 | |||
| PCA | rrBLUP | 100 | 0.6849 | ||
| LASSO | 100 | 0.6865 | |||
| GBM | 100 | 0.5274 | |||
| ANN | 100 |
| |||
| RF | 100 | 0.5530 | |||
| RF | rrBLUP | 1,000 | 0.6686 | ||
| LASSO | 1,000 | 0.5725 | |||
| GBM | 1,000 | 0.5922 | |||
| ANN | 1,000 | 0.5987 | |||
| RF | 1,000 | 0.6112 | |||
| random | rrBLUP | 1,000 | 0.0015 | ||
| LASSO | 1,000 | 0.0803 | |||
| GBM | 1,000 | 0.0046 | |||
| ANN | 1,000 | 0.0230 | |||
| RF | 1,000 | 0.0299 | |||
| Grain yield (HiN.HiF) | - | rrBLUP | 8,630 | 0.8003 | |
| LASSO | 8,630 | 0.7279 | |||
| GBM | 8,630 | 0.7280 | |||
| ANN | 8,630 | 0.7829 | |||
| RF | 8,630 | 0.7289 | |||
| PCA | rrBLUP | 100 | 0.7812 | ||
| LASSO | 100 | 0.7820 | |||
| GBM | 100 | 0.7423 | |||
| ANN | 100 | 0.7788 | |||
| RF | 100 | 0.7314 | |||
| RF | rrBLUP | 1,000 | 0.8030 | ||
| LASSO | 1,000 |
| |||
| GBM | 1,000 | 0.7361 | |||
| ANN | 1,000 |
| |||
| RF | 1,000 | 0.7579 | |||
| random | rrBLUP | 1,000 | 0.0007 | ||
| LASSO | 1,000 | 0.0050 | |||
| GBM | 1,000 | 0.0381 | |||
| ANN | 1,000 | 0.0537 | |||
| RF | 1,000 | 0.0021 | |||
| Grain yield (HiN.NoF) | - | rrBLUP | 8,630 |
| |
| LASSO | 8,630 | 0.5047 | |||
| GBM | 8,630 | 0.5226 | |||
| ANN | 8,630 | 0.4482 | |||
| RF | 8,630 | 0.4482 | |||
| PCA | rrBLUP | 100 | 0.5472 | ||
| LASSO | 100 | 0.5474 | |||
| GBM | 100 | 0.4421 | |||
| ANN | 100 | 0.5150 | |||
| RF | 100 | 0.4317 | |||
| RF | rrBLUP | 1,000 | 0.5362 | ||
| LASSO | 1,000 |
| |||
| GBM | 1,000 | 0.4844 | |||
| ANN | 1,000 | 0.4850 | |||
| RF | 1,000 | 0.5349 | |||
| random | rrBLUP | 1,000 | 0.0231 | ||
| LASSO | 1,000 | 0.0050 | |||
| GBM | 1,000 | 0.0014 | |||
| ANN | 1,000 | 0.0803 | |||
| RF | 1,000 | 0.0109 |
Figure 3Boxplots for wheat (A) grain yield LoN/NoF, (B) grain yield HiN/HiF, and (C) grain yield HiN/NoF predictions of the Triticum aestivum 191 commercial cultivars obtained in 10-fold cross-validation (CV) sets, using Pearson's correlations, with rrBLUP, LASSO, GBM, ANN, and RF, with 100, and 1,000 SNP subsets selected with different filtering methods and the entire SNP data set (colored in gray). Filtering methods: principal component analysis 100 PCA (red), 1,000 RF (green), 1,000 RS (blue), and 14,718 (gray, no FS using the total number of markers after QC).
Processing time for filter methods, principal component analysis-(PCA-) based data reduction and random forest-(RF-) based FS, combined with linear and nonlinear learners, ridge regression best linear unbiased prediction (rrBLUP), least absolute shrinkage and selection operator (LASSO), gradient boosting machines (GBM), artificial neural networks (ANN), and RF.
|
|
|
|
| |
|---|---|---|---|---|
|
| Hybrid yield | - | rrBLUP, LASSO, GBM, ANN, RF | 1,200 |
| PCA | rrBLUP, LASSO, GBM, ANN, RF | 6.7 (200 ×) | ||
| RF | rrBLUP, LASSO, GBM, ANN, RF | 139.7 (10 ×) | ||
| random | rrBLUP, LASSO, GBM, ANN, RF | 39.3 (30 ×) | ||
| Days to flower | - | rrBLUP, LASSO, GBM, ANN, RF | 749.9 | |
| PCA | rrBLUP, LASSO, GBM, ANN, RF | 7.4 (100 ×) | ||
| RF | rrBLUP, LASSO, GBM, ANN, RF | 135.7 (6 ×) | ||
| random | rrBLUP, LASSO, GBM, ANN, RF | 78.7 (10 ×) | ||
|
| Grain yield LoN/NoF | - | rrBLUP, LASSO, GBM, ANN, RF | 103.2 |
| PCA | rrBLUP, LASSO, GBM, ANN, RF | 3.1 (33 ×) | ||
| RF | rrBLUP, LASSO, GBM, ANN, RF | 15.2 (7 ×) | ||
| random | rrBLUP, LASSO, GBM, ANN, RF | 19.6 (5 ×) | ||
| Grain yield HiN/HiF | - | rrBLUP, LASSO, GBM, ANN, RF | 97.5 | |
| PCA | rrBLUP, LASSO, GBM, ANN, RF | 3.2 (32 ×) | ||
| RF | rrBLUP, LASSO, GBM, ANN, RF | 15.3 (6 ×) | ||
| random | rrBLUP, LASSO, GBM, ANN, RF | 19.7 (5 ×) | ||
| Grain yield HiN/NoF | - | rrBLUP, LASSO, GBM, ANN, RF | 97.4 | |
| PCA | rrBLUP, LASSO, GBM, ANN, RF | 3.1 (31 ×) | ||
| RF | rrBLUP, LASSO, GBM, ANN, RF | 15.2 (6 ×) | ||
| random | rrBLUP, LASSO, GBM, ANN, RF | 21.2 (5 ×) |