| Literature DB >> 24424163 |
L Ornella1, P Pérez2, E Tapia1, J M González-Camacho2, J Burgueño3, X Zhang3, S Singh3, F S Vicente3, D Bonnett3, S Dreisigacker3, R Singh3, N Long4, J Crossa3.
Abstract
Pearson's correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait-environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen's kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets) and the best RE in the same 13 data sets, with values ranging from 0.393 to 0.948 (statistically significant in 12 data sets). RR produced the best mean for both κ and RE in one data set (0.148 and 0.381, respectively). Regarding the wheat data sets, SVC-lin presented the best κ in 12 of the 16 data sets, with outcomes ranging from 0.280 to 0.580 (statistically significant in 4 data sets) and the best RE in 9 data sets ranging from 0.484 to 0.821 (statistically significant in 5 data sets). SVC-rbf (0.235), RR (0.265) and RHKS (0.422) gave the best κ in one data set each, while RHKS and BL tied for the last one (0.234). Finally, BL presented the best RE in two data sets (0.738 and 0.750), RFR (0.636) and SVC-rbf (0.617) in one and RHKS in the remaining three (0.502, 0.458 and 0.586). The difference between the performance of SVC-lin and that of the rest of the models was not so pronounced at higher percentiles of the distribution. The behaviour of regression and classification algorithms varied markedly when selection was done at different thresholds, that is, κ and RE for each algorithm depended strongly on the selection percentile. Based on the results, we propose classification method as a promising alternative for GS in plant breeding.Entities:
Mesh:
Year: 2014 PMID: 24424163 PMCID: PMC4023444 DOI: 10.1038/hdy.2013.144
Source DB: PubMed Journal: Heredity (Edinb) ISSN: 0018-067X Impact factor: 3.821
Information on the maize and wheat data sets used in this study
| GY-WW | Maize | Yield—well watered | 242 | 46 374 SNPs |
| GY-SS | Maize | Yield—drought stressed | 242 | 46 374 SNPs |
| MLF-WW | Maize | Male flowering time—well watered | 258 | 46 374 SNPs |
| MLF-SS | Maize | Male flowering time—drought stressed | 258 | 46 374 SNPs |
| FLF-WW | Maize | Female flowering time—well watered | 258 | 46 374 SNPs |
| FLF-SS | Maize | Female flowering time—drought stressed | 258 | 46 374 SNPs |
| ASI-WW | Maize | Anthesis silking interval—well watered | 258 | 46 374 SNPs |
| ASI-SS | Maize | Anthesis silking interval—drought stressed | 258 | 46 374 SNPs |
| GLS-1 | Maize | Grey leaf spot | 272 | 46 374 SNPs |
| GLS-2 | Maize | Grey leaf spot | 280 | 46 374 SNPs |
| GLS-3 | Maize | Grey leaf spot | 278 | 46 374 SNPs |
| GLS-4 | Maize | Grey leaf spot | 261 | 46 374 SNPs |
| GLS-5 | Maize | Grey leaf spot | 279 | 46 374 SNPs |
| GLS-6 | Maize | Grey leaf spot | 281 | 46 374 SNPs |
| KBIRD-Srm | Wheat | Stem rust—main season | 90 | 1355 DArT |
| KBIRD-Sro | Wheat | Stem rust—off season | 90 | 1355 DArT |
| KNYANGUMI-Srm | Wheat | Stem rust—main season | 176 | 1355 DArT |
| KNYANGUMI-Sro | Wheat | Stem rust—off season | 191 | 1355 DArT |
| F6PAVON-Srm | Wheat | Stem rust—main season | 176 | 1355 DArT |
| F6PAVON-Sro | Wheat | Stem rust—off season | 180 | 1355 DArT |
| JUCHI-Ken | Wheat | Yellow rust—Kenya | 176 | 1355 DArT |
| KBIRD-Ken | Wheat | Yellow rust—Kenya | 191 | 1355 DArT |
| KBIRD-tol | Wheat | Yellow rust—Mexico | 176 | 1355 DArT |
| KNYANGUMI-tol | Wheat | Yellow rust—Mexico | 180 | 1355 DArT |
| F6PAVON-Ken | Wheat | Yellow rust—Kenya | 147 | 1355 DArT |
| F6PAVON-tol | Wheat | Yellow rust—Mexico | 180 | 1355 DArT |
| GY-1 | Wheat | Yield-E1, low rainfall and irrigated | 599 | 1279 DArT |
| GY-2 | Wheat | Yield—high rainfall | 599 | 1279 DArT |
| GY-3 | Wheat | Yield—low rainfall and high temperature | 599 | 1279 DArT |
| GY-4 | Wheat | Yield—low humidity and hot | 599 | 1279 DArT |
Abbreviation: SNP, single-nucleotide polymorphism.
Figure 1Confusion matrix for a two-class problem. o and o are the number of observed cases for classes A and B, respectively; m=n+n and m=n+n are the number of predicted cases for classes A and B, respectively; n and n are the number of individuals correctly predicted for each class; n is the number of individuals in class A predicted as B, whereas n is the number of individuals in class B predicted as A. n is the total number of cases in the experiment.
Cohen's kappa coefficient for 6 regression and 3 classification methods for genomic selection applied to 14 maize data sets and across trait–environment combinations when selecting the best 15% of individuals
| GLS-1 | 0.249 | 0.190 | 0.243 | 0.249 | 0.196 | 0.196 | 0.243 | 0.272 | |
| GLS-2 | 0.329 | 0.329 | 0.318 | 0.323 | 0.364 | 0.376 | 0.329 | 0.318 | |
| GLS-3 | 0.399 | 0.446 | 0.411 | 0.393 | 0.417 | 0.434 | 0.405 | 0.323 | |
| GLS-4 | 0.368 | 0.338 | 0.356 | 0.344 | 0.380 | 0.338 | 0.315 | 0.250 | |
| GLS-5 | 0.102 | 0.084 | 0.084 | 0.143 | 0.154 | 0.154 | 0.102 | 0.084 | |
| GLS-6 | 0.178 | 0.154 | 0.166 | 0.137 | 0.148 | 0.160 | 0.125 | 0.148 | |
| GY-SS | 0.202 | 0.208 | 0.244 | 0.232 | 0.208 | 0.256 | 0.238 | 0.190 | |
| GY-WW | 0.370 | 0.376 | 0.382 | 0.394 | 0.364 | 0.340 | 0.334 | 0.394 | |
| MLF-WW | 0.580 | 0.592 | 0.586 | 0.557 | 0.580 | 0.586 | 0.575 | 0.468 | |
| MLF-SS | 0.580 | 0.586 | 0.610 | 0.545 | 0.580 | 0.610 | 0.569 | 0.510 | |
| FLF-WW | 0.557 | 0.586 | 0.580 | 0.539 | 0.580 | 0.586 | 0.610 | 0.445 | |
| FLF-SS | 0.569 | 0.610 | 0.616 | 0.504 | 0.610 | 0.598 | 0.480 | 0.421 | |
| ASI-WW | 0.072 | 0.066 | 0.078 | 0.001 | 0.066 | 0.078 | 0.096 | 0.078 | |
| ASI-SS | 0.049 | 0.090 | 0.073 | 0.007 | 0.060 | 0.107 | 0.120 | 0.155 | |
Abbreviations: BL, Bayesian LASSO; NS, not significant; RFR, Random Forest Regression; RHKS, Reproducing Kernel Hilbert Spaces; RR, Ridge Regression; SVR, Support Vector Regression with radial basis function (rbf) or linear (lin) kernels; RFC, Random Forest Classification; SVC, Support Vector Classification with radial basis function (rbf) or linear (lin) kernels.
Results presented are the average of 50 random partitions (the proportion of individuals in the training-testing data sets is 9:1).
For each data set the highest value is underlined.
*, ** Differences are significant at the 0.05 and 0.01 probability levels, respectively.
Relative efficiency of 6 regression and 3 classification methods for genomic selection applied to 8 maize data sets and across trait–environment combinations when selecting the best 15% of individuals
| GLS-1 | 0.358 | 0.300 | 0.341 | 0.354 | 0.278 | 0.296 | 0.331 | 0.336 | |
| GLS-2 | 0.589 | 0.527 | 0.551 | 0.583 | 0.601 | 0.585 | 0.539 | 0.506 | |
| GLS-3 | 0.702 | 0.730 | 0.706 | 0.700 | 0.707 | 0.721 | 0.693 | 0.600 | |
| GLS-4 | 0.611 | 0.570 | 0.600 | 0.602 | 0.580 | 0.534 | 0.473 | 0.572 | |
| GLS-5 | 0.283 | 0.287 | 0.260 | 0.325 | 0.331 | 0.354 | 0.264 | 0.237 | |
| GLS-6 | 0.423 | 0.376 | 0.424 | 0.382 | 0.393 | 0.315 | 0.367 | 0.381 | |
| GY-SS | 0.356 | 0.286 | 0.346 | 0.432 | 0.328 | 0.348 | 0.415 | 0.332 | |
| GY-WW | 0.591 | 0.585 | 0.603 | 0.616 | 0.585 | 0.561 | 0.534 | 0.588 | |
| MLF-WW | 0.848 | 0.888 | 0.863 | 0.800 | 0.847 | 0.886 | 0.856 | 0.687 | |
| MLF-SS | 0.847 | 0.871 | 0.890 | 0.803 | 0.856 | 0.882 | 0.803 | 0.741 | |
| FLF-WW | 0.822 | 0.877 | 0.856 | 0.775 | 0.845 | 0.878 | 0.867 | 0.692 | |
| FLF-SS | 0.832 | 0.885 | 0.874 | 0.757 | 0.866 | 0.872 | 0.738 | 0.673 | |
| ASI-WW | 0.182 | 0.173 | 0.312 | 0.103 | 0.210 | 0.263 | 0.226 | 0.238 | |
| ASI-SS | 0.134 | 0.165 | 0.124 | 0.109 | 0.143 | 0.116 | 0.262 | 0.304 | |
Abbreviations: BL, Bayesian LASSO; NS, not significant; RFR, Random Forest Regression; RHKS, Reproducing Kernel Hilbert Spaces; RR, Ridge Regression; SVR, Support Vector Regression with radial basis function (rbf) or linear (lin) kernels; RFC, Random Forest Classification; SVC, Support Vector Classification with radial basis function (rbf) or linear (lin) kernels.
Results presented are the average of 50 random partitions (the proportion of individuals in the training-testing data sets is 9:1).
For each data set the highest value is underlined.
*, ** Differences are significant at the 0.05 and 0.01 probability levels, respectively.
Cohen's kappa coefficient of 6 regression and 3 classification methods for genomic selection applied to 16 wheat data sets and across trait–environment combinations when selecting the best 15% of individuals
| KBIRD-Srm | 0.100 | 0.303 | 0.280 | 0.145 | 0.258 | 0.213 | 0.145 | 0.235 | |
| KBIRD-Sro | 0.322 | 0.392 | 0.344 | 0.425 | 0.344 | 0.322 | 0.235 | 0.168 | |
| KNYANGUMI-Srm | 0.168 | 0.184 | 0.208 | 0.160 | 0.184 | 0.224 | 0.056 | 0.232 | |
| KNYANGUMI-Sro | 0.438 | 0.406 | 0.485 | 0.414 | 0.398 | 0.470 | 0.327 | ||
| F6PAVON-Srm | 0.256 | 0.264 | 0.248 | 0.296 | 0.240 | 0.176 | 0.360 | 0.360 | |
| F6PAVON-Sro | 0.288 | 0.384 | 0.352 | 0.320 | 0.400 | 0.312 | 0.320 | 0.264 | |
| JUCHI-Ken | 0.258 | 0.258 | 0.258 | 0.078 | 0.168 | 0.213 | 0.145 | 0.235 | |
| KBIRD-Ken | 0.033 | 0.078 | 0.033 | 0.078 | 0.033 | 0.010 | 0.078 | 0.213 | |
| KBIRD-tol | 0.370 | 0.348 | 0.325 | 0.303 | 0.235 | 0.258 | 0.280 | 0.303 | |
| KNYANGUMI-tol | 0.034 | 0.058 | 0.058 | 0.066 | 0.050 | 0.018 | 0.074 | 0.288 | |
| F6PAVON-Ken | 0.158 | 0.158 | 0.112 | 0.146 | 0.181 | 0.054 | 0.181 | 0.287 | |
| F6PAVON-tol | 0.384 | 0.320 | 0.312 | 0.368 | 0.296 | 0.240 | 0.360 | 0.392 | |
| GY-1 | 0.229 | 0.142 | 0.135 | 0.234 | 0.210 | 0.119 | 0.231 | 0.271 | |
| GY-2 | 0.250 | 0.239 | 0.205 | 0.239 | 0.142 | 0.137 | 0.244 | 0.179 | |
| GY-3 | 0.229 | 0.158 | 0.224 | 0.166 | 0.161 | 0.216 | 0.150 | ||
| GY-4 | 0.346 | 0.344 | 0.401 | 0.383 | 0.208 | 0.286 | 0.346 | 0.297 | |
Abbreviations: BL, Bayesian LASSO; NS, not significant; RFR, Random Forest Regression; RHKS, Reproducing Kernel Hilbert Spaces; RR, Ridge Regression; SVR, Support Vector Regression with radial basis function (rbf) or linear (lin) kernels; RFC, Random Forest Classification; SVC, Support Vector Classification with radial basis function (rbf) or linear (lin) kernels.
Results presented are the average of 50 random partitions (the proportion of individuals in the training-testing data sets is 9:1).
For each data set the highest value is underlined.
*, ** Differences are significant at the 0.05 and 0.01 probability levels, respectively.
Relative efficiency of 6 regression and 3 classification methods for genomic selection applied to 16 wheat data sets and across trait–environment combinations when selecting the best 15% of individuals
| KBIRD-Srm | 0.284 | 0.600 | 0.549 | 0.558 | 0.517 | 0.544 | 0.326 | 0.118 | |
| KBIRD-Sro | 0.623 | 0.758 | 0.740 | 0.810 | 0.690 | 0.702 | 0.702 | 0.470 | |
| KNYANGUMI-Srm | 0.506 | 0.596 | 0.588 | 0.667 | 0.601 | 0.640 | 0.305 | 0.504 | |
| KNYANGUMI-Sro | 0.654 | 0.632 | 0.676 | 0.652 | 0.614 | 0.672 | 0.483 | 0.732 | |
| F6PAVON-Srm | 0.580 | 0.607 | 0.589 | 0.636 | 0.564 | 0.498 | 0.628 | 0.570 | |
| F6PAVON-Sro | 0.612 | 0.690 | 0.685 | 0.717 | 0.637 | 0.689 | 0.488 | 0.736 | |
| JUCHI-Ken | 0.496 | 0.500 | 0.497 | 0.170 | 0.426 | 0.475 | 0.244 | 0.255 | |
| KBIRD-Ken | 0.078 | 0.265 | 0.226 | 0.390 | 0.357 | 0.146 | 0.158 | 0.189 | |
| KBIRD-tol | 0.463 | 0.515 | 0.483 | 0.440 | 0.483 | 0.507 | 0.443 | 0.619 | |
| KNYANGUMI-tol | 0.086 | 0.204 | 0.204 | 0.315 | 0.230 | 0.119 | 0.157 | 0.343 | |
| F6PAVON-Ken | 0.317 | 0.209 | 0.177 | 0.267 | 0.285 | 0.175 | 0.304 | 0.264 | |
| F6PAVON-tol | 0.577 | 0.566 | 0.564 | 0.547 | 0.549 | 0.499 | 0.509 | 0.477 | |
| GY-1 | 0.530 | 0.380 | 0.377 | 0.519 | 0.491 | 0.215 | 0.533 | 0.479 | |
| GY-2 | 0.459 | 0.497 | 0.391 | 0.482 | 0.335 | 0.401 | 0.446 | 0.324 | |
| GY-3 | 0.449 | 0.441 | 0.366 | 0.448 | 0.266 | 0.367 | 0.226 | 0.338 | |
| GY-4 | 0.480 | 0.472 | 0.558 | 0.540 | 0.354 | 0.451 | 0.471 | 0.363 | |
Abbreviations: BL, Bayesian LASSO; NS, not significant; RFR, Random Forest Regression; RHKS, Reproducing Kernel Hilbert Spaces; RR, Ridge Regression; SVR, Support Vector Regression with radial basis function (rbf) or linear (lin) kernels; RFC, Random Forest Classification; SVC, Support Vector Classification with radial basis function (rbf) or linear (lin) kernels.
Results presented are the arithmetic means of 50 random partitions (the proportion of individuals in the training-testing data sets is 9:1).
For each data set the highest value is underlined.
*, ** Differences are significant at the 0.05 and 0.01 probability levels, respectively.
Figure 2Scatter plot of Pearson's correlation coefficient vs Cohen's kappa coefficient (a) and Pearson's correlation vs RE (b) for the 6 regression methods evaluated on 14 maize data sets using a percentile value of 15%. ASI data sets were excluded from the regression analysis (ovals).
Figure 3Scatter plot of Pearson's correlation coefficient vs Cohen's kappa coefficient (a) and Pearson's correlation coefficient vs RE (b) for the 6 regression methods evaluated on 16 wheat data sets using a percentile value of 15%.
Figure 4Bar plot of Cohen's kappa coefficient (a) and RE (b) for the best regression method (green) and the best classification method (grey) evaluated on 14 maize data sets using a percentile α=15% (light colour) and 30% (dark colour).
Figure 5Bar plot of Cohen's kappa coefficient (a) and RE (b) for the best regression method (green) and the best classification method (grey) evaluated on 16 wheat data sets using a percentile α=15% (light colour) and 30% (dark colour).