Literature DB >> 33748861

Optimal breeding-value prediction using a sparse selection index.

Marco Lopez-Cruz¹, Gustavo de Los Campos^2,3,4.

Abstract

Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.

Entities: Chemical

Keywords: GenPred; genomic prediction; penalized regression; prediction accuracy; selection index; shared data resources

Mesh：

Year: 2021 PMID： 33748861 PMCID： PMC8128408 DOI： 10.1093/genetics/iyab030

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

Introduction

Selection decisions in plant and animal breeding rely on the predicted genetic merit of selection candidates. Early prediction methods were based either on phenotypes measured in the candidates of selection or on progeny testing (e.g., Lush 1935). These methods were later extended into selection indices (Smith 1936; Hazel 1943) that can use information from various sources of correlated data, including secondary traits measured on the same individual, measurements of the same phenotype collected from relatives, and combinations thereof (Lush 1948). Henderson (1950) further extended the methodology by developing mixed-models that can include fixed and random effects. The Best Linear Unbiased Predictor (BLUP) predicts breeding values by borrowing (i.e., averaging) information from multiple sources of correlated data. Pedigrees often trace back a limited number of generations and often define “families.” In this context, borrowing of information spans within the scope of each family. However, this is not the case in genomic-BLUP (G-BLUP; VanRaden 2008) because genomic relationships are not sparse as pedigree-derived relationships. In the last two decades, genomic models (aka, genomic selection, GS; Meuwissen ) have become the method of choice for breeding value prediction. GS models predict breeding values using genome-wide markers and rely in the multi-locus linkage disequilibrium (LD) between SNPs and quantitative trait loci (QTL). However, it is also well-established that family relationships and population structure contribute to the accuracy of genomic prediction (Habier ). In a Genomic relationship matrix (VanRaden 2007) all individuals are related to some extent; therefore, every training data point contributes to the prediction of each individual in the testing set. Genomic prediction models were originally developed with reference to a homogeneous population in which marker effects are assumed to be the same across subgroups of the data. However, several factors, including imperfect LD between markers and QTL and nonadditive effects coupled with population structure and admixture, can make marker effects vary across subgroups in the sample (Pritchard and Donnelly 2001; de los Campos ). All these factors can make the genomic relationships derived from markers inaccurate predictors of the genomic relationships realized at causal loci (de los Campos ). Therefore, the accuracy of G-BLUP may be suboptimal when the training data consists of heterogeneous groups (e.g., multiple families or multiple strains or breeds) or even multi-generation data in which LD patterns may vary across distant generations. Several authors have recognized the need to model heterogeneous SNP effects in the context of multi-breed (e.g., Hayes ) and structured (e.g., de los Campos ) data. Most of the existing methods model group-specific effects using either multivariate Gaussian models (e.g., Olson ; Schulz-Streeck ; Lehermeier ) or interaction models (e.g., de los Campos ; Veturi ). However, these approaches can be difficult to use in the presence of cryptic genetic-heterogeneity patterns where no clear groups can be discerned. Another line of research seeks to identify an optimal training set for a given prediction set. These optimal training sets often consist of individuals that are closely related to the individuals in the prediction set, i.e., the candidates of selection (Rincent ; Akdemir ; Isidro ; Pszczola and Calus 2016; Akdemir and Isidro-Sanchez 2019). However, these methods assume that a single training set is optimal for all the individuals in the prediction set which is not necessarily the case. Therefore, in this study, we focus on developing a genomic prediction method that identifies, for each individual in a prediction set an optimal training set (i.e., a set of support points). Our approach achieves this goal by integrating sparsity (by adding an L1-penalty) into a selection index (SI) problem, we refer to the method as a sparse selection index (SSI).

Materials and Methods

A standard selection index () predicts the breeding value of an individual () using a linear combination of the training phenotypes (): . Here, phenotypes are assumed to be centered and corrected by nongenetic effects (e.g., experiment and block effects), and is a vector of weights that are obtained as the solution to the following optimization problem: The right-hand side of the above problem expands to . Assuming that genetic () and nongenetic effects () are independent, each with mean zero and (co)variance matrices and , we have that , where is a genetic variance parameter, is the phenotypic (co)variance matrix of the training phenotypes, and is a vector containing the genetic relationships between the subject of the prediction set and each of the subjects in the training data. Because does not depend on , the aforementioned optimization problem can be reduced to where is the ratio of the error to the genetic variance, which can be expressed in terms of the heritability, . The solution to the above problem can be shown to be The vector can be shown to be the row of the Hat matrix of the BLUPs of the genetic values of the individuals in the prediction set (see Supplementary File S1 in the Supplementary Material for a proof). Therefore, depending on whether is a pedigree- or genomic-derived relationship matrix, the standard SI is equivalent to a pedigree- (Henderson 1963) or genomic-BLUP, respectively. When is a pedigree-based relationship matrix, the off-diagonal entries corresponding to pairs of subjects not connected through the pedigree are equal to zero. In that case, some of the entries of can also be equal to zero which implies that the corresponding predicted breeding value () draws information from a subset of the training data. However, when is a genomic relationship typically none of the off-diagonals are equal to zero; therefore, none of the entries of will be exactly equal to zero. This implies that all the observations in the training set contribute to some extent to predict the breeding values of all the individuals in the prediction set.

Sparse selection index (SSI) Methodology

As noted earlier, there are several reasons (e.g., imperfect LD, effect heterogeneity) why borrowing of information between distantly related individuals may have a detrimental effect on prediction accuracy. Therefore, to achieve sparsity (and possibly differential shrinkage on the ) we considered adding an L1-penalty to the objective function in Equation (1); therefore, The above optimization problem does not have a closed-form solution; however, solutions can be obtained using a Coordinate Descent algorithm very similar to the one used to solve LASSO problems (see Lopez-Cruz ). Specifically, in Equation (3), the relationships between the prediction point and the training genotypes () enters in the optimization problem in the same way the right-hand-side term enters in least-square and LASSO problems for linear models of the form . On the other hand, the term , which accounts for relationships among training genotypes, enters in Equation (3) in the same way that enters in least-square and LASSO problems. The regularization parameter controls how sparse will be; this parameter is also expected to affect the accuracy of the index. Therefore, an optimal value of can be found by maximizing the accuracy of the resulting index.

Data

We used two wheat breeding data sets to evaluate and to compare the prediction performance of standard and sparse selection indices. The first data set (Wheat-large) is a multi-generation wheat breeding data set of a very large sample size (). The second one is (Wheat-small) is a small, structured data (see Supplementary Figure S1). The Wheat-large data set is from CIMMYT’s Global Wheat Program and it includes phenotype data from 58,798 wheat lines that were evaluated during 5 years (2009–2013) at the CIMMYT’s experimental station in Ciudad Obregon, Mexico. Lines were evaluated under six environmental conditions (B2I, B5I, MEL, LHT, DRB, and EHT) representing a combination of planting system (bed vs. flat, the later referred to as melgas), number of irrigations (2, 5 irrigations or drip irrigation), and sowing date (optimum, late or early planting). Each year, grain yield trials were established in an -lattice design with three replicates into incomplete blocks. Moisture-standardized grain yield () was measured at each plot. We used mixed-effects models with a (“fixed”) intercept and the random effects of the trial, block (within trial), and replicate (within trial) to derive least-square means by line and environmental condition. Separate mixed models were fitted to data from each of the simulated environments. The average grain yield in this data set varied from 2.72 to 7.12 (see Supplementary Figure S2B for boxplots of grain yield) and the heritability of single-plot records varied between 0.23 and 0.57 (see Supplementary Table S1 for a summary of the data). Only a subset of 29,484 genotypes was genotyped using a GBS (Genotyping-by-sequencing) technology that yielded 42,706 SNPs. We removed SNPs with more than 70% of missing values and those with minor allele frequency lower than 5%. After applying these filters, we retained 9,045 SNPs. The remaining genotypes that were missing were imputed with the sample mean of genotypes at the corresponding loci. The data set has been previously described and analyzed by Pérez-Rodríguez . The Wheat-small data set is also from CIMMYT’s Global Wheat Program and it is comprised of grain yield and genotype data for 599 historical inbred lines derived along 25 years. Lines were evaluated in the Elite Spring Wheat Yield Trials (ESWYT) that were grouped into four different mega-environments (ME1, …, ME4). The available phenotypic values are least-square means from two replicates. The average grain yield in this data set ranged from 3.23 to 5.14 (see Supplementary Figure S2A for boxplots of grain yield data) with heritability estimates for the least-square means ranging between 0.43 and 0.50 (see Supplementary Table S2). Each of the lines was genotyped for 1,279 diversity array technology (DArT) markers. The data set is available with the BGLR R-package (Perez and de los Campos 2014) and has been described and analyzed by previous authors (e.g., de los Campos ; Crossa ).

Analyses

For each data set, we computed a genomic relationship matrix using (centered and standardized) marker information, , as , where is the number of markers and is the matrix of centered and standardized markers obtained by subtracting from each marker entry, , the mean of each column, , followed by scaling by the standard deviation of the column, . The resulting matrix has an average of the diagonal elements equal to 1. To quantify the prediction accuracy of each of the indices, we divided each data set into training (trn) and testing (tst) sets by randomly assigning 30% (70%) of the data points to testing (training). Predictions were derived by first using Equation (2), (for the standard SI) and Equation (3), (for the SSI), with representing the genomic matrix of the training data points (i.e., with dimensions , where ), and being the vector containing the genomic relationships between the data-point of the testing set, with each of the individuals assigned to the training set (i.e., the dimensions of are ). This was repeated for each individual in the testing set (, where ). Subsequently, predictions for each individual were obtained using (for the standard SI) and (for the SSI) where is a vector with the adjusted-centered phenotypes of the training set. The implementation of the SI requires heritability estimates. We derived those by fitting a G-BLUP model of the form with and . Separate models were fitted to grain yield within each environment in each data set within the training set. We then used the variance parameters estimates to derive for grain yield. Prediction accuracy () was measured as the correlation between the phenotype and the index, divided by the square-root of the heritability of the trait (grain yield), (Dekkers 2007). We applied all methods to the same training-testing partitions (trn–tst partitions) and report, from these analyses, the average prediction accuracy and the proportion of times that one method was better than the other. For the SSI, we estimated the accuracy over a grid of values of the regularization parameter ( where . Here is the minimum value of that yields an SSI with no active predictors (i.e., with all coefficients equal to zero), and gives the weights of the standard SI. Following Friedman , we used a grid of values evenly spaced in the logarithm scale with a total of 100 values. Thus, for each value of in the grid, we had an estimate of the resulting accuracy of the SSI. This was used to profile the accuracy as a function of the regularization parameter and also to choose an optimal value of . To determine an optimal value of we implemented a calibration analysis using data from the training data only. Specifically, for each training set, we conducted an internal cross-validation (CV) as follows: (i) The training data set was partitioned into k subsets. (ii) SSIs were derived over a grid of values of using data from k-1 folds for training and the data in the fold as testing (i.e., for estimation of accuracy, see the previous paragraph). (iii) The resulting curves profiling accuracy () by values of were used to identify the value of () that maximized accuracy. (iv) Finally, we used all the data from the training set to derive and evaluated the accuracy of the resulting index in the left-out data from the testing set.

Software

All the analyses were performed in the R environment-language (R Core Team 2019) version 3.5. Genomic relationships were derived using the getG() function of the BGData R-package (Grueneberg and de los Campos 2019). The heritability of the trait was estimated using the rrBLUP R-package (Endelman 2011). Sparse SIs were derived using the SSI() function from the SFSI R-package that implements the Coordinate Descent algorithm described in Lopez-Cruz . The package is aided by ggplot2 (Hadley 2016) and parallel (R Core Team 2019) packages to visualize results and to speed computation. This package is available through the GitHub repository at https://github.com/MarcooLopez/SFSI. Scripts illustrating the use of this package using the Wheat-small data set are presented in the Supplementary Material, Supplementary File S2. Training set optimization via subset selection (presented in the Section Discussion) was implemented in the STPGA R-package (Akdemir ).

Data availability

Both phenotypic and marker data for the Wheat-large data set can be downloaded from CIMMYT’s repository at http://genomics.cimmyt.org/wheat_50k/PG/ (accessed March 6th, 2021). The Wheat-small data set can be downloaded from the BGLR R-package by calling “data(wheat).” Supplementary File S1 contains a proof on the equivalence between the standard SI and the BLUP. Code showing how to perform all analyses is provided in Supplementary File S2. All supplementary figures and tables are contained in Supplementary File S3. All supplementary files are available at figshare: https://doi.org/10.25386/genetics.14098952.

Results

Sparsity improves prediction accuracy

We assessed the effect of sparsity on the accuracy, by fitting the SSI for 100 values of (; the value produces the coefficients of the standard SI or G-BLUP. The results (averaged over 100 trn–tst partitions) are shown in Figure 1. The number of support points (i.e., the number of training data points contributing to the prediction) was, as expected, inversely proportional to ; therefore, to facilitate interpretation, the x-axis of Figure 1 is displayed as the average (across genotypes in the testing set) number of support points, which is more meaningful than the values. The accuracy of the G-BLUP is also shown at the rightmost side of the plot whose number of support points is equal to the size of the training data set. Intermediate values of led to sparse indices that, in most cases, achieved higher prediction accuracy than that of the G-BLUP (shaded “belly” area in Figure 1). The maximum accuracy in the environment EHT (see Figure 1A) was obtained with a penalization that leads to a sparse index with an average of 120 support points (. This predictive set of individuals represents around 8% of the total training set () available for prediction.

Figure 1

Prediction accuracy for grain yield (average across 100 trn–tst partitions) of the SSI versus the (average) number of support points of the SSIs. The G-BLUP (blue rightmost point) is a special case of an SSI when . Each panel represents one environment within data set. (A) Wheat-large data set. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. (B) Wheat-small data set. ME, mega-environment. Vertical bars represent a 95% confidence interval for the average. For the small data set (Figure 1B), the same “belly” pattern can be observed in all environments, except for ME2. This case shows that the SSI does not always outperform the G-BLUP; however, the SSI achieves the prediction accuracy of the G-BLUP with a smaller support set ( out of ).

Using an internal CV to achieve optimal sparsity

The results in Figure 1 suggest that one can find a value of that leads to an index with a predictive performance as least as high (and in most cases higher) as the G-BLUP. However, to obtain an unbiased estimate of the maximum accuracy that the SSI can achieve, one should not use data from the testing set to select the optimal value of . Therefore, we repeated the analyses described above, this time performing the grid search for an optimal value of by implementing 10 fivefold CVs within each training data set. This CV was used to choose an optimal value of . Then, we solved the SSI using and all the training genotypes, and evaluated the accuracy of in a testing set that was not used to choose . This was repeated for 100 trn–tst partitions. Figure 2 shows the accuracy of versus that of the G-BLUP, each point in the plot represents a trn–tst partition. In the Wheat-large data set, the optimal SSI outperformed the G-BLUP in 94% of the cases (Table 1). For this data set, the SSI offered accuracy gains ranging from 5% (in the environment B2I) to 10% (in the environments B5I and MEL) in the correlation metric. Similar patterns were observed with the Wheat-small data set. In environments ME1 and ME4 the SSI outperformed the G-BLUP in more than 80% of the trn–tst partitions (Table 1), with gains in accuracy ranging from 5% to 8%. However, in ME2 and ME3, there were no significant gains in accuracy (see Table 1).

Figure 2

Table 1

Prediction accuracy for grain yield (average across 100 partitions) achieved by sparse selection indices (SSIs) and by the G-BLUP (standard SI), by data set and environmental condition

Environment	n _tst	n _trn	Method	λcv a	n_sup b	Accuracy (SD)	P c
Wheat-large
B2I	1,120	2,612	G-BLUP	0.0000	2,612	0.617 (0.031)	0.97
B2I	1,120	2,612	SSI	0.0135	434	0.648 (0.031)	0.97
B5I	8,842	20,631	G-BLUP	0.0000	20,631	0.555 (0.010)	1.00
B5I	8,842	20,631	SSI	0.0107	1,470	0.609 (0.009)	1.00
MEL	1,321	3,082	G-BLUP	0.0000	3,082	0.600 (0.045)	0.99
MEL	1,321	3,082	SSI	0.0131	524	0.661 (0.046)	0.99
LHT	1,322	3,082	G-BLUP	0.0000	3,082	0.669 (0.024)	0.99
LHT	1,322	3,082	SSI	0.0168	380	0.709 (0.025)	0.99
DRB	1,129	2,634	G-BLUP	0.0000	2,634	0.629 (0.035)	0.98
DRB	1,129	2,634	SSI	0.0322	136	0.675 (0.037)	0.98
EHT	612	1,428	G-BLUP	0.0000	1,428	0.614 (0.049)	0.94
EHT	612	1,428	SSI	0.0301	178	0.649 (0.047)	0.94
Wheat-small
ME1	180	419	G-BLUP	0.0000	419	0.721 (0.070)	0.87
ME1	180	419	SSI	0.0413	78	0.760 (0.067)	0.87
ME2	180	419	G-BLUP	0.0000	419	0.702 (0.087)	0.41
ME2	180	419	SSI	0.0123	254	0.692 (0.085)	0.41
ME3	180	419	G-BLUP	0.0000	419	0.585 (0.101)	0.53
ME3	180	419	SSI	0.0613	84	0.586 (0.093)	0.53
ME4	180	419	G-BLUP	0.0000	419	0.663 (0.082)	0.87
ME4	180	419	SSI	0.0617	54	0.714 (0.075)	0.87

B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation; ME, mega-environment; and , size of the testing and training data sets, respectively; SD, standard deviation across trn–tst partitions.

Optimal value of (average across partitions) estimated by cross-validating the training set.

Average number of support points in the SSIs. G-BLUP model corresponds to an SSI with and .

P: proportion of times (out of the 100 partitions) that the SSI outperformed the G-BLUP in prediction accuracy.

Prediction accuracy for grain yield of the optimal SSI versus that of the G-BLUP. Each point represents a trn–tst partitions (a total of 100 partitions were implemented), the point shape and color represent environments. (A) Wheat-large data set. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. (B) Wheat-small data set. ME, mega-environment. The value of in the SSI was estimated using 10 fivefold CVs conducted within the training data. In parenthesis, by the legend, P is the proportion of times the SSI was better than the G-BLUP. Prediction accuracy for grain yield (average across 100 partitions) achieved by sparse selection indices (SSIs) and by the G-BLUP (standard SI), by data set and environmental condition B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation; ME, mega-environment; and , size of the testing and training data sets, respectively; SD, standard deviation across trn–tst partitions. Optimal value of (average across partitions) estimated by cross-validating the training set. Average number of support points in the SSIs. G-BLUP model corresponds to an SSI with and . P: proportion of times (out of the 100 partitions) that the SSI outperformed the G-BLUP in prediction accuracy.

Sparse selection indices build subject-specific training sets

For each individual in the prediction set, an SSI yields a set of support points in the training set consisting of the subjects with a nonzero entry in . Figure 3 shows the distribution (across 100 trn–tst partitions) of the number of support points () for for each of the environments of the Wheat-large data set. At , ranges from 30 to . In 3 of the environments (B2l, MEL, and LHT), the average number of support points was , that is 15-20% of the size of the training set. In environment B5l, the proportion of active training support points was ∼5–10%. On the other hand, in environment EHT predictions relied on an average of (out of 1,428) individuals from training (Figure 3). Similar patterns were also observed in the Wheat-small data (Supplementary Figure S3). For instance, testing phenotypes from environment ME1 were optimally predicted using, on average, (out of 419); however, the relative sparsity ( was smaller in the Wheat-large data set (5-17%) than in the Wheat-small data set (12-60%).

Figure 3

Distribution of the number of training support points () in the optimal SSI for grain yield (results obtained over 100 trn–tst partitions; , size of the training data set), by environmental condition. B2I, bed planting + 2 irrigations; B5I, bed planting + 5 irrigations; MEL, flat planting + 5 irrigations; LHT, late planting date; EHT, early planting date; DRB, bed planting + drip irrigation. Wheat-large data set. Figure 4 shows (for selected testing genotypes) the coordinates on the 1st and 2nd PC of both the prediction point (yellow circle) and the training genotypes. Active training genotypes are represented in a green circle, and those nonactive (i.e., with zero weight in the index) are represented in gray. In some cases, the support set includes training genotypes that are nearby (according to the coordinates on the top 2 PCs) the prediction point. However, in other cases, the support set spanned outside clusters that could be defined by top PCs (a similar plot for the Wheat-small data set is presented in Supplementary Figure S4). We note that the plots in Figure 4 (and Supplementary Figure S4) use coordinates that are based on two PCs that together explain 10.3% (and 16.3%) of the variance in genotypes. Thus, it is still possible that some points that appear distant in the top 2 PCs coordinates may still have a sizable genomic relationship. This could happen if, for instance, two lines share one ancestor but have the other ancestors originating from divergent populations. Thus, in the next section, we study in more detail the link between genomic relationships and the weights on the SSI.

Figure 4

First two principal components coordinates for prediction points (yellow) and the corresponding support points (green). Gray points represent genotypes that did not contribute to the prediction of the genetic value of grain yield of the genotype in yellow. All panels represent solutions for the environment. EHT, early planting date, wheat-large data set.

Genomic relationships and weights in standard and sparse selection indices

Figure 5A shows the coefficients of the G-BLUP and the SSI (i.e., the ’s derived from Equations (2) and (3), respectively) versus the genomic relationship (, the entry of ). In Figure 5A, the ’s were derived for one trn–tst partition with fixed heritability and chosen by CV conducted within the training set, for environment EHT from the Wheat-large data set. The weights used by the G-BLUP are, as expected, all different from zero and are positively associated with the genomic relationships (i.e., on average, training genotypes closely related to genotypes in the prediction set receive higher weight on the index). However, the points do not fall over a perfect line because the weight given to each of the training points depends not only on the relationship between the training point and the prediction point but also on the relationships among training genotypes. On the other hand, as expected, the SSI zero-outs most of the weights. The SSI seems to zero-out most of the weights that are in the top left and lower-right quadrants (i.e., points that had a negative (positive) relationship and in the G-BLUP got positive (negative) weight, compare both plots in Figure 5A).

Figure 5

(A) Weights () of a standard SI (G-BLUP) and the optimal SSI for grain yield versus the genomic relationship (). (B) Proportion of weights in the SSI that were zero (nonactive) and nonzero (support points); environment. EHT, early planting date, wheat-large data set. Figure 5B shows the proportion of coefficients that are zeroed-out by level of genomic relationship. Most of the coefficients corresponding to training genotypes with relationships with prediction points between −0.1 and 0.1 are zeroed-out; the proportion of coefficients that are zeroed decreases rapidly as increases; however, the decrease seems to be faster for the Wheat-large data set than for the Wheat- small (Supplementary Figure S6). Interestingly, the proportion of coefficients zeroed-out also decreases for “negative” genomic relationships, suggesting that the SSI does not only use a “local” support set; instead, the SSI seems to use informative support points. The linear kernel used here can have negative off-diagonal values (i.e., , mostly between pairs of genotypes from different clusters); these negative covariances are, in the context of the additive model fitted here, informative and thus some of the training points with negative covariance with the prediction genotypes can become active in the SSI. However, note that the size of the coefficients corresponding to points with negative genomic relationships is considerably smaller than the size of the coefficients for points with positive relationships with the prediction genotypes. The patterns observed in other environments of the Wheat-large data set and the four environments of the Wheat-small data set were conceptually similar to the ones presented in Figure 5 (see Supplementary Figures S5 and S6).

Discussion

Sample size has been recognized as one of the main factors limiting prediction accuracy in genomic prediction (Lorenzana and Bernardo 2009; de los Campos ; Habier ). In unstructured populations, SNP effects can be assumed to be homogeneous, in this context, genomic prediction accuracy is expected to increase with sample size (e.g., Daetwyler ; de los Campos ). However, this is not necessarily the case in structured and admixed populations, in multi-family data (e.g., data from bi-parental families), or in multi-generation data. In those cases, differences in allele frequencies and LD-patterns may make SNP effects heterogenous across subgroups in the sample. In that context, a larger training data set may not translate into a higher prediction accuracy. This phenomenon has been recognized in both plant and animal breeding, as well as in complex traits prediction in humans. For example, using data from a broiler breeding population, Wolc showed that using training sets that included data from many previous generations led to slightly lower prediction accuracy than the one achieved when models were trained with data from just the last 3 generations. Likewise, Hayes showed that the prediction accuracy for Holstein cattle was not improved by adding to the training set data from Jersey cattle. In plant breeding, using data from bi-parental families, Jacobson reported that within family prediction accuracy could be increased by training models using only data from families that share at least one of the parents. Finally, in the context of human data, de los Campos ) noted that the accuracy of SNP-derived genomic relationships could be very low for distantly related individuals. Thus, combining family data with large volumes of data from distantly related individuals may not improve (or may even reduce) prediction accuracy relative to models trained with family data only (Makowsky ).

Trade-offs between sample size and effect homogeneity

When data originates from heterogeneous sources there may be trade-offs between sample size and the possibility of having a homogenous data set in which SNP effects can be conceived as homogenous within the training data, and between training and testing sets. The recognition that in genomic prediction “bigger is not always better” led to the development of several models and model-training strategies aiming to improve prediction accuracy. One line of research models effect heterogeneity using group-specific effects (e.g., Veturi ; Rio ). This approach is useful when individuals cluster in a few (e.g., 2 or 3) well-defined clusters; however, the approach becomes less useful and more difficult to apply when data are characterized by either a large number of groups (e.g., bi-parental families) or when groups overlap in cryptic manners (e.g., admixed populations or partially overlapping-multi-generation data).

Training-set optimization techniques: one-size may not fit all

Another line of research seeks to identify an “optimal training set” by either selecting data from individuals that are closely related to the prediction set (e.g., Jacobson ; Lorenz and Smith 2015; Wolc ) or by using more sophisticated optimization algorithms (e.g., Rincent ; Akdemir ). Given an available training set, optimization procedures aim at identifying a subset that is optimal for all the genotypes in the prediction set. We applied the training set optimization methodology described in Rincent to the same trn–tst partitions that we used to evaluate the predictive performance of the SSI (see Section Materials and Methods for details of the trn–tst partitions, and Figure 1 and Table 1 for results obtained using an SSI). For each data set and environment, we evaluated the prediction accuracy achieved by the G-BLUP using the entire data set, and using smaller sets chosen either at random or by optimizing the training set using the CDmean criteria (Rincent ) as implemented in the STPGA R-package (Akdemir ). Across data sets and environments, the optimized training sets did not produce higher prediction accuracy than the one achieved when using the entire training data (Supplementary Figure S7). Furthermore, in all cases, the SSI outperformed the G-BLUP calibrated with the entire data set and all the G-BLUP models calibrated using smaller (optimized) training sets (Supplementary Figure S7). The above-results highlight the challenge of selecting a training set that is optimal for all the genotypes in a prediction set. Our approach tackles this problem by identifying an optimal training set for each testing genotype. Because different sparse indices are obtained for each individual in the prediction set, almost all individuals in the training set end up contributing to the index of one or more testing genotypes.

Sparsity of the SSI

When the training data consists of disconnected families, pedigree BLUP equations can also be sparse. However, this is not the case of the G-BLUP because genomic relationship matrices are dense. The SSI brings back sparsity into genomic prediction. The level of sparsity is largely controlled by the penalization parameter (). This parameter can be tuned using CV within the training data. As with any other parameter, the value of that maximizes accuracy may change slightly between trn–tst partitions; however, in our experience, using a few (e.g., 10) trn–tst partitions are enough to obtain an accurate estimate of the value of the regularization parameter that maximizes accuracy.

SSI and k-nearest neighbor

As noted, an SSI identifies, for each individual in the prediction set, a network of genotypes in the training data set (see Figure 4 and Supplementary Figure S4) that contribute to the prediction. At first glance, this appears similar to the approach used in a -nearest neighbor (KNN) regression (Cover and Hart 1967). In KNN, the genetically closest individuals (neighbors) predict each selection candidate, and predictions are derived using an average of the phenotypes in the neighborhood. There are important differences between the KNN and the SSI. First, the KNN uses only marginal similarities/distances between a prediction point and the points in the training data to identify a “neighborhood.” However, the SSI also considers the correlations (i.e., redundancies between training genotypes which are described by off-diagonal matrices of the matrix). As a consequence, the optimal support set of the SSI may zero-out coefficients for close relatives of prediction genotypes, and may include active coefficients for some distantly related individuals (see Figure 4 and Supplementary Figure S4). Second, while in the standard KNN predictions are simply the arithmetic mean of the phenotypes in the neighborhood, in the SSI each training point contributes differently with weights (the ) that reflect both the correlation of the training point with the prediction point as well as correlations among points in the training set.

SSI and sparse genomic relationship matrices

At first glance, it may seem that an SSI could be obtained using G-BLUP equations by simply zeroing-out small off-diagonal coefficients of the matrix. However, this approach would be different and will not necessarily yield a sparse index. First, as noted before, simply zeroing-out small coefficients of a matrix does not consider the fact that training genotypes are also related. (The results in Figure 5 show that the SSI also zero-out weights for training genotypes with sizable genomic relationships with testing genotypes.) Second, zeroing-out coefficients of the matrix does not guarantee that the inverse of (and therefore, the G-BLUP equations) will be sparse (making sparse the inverse of , as in Graphical-LASSO, Friedman , may be more effective). Third, zeroing-out off-diagonals coefficients of that are close or below zero ignores the fact that genotypes with negative genomic relationships with testing genotypes may be informative, simply because negative (co)variances are informative (this can also be seen in Figure 5, where some points with the negative genomic relationship are active in some SSIs).

Borrowing of information in the SSI

The fact that some points with negative genomic relationships contribute to the prediction equations of SSIs (see Figure 5) may be counter-intuitive, and may be considered undesirable. This happens simply because negative (co)variances are informative. However, the weights for training genotypes with negative genomic relationships with testing genotypes, if not zero, are small in absolute value. In other words, the SSI draws information primarily from closely related individuals. If one would like to obtain an SSI that is “strictly local” (i.e., that only training genotypes are closely related to testing genotypes) one would need to use a kernel that specify nonnegative prior (co)variances (e.g., a Gaussian kernel or additive-by-additive structure #).

SSI and elastic-net penalty

BLUP methods are equivalent to L2-penalized regressions. In BLUP, shrinkage is controlled by the noise and signal variances (, see Equation (2). We added to the optimization problem an L1-penality; thus, the SSI uses both L1 and L2 (which is intrinsically built in the SI) penalties. Therefore, the SSI can be seen as being a type of Elastic-Net (Zou and Hastie 2005) regression. However, in the SSI the weight on the L2-penalty is only determined by the ratio of variance components () which may or may not be an optimal choice from a prediction perspective (particularly if the underlying assumptions of the BLUP method, e.g., homogeneity of effects, do not hold). Therefore, to add flexibility to the SSI we considered explicitly adding L1- and L2-penalties, and searching for an optimal combination, using CV, of the relative weights of the penalization parameters of the Elastic-Net ( and ) optimization problem: To avoid too-much penalization, we decreased the weight of the initial L2-penalty to 0.5. We found that this practice could increase prediction accuracy by a small factor (2–3.5%, see Supplementary Table S3 for the Wheat-large data set) relative to the original SSI method (Equation 3). However, this practice did not provide any additional advantage over the original SSI in the Wheat-small data (see Supplementary Table S4).

The effect of sample size on the relative performance of the SSI

Sample size, SNP density, and genetic structure are features that may affect the ability of the SSI to achieve a higher prediction accuracy than the G-BLUP. To shed light on the effect of sample size we repeated the analysis presented before with varying sample size within each data set and environment (i.e., holding the structure and the number of SNPs constant). The results (Supplementary Figure S8) showed that the difference in prediction performance of the SSI and that of the G-BLUP increased with sample size. Clearly, the use of an SSI is more appealing when either there is a strong structure and the sample size is very large. Such conditions offer opportunities for the SSI to identify optimal support points for each genotype in the prediction set. Finally, we note that the derivation of the weights of the SSI depends on the trait only through the heritability (see Equation 3). Therefore, one could imagine deriving the weights of the SSI for each candidate of selection and then using these weights to predict breeding values for a range of traits with comparable heritability.

Conclusion

We presented a novel prediction method that combines in a single framework, selection index methodology with sparsity-inducing methods. The resulting SSI identifies optimal training sets for each genotype in a prediction set. The method can be useful for multiple applications, including the use in genomic prediction of data from structured populations, bi-parental families, and the analyses of multi-generation data sets. The superiority of the SSI relative to a standard G-BLUP is clearer with large sample size.

31 in total

Optimal breeding-value prediction using a sparse selection index.

Introduction

Materials and Methods

Sparse selection index (SSI) Methodology

Data

Analyses

Software

Data availability

Results

Sparsity improves prediction accuracy

Using an internal CV to achieve optimal sparsity

Sparse selection indices build subject-specific training sets

Genomic relationships and weights in standard and sparse selection indices

Discussion

Trade-offs between sample size and effect homogeneity

Training-set optimization techniques: one-size may not fit all

Sparsity of the SSI

SSI and k-nearest neighbor

SSI and sparse genomic relationship matrices

Borrowing of information in the SSI

SSI and elastic-net penalty

The effect of sample size on the relative performance of the SSI

Conclusion

1. Prediction of total genetic value using genome-wide dense marker maps.

2. The Genetic Basis for Constructing Selection Indexes.

3. Sparse inverse covariance estimation with the graphical lasso.

4. The impact of genetic relationship information on genome-assisted breeding values.

5. Predicting quantitative traits with regression models for dense molecular markers and pedigree.

Review 6. Whole-genome regression and prediction methods applied to plant and animal breeding.

7. Optimization of genomic selection training populations with a genetic algorithm.

8. Training set optimization under population structure in genomic selection.

9. Accuracy of predicting the genetic risk of disease using a genome-wide approach.

10. Incorporating Genetic Heterogeneity in Whole-Genome Regressions Using Interactions.

1. Building a Calibration Set for Genomic Prediction, Characteristics to Be Considered, and Optimization Approaches.

2. Training Set Construction for Genomic Prediction in Auto-Tetraploids: An Example in Potato.

3. Sparse testing using genomic prediction improves selection for breeding targets in elite spring wheat.

4. Multi-generation genomic prediction of maize yield using parametric and non-parametric sparse selection indices.