Literature DB >> 32371453

Improving Prediction Accuracy Using Multi-allelic Haplotype Prediction and Training Population Optimization in Wheat.

Ahmad H Sallam¹, Emily Conley², Dzianis Prakapenka³, Yang Da³, James A Anderson⁴.

Abstract

The use of haplotypes may improve the accuracy of genomic prediction over single SNPs because haplotypes can better capture linkage disequilibrium and genomic similarity in different lines and may capture local high-order allelic interactions. Additionally, prediction accuracy could be improved by portraying population structure in the calibration set. A set of 383 advanced lines and cultivars that represent the diversity of the University of Minnesota wheat breeding program was phenotyped for yield, test weight, and protein content and genotyped using the Illumina 90K SNP Assay. Population structure was confirmed using single SNPs. Haplotype blocks of 5, 10, 15, and 20 adjacent markers were constructed for all chromosomes. A multi-allelic haplotype prediction algorithm was implemented and compared with single SNPs using both k-fold cross validation and stratified sampling optimization. After confirming population structure, the stratified sampling improved the predictive ability compared with k-fold cross validation for yield and protein content, but reduced the predictive ability for test weight. In all cases, haplotype predictions outperformed single SNPs. Haplotypes of 15 adjacent markers showed the best improvement in accuracy for all traits; however, this was more pronounced in yield and protein content. The combined use of haplotypes of 15 adjacent markers and training population optimization significantly improved the predictive ability for yield and protein content by 14.3 (four percentage points) and 16.8% (seven percentage points), respectively, compared with using single SNPs and k-fold cross validation. These results emphasize the effectiveness of using haplotypes in genomic selection to increase genetic gain in self-fertilized crops.

Entities: CellLine Chemical Disease Gene Species

Keywords: GenPred; Shared data resources; genomic selection; haplotype prediction; plant breeding; quantitative trait loci; training population optimization; wheat

Mesh：

Year: 2020 PMID： 32371453 PMCID： PMC7341132 DOI： 10.1534/g3.120.401165

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Genomic selection is an important breeding approach for improving quantitative traits. It was advocated as a marker-assisted selection approach that uses high density SNP genotypes for estimating genomic breeding values (Meuwissen ). Genomic selection relies on linkage disequilibrium (LD) between SNP markers and quantitative trait loci (QTL), where the LD among markers is used as a verification for the association between markers and QTL. Several genomic prediction models were proposed including RR-BLUP, Bayes A, Bayes B, Bayes Cπ, Bayes LASSO, and Reproducing Kernel Hilbert Space RKHS (Meuwissen ; de Los Campos ; Kizilkaya ; Lorenz ). These prediction methods vary in the assumed genetic effects or/and variance associated with markers. Factors affecting the accuracy of genomic prediction include trait architecture, marker density and LD, training population size, and population structure (Daetwyler ; Asoro ; Heffner ; Lorenz ; Lorenz ; Sallam ; Zhang ). Since the development of genomic selection, it has been applied in both animal (Garrick 2011; Rexroad ) and plant breeding programs (Lian ; Sallam ; Crossa ) resulting in reshaping the breeding approaches by omitting the step of phenotyping the whole population. Rather, a smaller set of a calibration population is phenotyped and genotyped to train a prediction model for estimating breeding values of the selection candidates. The composition of the calibration population is of paramount importance because it determines the efficiency of selecting the best performing individuals. Several studies investigated methods to construct a calibration population for improving the accuracy of genomic prediction including stratified sampling, CDmean optimization, prediction error variance (PEV), and Gmean (Rincent ; Akdemir ; Isidro ; Lorenz and Smith 2015). These methods varied in their improvement of prediction accuracy for traits with different genetic architectures (Isidro ; Tiede and Smith 2018). One of the important factors determining prediction accuracy is population structure, which can result in variability in allele frequencies and the degree of the genetic relationship between subpopulations/clusters, leading to changes in the accuracy of prediction. The effect of population structure on the accuracy of genomic prediction was observed in both animals (Hayes ; Saatchi ) and plants (Asoro ; Technow ), resulting in a general recommendation of constructing a mixed calibration population that includes individuals from all clusters for improving the accuracy of prediction. To cope with structured populations that are developed from parents with different breeding histories, the stratified sampling approach was proposed by sampling a representative sample from each cluster and this approach showed improvement in the prediction accuracy for several quantitative traits (Isidro ). Unlike PEV and CDmean optimization, the stratified sampling approach is not dependent on trait heritability; thereby, it is expected to perform more consistently across different traits with variable genetic architecture (Rincent ). Current methods of genomic selection mostly use single SNP markers to predict the genetic merits of individuals. However, haplotypes may have several advantages over single markers for genomic selection. Phased marker haplotypes can better estimate identity-by-descent and haplotype effects (Meuwissen and Goddard 2000; Hess ). Additionally, haplotypes increase the LD between the group of phased markers and QTL, explaining greater levels of QTL variance (Hayes ). The goal of using genomic selection in plant breeding is to improve the rate of genetic gain under conditions of reduced resources available for evaluating the calibration population. We tested the effect of population structure and using haplotypes on the accuracy of genomic prediction in a wheat population that represents the genetic diversity of the University of Minnesota spring wheat breeding program. The goals of this study were to (1) investigate the effect of population structure on prediction accuracy for yield, test weight, and protein content in a hard red spring wheat population, (2) compare stratified sampling optimization with k-fold cross validation for the prediction of the three traits, and (3) compare the prediction accuracy of single markers to four different multi-allelic haplotype blocks with different sizes.

Materials and Methods

Phenotypic data

The Minnesota wheat genomic selection (MN-WGS) panel is composed of 383 breeding lines that represent the genetic diversity of the University of Minnesota spring wheat breeding program and includes 93 parents and 290 derived lines from 177 unique crosses represented in their pedigrees (Conley ). Parents included lines from the spring wheat breeding programs of the University of Minnesota, North Dakota State University, South Dakota State University, AgriPro, WestBred, and CIMMYT. The MN-WGS panel was evaluated together for agronomic traits in two trials in 2013 at St. Paul and Crookston, MN using standard agronomic practices. Plot sizes were 2.6 square meters in St. Paul and 3.4 square meters in Crookston. No fungicides were applied in either location. Lines were planted in a Type II modified augmented field design with 32 blocks. Linkert (Anderson ) was used as the primary check with LCS Albany (PI 658002), Briggs (Devkota ), Prosper (Mergoum ), and Vantage (PI 653518) as secondary checks. Linkert was repeated once in all of the 32 blocks. The population was phenotyped for grain yield, test weight, and protein content. Yield was determined after harvesting plots with a Wintersteiger small plot combine then weighing the grain to express data as kg/ha. The test weight was measured as the weight of seeds that completely fill a quarter pint (118.3 Milliliter) and the resulting data were expressed as kg/hL. Near infrared reflectance spectroscopy (NIR) was used to determine protein content in the harvested grains (Inframatic 9500, Perten Instruments, Sweden).

Phenotypic data analysis

Correction for spatial field variability for yield, test weight, and protein content was done using a moving grid adjustment (Technow 2015; R-package mvngGrAd, R development core team 2017). After setting the field in rows and columns, a moving mean was calculated using a surrounding grid of a particular size. This moving mean was used subsequently as a covariate to calculate the adjusted phenotypes. A moving average window of eight plots was used to determine the phenotypic performance of the line in the center. For all traits, the entire set of lines were used to correct for variance in trial means using the MIXED procedure in SAS 9.4 (Sallam ; SAS Institute 2013). In all experiments, genetic and residual variances were calculated using the MIXED procedure in SAS. Broad-sense heritability was estimated using the equation , where is genetic variance, is the variance of random residuals, and n is the number of trials.

Genotyping and linkage disequilibrium

Leaf tissues were harvested from the 383 breeding lines at the three leaf stage. DNA extraction was performed using the BioSprint 96 DNA Plant Kit according to the manufacturer’s instructions (Qiagen 2016). The panel was genotyped using the 90K Illumina Infinium iSelect Assay. Clustering was performed using Illumina’s Genome Studio Polyploid Clustering Module v1.0 using the procedure described by Wang , followed by manual curation to correct inaccurately clustered loci. Markers were filtered for MAF < 0.05 and missing data > 0.10 resulting in 16,697 SNP markers. From this marker set 14,086 SNP markers had map positions based on a consensus wheat map developed from six independent double haploid mapping populations (Wang ). Missing marker data were imputed using LD-kNNi, which imputes missing marker genotypes based on the k-nearest neighbor imputation method (Money ). To characterize the level of LD in the MN-WGS panel, the adjacent marker LD was estimated as r for the 21 wheat chromosomes in TASSEL (Bradbury ). The genomic additive relationship matrix was estimated among all lines in rrBLUP package of R using all markers (Endelman and Jannink 2012). The genomic additive relationship matrix was estimated as:where: Z = M – P, M being the individuals by SNP loci marker matrix and P the frequencies of alleles expressed as 2(p – 0.5) with p representing the allele frequency of marker i (VanRaden 2008).

Constructing haplotype blocks

The high density SNP marker genotypes were used to construct haplotype blocks after ordering markers based on the consensus map positions for all 21 chromosomes (Wang ). We generated four different haplotype blocks, each with a fixed number of 5 adjacent markers (Haploblock-5), 10 (Haploblock-10), 15 (Haploblock-15), and 20 (Haploblock-20) for each chromosome. Haplotype alleles for each haplotype block were numbered using a custom script in R (R Development Core Team 2017).

Assessment of population structure and training population optimization

A cluster analysis was performed by generating a pairwise distance matrix estimated as 1 - IBS (identity-by-state) probability in TASSEL using SNP marker data for all lines in the MN-WGS panel. Using the distance matrix, K-means clustering was performed using the Hartigan-Wong algorithm implemented in R (R Development Core Team 2017). Based on prior knowledge of pedigree information, three clusters were assumed in K-means clustering. Principal component analysis (PCA) was performed in R using SNP marker data for all lines in the MN-WGS panel to visually identify clusters assigned by the K-means clustering (R Development Core Team 2017). Using the genomic additive relationship matrix, the average genetic relationships were calculated for lines within a cluster (A) and lines between clusters (A). To investigate the effect of population structure on genomic prediction in the MN-WGS panel, using single SNP markers only, the three clusters identified by K-means clustering were used in evaluating the predictive ability by combining two clusters for predicting the performance of the third cluster and repeating this step iteratively for all clusters. The predictive ability was calculated as the correlation between phenotypic values of individuals in the validation population and the estimated genomic predictions of those individuals (Legarra ). To evaluate genomic prediction accuracy, k-fold cross validation was implemented so each individual appeared once in the validation population. We used both single SNP markers and haplotype blocks for the assessment of the predictive ability. The population was randomly shuffled followed by using k-fold cross validation by dividing the MN-WGS panel into four groups. One of those groups were excluded to estimate marker/haplotype effects using the three remaining groups to define 75% (288 individuals) of the population as a random calibration population. The k-fold cross validation was repeated four times iteratively for each of the four randomly assigned groups. These previous k-fold cross validation steps were replicated four times. In addition to k-fold cross validation, a training population optimization procedure using stratified sampling was evaluated. For the stratified sampling procedure, clusters identified by K-means clustering were used as a criterion for selecting the calibration population. A stratified sampling genomic prediction procedure was performed by constructing a calibration population through randomly sampling 75% of lines from each of the three clusters. Therefore, the sample size from each cluster was proportional to the size of the cluster, and a total of 75% (288) of lines in the MN-WGS panel were used as a calibration set to predict the remaining 25% (94). The stratified sampling prediction approach was replicated sixteen times using both single SNP markers and the four haplotype block sizes. Each predictive ability value was transformed using Fisher Z. The test statistics were calculated as , where r is the predictive ability and n is the number of tests (Bobko 2001). The test statistic follows a tN - 2 distribution (Bobko 2001). A paired t-test was used for the assessment of statistical significance between single markers and each haplotype block size for the same constructed calibration populations.

Genomic prediction models

For genomic best linear unbiased prediction (GBLUP) using single markers, the mixed model with SNP additive effects (or average effects of gene substitution) based on the partition of genotypic values (Da ) was used:where µ = population mean, 1 = column vector of 1’s, n = number of lines, α = column vector of marker additive effects, m = number of SNPs, = model matrix of α with elements of , , and for a marker genotype, = frequency of allele k of a SNP (k = 1,2), and = GBLUP of additive values of the n lines. Assumptions for the first and second moments are: , , and , where = variance of SNP additive effects, = residual variance, = identity matrix, and = identity matrix. The GBLUP of additive values, and genomic restricted maximum likelihood (GREML) estimates were calculated using the GVCBLUP computer package (Wang ; https://animalgene.umn.edu). For haplotype analysis, a multi-allelic haplotype model that treats each haplotype block as a ‘locus’ and each haplotype within the haplotype block as an allele (Da 2015) was used. The multi-allelic haplotype prediction was modeled as:where µ = population mean, 1 = column vector of 1’s, n = number of lines, = column vector of haplotype additive effects, = number haplotype additive effects (or average effects of gene substitution), = model matrix of with elements of , , and for a haplotype genotype, = frequency of a haplotype in a haplotype block, and = GBLUP of additive values of the n lines. Assumptions for the first and second moments are: , , and , where = variance of haplotype additive effects, = residual variance, = identity matrix, and = identity matrix. The GBLUP of additive values were calculated using the GVCHAP computer package (Prakapenka ; https://animalgene.umn.edu).

Data availability

Genotypic and raw phenotypic data for this study are available at figshare portal. The link to the genotypic data (https://figshare.com/articles/Conley_MNWGSpanel_cM_hmp_txt/10031867). The link for the raw phenotypic data (https://figshare.com/articles/Pheno_MN-WGS/10032326). Supplementary tables are available at figshare (https://figshare.com/articles/Supplemental_Tables_for_MN-WGS_panel/10031891). Table S1 includes the average adjacent marker LD estimated as (r) for the 21 wheat chromosomes in the MN-WGS panel. Table S2 includes number of haplotype blocks for each chromosome, distance covered by haplotype blocks, maximum number, and average number of haplotype alleles in fixed length haplotypes of 5, 10, 15, and 20 adjacent markers. Table S3 includes the predictive ability for yield, test weight, and protein content using single markers, haplotype blocks of 5 adjacent markers (Haploblock-5), haplotype blocks of 10 adjacent markers (Haploblock-10), haplotype blocks of 15 adjacent markers (Haploblock-15), and haplotype blocks of 20 adjacent markers (Haploblock-20).

Results

Phenotypic and genotypic data analysis

The MN-WGS panel was evaluated in two balanced trials in Minnesota for grain yield, test weight, and grain protein content. Correction for spatial field variability and trial effects was performed to improve estimates of phenotypic values of individuals. Significant differences were observed among lines for yield, test weight, and protein content. Estimated genetic variance, residual variance, and the broad-sense heritability for each trait are shown in Table 1. Heritability estimates were 0.28 for yield, 0.67 for test weight, and 0.68 for protein content (Table 1). After quality control filtering, 14,086 markers with map positions were used in the study. Marker density varied among chromosomes and ranged from 73 for chromosome 4D to 1,488 for chromosome 2B (Table S1). Extensive levels of LD, estimated as r, were observed for all chromosomes that varied between 0.45 for chromosome 7A to 0.69 for chromosome 3B (Table S1). The average adjacent marker LD across all chromosomes was 0.57. K-means clustering identified three different clusters and the number of lines for each cluster were 175 for cluster 1, 89 for cluster 2, and 118 for cluster 3 (Table 2). For the PCA, the majority of individuals in the MN-WGS panel were located in their respective clusters identified by K-means clustering (Figure 1). The first principal component (PC1) explained 10.0% of the variability whereas the second principal component (PC2) explained 8.3% of the variability in the MN-WGS panel (Figure 1). The genomic additive relationship matrix agreed with the results of the K-means clustering in identifying three clusters, each including genetically related individuals (Figure 2). Table 2 displays the average additive genetic relationship between individuals in different (A) clusters and individuals within (A) clusters. Cluster 1 had the highest A and lowest A compared to the other two clusters (Table 2). On the other hand, cluster 2 had the lowest A and highest A (Table 2). The average yield for the three clusters were 5556, 5617, and 5583 kg/ha for cluster 1, cluster 2, and cluster 3; respectively. No significant difference was observed for yield across the three clusters. The average test weight for the three clusters were 79.1, 77.8, and 78.7 kg/hL for cluster 1, cluster 2, and cluster 3; respectively. No significant difference was observed for test weight across the three clusters. The average protein content for the three clusters were 14.4, 14.0, and 14.4% for cluster 1, cluster 2, and cluster 3; respectively. No significant difference was observed for protein content across the three clusters.

Table 1

Estimated genetic variance (), residual variance (), and broad-sense heritability (H) for yield, test weight, and protein content in the Minnesota wheat genomic selection panel

Trait	σg2	σe2	H
Yield (kg/ha)	33737	168275	0.29
Test weight (kg/hL)	1.20	1.19	0.67
Protein (%)	0.28	0.27	0.68

Table 2

Genetic relationship between (A) and within (A) clusters and average predictive ability, when using the cluster in two training populations to predict another cluster, for yield, test weight, and protein based on single markers

				Predictive ability
Clusters	Number of individuals	A_{ij between}	A_{ij within}	Yield	Test weight	Protein	Ave. across traits
Cluster 1	176	−0.12 ± 0.001	0.13 ± 0.002	0.32	0.38	0.28	0.33
Cluster 2	89	−0.16 ± 0.001	0.51 ± 0.005	0.29	0.34	0.19	0.27
Cluster 3	118	−0.14 ± 0.001	0.28 ± 0.003	0.28	0.31	0.23	0.27
Average for each trait				0.30	0.34	0.23

Figure 1

Figure 2

Heatmap for the additive genetic relationship matrix displaying genetic relatedness among lines in the MN-WGS panel with the corresponding clusters identified using K-means clustering.

Population stratification of the Minnesota wheat genomic selection (MN-WGS) panel of 383 wheat lines inferred from K-means clustering in which three clusters were identified and visualized on principal component analysis. Cluster 1 is shown in blue, cluster 2 in red, and cluster 3 in green. Heatmap for the additive genetic relationship matrix displaying genetic relatedness among lines in the MN-WGS panel with the corresponding clusters identified using K-means clustering.

Haplotype block construction

Haplotype blocks of 5, 10, 15, and 20 adjacent markers were generated for all chromosomes, with variable number of haplotype alleles identified for each haplotype locus. We will refer to haplotype blocks of 5, 10, 15, and 20 as Haploblock-5, Haploblock-10, Haploblock-15, Haploblock-20; respectively. With the increase of haplotype lengths (Haploblock-5, Haploblock-10, Haploblock-15, to Haploblock-20), lower number of haplotype blocks were generated across the genome with higher numbers of haplotype alleles per haplotype blocks (Table S2). For Haploblock-5, a total of 2,810 haplotype blocks were identified across all chromosomes with up to 29 haplotype alleles per haplotype block (Table S2). On average, across the 21 wheat chromosomes, each Haploblock-5 covered 2.2 cM (Table S2). For Haploblock-10, 1,400 haplotype blocks were identified across all chromosomes with up to 105 haplotype alleles per haplotype block (Table S2). On average across the 21 wheat chromosomes, each Haploblock-10 covered 4.7 cM (Table S2). For Haploblock-15, 930 haplotype blocks were identified across all chromosomes with up to 151 haplotype alleles per haplotype block (Table S2). On average across the 21 wheat chromosomes, each Haploblock-15 covered 7.8 cM (Table S2). For Haploblock-20, 691 haplotype blocks were identified across all chromosomes with up to 259 haplotype alleles per haplotype block (Table S2). On average across the 21 wheat chromosomes, each Haploblock-20 covered 9.6 cM (Table S2). The average number of haplotype alleles across all chromosomes were 3, 4, 6, and 10 for Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20; respectively (Table S2). The four different haplotype block sizes were used in genomic prediction using both k-fold cross validation and stratified sampling optimization.

Training population scenarios and comparing between single and haplotype prediction

Generally, the predictive ability was lower for yield compared to test weight and protein content in both k-fold cross validation and stratified sampling. To investigate the effect of population structure on the predictive ability, using single SNP markers, the identified clusters were used as training populations by combining two clusters for predicting the third cluster for yield, test weight, and protein content. The size of the formed training populations varied depending on the clusters size (Table 2). The predictive abilities when including a cluster in the training populations are presented in Table 2. When including cluster 1 in the training population in two cases (cluster 1 + cluster 2 and cluster 1 + cluster 3), the predictive abilities were higher across all traits (Table 2). The average predictive abilities for cluster 2 were similar to cluster 3 across all traits and both were lower than cluster 1 (Table 2). After confirming the effect of population structure on the predictive ability, a training population optimization method was used to design a calibration population by sampling a representative sample from each cluster. In general, the stratified sampling resulted in an increase in the predictive ability compared with k-fold cross validation for yield and protein content; while decreasing the predictive ability for test weight using all marker prediction scenarios (Figure 3; Table S3). Four different haplotype block sizes (Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20) were used to assess the effectiveness of haplotypes compared with single markers in genomic prediction. All four haplotype blocks improved the predictive ability in both k-fold cross validation and stratified sampling in yield and protein content compared with single markers (Figure 3; Table S3). For k-fold cross validation in yield, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20 resulted in average increases of 6.3, 2.9, 5.3, and 2.2% in the predictive ability over single markers (Figure 3; Table S3). With the use of stratified sampling, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20 resulted in significant average increases of 6.8, 5.5, 9.4, and 5.2% in the predictive ability over single markers (Figure 3; Table S3). For k-fold cross validation in protein content, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20 resulted in significant average increases of 3.4, 4.6, 6.7 and 6.9% in the predictive ability over single markers (Figure 3; Table S3). With the use of stratified sampling, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20 resulted in significant average increases of 2.7, 4.3, 6.0, and 5.8% in the predictive ability for protein content over single markers (Figure 3; Table S3). For test weight, the increase of predictive ability of haplotypes compared with single marker prediction was significant only when using Haploblock-15 and Haploblock-20 with stratified sampling optimization (Figure 3; Table S3).

Figure 3

The predictive ability for yield, test weight, and protein content using single markers, haplotype blocks of 5 adjacent markers (Haploblock-5), haplotype blocks of 10 adjacent markers (Haploblock-10), haplotype blocks of 15 adjacent markers (Haploblock-15), and haplotype blocks of 20 adjacent markers (Haploblock-20). The two validation methods used are k-fold cross validation and the stratified sampling optimization. A star over the error bar indicates a significant difference in the predictive ability between the haplotype and single markers for the same validation method.

Discussion

Most genomic selection investigations rely on using single markers for predicting breeding values of individuals. In our breeding experiment, multi-allelic haplotype prediction performed better than single markers for all three traits investigated. For both single markers and haplotype predictions, the predictive ability for yield was lower than protein content and test weight due to a lower heritability estimate for yield compared with other two traits. Traits with low heritability estimates tend to be highly quantitative, controlled by many loci with smaller effects, and have much environmental noise; which can result in lower prediction accuracies (Bernardo and Yu 2007; Daetwyler ; Sallam ). Daetwyler demonstrated that increasing heritability will improve the accuracy of the prediction for both GBLUP and Bayes B. In Norwegian dairy cattle, a strong relationship was observed between prediction accuracy and trait heritability (Luan ). Similar results were observed in wheat and barley as high heritability traits such as heading date, height, and test weight had higher prediction accuracies compared with low heritability traits such as yield (Heffner ; Sallam ). With low heritability traits, a larger number of phenotypic records are needed to better estimate marker effects for improving prediction accuracy (Hayes ; Luan ). K-means clustering identified three clusters with variable sizes. The cluster analysis revealed the pedigree structure in MN-WGS panel. The three clusters had similar performances for the three traits. The five most frequent parents each appeared in pedigrees at least 45 times (data not shown). One of those parents was not part of the MN-WGS panel. For example, Sabin (Anderson ), assigned in cluster 2, is a parent to 91 individuals, 87 of which are included in cluster 2. RB07 (Anderson ), assigned in cluster 3, is a parent to 74 individuals, 71 of which are included in cluster 3. MN02072-7, assigned in cluster 1, is a parent to 66 individuals, 56 of which are included in cluster 1. MN01333-A-2 is a parent to 53 individuals, 41 of which are included in cluster 1. Finally, Glenn (Mergoum ), assigned in cluster 1, is a parent to 45 individuals, 32 of which are included in cluster 1. These results indicate that clustering in the population is determined by pedigree stratification. Cluster 2 had the highest A and by searching through pedigree information, we found that all lines in this cluster are half-sibs, sharing Sabin as a common parent. The two training populations that included cluster 2 (cluster 1 + cluster 2 and cluster 2 + cluster 3) had an average predictive ability that is lower than cluster 1, whose included three parents: MN02072-7, Blade (PVP no. 200800075), Faller (Mergoum ), and Glenn (Mergoum ) and high frequency of their progenies. Blade, Faller, and Glenn are wheat cultivars and used as parents in the MN-WGS panel but with more progeny for those parents in cluster 1 (48 progeny lines). Several direct progenies of these three parents were also included in cluster 2 (16) and cluster 3 (20), which may explain the highest A for cluster 1. The high genetic relationship of cluster 1 with the other two clusters resulted in a higher prediction accuracy of single marker prediction when including cluster 1 in the training population. These results may not be applicable to other breeding situations with different levels of population structure due to admixture or pedigree stratification (Toosi ; Asoro ). Our findings are in agreement with a genomic selection study in angus beef cattle as one of the five clusters identified using K-means clustering showed lower genetic relationship to other clusters, resulting in the smallest prediction accuracy across all traits when including this cluster in the training population (Saatchi ). It has been proven that increasing the genetic relationship between the training and validation populations will improve the accuracy of genomic prediction (Habier ; Lorenz ; Lorenz and Smith 2015). The size of the training population is another factor that affects the accuracy of genomic prediction. Training populations including cluster 1 had a larger size compared with the other two clusters and that may contribute to the higher prediction accuracy of cluster 1. Despite the fact that training populations including cluster 3 were larger than cluster 2, both resulted in similar prediction accuracies across all traits. In breeding populations, the change of prediction accuracy due to the increase of training population size is dependent on the genetic relationship (Albrecht ) and breeding history (Sallam ); therefore, careful selection of the training population is needed for a successful implementation of genomic selection (Lorenz ; Lorenz and Smith 2015). When implementing genomic selection in a breeding program, it is important to consider the best genotypes to be included in the calibration population. Additionally, implementing an efficient marker prediction approach is required to maximize the accuracy of genomic prediction. An effective genomic selection strategy in plant breeding programs is able to design a smaller training population for the purpose of generating genotypic and phenotypic data, which can improve resource allocation (Lorenz 2013; Endelman ). Several methods were proposed to optimize calibration population design including CDmean (Rincent ), PEV (Akdemir ), stratified sampling (Isidro ), and Gmean (Lorenz and Smith 2015). The stratified sampling approach outperformed CDmean optimization in structured populations across several traits with different genetic architectures (Isidro ). Compared with k-fold cross validation, stratified sampling increased the predictive ability in yield by 4.4, 4.9, 7.0, 8.5 and 7.4% for single markers, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20; respectively. For protein content, the stratified sampling increased the predictive ability by 10.2, 9.5, 9.8, 9.4, and 9.0% for single markers, Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20; respectively. Stratified sampling reduced the predictive ability for test weight across all the marker/haplotype genotyping scenarios. The reason for the reduction in the test weight predictive ability when implementing stratified sampling is unknown. Similar results were observed in barley when stratified sampling resulted in a reduction of prediction accuracy in yield across selection cycles while increasing the prediction accuracy for Fusarium toxin accumulation (Deoxynivalenol) (Tiede and Smith 2018). There is no absolute training population optimization method that could be applied to all traits with variable genetic architecture (Isidro ; Tiede and Smith 2018). This may be partly due to the correlation between the traits and population structure, which can affect the accuracy of prediction (Sallam ). However, other optimization methods performed similarly or more consistently in traits with different genetic architecture and in populations with limited population stratification (Isidro ; Tiede and Smith 2018). The current study accentuates the improvement of prediction accuracy based on haplotype blocks vs. single marker genomic prediction in a self-fertilized crop species. The four different haplotype sizes significantly improved the accuracy of prediction compared with single marker prediction for yield, test weight, and protein content. It is expected for haplotypes to improve the prediction accuracy over single marker prediction due to the increased LD between haplotypes and causal genetic variants, the effectiveness of capturing genetic relationship using haplotype information, and the ability of haplotype blocks to capture short-range epistatic interactions of nearby genetic variants (Clark 2004; Hayes ; Hess ; Jiang ). Haplotypes may better capture the genomic similarity between lines because LD patterns in each block are considered. A relationship was observed between the length of the haplotype and the accuracy of prediction in animal studies using both simulated (Calus ; Villumsen ) and empirical data (Hayes ; Hess ). The increase of haplotype length is expected to capture LD between markers in blocks with QTL; thereby increasing the accuracy of prediction. However, this may also increase the number of haplotype allelic classes, which may reduce the accuracy of prediction due to smaller sample sizes representing these classes (Villumsen ; Da 2015; Hess ). In our study, the longer haplotype blocks in Haploblock-20 resulted in a large increase for the number of haplotype allelic classes, on average, compared with other haplotype sizes, leading to no improvement in accuracy over Haploblock-15. Villumsen found that the ideal haplotype size could be determined based on the LD level and marker density in a population. In the current study, extensive levels of LD were observed in the four haplotype block sizes with average LD of 0.568, 0.569, 0.572, and 0.570 for Haploblock-5, Haploblock-10, Haploblock-15, and Haploblock-20; respectively. The high levels of LD are a consequence of the selfing nature of wheat that results in extension of LD over long distance. In a simulation study in animals, an adjacent marker LD of 0.20 was sufficient for the use genomic prediction (Calus ). LD is an important component for driving the accuracy of genomic prediction as the prediction accuracy increases at a similar pattern to the increase of LD (Solberg ). It is clearly evident that the prediction accuracy is reduced at lower LD levels (Solberg ; Calus ). To assess the accuracy of genomic prediction in a rice diversity panel using single markers, LD levels between 0.49 and 0.64 resulted in higher accuracies with reductions in the accuracy of prediction at lower LD levels for three different traits (Ben Hassen ). Thus, monitoring LD level while constructing haplotype blocks is a safe approach to ensure improvement of genomic predictions. This is because partial linkage between QTL and a group of markers may reduce QTL variance explained by haplotypes, thereby lowering the prediction accuracy (Villumsen ).

Utility of haplotype prediction in plant breeding

We evaluated the implementation of a multi-allelic haplotype genomic prediction model in wheat to assess the changes of the predictive ability compared with single markers. Several methods were proposed for constructing haplotype blocks including fixed-length haplotypes and variable-length haplotypes that are based on haplotype identity-by-descent (IBD) and LD-based haplotypes (Calus ; Cuyabano ; Hess ). In current study, four fixed numbers (5, 10, 15, and 20) of adjacent markers were used to construct four different haplotype block sizes that resulted in improvement over single marker prediction for traits with different genetic architectures. With the implementation of haplotype prediction in conjunction with a training population optimization approach such as stratified sampling, the prediction accuracy improved substantially. Using Haploblock-15 and implementing stratified sampling for training population optimization, the predictive ability was improved significantly by 14.3 (four percentage points) and 16.8% (seven percentage points) for yield and protein content, respectively, compared with single markers and random k-fold cross validation. Improvement of prediction accuracy can change the ranking of top performing individuals in the selection candidate population, thereby increasing genetic gain.

40 in total

1. Prediction of total genetic value using genome-wide dense marker maps.

Authors: T H Meuwissen; B J Hayes; M E Goddard
Journal: Genetics Date: 2001-04 Impact factor: 4.562

2. Accuracy of genomic selection using different methods to define haplotypes.

Authors: M P L Calus; T H E Meuwissen; A P W de Roos; R F Veerkamp
Journal: Genetics Date: 2008-01 Impact factor: 4.562

3. Genomic selection using different marker types and densities.

Authors: T R Solberg; A K Sonesson; J A Woolliams; T H E Meuwissen
Journal: J Anim Sci Date: 2008-04-11 Impact factor: 3.159

4. Establishment and Optimization of Genomic Selection to Accelerate the Domestication and Improvement of Intermediate Wheatgrass.

Authors: Xiaofei Zhang; Ahmad Sallam; Liangliang Gao; Traci Kantarski; Jesse Poland; Lee R DeHaan; Donald L Wyse; James A Anderson
Journal: Plant Genome Date: 2016-03 Impact factor: 4.089

5. Genome-based prediction of maize hybrid performance across genetic groups, testers, locations, and years.

Authors: Theresa Albrecht; Hans-Jürgen Auinger; Valentin Wimmer; Joseph O Ogutu; Carsten Knaak; Milena Ouzunova; Hans-Peter Piepho; Chris-Carolin Schön
Journal: Theor Appl Genet Date: 2014-04-11 Impact factor: 5.699

Review 6. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives.

Authors: José Crossa; Paulino Pérez-Rodríguez; Jaime Cuevas; Osval Montesinos-López; Diego Jarquín; Gustavo de Los Campos; Juan Burgueño; Juan M González-Camacho; Sergio Pérez-Elizalde; Yoseph Beyene; Susanne Dreisigacker; Ravi Singh; Xuecai Zhang; Manje Gowda; Manish Roorkiwal; Jessica Rutkoski; Rajeev K Varshney
Journal: Trends Plant Sci Date: 2017-09-28 Impact factor: 18.313

7. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation.

Authors: Mahdi Saatchi; Mathew C McClure; Stephanie D McKay; Megan M Rolf; JaeWoo Kim; Jared E Decker; Tasia M Taxis; Richard H Chapple; Holly R Ramey; Sally L Northcutt; Stewart Bauck; Brent Woodward; Jack C M Dekkers; Rohan L Fernando; Robert D Schnabel; Dorian J Garrick; Jeremy F Taylor
Journal: Genet Sel Evol Date: 2011-11-28 Impact factor: 4.297

8. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: a simulation experiment.

Authors: Aaron J Lorenz
Journal: G3 (Bethesda) Date: 2013-03-01 Impact factor: 3.154

9. Fixed-length haplotypes can improve genomic prediction accuracy in an admixed dairy cattle population.

Authors: Melanie Hess; Tom Druet; Andrew Hess; Dorian Garrick
Journal: Genet Sel Evol Date: 2017-07-03 Impact factor: 4.297

10. Characterization of polyploid wheat genomic diversity using a high-density 90,000 single nucleotide polymorphism array.

Authors: Shichen Wang; Debbie Wong; Kerrie Forrest; Alexandra Allen; Shiaoman Chao; Bevan E Huang; Marco Maccaferri; Silvio Salvi; Sara G Milner; Luigi Cattivelli; Anna M Mastrangelo; Alex Whan; Stuart Stephen; Gary Barker; Ralf Wieseke; Joerg Plieske; Morten Lillemo; Diane Mather; Rudi Appels; Rudy Dolferus; Gina Brown-Guedira; Abraham Korol; Alina R Akhunova; Catherine Feuillet; Jerome Salse; Michele Morgante; Curtis Pozniak; Ming-Cheng Luo; Jan Dvorak; Matthew Morell; Jorge Dubcovsky; Martin Ganal; Roberto Tuberosa; Cindy Lawley; Ivan Mikoulitch; Colin Cavanagh; Keith J Edwards; Matthew Hayden; Eduard Akhunov
Journal: Plant Biotechnol J Date: 2014-03-20 Impact factor: 9.803

5 in total

1. Prediction ability of genome-wide markers in Pinus taeda L. within and between population is affected by relatedness to the training population and trait genetic architecture.

Authors: Edwin Lauer; James Holland; Fikret Isik
Journal: G3 (Bethesda) Date: 2022-02-04 Impact factor: 3.542

2. Haplotype- and SNP-Based GWAS for Growth and Wood Quality Traits in Eucalyptus cladocalyx Trees under Arid Conditions.

Authors: Camilo E Valenzuela; Paulina Ballesta; Sunny Ahmar; Sajid Fiaz; Parviz Heidari; Carlos Maldonado; Freddy Mora-Poblete
Journal: Plants (Basel) Date: 2021-01-13

3. Haplotype Analysis of Genomic Prediction Using Structural and Functional Genomic Information for Seven Human Phenotypes.

Authors: Zuoxiang Liang; Cheng Tan; Dzianis Prakapenka; Li Ma; Yang Da
Journal: Front Genet Date: 2020-11-26 Impact factor: 4.599

Review 4. Features and applications of haplotypes in crop breeding.

Authors: Javaid Akhter Bhat; Deyue Yu; Abhishek Bohra; Showkat Ahmad Ganie; Rajeev K Varshney
Journal: Commun Biol Date: 2021-11-04

5. Can Cross-Country Genomic Predictions Be a Reasonable Strategy to Support Germplasm Exchange? - A Case Study With Hydrogen Cyanide in Cassava.

Authors: Lívia Gomes Torres; Eder Jorge de Oliveira; Alex C Ogbonna; Guillaume J Bauchet; Lukas A Mueller; Camila Ferreira Azevedo; Fabyano Fonseca E Silva; Guilherme Ferreira Simiqueli; Marcos Deon Vilela de Resende
Journal: Front Plant Sci Date: 2021-12-08 Impact factor: 5.753

5 in total