Literature DB >> 24531728

Accuracy of estimation of genomic breeding values in pigs using low-density genotypes and imputation.

Yvonne M Badke¹, Ronald O Bates, Catherine W Ernst, Justin Fix, Juan P Steibel.

Abstract

Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.

Entities: Chemical Disease Gene Species

Keywords: GenPred; genomic selection; genotype imputation; shared data resources; swine

Mesh：

Year: 2014 PMID： 24531728 PMCID： PMC4059235 DOI： 10.1534/g3.114.010504

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Genetic improvement through breeding for lean growth, reproductive performance, meat quality, and health traits is an important tool in the pig-breeding industry to assure its continued competitiveness and success. Traditional estimated breeding values (EBVs) derived from pedigree information have resulted in continuous genetic improvement but have several limitations (Dekkers ). Notably, some important phenotypes are difficult and expensive to observe, impairing estimation of accurate EBV. The use of genomic breeding values (GEBVs), estimated using a large number of genetic markers across the genome, is expected to overcome a number of those limitations (Meuwissen ; Dekkers ) and allow for the selection of animals at a young age, thereby shortening generation intervals (Hayes ; Vanraden ; Wiggans ). Several papers have reported the progress and success of genomic selection in dairy cattle (Hayes ; VanRaden ; Wiggans ), and it is expected to be equally useful in pigs (Tribout ). High-density genotypes in pigs can be obtained from the PorcineSNP60 BeadChip (Illumina, San Diego, CA) containing roughly 62K single-nucleotide polymorphisms (SNPs) (Ramos ). First implementations of genomic prediction in pigs included evaluations for total number of pigs born in a litter and percent stillborn (Cleveland ). The results of this study indicated that GEBV in pigs can reach accuracies comparable with those observed in dairy cattle if the training population is large enough (Cleveland ). In addition, several strategies to increase cost efficiency through the use of low-density genotypes have been explored, but the accuracy of GEBV was reasonable only for certain traits, likely due to differences in the genetic architecture of the traits (Cleveland ). However, when genotypes were imputed with high accuracy, results for genomic evaluation were promising for several traits in a commercial pig population (Cleveland and Hickey 2013). A question that was not investigated in those papers and that we want to answer in this study is how different imputation scenarios (of varying cost and accuracy) translate into accuracy of genomic predictions. The posed question is important because the relatively high genotyping cost per animal currently limits the widespread commercial use of high-density genotypes for genomic selection purposes in pigs. One strategy to improve the cost efficiency of genotyping schemes is the use of genotype imputation for a portion of the population. In the interest of cost efficiency, it is likely that selection candidates will not be genotyped using a high-density array such as the PorcineSNP60 but rather will be genotyped on a low-density array like the recently released GeneSeek Genomic Profiler for Porcine LD (GGP-Porcine: GeneSeek Inc., a Neogen Co., Lincoln, NE), a subset of the PorcineSNP60 containing roughly 10K SNP. We showed (Badke ) that genotypes in pigs can be imputed from the GGP-Porcine to the PorcineSNP60 with accuracy of R2 = 0.88 using linkage disequilibrium (LD)-based imputation algorithms with a small reference panel of haplotypes (N = 128 haplotypes). We also showed that imputation accuracy can be further improved by adding animals to the reference panel (Badke ), or in case of a pedigreed population, by exploiting Mendelian segregation and population-wide LD (Huang ; Gualdrón Duarte ). In this paper, we use genotypes imputed based on population wide LD, offering a strategy that can be applied universally in any population, for which a suitable reference panel can be assembled. Our objective was to estimate the accuracy of genomic evaluation using observed or imputed genotypes. Moreover, we consider two contrasting imputation scenarios: (a) a higher-cost and high-accuracy scenario in which high-density genotypes from training animals and from a reference panel are used to impute genotypes in candidates for selection and (b) a low-cost and lower-accuracy scenario in which a small reference panel of high-density haplotypes is used to impute genotypes in training animals and candidates for selection.

Materials and Methods

Materials

Animals and genotypes:

Data used in this study were collected from 983 Yorkshire sires. A pedigree of 4092 individuals spanning 22 generations and including all 983 sires and their registered ancestors was available from the National Swine Registry (NSR). Of 983 genotyped sires, 575 had their sire genotyped as well, 341 had a grand sire, and 597 animals had at least one half sib among the 983 animals. The number of full sibs was much lower, and only 110 sires had a full sib genotyped. Details on these quantities can be found in Supporting Information, Figure S1. High-density genotypes for these animals were obtained from samples provided by the NSR. Genotyping was performed at a commercial laboratory (GeneSeek) using the Illumina PorcineSNP60 BeadChip. The same dataset was previously used to assess the effect of genotype imputation (Badke ) and is publicly available at: https://www.msu.edu/~steibelj/JP_files/imputation.html. Animal protocols were approved by the Michigan State University All University Committee on Animal Use and Care (AUF# 03/09-046-00). Genotyping rate of at least 90% of both animals and SNP and a minor allele frequency (MAF) of at least 5% were required for genotypes to be included in the analysis, leaving a total of 41,248 markers in 983 animals. SNPs that were not assigned to an autosomal position in map build 10.2 were excluded from the analysis. It was our goal to estimate the GEBV of male offspring of a sire and since sires will not pass an X chromosome to their male offspring, these SNP do not contribute to the sons’ GEBV (VanRaden ). In addition to genotypes for 983 Yorkshire sires, a set of 128 Yorkshire haplotypes was available as a reference panel for genotype imputation from a previous study (Badke ). These haplotypes are also freely available at https://www.msu.edu/~steibelj/JP_files/LD_estimate.html, and details on the design and phasing can be found in Badke .

Phenotypes:

For every animal and their parents, EBVs and accuracies were obtained for three traits from NSR through their traditional genetic evaluation. These traits were: backfat thickness (BF), number of days to 250 lb (D250), and loin muscle area (LEA). Descriptive statistics of EBV and accuracies are presented in Table 1. All code and data used in this paper have been assembled into an R package, accessible at: http://tinyurl.com/MSURGEBV.

Table 1

Descriptive statistics of EBVs

	BF	D250	LEA
EB¯V	−0.03	4.57	0.61
^r¯EBV2^a	0.74	0.67	0.75
N ^b	965	936	938
h²	0.45	0.26	0.47

EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.

Average reliability of EBV.

Number of animals with usable EBV.

EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area. Average reliability of EBV. Number of animals with usable EBV.

Methods

De-regression of breeding values:

De-regressed breeding values (dEBVs) were used as response variables throughout the analysis. We computed individual animal dEBVs and their weights (w) with the parent average removed by following the procedure outlined by Garrick . We discarded records with a negative weight. The weight of an animal will only be below 0 if the unknown information content on this particular animal and its offspring is below 0, such that there is no individual information observed. This would be the case in a young animal, where all observed information came from ancestors and parents of this animal. To avoid double counting, these animals were eliminated from the analysis because they did not contribute individual information. After de-regression and filtering a total of 965, 936, and 938 animals remained for the traits BF, D250, and LEA, respectively.

Estimation of genomic relationship matrix:

The genomic relationship matrix was estimated from observed or imputed high density (~41 K) SNP genotypes. Genotypes were expressed as allelic dosage, which is the number of copies of the minor allele, such that genotypes were entered into a marker matrix W as a decimal number in the interval [0, 2]. We obtained matrix Z by subtracting twice the allelic frequency of the minor allele (p), from columns of W (VanRaden 2008). The genomic relationship matrix was then calculated as:where is a normalizing constant (Wang ) summing expected variances across markers scaling G toward the numerator relationship matrix (VanRaden 2008). The allele frequency p was obtained using all available animals (N = 983). Average relatedness between animals was obtained from the row/column vectors of G. We quantified relatedness in this study as the average of the top 10 relationships observed within the G matrix (re/10). The choice of top 10 as opposed of another number is arbitrary but driven by the fact that each animal had a very limited number of close and distant relatives in the training set (Figure S1). Moreover, other studies have used this measure and proposed its inclusion in future work on genomic selection to promote comparability (Daetwyler ).

Implementation of prediction model:

Using the genomic relationship matrix from equation (1), an animal-centric model for genomic evaluations can be written as:where y is the vector of dEBV, μ is the overall mean, a is the vector of n animal effects , and e is a vector of random residuals . The variance of the dEBV is , where R is a diagonal matrix with diagonal elements , the inverse of the weights of the dEBV (VanRaden ). Equivalently, the information in G can also be included in the incidence matrix of the animal effects a as follows (Vazquez ):where C is the Cholesky decomposition of G, such that G = CC′, μ is the overall mean, a* is the vector of animal effects with noticing that a = Ca*, and e is a vector of residual effects such that . The variance terms for models (2) and (3) are equal, such that the two models are in fact equivalent if variance components are assumed known. Likewise, when estimating the parameters under these two models, we found virtually identical results, but model (3) was computationally more efficient resulting in a twofold reduction in compute time (results not shown). The BLR package (Pérez ) in R (R Development Core Team 2011) was used to fit the mixed model equations. Model parameters and were sampled from their corresponding full conditional distribution using a Gibbs sampler. Prior distributions were elicited based on equations presented by Pérez . The prior distribution of and were an inverse χ2 distribution with degrees of freedom df and scale S. To ensure proper priors with finite expectations, we set df = 3. The scale parameters were obtained as a function of the df and assuming values of the genetic variance (V) and error variance (V) (Pérez ):where , is the average inbreeding coefficient, set equal to 1 in this case, assuming no inbreeding. Heritability was assumed to be h2 = 0.5, such that after the value for V was arbitrarily set to 0.4, V was estimated . The Gibbs sampler implemented in BLR (Pérez ) was used to obtain a total of 100,000 samples, 10,000 of which were discarded as burn-in. The reported estimates of , , animal effects (a*), and GEBV were based on the posterior means of the remaining 90,000 iterations. We assessed convergence of the Markov chain Monte Carlo method as well as sensitivity to priors to ensure robustness of estimates to priors (results not shown).

Genomic prediction under cross-validation

Accuracy of genomic evaluation was estimated in a 10-fold cross-validation design. Approximately 10% of the animals were randomly assigned to a validation panel (V) in which predictions would be made, whereas the remaining 90% were used as the training panel (T) to estimate the parameters necessary for prediction. A total of 10 separate datasets were created such that each animal would be used for validation once. Across cross-validation datasets we fit model (3) to the training animals; we refer to that subset by adding a subindex T:to estimate the BLUP of (VanRaden ):where the matrices G and C are partitioned into block structure such thatThe relation between the BLUP for a based on model (2) and based on model (3) can be expressed as:The GEBVs of training animals in model (2) were computed as:Subsequently, the GEBVs of the validation animals were estimated from using the following equation:where , , and are estimated using model (4), which is equivalent to applying model (3) to the training animals.

Estimation of accuracy:

Accuracy of genomic evaluation is the correlation between the estimated GEBV and the unknown true breeding values (TBVs) (Hayes ). However, the TBVs are unknown. Consequently, the accuracy of genomic evaluation has to be approximated using the available information. Hayes proposed to express the correlation between GEBV and TBV as a function of the correlation between GEBV and EBV:where is the estimated reliability of the EBV. VanRaden replaced with the arithmetic mean of the reliability of the EBV. Daetwyler proposed to report a simple Pearson correlation coefficient between GEBV and EBV to allow for comparability of results across studies. We estimate accuracy of genomic evaluation as the Pearson correlation coefficient between GEBV and EBV (r(, )) and the Pearson correlation coefficient adjusted for the average accuracy of the EBV to facilitate such comparison . Accuracies of individual GEBV were obtained analogous to the accuracy of EBV in an animal model (Goddard ) through inversion of the mixed model equations (Mrode 2005; VanRaden 2008; VanRaden ; Strandén and Garrick 2009; Clark ). The accuracy of of the model (2) can be expressed as (Mrode 2005; Strandén and Garrick 2009; Clark ):This equation and its derivation can be found in Strandén and Garrick (2009) and VanRaden (2008) and was used to estimate the accuracy of individual GEBV for validation animals.

Genotype imputation:

LD-based genotype imputation was performed with BEAGLE version 3.3.1 (Browning and Browning 2009). We used the standard settings for BEAGLE: 10 iterations of the phasing algorithm, drawing four samples per iteration. Previous results from our group (Badke ) and other studies (Hayes ) showed negligible improvement in imputation accuracy as a result of an increase in iterations or samples per iteration. Imputation of 10K SNP chip [6890 SNP after filtering for minor allele frequency (MAF) and missing rate] were used as tagSNP to impute 60K SNP (41,248 after filtering). We implemented two separate imputation experiments that differ in the size of the high-density reference panel used for imputation: (1) a reference panel of 128 Yorkshire haplotypes or (2) a reference panel combining the 128 Yorkshire haplotypes with the haplotypes of all animals that are part of the training panel (~1700 additional haplotypes) in the respective cross-validation dataset. To assess the effect of genotype imputation on genomic prediction we considered the following four scenarios: (1) the reference scenario in which genomic evaluation was based on observed genotypes in training and validation animals, (2) genomic evaluation based on observed genotypes in the training animals and genotypes imputed from a large reference panel (~1800 haplotypes) in the validation animals, (3) genomic evaluation based on observed genotypes in the training animals and genotypes imputed from a small reference panel (128 haplotypes) in the validation animals, and (4) genomic evaluation based on imputed genotypes in training and validation animals using a small (128 haplotypes) but representative reference panel for imputation. All genotype imputation and subsequent estimation of imputation accuracy was implemented using the R package impute.R (Badke ). To compare average accuracy of genomic evaluation across these four scenarios, we fitted a linear model with the average accuracy of genomic evaluation as response variable and the genotype imputation scenario as independent variable, adding the effect of the random cross-validation dataset in which accuracy of genomic evaluation was estimated as a random blocking factor.

Results

Accuracy of genomic evaluation and GEBV using observed genotypes

When genotypes were observed in both training and prediction animals, the accuracy of genomic evaluation, measured as the weighted mean of the Pearson correlation coefficient between EBV and predicted GEBV across 10 cross-validation datasets, was 0.68, 0.66, and 0.65 for BF, D250, and LEA, respectively (Table 2). When the measure of accuracy was adjusted for the average reliability of the EBV of the training animals, the observed accuracy of genomic evaluation was 0.80, 0.82, and 0.76 for BF, D250, and LEA, respectively (Table 2).

Table 2

Estimates of accuracy for genomic evaluation and individual GEBV across imputation scenarios

Trait	Scenario^a	Imputation Accuracy^b	r_EBV_, _GEBV^c	r_EBV^d	rEBV,GEBVr¯EBV	r¯GEBV	HPD^e
BF	1	(1, 1)	0.6810¹	0.8510	0.7998	0.6852	[0.5395, 0.8211]
	2	(1, 0.95)	0.6795¹		0.7981	0.6861	[0.5467, 0.8164]
	3	(0.88, 0.88)	0.6598²		0.7749	0.7014	[0.5727, 0.8267]
	4^f	(1,1)	0.7210		0.8405	0.8560	[0.8174, 0.8768]
D250	1	(1, 1)	0.6603¹	0.8020	0.8229	0.6575	[0.5073, 0.7948]
	2	(1, 0.95)	0.6555^1,²		0.8170	0.6585	[0.5187, 0.7962]
	3	(0.88, 0.88)	0.6463²		0.8054	0.6750	[0.5345, 0.7985]
	4^f	(1,1)	0.5354		0.6550	0.8438	[0.8048, 0.8704]
LEA	1	(1, 1)	0.6516¹	0.8529	0.7639	0.6859	[0.5386, 0.8325]
	2	(1, 0.95)	0.6491¹		0.7610	0.6868	[0.5377, 0.8214]
	3	(0.88, 0.88)	0.6364²		0.7461	0.7040	[0.5667, 0.8330]
	4^f	(1,1)	0.7165		0.8201	0.8549	[0.8223, 0.8787]

GEBV, genomic breeding value; EBV, estimated breeding values; HPD, highest posterior density; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.

Scenarios 1: all observed genotypes, 2: genotypes in prediction animals imputed with large reference haplotype panel (~1800), 3: genotypes in prediction animals imputed with small haplotype reference panel (128), and 4: validation animals with at least one close relative in the reference panel.

Accuracy of genotype imputation R2 for training and validation animals: .

Tukey honest significant difference post-hoc comparison of accuracy of genomic evaluation across imputation scenarios.

Average accuracy of EBV in the validation panel.

95% HPD interval of GEBV accuracy across validation animals.

Scenario with young animals in the validation panel that almost all have at least one close relative in the training panel.

1,2Means with different superscript differ significantly according to Tukey post-hoc tests with α = 0.05.

GEBV, genomic breeding value; EBV, estimated breeding values; HPD, highest posterior density; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area. Scenarios 1: all observed genotypes, 2: genotypes in prediction animals imputed with large reference haplotype panel (~1800), 3: genotypes in prediction animals imputed with small haplotype reference panel (128), and 4: validation animals with at least one close relative in the reference panel. Accuracy of genotype imputation R2 for training and validation animals: . Tukey honest significant difference post-hoc comparison of accuracy of genomic evaluation across imputation scenarios. Average accuracy of EBV in the validation panel. 95% HPD interval of GEBV accuracy across validation animals. Scenario with young animals in the validation panel that almost all have at least one close relative in the training panel. 1,2Means with different superscript differ significantly according to Tukey post-hoc tests with α = 0.05. We observed a significant difference between the estimates of accuracy of genomic evaluation across 10 randomly assigned cross-validation datasets for three traits (Table 3). That variation across cross-validation datasets was partially explained by a significant effect of the average EBV accuracy of validation animals on accuracy of genomic evaluation (Table 3) in three traits and a significant effect of top 10 relatedness on accuracy of genomic evaluation in D250. In general, D250 had slightly lower average EBV accuracy due to an increased frequency of EBV with intermediate accuracy (r close to 0.6, Figure S2). As expected, this resulted in slightly lower correlation of EBV and GEBV because the ‘true value’ (EBV) is subject to more uncertainty. Another source of difference of accuracy of genomic evaluation across cross-validation datasets could be the population structure. This would be revealed through differences in estimated variance components. We did not expected differences in variance components estimated from randomly assigned validation datasets. We confirmed this assumption by studying the distribution of estimated heritability and included the obtained results in Figure S3. We observed that the posterior distributions of heritabilities did not change across folds. Conversely, in the presence of population structure, the relationships of animals of different cross validation datasets will change (depending on who else is in the training set), and we expect that to affect the estimate of heritability.

Table 3

Significance of variables affecting accuracy of genomic evaluation

dataset^a			rel10^b		^r¯EBV^c
trait	F^d	p	F^e	p	F^e	p
BF	258	< 0.001	2.83	0.1013	11.73	0.0016**
D250	229	< 0.001	5.18	0.0291*	7.238	0.0109*
LEA	311	< 0.001	2.06	0.1605	3.430	0.0725

EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.

Accuracy of genomic evaluation was estimated for a total of 10 randomly assigned datasets of the cross-validation, such that we could assess whether accuracy of genomic evaluation was significantly different across these 10 datasets.

Accuracy of genomic evaluation by average of the top 10 genomic relationship estimates of animals in the validation set.

Accuracy of genomic evaluation by average accuracy of EBV of validation animals by cross-validation dataset.

df = c(9, 27).

df = c(1, 35).

P < 0.05, **P < 0.01.

EBV, estimated breeding values; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area. Accuracy of genomic evaluation was estimated for a total of 10 randomly assigned datasets of the cross-validation, such that we could assess whether accuracy of genomic evaluation was significantly different across these 10 datasets. Accuracy of genomic evaluation by average of the top 10 genomic relationship estimates of animals in the validation set. Accuracy of genomic evaluation by average accuracy of EBV of validation animals by cross-validation dataset. df = c(9, 27). df = c(1, 35). P < 0.05, **P < 0.01. The average accuracy of the genomic evaluation and the assessment of the accuracy of individual GEBV using equation 10 is equally important in a practical implementation of genomic selection. Average accuracy of individual GEBV was 0.69, 0.66, and 0.69 for BF, D250, and LEA, respectively with a 95% highest posterior density interval ranging from roughly 0.51 to 0.80 across all traits (Table 2). As can be seen in Figure 1, the accuracy of GEBV (r) and accuracy of EBV (r) are not linearly related. The accuracy of EBV was higher than the estimated accuracy of GEBV for most animals in three traits, especially when r > 0.8. For a few animals with r between 0.4 and 0.8, the accuracy of GEBV was higher than their respective EBV accuracy. Hypothetically, individual differences in r can be explained by the presence or absence of relatives of the predicted animal in the training set (Clark ; Pérez-Cabal ). We investigated this assertion in two ways: (1) by computing average r for animals with different number of relatives in training panel and (2) by regressing r on the average top 10 relatedness in the genomic relationship matrix. Following Pérez-Cabal , we defined close relatives as sires and full sibs and distant relatives as maternal grand sires and half sibs. We found that increasing the number of close relatives from one to four in the training panel increased average r by about 0.1 decimal points (Figure 2) across the three traits in this study (from an average of = 0.63 to = 0.73 regardless of the trait considered). The presence of distant relatives in the training set also resulted in an increase of r of similar magnitude when comparing individuals without any distant relative to individuals with at least five distant relatives in the training set (Figure 2). A similar relationship was observed when comparing r with the average relationship to the 10 most-related individuals in the training set. We observed an almost linear increase in r as top 10 relatedness increased (Figure 2), which was statistically significant (P < 0.01). To further investigate the effect of relatedness between training and validation animals, we selected the youngest 87 animals (approximately 10% of the population) that included 82 animals with at least a sire or a grand-sire in the training panel. We repeated genomic evaluation with this validation panel and estimated the accuracy of GEBV. As expected, average accuracy of GEBV for this validation panel was higher than the average observed across the cross-validation datasets with 0.72 for BF and LEA, and 0.54 for D250. However, when looking at the range of accuracies observed for all 10 cross-validation datasets these values do not exceed the maximum accuracy observed. One interesting finding was that estimates of individual accuracy, or accuracy of GEBV predicted through the genomic relationship matrix, were much larger than the observed accuracy of genomic evaluation in all three traits (Table 2). Goddard proposed to use this measure of accuracy of individual GEBV when using them for selection but also to screen for animals whose GEBV could be expected to be highly accurate. Our results show that while it is true that individuals with close relatives in the training panel will have on average more accurate GEBV, the individual accuracies obtained from the G matrix would be overestimated.

Figure 1

Figure 2

Accuracy of GEBV by average top 10 relatedness between the individual and training panel for (A) BF, (B) D250, and (C) LEA r in relation to the animals rel10, a loess smoother (red line), which is a local weighted mean of the r. GEBV, genomic breeding value; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.

Accuracy of GEBV by observed accuracy of EBV for (A) BF, (B) D250, and (C) LEA r in relation to the animals r, with the 1-1 line of the regression (green line) and a loess smoother (red line), which is a local weighted mean of the r. GEBV, genomic breeding value; EBV, estimated breeding value; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area. Accuracy of GEBV by average top 10 relatedness between the individual and training panel for (A) BF, (B) D250, and (C) LEA r in relation to the animals rel10, a loess smoother (red line), which is a local weighted mean of the r. GEBV, genomic breeding value; BF, backfat thickness; D250, number of days to 250 lb; LEA, loin muscle area.

Effect of genotype imputation on accuracy of genomic evaluation and GEBV

Accuracy of imputation (R2) for each animal was measured as the squared correlation between the observed and imputed allelic dosage across all SNP (Badke ). Average accuracy of imputation was R2 = 0.88 for the scenario using a small (128) haplotype reference panel, and it increased to R2 = 0.95, when a larger reference panel (~ 1800 haplotypes) was used. In our previous study (Badke ), we found that increasing the size of the reference panel led to an improved imputation, especially of SNP that appear difficult to impute, such as SNP with low (<0.1) MAF and those located in the chromosomal extremes. These results were repeated in this study (Figure S4). For BF we found that the average accuracy of genomic evaluation under scenario 2 (r, = 0.68), where genotypes in the validation animals were imputed with high accuracy (R2 = 0.95), was not significantly different from the accuracy (r, = 0.68) estimated in the reference scenario, where all genotypes were observed. However, average accuracy of genomic evaluation was significantly lower (r, = 0.66), when genotypes were imputed in both training and validation with lower accuracy (R2 = 0.88 using a small reference panel of haplotypes (scenario 3). For D250, there was no significant difference in accuracy of genomic evaluation between the reference scenario (r, = 0.66) and the scenario where genotypes were imputed in the validation animals (Table 2). However, when genotypes were imputed in both training and validation (scenario 3), the accuracy of genomic selection was significantly lower (r, = 0.65). For LEA there was also no difference in accuracy of genomic evaluation between the reference scenario (r, = 0.65) and scenario 2 (r, = 0.65). There was a significant decrease in accuracy of genomic evaluation when genotypes were imputed with lower accuracy (R2 = 0.88) in scenario 3 (r, = 0.63). To assess the effect of genotype imputation on the results of a genomic evaluation, we compared the top 5% sires (n = 46), ranked by their estimated GEBV across imputation scenarios. Again, scenario 1 was used as a reference scenario to compare how many of the top 5% ranked animals were also top ranked under the imputation scenarios. The proportion of top 5% ranked sires that were conserved when genotypes were imputed in validation animals with high accuracy (scenario 2) was 0.96 for BF and 0.98 for D250 and LEA. When genotypes were imputed with low accuracy in training and validation, the proportion of top 5% sires conserved in comparison with the reference design showed a small decrease compared with the design with only validation animals imputed for BF (0.88) and for D250 (0.89), and a more substantial decrease for LEA (0.81). Accuracy of individual GEBV is estimated using the genomic relatedness between training and validation animals. Using genotypes imputed with high accuracy (R2 = 0.95) the estimated r remained constant in all traits, compared with estimates obtained from observed genotypes. Accuracy of imputation was correlated with r (Figure S5). However, this does not imply that high imputation accuracy caused an increase in r. Another possibility is that genotypes from animals with relatives in the reference panel will be imputed with high accuracy and their GEBV will also be predicted more accurately. We believe that this was the case for our population because the correlation between GEBV and EBV did not differ significantly when imputation was used (Table 2, compare scenario 1 and 2). Moreover, when genotypes were imputed with less accuracy (R2 = 0.88), the observed accuracy of GEBV was increased even with respect to the reference scenario (Table 2, compare scenario 3 to 1 and 2). This result is counterintuitive, and we investigated the reason for this increase. Examining the estimation procedure for r we found that the increase was due to smaller estimates of the diagonal elements of the genomic relationship matrix between the validation elements (G) in the scenario with all imputed genotypes. This is the result of all imputed animals conditional on a small reference panel looking genetically more similar than they really are (because they are all imputed toward the haplotype frequencies in the small panel). Those diagonal elements of G were used to scale values of r (equation 10), and smaller values in the denominator resulted in the larger estimates of r we saw for animals in scenario 3. Comparing unscaled values of r individual accuracy was higher in the reference scenario for all animals.

Discussion

The size of the training population used to train the prediction equation in this study was small compared with previous genomic evaluations published in swine (Cleveland , 2012), and especially compared with studies applying genomic evaluation in European (Dassonneville ) or US dairy cattle (Weigel ; Wiggans ). Observed accuracy of genomic evaluation in this study was in good agreement with previously published results for genomic evaluation in pigs, assessing five unspecified commercial traits with comparable heritability (Cleveland ) and earlier results for two reproductive traits (Cleveland ). Accuracy of genomic evaluation was high across three traits (BF: r = 0.6810; D250: r = 0.6603; LEA: r = 0.6516). In addition, we report accuracy adjusted for the fact that the Pearson correlation between EBV and GEBV will underestimate the true quantity of interest (Luan ). Assessing the variation in accuracy of genomic evaluation across datasets of the cross-validation, we found that the of the validation animals and their relatedness to the training animals were significantly associated to the average accuracy of genomic evaluation. Higher accuracy of genomic evaluation of prediction animals with close relatives in the training population (Habier ; Clark ) and within closely related populations, with relatively small effective population size, has been previously reported (Daetwyler ). Accuracy of genomic evaluation in this study was high despite the limited number of animals available for training and the inclusion of animals with relatively low EBV accuracy. Furthermore, we obtained accurate genomic predictions using an equivalent model fitting the genomic relationship matrix instead of a marker based matrix (Hayes ), thereby greatly reducing the computational load. We expect that accuracy of genomic evaluation in this population and other US swine populations with comparable population structure and LD (Badke ), will be feasible for commercial implementation and could be further increased through the inclusion of additional training animals with highly accurate EBV. Besides assessing the accuracy of genomic evaluation, we also reported accuracies for individual GEBV. The accuracy of GEBV is important because it can influence selection decisions. Moreover, as proposed by Goddard , r can also be approximated prior to the implementation of genomic evaluation and used to inform the design of genomic selection in a population. The main difference between r and r(, ) is that r(, ) is indicative of the average accuracy of GEBV in a population, whereas r gives a measure of accuracy of each individually estimated GEBV. As expected, we observed that accuracy of GEBV increased with increased relatedness between the animal and the training panel. An interesting finding was that under a low accuracy imputation scenario, r was overestimated compared with r(, ). We traced this back to the diagonal elements of the genomic relationship matrix and attributed it to an artifact of the imputation using a small reference panel. Several previous studies in other populations and simulation experiments also showed the importance of relatedness for the prediction of accurate GEBV (Habier ; Clark ), especially when the training population was small (Wientjes ) as was the case in our study. In addition, we observed that accuracy of GEBV was higher than accuracy of EBV for only a few animals that had mostly low accuracy of EBV. This finding is further supported by previous reports that implementation of genomic evaluation would be most beneficial for young animals with little information on their own and subsequently low accuracy of traditional EBV (VanRaden 2008). Genotype imputation is an efficient tool to decrease the cost of obtaining high-density genotypes for selection candidates. One of the goals of this study was to quantify the loss on accuracy of genomic evaluation if GEBV were estimated from imputed rather than observed genotypes in selection candidates. Comparing accuracy of genomic evaluation across three scenarios of genotype imputation we found that for three traits there was no significant loss of accuracy of genomic prediction if genotypes in validation animals were with high accuracy (R2 = 0.95) instead of observed. However, accuracy of genomic evaluation decreased in comparison with the reference scenario when genotypes were imputed with lower overall accuracy (R2 = 0.88). When low-accuracy imputation was applied in training and prediction animals we observed a decrease in accuracy of genomic evaluation. Previously published results support that although it is not feasibly to implement genomic prediction based on low-density genotypes (Habier ; Cleveland ) the accuracy of genomic evaluation is still feasible for practical implementation when genotypes in selection candidates are accurately imputed to high density (Weigel ; Cleveland and Hickey 2013). In addition, several studies also support that an increase in imputation accuracy will generate genomic evaluations with nearly identical or even higher accuracy compared with that obtained from observed genotypes (Dassonneville ; Wiggans ; Cleveland and Hickey 2013) because the cost efficiency of low-density genotypes allows a much larger proportion of the population to be included in the genomic evaluation procedure (Wiggans ). In conclusion, an implementation of genomic selection based on observed genotypes for training of the prediction equation and GEBV predictions obtained from genotypes imputed with high accuracy appears to be a promising approach to provide the swine breeding industry with a cost-efficient procedure to obtain GEBV for animals at a young age. A recent study assessing the accuracy of genomic evaluation using high-density genotypes and various imputation schemes in a commercial pig population further supports these findings (Cleveland and Hickey 2013). We found that accuracy of individual GEBV was a linear function of the relatedness between a validation animal and the respective training set. As has been previously shown in the literature, animals that are highly related to the training population will have higher r (Habier ; Clark ). As shown in the last scenario, however, when all selection candidates had at least one close relative in the training population, r overestimates the accuracy observed for the genomic evaluation (r(, )). Although this measure certainly has value to rank animals according to how trustworthy estimated GEBV are, it is likely overestimated for candidates with close relatives. The other case in which we observed overestimated individual accuracy of GEBV (r) pertains to the last of the imputation scenarios where genotypes were imputed in training and prediction animals. Specifically, when genotypes were imputed in training and prediction animals with lower accuracy, the average r was larger than the accuracy of genomic evaluation, which we found was an artifact of lower estimates of the diagonal elements of the G matrix. This was caused by a decrease in the variance of the allelic dosage of imputed genotypes due to the relatively small number of reference haplotypes available. When the variance of imputed allelic dosages was decreased, the deviation from the expected value estimated from MAF (2p) also decreased, causing overall smaller estimates of Z and the resulting diagonal elements of the G matrix. This increase in the homogeneity of allelic dosages in the imputed genotypes causes the observed inflation in accuracy of estimated GEBV, such that in any case when GEBV are obtained from imputed genotypes the estimated accuracy of GEBV should be used with caution. The average GEBV accuracy notably exceeded the expected accuracy of genomic evaluation in that scenario. In conclusion, we found that results for the accuracy of GEBV further support the notion that genomic evaluation using high-density genotypes imputed with high accuracy for selection candidates is a feasible method to implement a cost-efficient design for genomic selection in swine. When genotypes were imputed with lower accuracy in training and prediction animals, the accuracy of genomic evaluation was significantly decreased, and estimates of accuracy of GEBV were inflated. From our results, we can affirm that starting a genomic evaluation using low-density genotypes and a small panel of high-density haplotypes will result in reduced accuracy of evaluation. Contrarily, once an evaluation is established with a large number of animals genotyped using a high-density platform, the addition of more animals genotyped at low density is promising. Further research is needed to study the effect of adding those imputed animals to the training population in further model retraining. As mentioned previously, all code and data used in this paper has been made available through an R package, accessible at: http://tinyurl.com/MSURGEBV.

33 in total

1. Increased accuracy of artificial selection by using the realized relationship matrix.

Authors: B J Hayes; P M Visscher; M E Goddard
Journal: Genet Res (Camb) Date: 2009-02 Impact factor: 1.588

2. Efficient methods to compute genomic predictions.

Authors: P M VanRaden
Journal: J Dairy Sci Date: 2008-11 Impact factor: 4.034

3. Practical implementation of cost-effective genomic selection in commercial pig breeding using imputation.

Authors: M A Cleveland; J M Hickey
Journal: J Anim Sci Date: 2013-06-04 Impact factor: 3.159

4. Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations.

Authors: R Dassonneville; R F Brøndum; T Druet; S Fritz; F Guillaume; B Guldbrandtsen; M S Lund; V Ducrocq; G Su
Journal: J Dairy Sci Date: 2011-07 Impact factor: 4.034

5. The effect of linkage disequilibrium and family relationships on the reliability of genomic prediction.

Authors: Yvonne C J Wientjes; Roel F Veerkamp; Mario P L Calus
Journal: Genetics Date: 2012-12-24 Impact factor: 4.562

6. The impact of genetic relationship information on genomic breeding values in German Holstein cattle.

Authors: David Habier; Jens Tetens; Franz-Reinhold Seefried; Peter Lichtner; Georg Thaller
Journal: Genet Sel Evol Date: 2010-02-19 Impact factor: 4.297

7. The accuracy of Genomic Selection in Norwegian red cattle assessed by cross-validation.

Authors: Tu Luan; John A Woolliams; Sigbjørn Lien; Matthew Kent; Morten Svendsen; Theo H E Meuwissen
Journal: Genetics Date: 2009-08-24 Impact factor: 4.562

Review 8. Invited review: Genomic selection in dairy cattle: progress and challenges.

Authors: B J Hayes; P J Bowman; A J Chamberlain; M E Goddard
Journal: J Dairy Sci Date: 2009-02 Impact factor: 4.034

9. Genomic evaluations with many more genotypes.

Authors: Paul M VanRaden; Jeffrey R O'Connell; George R Wiggans; Kent A Weigel
Journal: Genet Sel Evol Date: 2011-03-02 Impact factor: 4.297

10. A common dataset for genomic analysis of livestock populations.

Authors: Matthew A Cleveland; John M Hickey; Selma Forni
Journal: G3 (Bethesda) Date: 2012-04-01 Impact factor: 3.154

18 in total

1. Moving Beyond Managing Realized Genomic Relationship in Long-Term Genomic Selection.

Authors: Herman De Beukelaer; Yvonne Badke; Veerle Fack; Geert De Meyer
Journal: Genetics Date: 2017-04-04 Impact factor: 4.562

2. Accuracy of Predicted Genomic Breeding Values in Purebred and Crossbred Pigs.

Authors: André M Hidalgo; John W M Bastiaansen; Marcos S Lopes; Barbara Harlizius; Martien A M Groenen; Dirk-Jan de Koning
Journal: G3 (Bethesda) Date: 2015-05-26 Impact factor: 3.154

3. Imputation of genotypes in Danish purebred and two-way crossbred pigs using low-density panels.

Authors: Tao Xiang; Peipei Ma; Tage Ostersen; Andres Legarra; Ole F Christensen
Journal: Genet Sel Evol Date: 2015-06-30 Impact factor: 4.297

4. Evaluation of genome based estimated breeding values for meat quality in a berkshire population using high density single nucleotide polymorphism chips.

Authors: S Baby; K-E Hyeong; Y-M Lee; J-H Jung; D-Y Oh; K-C Nam; T H Kim; H-K Lee; J-J Kim
Journal: Asian-Australas J Anim Sci Date: 2014-11 Impact factor: 2.509

Review 5. Methods to address poultry robustness and welfare issues through breeding and associated ethical considerations.

Authors: William M Muir; Heng-Wei Cheng; Candace Croney
Journal: Front Genet Date: 2014-11-26 Impact factor: 4.599

6. Improving virus production through quasispecies genomic selection and molecular breeding.

Authors: Francisco J Pérez-Rodríguez; Lucía D'Andrea; Montserrat de Castellarnau; Maria Isabel Costafreda; Susana Guix; Enric Ribes; Josep Quer; Josep Gregori; Albert Bosch; Rosa M Pintó
Journal: Sci Rep Date: 2016-11-03 Impact factor: 4.379

7. Genomic evaluation of feed efficiency component traits in Duroc pigs using 80K, 650K and whole-genome sequence variants.

Authors: Chunyan Zhang; Robert Alan Kemp; Paul Stothard; Zhiquan Wang; Nicholas Boddicker; Kirill Krivushin; Jack Dekkers; Graham Plastow
Journal: Genet Sel Evol Date: 2018-04-06 Impact factor: 4.297

8. Genomic Prediction of Average Daily Gain, Back-Fat Thickness, and Loin Muscle Depth Using Different Genomic Tools in Canadian Swine Populations.

Authors: Siavash Salek Ardestani; Mohsen Jafarikia; Mehdi Sargolzaei; Brian Sullivan; Younes Miar
Journal: Front Genet Date: 2021-06-03 Impact factor: 4.599

9. Meta-analysis of genome-wide association from genomic prediction models.

Authors: Y L Bernal Rubio; J L Gualdrón Duarte; R O Bates; C W Ernst; D Nonneman; G A Rohrer; A King; S D Shackelford; T L Wheeler; R J C Cantet; J P Steibel
Journal: Anim Genet Date: 2015-11-26 Impact factor: 3.169

10. Genomic Prediction Accounting for Residual Heteroskedasticity.

Authors: Zhining Ou; Robert J Tempelman; Juan P Steibel; Catherine W Ernst; Ronald O Bates; Nora M Bello
Journal: G3 (Bethesda) Date: 2015-11-12 Impact factor: 3.154