Literature DB >> 35698863

Using pooled data for genomic prediction in a bivariate framework with missing data.

Johnna L Baller¹, Stephen D Kachman², Larry A Kuehn³, Matthew L Spangler¹.

Abstract

Pooling samples to derive group genotypes can enable the economically efficient use of commercial animals within genetic evaluations. To test a multivariate framework for genetic evaluations using pooled data, simulation was used to mimic a beef cattle population including two moderately heritable traits with varying genetic correlations, genotypes and pedigree data. There were 15 generations (n = 32,000; random selection and mating), and the last generation was subjected to genotyping through pooling. Missing records were induced in two ways: (a) sequential culling and (b) random missing records. Gaps in genotyping were also explored whereby genotyping occurred through generation 13 or 14. Pools of 1, 20, 50 and 100 animals were constructed randomly or by minimizing phenotypic variation. The EBV was estimated using a bivariate single-step genomic best linear unbiased prediction model. Pools of 20 animals constructed by minimizing phenotypic variation generally led to accuracies that were not different than using individual progeny data. Gaps in genotyping led to significantly different EBV accuracies (p < .05) for sires and dams born in the generation nearest the pools. Pooling of any size generally led to larger accuracies than no information from generation 15 regardless of the way missing records arose, the percentage of records available or the genetic correlation. Pooling to aid in the use of commercial data in genetic evaluations can be utilized in multivariate cases with varying relationships between the traits and in the presence of systematic and randomly missing phenotypes.

Entities: Chemical

Keywords: DNA pooling; beef cattle; bivariate models; genomic prediction

Mesh：

Year: 2022 PMID： 35698863 PMCID： PMC9544112 DOI： 10.1111/jbg.12727

Source DB: PubMed Journal: J Anim Breed Genet ISSN： 0931-2668 Impact factor: 3.271

INTRODUCTION

Most of the data included in beef cattle genetic evaluations in the US are recorded within the nucleus (seedstock) segment; however, often economically relevant traits (ERT) are only observed at the commercial level. Records (phenotypes) are routinely collected at the commercial level but the pedigree relationships needed to connect these records to seedstock animals are often missing due to the lack of recording, group mating or the information does not follow the animals as they move through the industry (Bell et al., 2017). These relationships could be estimated using genomics but all commercial animals with a phenotype would need to be individually genotyped. This level of genotyping would not be economical. Nevertheless, the inclusion of commercial data has enormous potential to increase the response to selection for traits that are economically important to the beef industry including feedlot performance, reproductive longevity, disease resistance and carcass merit. An optimal solution would be to collect the true ERT from commercial herds and estimate relationships between commercial animals and seedstock animals in an economical manner for use in routine genetic evaluations. Genome‐wide association studies (GWAS) in conjunction with pooling have been shown to reduce the cost of genotyping (Sham et al., 2002) by grouping together animals with similar observations and then genotyping a pooled DNA sample from those groups (Darvasi & Soller, 1994). Many studies have used pooled DNA for GWAS to identify quantitative trait loci (QTL) in humans (e.g. general cognitive ability in children (Fisher et al., 1999) and colorectal and prostate cancer in a Polish population (Gaj et al., 2012)) and livestock (e.g. low reproductive cattle with the presence of SNP mapped to the Y chromosome (McDaneld et al., 2012), fertility in Holstein cattle (Huang et al., 2010) and somatic cell score in Valdostana Red Pied cattle (Strillacci et al., 2014)). Pooling has also been investigated for its utility in genetic prediction. Work has been done with simulation—e.g. Sonesson et al. (2010) simulated an aquiculture population whereas Alexandre et al. (2019) and Baller et al. (2020) simulated cattle populations. Pooled data in prediction have also seen use in real data sets—e.g. Henshall et al. (2012) and Reverter et al. (2016) used Brahman Tropical composite cattle, Bell et al. (2017) used Merino sheep and Alexandre et al. (2020) used in silico Angus data. Most research has focused on the usefulness of pooling on a single trait. Alexandre et al. (2019) extended this concept to two traits, where pools were constructed on one trait or a combination of two traits using genomic best linear unbiased prediction (GBLUP) and genomic EBV (GEBV) was estimated with univariate models. Choosing animals to pool together in practice might best be facilitated at random, perhaps in part to ensure similar environmental effects or simply for ease of implementation. However, using real data and in silico, there are examples where pools have been constructed attempting to minimize phenotypic variation (Alexandre et al., 2020; Bell et al., 2017; Henshall et al., 2012; Reverter et al., 2016). Differences in pool construction and the impact on genomic prediction have been reported in simulation studies involving one trait (Baller et al., 2020) and two traits (Alexandre et al., 2019), both of which concluded minimizing phenotypic variation within the pools led to the highest accuracies as compared to other pool construction strategies. To our knowledge, previous studies have not attempted to quantify how pooling separately on the traits affects the EBV accuracy of each trait or combined all information from the two traits in a bivariate model. The objectives of this study were to evaluate factors that could impact the usefulness of pooling data for genetic prediction in a bivariate context. Consequently, the factors of pooling size, pooling strategy, generational gaps of genotyping, genetic correlation between two traits, how missing values arise, and the percentage of available records were evaluated within a single‐step GLBLUP framework to determine how these factors impact EBV accuracy.

MATERIALS AND METHODS

Animal care and use committee approval were not required for this research as all data were simulated.

Simulation

Five replicates of a simulation mimicking a purebred beef cattle population were carried out using Geno‐Diver (Howard et al., 2017). Following Baller et al. (2019, 2020), each replicate contained a different founder genome comprised of 29 chromosomes each with a length of 87 Mb, which was determined as the average length of chromosomes using the NCBI Bos taurus 2009 assembly. Markers that represented a 50K SNP panel were randomly distributed across the genome; the location of 1,724 markers per chromosome and the quantitative trait loci (QTL) were drawn randomly from a uniform distribution with the parameters of 0 and the length of the chromosome. It was assumed the QTL occurred once per 3 Mb, resulting in 29 QTL per chromosome. Expanding on the simulations of Baller et al. (2019, 2020), two traits were simulated, each with a heritability of 0.4 resulting from phenotypic, additive and dominance variances set to 1, 0.4 and 0, respectively. Three different genetic correlations between the phenotypes were simulated for each of the five replicates representing low genetic correlation (0.1), moderate genetic correlation (0.4) and high genetic correlation (0.7). The QTL effects were generated by sampling from three independent gamma distributions, then the samples were combined to generate the additive effects of Trait 1 and 2 (Howard et al., 2018). The founder genomes were generated by the Markovian Coalescence Simulator (MaCS) program (Chen et al., 2009). Following Baller et al. (2019, 2020) founder genomes were generated to contain a large amount of short‐range LD, and the effective population size of the founder generation was set to 70. Founder animals consisted of 100 sires and 2,000 dams that were randomly mated for five generations and were randomly replaced, which were used to establish the pedigree. An additional 10 generations were simulated where animals were mated randomly with the caveat that animals with a relationship of 0.125 or greater were not mated together. The last 10 generations were randomly selected, with replacement rates of 0.4 and 0.2 for sires and dams, respectively. Animals were also culled when they had been in the population as a parent for 12 generations. Each mating resulted in one progeny; thus, each sire had 20 progeny per generation while each dam only had 1. The final population consisted of a total of 15 generations (n = 32,000).

Missing records

In industry, missing records can manifest in many ways, two of which were simulated in this study—sequential culling and randomly missing records. Missing records were simulated across the whole population, not just the last generation where pooling occurred. Selection occurs at various points in an animal’s lifetime. Some animals are culled based on a previously recorded trait(s) and do not have the opportunity to express traits later in life. To simulate this process, all individuals had an observable Trait 1 phenotype. The animals with the highest 75%, 50% or 25% Trait 1 phenotype had an observable Trait 2 phenotype recorded. Missing records can also occur randomly simply due to missed observations in the field. To simulate this scenario, three different percentages were considered—100%, 90% or 80% of records were available (0%, 10% or 20% of records were missing, respectively). The randomly missing records were determined for each trait independently, but with the same percentage of missing records—leading to 100% of Trait 1 and 100% of Trait 2 available, 90% of Trait 1 and 90% of Trait 2 available, or 80% of Trait 1 and 80% of Trait 2 available. Even though animals were randomly chosen, the same random animals were chosen within each replicate for consistency of comparison; for example, the same 80% of animals were chosen to have records retained within each replicate. Independently, the same 90% of animals were chosen to have records retained within each replicate.

Pooling

The individuals born in generation 15 (n = 2,000) were assigned to pools. Two sets of pools were independently constructed: the first set was constructed based on Trait 1 records, and the second set was based on Trait 2 records. Baller et al. (2020) recommended pool sizes of 2, 10, 20 or 50 while Kuehn et al. (2018) recommended pool sizes of 20 as a minimum. Consequently, pool sizes of 20, 50 and 100 were simulated to illustrate a gradient from a recommended minimum to larger values. In the case where there were no missing records, pool sizes of 20, 50 and 100 individuals resulted in 200 pools (100 based on Trait 1 and 100 based on Trait 2), 80 pools or 40 pools, respectively. In the case where there were missing records for a trait, the number of pools based on that trait would be proportionally less. Pool assignments were determined in two different ways: (a) randomly or (b) minimizing the phenotypic variation within a pool. Random pools were formed by randomly assigning individuals to a pool based on Trait 1 and to a pool based on Trait 2. For example, for a pool size of 20 and no missing records, an animal would be randomly assigned to two pools, one pool from the 100 pools based on Trait 1 and one pool from the 100 pools based on Trait 2. To construct pools to minimize phenotypic variation within pools, individuals with records for Trait 1 were first ranked based on their phenotypic record for Trait 1 and then grouped together depending on the pool size. This process was then repeated for individuals with a record for Trait 2. For example, with a pool size of 20 and no missing records, the animals with the smallest 20 phenotypes for Trait 1 were included in Pool 1 and the smallest 20 phenotypes for Trait 2 were included in Pool 101. Pools based on Trait 1 had a phenotypic record for Trait 1 and a missing record for Trait 2 and vice versa. Individuals could only be included in one pool per trait per scenario, where the scenario is defined as a combination of missing record strategy, pooling strategy, percentage of missing records and generation in which genotyping stopped but could be found in two pools if both traits were recorded. Pool size was consistent within each scenario. The phenotypic record for a pool based on a trait was the average phenotype for that trait of the individuals contributing to that pool. Genotypes of the pools were average genotype calls across all SNP of the individuals that made up the pool, and ranged from 0 to 2, as described by Baller et al. (2020). It was assumed all genotypes were known without error and there was also no error introduced by pool formation leading to no additional residual error due to the process of pooling DNA samples or genotyping. Pedigree ties between the commercial and seedstock animals are known to exist, but they are often not recorded. Thus, following Baller et al. (2020), the pedigree of the animals in generation 15 was assumed unknown. The only ties between the pooled commercial animals and the seedstock population were estimated by genomic relationships. Missing records for animals in generation 15 followed the same scenarios as with the earlier generations: sequential culling and randomly missing records. To provide a comparison of extreme cases, scenarios were considered where animals from generation 15 entered the evaluation individually (pool size of 1) and when the animals from generation 15 did not enter the evaluation at all (No gen 15). For pool size of 1, each animal in generation 15 had an opportunity to have an individual record for each trait dependent on whether or not their phenotypes were used for pooling and to have their individual genotype entered into the evaluation. For the case of missing records, some animals were not pooled at all; for consistency of comparing across scenarios, only the individuals that did appear in a pool were considered for a pool size of 1. In this case, the genotype calls of these individuals were entered into the evaluation as the traditional “0,” “1” or “2.”

Missing generation of genotypes

All parents were assumed to be genotyped even if they did not have a recorded phenotype because of randomly missing records. As with Baller et al. (2020), generational gaps in genotyping were induced between the seedstock and commercial animals because the cost of genotyping in real populations can be prohibitive. Therefore, the genotypes of animals above the pooled individuals were masked. Two scenarios were considered: (a) animals up to and including those born in generation 13 were genotyped (Gen13) and (b) animals up to and including those born in generation 14 were genotyped (Gen14). Baller et al. (2020) explored additional scenarios where more generations had genotypes masked, but they led to similar results as Gen13. All animals in generations 6–14 were included in the pedigree regardless of the genotyping scenario. Additionally, founder animals may be missing or were not genotyped. Therefore, only animals in generations 0–5 that appeared in a three‐generation pedigree of the pooled animals were included in the pedigree and it was assumed these animals were not genotyped. All other animals in generations 0–5 were excluded from the analysis.

Analysis

A bivariate animal model utilizing single‐step GBLUP was used to estimate EBV. Single‐step GBLUP combines genomic and pedigree information into one kinship matrix called H (Aguilar et al., 2010; Christensen & Lund, 2010). The model used when only individual observations were available (pool sizes of 1 and when generation 15 did not enter the evaluation) was: where is a vector of individual phenotypic observations for the ith trait; was a known incidence matrix relating the observations to the fixed effects for the ith trait; was a vector of fixed effects for the ith trait; was a known incidence matrix relating observations to the random additive genetic effects for the ith trait; was a vector of random additive genetic effects for the ith trait; and was a vector of random residuals for the ith trait. The only fixed effect included in the model for either trait was the intercept. It was assumed that where G is a 2 × 2 matrix containing the variance components for the additive effects and R is a diagonal matrix containing the variances for the residual effects. The details of the construction of the inverse of the kinship matrix H () were described previously by Baller et al. (2020). The underlying model introduced by Baller et al. (2020) was extended to a bivariate case. However, it was assumed the individual genotypes, pedigrees and phenotypes of animals in generation 15 were unknown, but the individual phenotypes of Traits 1 and 2 contributed to the pool means (i.e. individual data were unobserved, but pool means were observed). Thus, the final prediction model used was where is a vector of individual and pooled phenotypic observations for the ith trait; was a known incidence matrix relating the individual and pooled observations to the fixed effects for the ith trait; was the same vector of fixed effects for the ith trait as above (containing only the intercept); was a known incidence matrix relating individual and pooled observations to the random additive genetic effects for the ith trait; was a vector of random additive genetic effects for the ith trait for both individuals and pools; and was a vector of random residuals for individuals and pools based on the ith trait. It was assumed that where again G is a 2 × 2 matrix containing the variance components for the additive effects, is a kinship matrix relating individual animals and pools of animals, and R is a diagonal matrix containing the variances for the residual effects. Because the phenotypes in are heterogeneous in information content—the phenotypes for animals in generations 0–14 are individual phenotypes, whereas the phenotypes for pools are averages of animals from generation 15—the variance of the residuals is where is the residual variance for the ith trait and is 1 for an individual record and the pool size for a pooled record. For simplicity, the variance structure for the residuals used in the model assumes that animals are randomly assigned to pools. When pools were formed to minimize the phenotypic variance the assumption of random assignment does not hold, but the variance structure is one that would be used in practice. The inverse of was constructed the same as H except that the allelic frequencies were estimated from individuals and pools. Pool constructions and the computation of inverses of H and were carried out in R (R Core Team, 2017). Breeding values were estimated in the ASReml v4.1 software (Gilmour et al., 2015) using the preconditioned conjugate gradients (PCG) method. The accuracy of EBV for sires and dams was estimated as the correlation between the true breeding values (TBV) and the EBV. The accuracies were estimated separately for sires and dams, the generation in which they were born (11, 12, 13 or 14), and for each trait (Trait 1 and Trait 2). The accuracy of the pools was estimated as the correlation between the average TBV of the animals that made up the pool and the EBV. An observation (EBV accuracy of a sire or dam born within a particular generation, replicate, missing record strategy, pooling strategy, percentage of missing records and generation in which genotyping stopped—considered a final simulated set) was deemed an outlier if it was identified in both an interquartile range (IQR) test within a replicate and an IQR test within a pool size. The IQR test identifies an observation as an outlier if the observation is either more than or less than , where are the first and third quantiles, respectively. All data from a final simulated set with at least one outlier were excluded from the analysis. In the presence of outliers, medians are more robust than means; thus, final plotted accuracies are median values across the five replicates. However, to determine the significance of effects on the EBV accuracy, Analysis of Variance tests were performed after excluding all observations from a final simulated set with at least one outlier with the following model: where was the EBV accuracy of sires/dams born in generations 11, 12, 13 or 14 or pools for Trait 1 or Trait 2 with outliers removed; was the overall mean; was the effect of the generational gap; was the effect of pooling strategy; was the effect of pool size; was the effect of the way missing values arise; was the effect of percentage of available records nested within the way missing values arise; was the random effect of replicate; and was the random residual. The model was restricted to only two‐way interactions. It was assumed that and were distributed normally with a mean of zero and variance of and , respectively. Significance was determined at = .05.

Expectations of pooled genomic relationships

Baller et al. (2020) assumed individuals were only included in one pool, but with the extensions provided in this research, individuals can now be included in more than one pool—a pool based on its Trait 1 phenotype and a separate pool based on its Trait 2 phenotype. Because of this modification, a slight generalization in the expectations of the pooled genomic relationships between the pools presented by Baller et al. (2020) is needed to account for the possibility of shared individuals among pools. Let the matrix represents the relationships between individuals in generation 15. Similarly, let represents the relationships between the pools. The expected genomic relationship matrix is a function of and follows: where is the kk' element of corresponding to pools k and k', is the kk' submatrix of corresponding to individuals in pools k and k', and and are indicator vectors for pools k and k' with elements 1 if the individual is in the pool and 0 if the individual is not in the pool. Assuming all individuals in generation 15 are unrelated. From the expectations above it can be seen that for pools of individuals, the diagonal elements of are equal to and the off‐diagonals of are proportional to where m is the number of individuals in common between two pools. Thus, the off‐diagonals of between pools that were based off of the same trait are expected to be zero as they share no common individuals but are expected to be proportional to if one animal is in common between pools based on different traits, proportional to if two animals are in common, and so on. If the individuals in generation 15 are related, as is the case in this simulation and likely with real data, the diagonal elements of are expected to be greater than and the off‐diagonal elements of between pools based on different traits will be greater than as the individuals in the pools become more related.

RESULTS AND DISCUSSION

Figure 1 depicts the correlation between the average phenotype and average TBV of the pools. Regardless of genetic correlation, the way in which missing values arise, the percentage of available records or the trait considered, pool sizes of 20, 50 and 100 led to larger correlations of average phenotype and TBV compared with pool sizes of 1; this agrees with Baller et al. (2020). Previously, Baller et al. (2020) observed pools constructed randomly led to approximately similar correlations between average phenotype and TBV regardless of pool size. In the current study, this was not observed. No identifiable pattern in regards to pool sizes was observed with random pooling. However, the range of correlations between average phenotype and TBV was larger for sequential culling than for random missing records.

FIGURE 1

Correlation of average phenotype and average true breeding value (TBV) in pools. Pools resulting from different genetic correlations, how missing records occur (random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling), pooling strategies (random = randomly allocated to pools; Minimize = minimize phenotypic variation within pools), percentage of available records (80% = 80% of Trait 1 and Trait 2 records are available, 100% = 100% of Trait 1 and Trait 2 records are available; 25% = 100% of Trait 1 records and 25% of Trait 2 records are available) and pool sizes [Colour figure can be viewed at wileyonlinelibrary.com] The average relationships within a pool and across pools were approximately equal regardless of pool size. The comparison across pools was only considered within the trait the pools were designed for. Regardless of how missing values arise, the average relationships within a pool and between pools were approximately the same for Traits 1 and 2 when pools were formed to minimize phenotypic variation. However, when pools were formed randomly, the average relationships of Trait 2 were typically higher than those of Trait 1, both within and across pools. The difference between the average relationships of pools based on Trait 1 and 2 becomes larger as the percentage of available records becomes smaller. The average relationships within pools and across pools within the trait the pools were designed for were lower than those observed by Baller et al. (2020). This result could be an artefact of selection—Baller et al. (2020) simulated a population whereby selective replacement based on EBV was practiced whereas the current simulation employed random selection. When considering the average relationships of individuals pooled across traits, it is important to note again that the same individuals were used for pooling across all pool sizes and pooling strategies. Additionally, within the way missing records arise and the percentage of individuals available, the individuals were always the same for consistency. Regardless of genetic correlation, the average relationship of individuals between pools based on Traits 1 and 2 increased as the percentage of records available increased when missing records arose randomly. This increase was due to more animals being included for both traits with more records as it was very unlikely the same animals would randomly have missing records for both traits. The average relationship of individuals between pools based on Traits 1 and 2 also increased as the percentage of records available increased with sequential culling and a genetic correlation of 0.7. This increase in relationship is expected as it is more likely related animals were retained during sequential culling when the genetic correlation is high. With a genetic correlation of 0.4 and sequential culling, the relationships between pools based on different traits were approximately the same regardless of the percentage of records available, except for when 25% of Trait 2 records were available, which led to lower average relationships. With a genetic correlation of 0.1, sequential culling and across all percentages of available records, the relationships between pools based on different traits were approximately equal.

EBV accuracies of sires and dams

Figures 2 and 3 depict the median EBV accuracies of sires born in generation 14 for sequential culling and randomly missing records, respectively, depending on genetic correlation, pooling strategy, percentage of missing records and when genotyping stopped at generation 14. Results of dams are not shown as they follow the same patterns as the sires. Although the same patterns are present with the sires and dams, two key differences do exist. First, the median EBV accuracies of dams were numerically lower than those of the sires. Additionally, the difference between EBV accuracy when pool sizes of 1 were used and when generation 15 did not enter the evaluation at all was smaller for dams than sires. Both of these were due to the fact that dams only had one progeny per generation while sires had 20.

FIGURE 2

FIGURE 3

Use of randomly missing records leading to estimated breeding value (EBV) accuracies of sires (estimated as the correlation between true breeding value [TBV] and EBV). Presented sires born in generation 14 with accuracies resulting from different genetic correlations, pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com]

Use of sequential culling leading to estimated breeding value (EBV) accuracies of sires (estimated as the correlation between true breeding value [TBV] and EBV). Presented sires born in generation 14 with accuracies resulting from different genetic correlations, pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (25% = 100% of Trait 1 records and 25% of Trait 2 records are available; 50% = 100% of Trait 1 records and 50% of Trait 2 records are available; 75% = 100% of Trait 1 records and 75% of Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com] Use of randomly missing records leading to estimated breeding value (EBV) accuracies of sires (estimated as the correlation between true breeding value [TBV] and EBV). Presented sires born in generation 14 with accuracies resulting from different genetic correlations, pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com]

Generational gap of genotyping

For sires and dams born in generation 14, the EBV accuracies of both traits were lower when genotyping stopped at generation 13 than when genotyping occurred through generation 14 by 0.140 and 0.136 for sires and dams, respectively. Large decreases in EBV accuracy were not found in sires or dams born in generations 13 or earlier dependent on when genotyping stopped because the animals born in these generations were always genotyped (results not shown). Baller et al. (2020) also noted that EBV accuracies of sires and dams by the generation of birth were highest when the genotyping occurred through or past the generation considered. Therefore, larger EBV accuracies are a result of connectedness arising from genomic relationships rather than pedigree relationships (Baller et al., 2020). Using single‐step GBLUP in a simulated data set, the accuracy of GEBV increased as more genotyped individuals were used (Lourenco et al., 2015).

Pooling strategy and size

When pools were constructed randomly, the EBV accuracy resulting from any pool size or when generation 15 did not enter the evaluation was significantly lower than that from a pool size of 1. When pools were constructed to minimize phenotypic variation, more interesting comparisons were apparent. Ideally, for pooling to be an acceptable approach to include commercial data into evaluations, EBV accuracies of pools would be significantly different than those from when generation 15 did not enter the evaluation and not different from a pool size of 1. This result occurred for sires born in generation 14 for Trait 1 across all pool sizes and was also true for dams born in generation 14 only when pool sizes were of size 20 for Trait 1. For Trait 2, this result occurred for sires born in generations 13 and 14. Significant differences in pool size were likely different for Trait 1 compared with Trait 2 because missing records, especially for sequential culling, were induced for Trait 2. Differences between sires and dams regarding significant differences in pool sizes were likely due to the amount of information available due to the number of progeny each sex had. A less optimal situation would be where the EBV accuracies as a result of pooling were still significantly higher than when generation 15 did not enter the evaluation but also significantly lower than pool sizes of 1. This occurred with pool sizes of 20, 50 and 100 for sires born in generation 13 for Trait 1 and pool sizes of 50 and 100 for sires born in generation 14 for Trait 2. These comparisons may be statistically significant; however, numerically, the largest pairwise difference was 0.03 as they were averaged over generation in which genotyping stopped, genetic correlation, the way in which missing records arose, and the percentage of missing records nested within how the missing records arose (data not shown). Thus, with that small numeric difference, the decreased cost of pooling may still be much more economical in its effect on accuracy than individual genotyping. Reverter et al. (2016) used pooling within Brahman cattle for pregnancy and lactation status using GBLUP. Cattle were pooled based on results from a pregnancy test in pools of 15–28 individuals. Estimations of GEBV for fertility were obtained for bulls that were not sires of the cattle that were pooled. Bell et al. (2017) used pooling within Merino sheep using dag scores also using GBLUP to attain estimates of GEBV. The sheep were pooled by sex and dag score category with pool sizes of 33 to 40 individuals. The accuracies of GEBV resulting from pooled data from Bell et al. (2017) or Reverter et al. (2016) were not compared with a baseline of GEBV resulting from individual data, and so, it is not known if the loss of accuracy in prediction due to pooling was significant or not, warranting validation of pooling with simulation. Previously, Baller et al. (2020) constructed pools to uniformly maximize phenotypic variation within pools, but it was determined this strategy resulted in comparable results to random allocation to pools and did not see improvement in EBV accuracy above those from minimizing phenotypic variation within pools. Baller et al. (2020) concluded that when pools were constructed by minimizing phenotypic variation, pool sizes of 2, 10, 20 or 50 did not lead to EBV accuracies different from when individual progeny data were used. In a simulation of two traits, Alexandre et al. (2019) investigated pooling strategies based on Trait 1, Trait 2, a combination of both or randomly to estimate GEBV. In contrast to the current study, pools were not reformed for individual traits, nor was a bivariate model used. Accuracies of GEBV of sires, estimated as the correlation of GEBV and TBV within a trait, were greatest when pools were constructed on the trait itself and lowest when pools were constructed randomly. Alexandre et al. (2020) investigated the use of pooling using Angus data in silico using three traits. The genomic EBV was again calculated using univariate models. Accuracy of GEBV was calculated as the correlation between the sire’s GEBV with pooled progeny data and the sire’s GEBV using individual progeny data. Pooling strategies employed by Alexandre et al. (2020) were (a) random pooling and (b) by phenotype—which is equivalent to minimizing phenotypic variation within pools in the current study. All three traits were not recorded across all animals, which hindered the calculation of GEBV accuracy for one trait when the pools were constructed based on another trait. Regardless, they also found pooling by trait led to larger GEBV accuracies than pooling randomly. Alexandre et al. (2019) suggested pool sizes of 10 in order to compromise the loss in GEBV accuracy and cost saving of pooling; Alexandre et al. (2020) suggested this could be extended to pool sizes greater than 10. Pool sizes of 1, 2, 5, 10, 15, 20 and 25 were investigated; even pool sizes of 25 did not lead to unreasonable losses of GEBV accuracies compared with individual data. In a study investigating the efficiency of estimated genomic relationships of pools to the animals that make up the pools and to other potentially related individuals, Kuehn et al. (2018) suggested pools of at least 20 to lessen pool construction error. Table 1 contains the least‐squares EBV accuracy means by the percentage of records available nested within how the missing records arose. As expected, the accuracy of Trait 1 EBV for sires and dams was not impacted by sequential culling given all animals had a Trait 1 phenotype recorded. However, sequential selection impacted Trait 2 EBV accuracy as all pairwise comparisons of percentage of missing records within how the missing records arose were significant. When records were randomly missing, pairwise comparisons of percentage of missing records within how the missing records arose were significant, meaning that as the percentage of available records increased, so did the EBV accuracies. Even though these comparisons were statistically significant, the numerical increase in EBV accuracy was small, typically only by 0.1 from 80% to 90% available records or 90% to 100% available records. It is important to note that these least‐squares means were averaged over pool sizes, pooling strategy, genetic correlation and the generation in which genotyping stopped. Overall, as more records were available, the EBV accuracies of the traits increased.

TABLE 1

Least‐squares mean estimates of EBV accuracies due to the percent of missing records nested within how the missing records arose

Missing records ^†	Percent available ^‡	Trait 1 ^§				Trait 2 ^¶
		Sire		Dam		Sire		Dam
		14 ^††	13 ^‡‡	14	13	14	13	14	13
Random missing	80%	0.84^a	0.93^a	0.82^a	0.90^a	0.84^a	0.93^a	0.82^a	0.90^a
	90%	0.85^b	0.93^a	0.83^b	0.90^b	0.84^a	0.94^ab	0.83^b	0.91^b
	100%	0.86^b	0.94^b	0.84^c	0.91^c	0.85^b	0.94^b	0.84^c	0.91^c
Sequential culling	25%	0.85^a	0.94^a	0.84^a	0.91^a	0.75^a	0.84^a	0.73^a	0.81^a
	50%	0.85^a	0.94^a	0.84^ab	0.91^a	0.80^b	0.90^b	0.79^b	0.87^b
	75%	0.85^a	0.94^a	0.84^ab	0.91^a	0.83^c	0.93^c	0.82^c	0.90^c
	100%	0.86^a	0.94^a	0.84^b	0.91^a	0.85^d	0.94^d	0.84^d	0.91^d
Std. error		0.007	0.004	0.005	0.001	0.005	0.016	0.006	0.005

Note: a,b,c,dWithin a column and missing record scenario, least‐square means with the same letter are not significantly different = .05.

Random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling.

80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available; 25% = 100% of Trait 1 records and 25% of Trait 2 records are available; 50% = 100% of Trait 1 records and 50% of Trait 2 records are available; %75 = 100% of Trait 1 records and 75% of Trait 2 records are available.

EBV accuracy of Trait 1.

EBV accuracy of Trait 2.

Sires or dams born in generation 14.

Sires or dams born in generation 13.

Least‐squares mean estimates of EBV accuracies due to the percent of missing records nested within how the missing records arose Note: a,b,c,dWithin a column and missing record scenario, least‐square means with the same letter are not significantly different = .05. Random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling. 80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available; 25% = 100% of Trait 1 records and 25% of Trait 2 records are available; 50% = 100% of Trait 1 records and 50% of Trait 2 records are available; %75 = 100% of Trait 1 records and 75% of Trait 2 records are available. EBV accuracy of Trait 1. EBV accuracy of Trait 2. Sires or dams born in generation 14. Sires or dams born in generation 13. Guo et al. (2014) studied the difference in the reliabilities of GEBV, measured as the squared correlation between GEBV and TBV, of two traits using all available data or assuming 90% of the EBV for the first trait was not used for genomic selection or 90% of the EBV for the second trait was not used for genomic selection. The GEBV was estimated using GBLUP where the response variables were traditional EBV. The first trait had a heritability of 0.3 while the second trait had a heritability of 0.05 and the genetic correlation was 0.5. When there were missing records for the first trait, the reliability of GEBV decreased by 0.258 as compared to when both traits were recorded on all animals. When there were missing records for the second trait, the reliability of GEBV decreased by 0.171 as compared to when both traits were recorded on all animals. The interactions of pool size and pooling strategy with the percentage of missing records nested within how the missing records arose were not significant. This result signifies that the impact of pool size and pooling strategy is not dependent on the percentage of missing records nested within how the missing records arose, rather they are consistent across those investigated herein. The interaction of the generation in which genotyping stopped and the percentage of missing records nested within how the missing records arose was significant for EBV accuracies of Trait 2 for sires born in generation 14 and also for the EBV accuracies of both traits for dams born in generation 14 (data not shown). The largest numerical differences resulted from comparisons made between whether genotyping stopped at generation 13 or 14, which is not surprising given the significant effect of missing records on EBV accuracy. Furthermore, the only sources of progeny information for parental animals born in generation 14 were pooled data whereas earlier generations (i.e. generation 13) benefited from offspring with individual records in addition to descendants contained within the pools. Regardless of how the missing values arose or the percentage of available records, when pools were constructed in order to minimize phenotypic variation, pools of any size generally led to larger accuracies than when data from generation 15 did not enter the evaluation. These are encouraging results suggesting that missing values do not affect the usefulness of pooling.

Genetic correlation

The interactions of pool size and pooling strategy with genetic correlation were not significant. This result again signifies that the impact of pool size and pooling strategy are not dependent on genetic correlation, rather they are consistent across the genetic correlations investigated herein. The interaction of the generation in which genotyping stopped and the genetic correlation between the two traits was significant for sires and dams born in generation 14 for both traits. Again, the largest numerical differences arose from comparisons of when genotyping stopped at generations 13 and 14. The interaction between the genetic correlation and the way in which the missing records arose was significant for some trait, sire/dam and generation of birth combinations. Although this interaction was statistically significant, numerically the differences were not large, usually ranging from 0.01 to 0.03 (data not shown). The largest difference (0.05) was observed for the EBV accuracy of Trait 2 for sires born in generation 13 when sequential culling was initiated and comparing across genetic correlations of 0.4 and 0.7. Jia and Jannink (2012) investigated the effect genetic correlation had on the prediction accuracy of two traits with multi‐trait genomic selection within the simulation. One trait had a heritability of 0.1 while the other had a heritability of 0.8. As the genetic correlation increased, the prediction accuracy of the lowly heritable trait increased; however, the highly heritable trait saw no increase in prediction accuracy even as the genetic correlation increased between 0.1 and 0.9. In the current study, the effect of genetic correlation on EBV accuracy did not lead to large numerical differences given the moderate heritability of the traits. Across all genetic correlations, the generations in which the sires and dams were born in, and Traits 1 and 2, the EBV accuracy consistently decreased by 0.01 when the percentage of records available decreased randomly from 100% to 90% and then again from 90% to 80%. Thus, randomly missing records did not have a large impact on EBV accuracy across the studied genetic correlations. Additionally, the accuracy of Trait 1 EBV for sires and dams was negligibly impacted by sequential culling, the differences in EBV accuracy were generally in the range of 0.01 regardless of the percentage of animals culled and the genetic correlation. The differences in EBV accuracies for Trait 2 considering no culling to 25% of Trait 2 recorded was the smallest (0.06) for sires born in generation 14 and genetic correlation of 0.7. All other differences in EBV accuracy for sires and dams across the genetic correlations were approximately 0.12. In general, the EBV accuracies of Trait 2 when considering sequential culling increased as the percentage of culled data increased, regardless of genetic correlation. Consequently, as more records were available due to less sequential culling, the EBV accuracies of Trait 2 approached the EBV accuracies of Trait 1.

EBV accuracy of pools

Even though pools were constructed by trait, all pools received EBV for both traits. Figure 4 depicts the median EBV accuracies of the pools that were determined by Trait 1 and Figure 5 depicts the median EBV accuracies of the pools that were determined by Trait 2. Significant interactions were quite varied depending on if observing the trait in which the pools were made or the correlated trait. For example, when considering pools for Trait 1 and the EBV accuracy of Trait 1, significant interactions only included pool size by pooling strategy and genetic correlation by the percentage of available records nested within how the missing records arose. However, when considering pools for Trait 1 and the EBV accuracy of Trait 2, nearly all possible interactions were significant. When considering pools for Trait 2 and the EBV accuracy of either trait, nearly all interactions involving pool size and pooling strategy were significant.

FIGURE 4

FIGURE 5

Trait 2 pools' estimated breeding value (EBV) accuracies (estimated as the correlation between the average true breeding value (TBV) of the individuals within the pool and predicted EBV of the pool). Pools resulting from different genetic correlations, how missing records occur (random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling), pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available), individuals up to and including those born in generation 14 were genotyped (Gen14) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com]

Trait 1 pools' estimated breeding value (EBV) accuracies (estimated as the correlation between the average true breeding value [TBV] of the individuals within the pool and EBV of the pool). Pools resulting from different genetic correlations, how missing records occur (random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling), pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available), individuals up to and including those born in generation 14 were genotyped (Gen14) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com] Trait 2 pools' estimated breeding value (EBV) accuracies (estimated as the correlation between the average true breeding value (TBV) of the individuals within the pool and predicted EBV of the pool). Pools resulting from different genetic correlations, how missing records occur (random missing = missing records occur randomly; sequential culling = missing records occur because of sequential culling), pooling strategies (random = randomly allocated to pools; minimize = minimize phenotypic variation within pools), percent of available records (80% = 80% of Trait 1 and Trait 2 records are available; 90% = 90% of Trait 1 and Trait 2 records are available; 100% = 100% of Trait 1 and Trait 2 records are available), individuals up to and including those born in generation 14 were genotyped (Gen14) and pool sizes with ranges in accuracy along the x‐axis [Colour figure can be viewed at wileyonlinelibrary.com] A few conclusions can be drawn about the EBV accuracies of the pools. As long as the pools were constructed to minimize phenotypic variation, the EBV accuracy of the pools was generally highest for pool sizes of 100 and lowest for pool sizes of 1 for the trait in which the pools were made for. This is consistent with Baller et al. (2020). When the genetic correlation between the traits was high (0.7), the same pattern was true for the correlated trait. In fact, the EBV accuracy was almost as high for the correlated trait as the EBV accuracies the pools made for. As the genetic correlation decreased to 0.4, the EBV accuracy of the correlated trait began to decrease, especially compared with the EBV accuracy of the trait the pools were made for (data not shown). The EBV accuracy of any pool size was generally larger than the pool size of 1. When considering the genetic correlation of 0.1, the EBV accuracies of pools for the alternate trait resulting from any pool size were approximately the same. When considering sequential culling and a genetic correlation of 0.1, the EBV accuracies of the correlated trait resulting from pools of 100, 50 and 20 were less than the accuracy from a pool size of 1. When considering pools formed randomly, the EBV accuracies of pools generally increased as pool size decreased, which is also consistent with Baller et al. (2020). This is expected given that when pools are formed randomly and pool size increases the variation among pools decreases. This pattern was observed for both traits regardless of which trait the pools were made for.

CONCLUSIONS

The results presented herein demonstrate the usefulness of pooled data in genetic evaluations that employ a bivariate model using single‐step GBLUP across a range of genetic correlations and scenarios in which missing values can arise. Similar to the univariate case, when pools were constructed to minimize phenotypic variation, pool sizes of at least 20 could be used to attain EBV accuracies not significantly different than those attained from individual data. Larger pool sizes (50 and 100) also led to improvement of EBV accuracies for sires born the generation directly before pooling was initiated. There were no significant interactions of pool size or pooling strategy with either percentage of missing records nested within how the missing records arose or genetic correlation, suggesting the robustness of pooling recommendations in the bivariate case or when missing values are present. When considering pooling by minimizing phenotypic variation and a genetic correlation of 0.7, the EBV accuracy of pools was almost as high for the correlated trait as the EBV accuracies the pools were made for. As the genetic correlation decreased, the EBV accuracy of the correlated trait decreased, especially compared with the EBV accuracy of the trait the pools were made for. The results herein provide encouraging conclusions that as long as pools are made to minimize phenotypic variation, pooling can be used across a variety of genetic correlations and ways in which missing values arise to garner the use of commercial ERT within genetic evaluations.

CONFLICT OF INTEREST

The authors declare no conflicts of interest to objectively present this research. Mention of a trade name, proprietary product or specific equipment does not constitute a guarantee or warranty by the USDA and does not imply approval to the exclusion of other products that may be suitable. USDA is an equal opportunity provider and employer.

23 in total

1. DNA pooling identifies QTLs on chromosome 4 for general cognitive ability in children.

Authors: P J Fisher; D Turic; N M Williams; P McGuffin; P Asherson; D Ball; I Craig; T Eley; L Hill; K Chorney; M J Chorney; C P Benbow; D Lubinski; R Plomin; M J Owen
Journal: Hum Mol Genet Date: 1999-05 Impact factor: 6.150

Review 2. DNA Pooling: a tool for large-scale association studies.

Authors: Pak Sham; Joel S Bader; Ian Craig; Michael O'Donovan; Michael Owen
Journal: Nat Rev Genet Date: 2002-11 Impact factor: 53.242

3. Y are you not pregnant: identification of Y chromosome segments in female cattle with decreased reproductive efficiency.

Authors: T G McDaneld; L A Kuehn; M G Thomas; W M Snelling; T S Sonstegard; L K Matukumalli; T P L Smith; E J Pollak; J W Keele
Journal: J Anim Sci Date: 2012-03-09 Impact factor: 3.159

4. The impact of clustering methods for cross-validation, choice of phenotypes, and genotyping strategies on the accuracy of genomic predictions.

Authors: Johnna L Baller; Jeremy T Howard; Stephen D Kachman; Matthew L Spangler
Journal: J Anim Sci Date: 2019-04-03 Impact factor: 3.159

5. Geno-Diver: A combined coalescence and forward-in-time simulator for populations undergoing selection for complex traits.

Authors: J T Howard; F Tiezzi; J E Pryce; C Maltecca
Journal: J Anim Breed Genet Date: 2017-05-02 Impact factor: 2.380

6. Pooled genotyping strategies for the rapid construction of genomic reference populations1.

Authors: Pâmela A Alexandre; Laercio R Porto-Neto; Emre Karaman; Sigrid A Lehnert; Antonio Reverter
Journal: J Anim Sci Date: 2019-12-17 Impact factor: 3.159

7. Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus.

Authors: A Darvasi; M Soller
Journal: Genetics Date: 1994-12 Impact factor: 4.562

8. Genetic evaluation using single-step genomic best linear unbiased predictor in American Angus.

Authors: D A L Lourenco; S Tsuruta; B O Fragomeni; Y Masuda; I Aguilar; A Legarra; J K Bertrand; T S Amen; L Wang; D W Moser; I Misztal
Journal: J Anim Sci Date: 2015-06 Impact factor: 3.159

9. Pooled sample-based GWAS: a cost-effective alternative for identifying colorectal and prostate cancer risk variants in the Polish population.

Authors: Pawel Gaj; Natalia Maryan; Ewa E Hennig; Joanna K Ledwon; Agnieszka Paziewska; Aneta Majewska; Jakub Karczmarski; Monika Nesteruk; Jan Wolski; Artur A Antoniewicz; Krzysztof Przytulski; Andrzej Rutkowski; Alexander Teumer; Georg Homuth; Teresa Starzyńska; Jaroslaw Regula; Jerzy Ostrowski
Journal: PLoS One Date: 2012-04-19 Impact factor: 3.240

10. Estimating the genetic merit of sires by using pooled DNA from progeny of undetermined pedigree.

Authors: Amy M Bell; John M Henshall; Laercio R Porto-Neto; Sonja Dominik; Russell McCulloch; James Kijas; Sigrid A Lehnert
Journal: Genet Sel Evol Date: 2017-02-28 Impact factor: 4.297

1 in total

1. Using pooled data for genomic prediction in a bivariate framework with missing data.

Authors: Johnna L Baller; Stephen D Kachman; Larry A Kuehn; Matthew L Spangler
Journal: J Anim Breed Genet Date: 2022-06-14 Impact factor: 3.271

1 in total