Literature DB >> 30063724

The genetic connectedness calculated from genomic information and its effect on the accuracy of genomic prediction.

Suo-Yu Zhang¹, Babatunde Shittu Olasege¹, Deng-Ying Liu¹, Qi-Shan Wang^1,2, Yu-Chun Pan^1,2, Pei-Pei Ma¹.

Abstract

The magnitude of connectedness among management units (e.g., flocks and herds) gives a reliable estimate of genetic evaluation across these units. Traditionally, pedigree-based methods have been used to evaluate the genetic connectedness in China. However, these methods have not been able to yield a substantial outcome due to the lack of accuracy and integrity of pedigree data. Therefore, it is necessary to ascertain genetic connectedness using genomic information (i.e., genome-based genetic connectedness). Moreover, the effects of various levels of genome-based genetic connectedness on the accuracy of genomic prediction still remain poorly understood. A simulation study was performed to evaluate the genome-based genetic connectedness across herds by applying prediction error variance of difference (PEVD), coefficient of determination (CD) and prediction error correlation (r). Genomic estimated breeding values (GEBV) were predicted using a GBLUP model from a single and joint reference population. Overall, a continued increase in CD and r with a corresponding decrease in PEVD was observed as the number of common sires varies from 0 to 19 regardless of heritability levels, indicating increasing genetic connectedness between herds. Higher heritability tends to obtain stronger genetic connectedness. Compared to pedigree information, genomic relatedness inferred from genomic information increased the estimates of genetic connectedness across herds. Genomic prediction using the joint versus single reference population increased the accuracy of genomic prediction by 25% and lower heritability benefited more. Moreover, the largest benefits were observed as the number of common sires equals 0, and the gain of accuracy decreased as the number of common sires increased. We confirmed that genome-based genetic connectedness enhanced the estimates of genetic connectedness across management units. Additionally, using the combined reference population substantially increased accuracy of genomic prediction. However, care should be taken when combining reference data for closely related populations, which may give less reliable prediction results.

Entities: Disease Gene Species

Mesh：

Year: 2018 PMID： 30063724 PMCID： PMC6067733 DOI： 10.1371/journal.pone.0201400

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The reliability of genetic evaluations across management units (e.g., flocks and herds) depends on the magnitude of connectedness among these units. Comparisons of estimated breeding values (EBVs) tend to be biased when poor connectedness exists across units[1]. The lower the connectedness across units, the larger the bias and thus, decreasing the accuracy of comparison of EBVs across units. It was reported that few highly selected sires from dairy cattle populations generally have strong genetic links owing to the wide use of artificial insemination (AI)[2]. However, it is not the case in sheep, beef cattle or pig populations where AI is less used, leading to poor or no genetic connectedness across management units. Therefore, it is necessary to estimate connectedness among management in these species units before conducting genetic evaluation across these units. Traditionally, genetic connectedness can be calculated through pedigree-based method [1-4]. However, the pedigree information used in China cannot guarantee its integrity and accuracy, which in turn may lead to lower or unreasonable estimates of genetic connectedness across pig nucleus farms in China[3, 5, 6]. The lack of extensive and reliable pedigree information is a general problem in developing countries[7], particularly in China, where the source of the pigs are extremely complex (e.g., introduced pigs from Denmark, the United States, Canada and France). Therefore, actual genetic connectedness among Chinese pig farms might not be totally reflected by pedigree information due to the inconsistence pedigree recording system between China and the foreign countries [2]. Moreover, Yu et al. [8] confirmed that genomic relatedness inferred from genomic information (i.e., single nucleotide polymorphisms, SNPs) increased the estimates of genetic connectedness across different management units, compared with pedigree information. Therefore, with regards to the above opinions, it is possible to ascertain genetic connectedness through genomic information, and this can be perceived as a plausible solution to get more accurate estimates of genetic connectedness across pig farms in China, as well as enhance the genetic improvement of Chinese pigs. Recently, connectedness statistics have been used in genomic selection[9] for the sake of optimizing the design of reference population[10, 11]. However, it is important to investigate the effect of enhanced genetic connectedness estimated by genomic relatedness on the accuracy of genomic prediction, as noted by Yu et al.[8]. In this study, we simulated two populations which were applied to mimic existing China pig populations with the aim to measure genetic connectedness across management units (i.e., populations) by using genomic information and also investigate the effect of various levels of genetic connectedness across herds on the accuracy of genomic prediction.

Materials and methods

Simulation

A simulation scheme presented by E.C. Akanno[12] was used to mimic pig breeding programs in developing countries, which was adopted in our study to mimic the situation in China. The software QMSim[13] was used to simulate the genomic data and the whole simulation process was repeated nine times. QMSim software was designed to simulate a broad range of genetic architectures and population structures in livestock. Large-scale genotyping datasets and multiple livestock pedigrees can be reliably simulated. Simulation of populations was carried out in two steps: 1) to create historical population for establishing mutation-drift equilibrium, and 2) to simulate recent population, which can be very complex. A wide range of parameters (e.g., number of chromosomes, QTL and markers, crossover interference and location of QTL and markers) are available in order to simulate appropriate genome. This simulator is efficient in time and memory[13].

Population structure

The populations were generated in three steps. In the first step, 1000 generations with a gradual decrease in population size from 5000 to 1050 were simulated, and then the population size was further decreased from 1050 to 200 in the following 1000 generations for the purpose of creating initial linkage disequilibrium (LD) and establishing mutation-drift equilibrium in historical population (HP). In the second step, an expanded population (EP) was simulated by randomly choosing the 100 founder males and 100 founder females from the last generation of HP. Here, in order to expand the population, six generations was simulated assuming 10 offspring per dam under random mating. In the third step, three recent populations (RP) (i.e., Herd1, Herd2 and Herd3) were simulated, and each of them with the population size of 20 founder males and 400 founder females from the last generation of EP. The size defined above represented the median group size for pig nucleus farms in China. The Herd1 population was composed of the top 20 males and top 400 females on the basis of their own phenotypic values from the EP. In order to make Herd1 have no connection with Herd2, Herd2 was simulated by selecting the last 20 males and the last 400 females from the EP. It is well recognized that genetic connectedness among China pig herds was generally established through using of common sires (i.e., sires with progeny in multiple herds or sires born in one herd with progeny in another herd) or through transferring of seedstock from one herd to another[3]. Therefore, to mimick the genetic connectedness created by common sires, 400 founder females of Herd3 were all from the first generation of Herd2, while the 20 founder males of Herd3 came from Herd1 and Herd2. It is assumed that the number of males defined as common sires from the founder males of Herd1 is n (0 ≤ n ≤ 19), then the remaining males from Herd2 is 20—n. Increasing n increased the genetic connectedness between Herd1 and Herd3. Moreover, the RP parameters used in this study mimicked more closely to a real Chinese pig production system with selection for high values of EBV and culling for low values of EBV with a replacement rate of 100% for sires and 40% for dams. Best linear unbiased prediction (BLUP) method was used to estimate the breeding value by using the Henderson’s mixed model theory[14] for an animal model. In this study, three traits corresponding number born alive, average daily gain and backfat were mimicked, whose heritability and phenotypic variance were obtained from a previous study carried out by Akanno E et al. [15]. Considering the computing time and memory requirements, only two generations of each RP were simulated. Herd1 and Herd3 both had 2020 individuals, which were made up of 420 founders and 800 progenies each from the first and second generation. Details of the parameters used to generate genomic data are given in Table 1, while the simulation steps are described in Fig 1.

Table 1

Parameters of the simulation process.

Population structure	Parameters
Step1: Historical population (HP)
Number of generations (size)–phase 1	1000 (1050)
Number of generations (size)–phase 2	1000 (200)
Step2: Expanded population (EP)
Number of males from HP	100
Number of females from HP	100
Number of generations	6
Number of offspring per dam	10
Step3: Recent populations (RP)
Number of males from EP	20
Number of females from EP	400
Number of offspring per dam	2
Ratio of male	0.5
Number of generations	2
Replacement ratio for males	100%
Replacement ratio for females	40%
Selection /culling	EBV
Breeding value estimation method	BLUP
Traits
Number born alive	h² = 0.08, σp2 = 7.73
Average daily gain, g/d	h² = 0.28, σp2 = 10361.20
Backfat, mm	h² = 0.63, σp2 = 20.88
Genome
Number of chromosomes	18
Genome length per chromosome	100 cM
Number of markers per chromosome	3300
Number of QTL per chromosome	25
Minor allele frequency (MAF)	≥ 0.05
Mutation rate of marker locus	2.5 × 10^-3
Mutation rate of QTL locus	2.5 × 10^-5

EBV: estimated breeding value; BLUP: best linear unbiased prediction; h2: heritability; : phenotypic variance; QTL: quantitative trait loci.

Fig 1

A sketch map of simulation process.

Note: Ne: effective population size; LD: linkage disequilibrium.

A sketch map of simulation process.

Note: Ne: effective population size; LD: linkage disequilibrium. EBV: estimated breeding value; BLUP: best linear unbiased prediction; h2: heritability; : phenotypic variance; QTL: quantitative trait loci.

Genome

The genome parameters were consistent with a previous study conducted by [16]. In this study, in order to create more realistic pig genome size, each chromosome was simulated to acquire an average length of 100 cM[17]. The marker density represented approximately 60 K SNP chip currently available[18]. The parameters shown in Table 1 were used to simulate the genome.

Genetic connectedness criteria

We used prediction error variance (PEV) of differences (PEVD), generalized coefficient of determination (CD) and prediction error correlation (r) defined below to investigate genetic connectedness between Herd1 and Herd3. Here, the PEV were obtained from the Henderson’s mixed model equation (MME) [14] and the PEV of ith individual is given by Where is the ith diagonal element of D22 coefficient matrix which is defined as the inverse of the MME coefficient matrix (D) corresponding to genetic values. is the residual variance. A detailed description of the genetic connectedness criteria was provided by Yu et al [8]. PEVD, the average PEV of all pairwise EBV differences between the individuals across management units[2], which is calculated as Where and represent genetic value for individual i and individual j, respectively. PECij indicates the prediction error covariance (PEC) defined by the off-diagonal element of the PEV matrix. The PEVD is used as a criterion to measure the genetic connectedness because poor connectedness among individuals will have higher prediction error than strong connectedness. In this study, a scaled PEVD was used for further analysis based on Kuehn’s suggestion[19]. Smaller PEVD indicated stronger connectedness. CD, generalized coefficient of determination[20], is calculated as follows Where λ, , , are the same values defined above, and R is a relationship matrix which measures the relationship between individuals (defined below). This statistic ranging from 0 to 1 with larger values represented stronger connectedness. And the r between genetic values of individuals from different management units is derived as[4]. Similar to CD, the statistic r also ranged from 0 to 1 and larger r indicated stronger connectedness across management groups.

Relationship matrix

Connectedness is determined in BLUP framework using the genetic relationship matrix. The information about the covariance structures among individuals is required to estimate the relatedness of the three genetic connectedness criteria stated above[8]. In this study, four relationship matrices (R) measuring the relationship among individuals are the same as previous study provided by Yu et al [8] and are defined below. Firstly, R = A, the usual numerator relationship matrix. When genetic evaluation is under an animal model, connectedness occurs due to A[2]. The A is directly calculated from the known pedigree and denotes the probability of inheritance of alleles from a common ancestor indicating that they are identical by descent (IBD). The off-diagonal elements are twice coefficients of kinship and are equivalent to the numerators of Wright’s correlation coefficients[21]. Secondly, R = G, basic genomic relationship matrix G was constructed according to the method (method 1) described by VanRaden[22], i.e., , where elements in column i of M are 0-2p, 1-2p and 2-2p for genotypes AA, AA and AA, respectively, and p is the allele frequency of A at locus i, calculated from the available marker data, as negative values generated in this scenario, R = G (i.e., the third matrix), which supposes the MAF in the base population is unknown and 0.5 is used for all p. The G constructed in this way does not create any negative values for simulated data. Fourthly, when comparing marker-based with pedigree-based relationship matrices, scaling of genomic relationship matrices is needed for interpretation of genetic connectedness criteria. A reasonable rescaling may be achieved by using genomic elements that ranged between 0 and 2, which are the minimum and maximum values of A, respectively. Therefore, to render G on the same scale as A, a scaled G matrix (G) was created and the scaled genomic relationship between ith and jth individual was given by Where Gs is a scaled element of the G and G is a typical element of G. Gs = 2 and Gs = 0 are the maximum and minimum values elements that the scaled matrix is allowed to take, respectively, while G and G are the maximum and minimum element of the G. In this case, G does not create any negative values. Finally, in order to simulate a more realistic scenario where not all the individuals were genotyped in the population, the H matrix (i.e., relationship matrix with pedigree and genomic information) was given by [23-25] where the A, A and A are submatrices of A matrix representing relationships among genetyped, among non-genotyped, and between genotyped and non-genotyped individuals respectively, and the superscript T represents the transpose of a matrix. The G matrix indicates relationship of genotyped individuals and defined as where the ω represents the fraction of the genetic variance not captured by markers, and G = G, G and G defined above. In this study, we assumed that individuals at generation 0–1 (N = 2440) as non-genotyped individuals while individuals from generation 2 (N = 1600) were genotyped. This simulated a real scenario, where individuals from more recent generation were likely to be genotyped with a relatively small sample size compared with individuals from earlier generations.

Population structures of the simulated populations

Principle component analysis (PCA) was used to investigate the population structure of Herd1 and Herd3. PCA was performed using PLINK software[26] and the PC plots were drawn by the ggplot2 package[27].

Prediction of genomic breeding values

In order to investigate the impact of various genetic connectedness inferred from genomic information on the accuracy of genomic prediction, the genomic breeding values were predicted using GBLUP, with different genomic matrices (G, G and G) defined above. In addition, we also examined the predictive ability of other two relationship matrices (i.e., A and H) to better understanding the possible effects of genomic connectedness on genomic prediction. The model was the same as the GBLUP model shown below but genomic relationship matrices were replaced by A and H when predicting the (G) EBV. The basic GBLUP model [22, 28] was defined as: Where y is simulation phenotypes, μ is the population mean, g is the vector of breeding values, ε is the vector of residuals, Z is an appropriate design matrix. Assuming that and , where G is the genomic relationship matrix. is the additive genetic variance, I is the identity matrix and is the residual variance.

Reference and validation data

The Herd1 data were divided into reference data and validation data by generation. The reference population was made up of a total of 1220 individuals comprising of 420 founders and 800 progenies from the first generation. The validation population comprised of 800 individuals from the second generation. To avoid inflation of the accuracy of genomic prediction, 1220 individuals from the founders and the first generation of Herd3 were included in a joint reference population. The accuracy of genomic prediction was estimated as the correlation between predicted genomic estimated breeding values (GEBV) and the true breeding values of the animals in the validation set.

Results

Genetic connectedness criteria between Herd1 and Herd3 for varied number of common sires with heritability of 0.08, 0.28 and 0.63 were presented in Fig 2, Fig 3 and Fig 4, respectively. Similar results among PEVD, CD and r were observed.