Literature DB >> 34957440

Accuracy of genotype imputation based on reference population size and marker density in Hanwoo cattle.

DooHo Lee¹, Yeongkuk Kim¹, Yoonji Chung¹, Dongjae Lee¹, Dongwon Seo¹, Tae Jeong Choi², Dajeong Lim³, Duhak Yoon⁴, Seung Hwan Lee¹.

Abstract

Recently, the cattle genome sequence has been completed, followed by developing a commercial single nucleotide polymorphism (SNP) chip panel in the animal genome industry. In order to increase statistical power for detecting quantitative trait locus (QTL), a number of animals should be genotyped. However, a high-density chip for many animals would be increasing the genotyping cost. Therefore, statistical inference of genotype imputation (low-density chip to high-density) will be useful in the animal industry. The purpose of this study is to investigate the effect of the reference population size and marker density on the imputation accuracy and to suggest the appropriate number of reference population sets for the imputation in Hanwoo cattle. A total of 3,821 Hanwoo cattle were divided into reference and validation populations. The reference sets consisted of 50k (38,916) marker data and different population sizes (500, 1,000, 1,500, 2,000, and 3,600). The validation sets consisted of four validation sets (Total 889) and the different marker density (5k [5,000], 10k [10,000], and 15k [15,000]). The accuracy of imputation was calculated by direct comparison of the true genotype and the imputed genotype. In conclusion, when the lowest marker density (5k) was used in the validation set, according to the reference population size, the imputation accuracy was 0.793 to 0.929. On the other hand, when the highest marker density (15k), according to the reference population size, the imputation accuracy was 0.904 to 0.967. Moreover, the reference population size should be more than 1,000 to obtain at least 88% imputation accuracy in Hanwoo cattle. © Copyright 2021 Korean Society of Animal Science and Technology.

Entities: Chemical

Keywords: Hanwoo cattle; Imputation; Marker density; Reference population size; Single nucleotide polymorphism (SNP)

Year: 2021 PMID： 34957440 PMCID： PMC8672260 DOI： 10.5187/jast.2021.e117

Source DB: PubMed Journal: J Anim Sci Technol ISSN： 2055-0391

INTRODUCTION

The complete cattle genome has been sequenced, and Illumina (San Diego, CA, USA) and Affymetrix (Santa Clara, CA, USA) have developed commercial single nucleotide polymorphism (SNP) chips that use chip-based array technology [1]. The development of SNP panels has enabled many studies, such as genome-wide association studies (GWAS) and best linear unbiased prediction (BLUP) studies [2]. Many genetic markers associated with objective breeding traits have been identified for marker-assisted selection [3]. Using a high-density SNP panel in a GWAS increases the probability of finding quantitative trait locus regions [4]. Improved high-density SNP panels also increase the accuracy of genomic breeding value estimations using genomic BLUP [5-8]. However, it is very difficult to genotype all animals in a population because of the cost of high-density chips. In addition, SNP panels for different platforms, which may differ in density or chip data versions, are not completely compatible. Imputation methods for converting from low- to high-density data are an alternative [9]. Genotype imputation refers to statistical inference of genotype and includes family and population-based methods. Family based methods require sufficient pedigree information to compare reference and test groups, so are difficult to apply when there is no pedigree information or insufficient pedigree depth [10,11]. Population-based methods predict low-density genotypes of animals by referring to a reference population genotyped at high density. This method uses a library and haplotype clustering to find the most appropriate haplotype and genotype [12-15]. Many factors affect the imputation accuracy of this method, including the reference population size, relationship between animals in the reference and test populations, minor allele frequency of the SNP to be imputed, proportion of missing genotypes on the low- and high-density panels, marker density, population structure, and the level of linkage disequilibrium (LD) [16]. Generally, family based methods aim to identify the animals to be sequenced, while population-based methods aim at imputation of the genotypes of unrelated individuals. Many studies have examined ways to increase imputation accuracy using population-based imputation software, such as fastPHASE [17], Beagle [18], Minimac [19], and findhap.f90 [20]. However, no studies have examined imputation accuracy in Hanwoo cattle according to the reference population size and marker density. Hanwoo cattle is a native taurine cattle breed in Korea and has been bred as a draft animal since 5,000 years ago. Over time, Hanwoo cattle have been bred for meat production and have become very popular despite high prices due to marbling fat, softness, juiciness, and unique flavor. [21] Therefore, this study investigated the efficacy of genotyping by imputation of a high-density chip from a low-density one, according to the reference population size and marker density, and proposes an appropriate reference population size for high-quality imputation in Hanwoo cattle.

MATERIALS AND METHODS

Genotypes data

All the data-set (50K genotypes) used in this study was provided from the previous Research Project (BioGreen21, Hanwoo Research Institutes of National Institute of Animal Science, RDA) and current research project (Bridge Project of NIAS, RDA). To investigated imputation accuracy, the 3,821 animals were randomly selected from the population.

Quality control

Genotype data was modified using GenomeStudio (Illumina) ver. 2.0 software with a genotyping module to fit the analysis software format: Illumina data file (.bsc) to genotype file format (.ped). We removed SNPs in unknown chromosomes and sex chromosomes for the next steps. The quality control procedure was performed using plink1.9 software [22]. The raw data has a 95.55% genotyping rate, so missing genotype data phasing was performed as a pre-imputation task for the imputation accuracy as a reference population. SNP data were subjected to strict quality control to minimize the impact of the imputation accuracy on genotyping error: minor allele frequency (0.01), genotyping call rate (0.9), missing individuals (0.1), Hardy-Weinberg equilibrium test p-value (0.0001). After quality control, a total of 38,933 SNPs were used for analysis. the number of SNPs on each chromosome before and after quality control is described in Table 1.

Table 1.

Number of SNPs on each chromosomes between before and after in quality control

Chromosome	Before QC	After QC	Removed No. SNP
Chromosome	No. SNP	No. SNP	Removed No. SNP
1	3,133	2,567	566
2	2,553	2,037	516
3	2,279	1,905	374
4	2,357	1,901	456
5	2,050	1,611	439
6	2,373	1,980	393
7	2,141	1,725	416
8	2,181	1,786	395
9	1,899	1,541	358
10	1,977	1,618	359
11	2,058	1,666	392
12	1,599	1,280	319
13	1,666	1,370	296
14	1,687	1,386	301
15	1,583	1,255	328
16	1,542	1,222	320
17	1,442	1,189	253
18	1,249	1,001	248
19	1,274	1,047	227
20	1,408	1,144	264
21	1,313	1,048	265
22	1,194	957	237
23	976	817	159
24	1,209	990	219
25	905	747	158
26	1,012	810	202
27	895	727	168
28	889	730	159
29	966	796	170
30	1,044	80	964
Total	48,854	38,933	9,921

SNP, single nucleotide polymorphism; QC, quality control.

Imputation scenarios

Imputation scenarios are set based on population size and marker density. The population size controls the size of the reference population, and the marker density controls the maker density of the test population. The test population was selected by the lowest birth year belonging to the data set. Thus, the test population consists of 889 animals, and these were divided into four validation sets. In the reference population, five reference populations were constructed. First, 500, 1,000, 1,500, and 2,000 animals are selected based on the individuals not included in the test population. In addition, 2,000 animals and the remaining test populations not included in each validation were included as reference groups to constitute over 2,000 (3,600) reference groups. When increasing the number of reference populations, the first 500 animals were randomly selected from the data set using the R program ver 3.6 [23], and another 500 animals were added from the remaining individuals. Thus, each scenario set is described in Table 2. and Fig. 1. In addition, three low-density SNP panels are created to use for validation marker density. Each SNP panel selected evenly spaced 5k, 10k, and 15k from a 50k Illumina chip, and the number and average distance of SNPs for each chromosome of these panels are described in Table 3. The test population data were analyzed using the generated low-density SNP panel information. A schematic diagram of the imputation scenario is illustrated in Fig. 2.

Table 2.

Summary of imputation scenarios using different reference population size and validation data set

Scenario set	Reference SNP data	Reference set		Test set								Total animal
		Reference set		Validation 1		Validation 2		Validation 3		Validation 4
		N	%	N	%	N	%	N	%	N	%
Ref 500	50k	500	0.69	222	0.31	223	0.31	223	0.31	221	0.31	721–723
Ref 1,000	50k	1,000	0.82	222	0.18	223	0.18	223	0.18	221	0.18	1,221–1,223
Ref 1,500	50k	1,500	0.87	222	0.13	223	0.13	223	0.13	221	0.13	1,721–1,723
Ref 2,000	50k	2,000	0.90	222	0.10	223	0.10	223	0.10	221	0.10	2,221–2,223
Ref 2,000+	50k	3,600	0.94	222	0.06	223	0.06	223	0.06	221	0.06	3,821

Total animal, The total number of the reference population and validation animals; Ref, reference population; SNP, single nucleotide polymorphism.

Fig. 1.

Organizing or analyzing data.

The raw data consisted of a total of 3,821 individuals born between 1989 and 2013. First, the youngest individuals in the dataset were selected as the validation set. Then, the others were configured as a reference group. The reference group was constructed by changing the group size and the validation group by changing the number of markers. Val, validation.

Table 3.

Each chromosome information about the number of SNP and average distance according to marker density

Chr	Length (Mb)	5k		10k		15k		50k
Chr	Length (Mb)	No. SNP	Average distance (kb)	No. SNP	Average distance (kb)	No. SNP	Average distance (Kb)	No. SNP	Average distance (kb)
1	158.1	330	0.48	660	0.24	990	0.16	2,567	0.06
2	136.7	262	0.52	524	0.26	786	0.17	2,037	0.07
3	121.1	245	0.49	490	0.25	735	0.16	1,905	0.06
4	120.5	245	0.49	490	0.24	735	0.16	1,901	0.06
5	121.1	208	0.58	415	0.29	623	0.19	1,611	0.08
6	119.0	256	0.46	512	0.23	768	0.15	1,980	0.06
7	112.4	222	0.50	444	0.25	666	0.17	1,725	0.07
8	113.0	230	0.49	460	0.24	689	0.16	1,786	0.06
9	105.0	198	0.52	396	0.26	594	0.17	1,541	0.07
10	103.1	208	0.49	416	0.25	624	0.16	1,618	0.06
11	107.1	214	0.50	428	0.25	643	0.17	1,666	0.06
12	90.9	164	0.55	328	0.28	493	0.18	1,280	0.07
13	83.9	176	0.48	352	0.24	528	0.16	1,370	0.06
14	83.1	178	0.47	356	0.23	534	0.16	1,386	0.06
15	84.2	162	0.52	324	0.26	485	0.17	1,255	0.07
16	81.2	158	0.51	316	0.25	473	0.17	1,222	0.07
17	74.8	153	0.49	306	0.24	459	0.16	1,189	0.06
18	65.2	128	0.51	256	0.25	384	0.17	1,001	0.07
19	63.5	135	0.46	270	0.23	405	0.15	1,047	0.06
20	71.5	147	0.48	294	0.24	442	0.16	1,144	0.06
21	71.1	136	0.51	272	0.26	407	0.17	1,048	0.07
22	61.1	124	0.49	248	0.24	371	0.16	957	0.06
23	52.2	105	0.50	210	0.25	315	0.17	817	0.06
24	62.1	127	0.49	254	0.24	381	0.16	990	0.06
25	42.7	96	0.44	192	0.22	288	0.15	747	0.06
26	51.0	104	0.49	208	0.24	312	0.16	810	0.06
27	45.3	94	0.48	188	0.24	282	0.16	727	0.06
28	46.2	94	0.49	188	0.25	281	0.16	730	0.06
29	51.1	103	0.48	206	0.24	309	0.16	796	0.06

SNP, single nucleotide polymorphism.

Fig. 2.

Imputation scenarios.

This figure shows reference population1 and validation1 as examples, and imputation was performed on each marker density (5k, 10k, and 15k). Since the reference data set consists of 5 groups, and the test set consists of 4 validation groups, the imputation process ran 60 times. Ref, reference Val, validation.

Total animal, The total number of the reference population and validation animals; Ref, reference population; SNP, single nucleotide polymorphism.

Organizing or analyzing data.

Imputation scenarios.

Linkage disequilibrium

We analyzed the LD level, which is one of the factors affecting imputation accuracy. Because imputation uses haplotype information, imputation accuracy will decrease if the LD levels of the reference population and test population LD levels are different. LD value (r2) between SNPs within 1Mb distance was measured using plink1.9 software. This means that the maximum distance between the markers is 1Mb, and the average r2 value is estimated for each autosomal chromosome. The following formula is used for LD estimation [24]. Where, A1, A2, B1, and B2 are the alleles of SNP A and SNP B, and PA1, PA2, PB1, and P B2 are the corresponding allele frequencies. P A1B1 is the haplotype frequency of A1B1. The average LD levels of the reference and test populations used in all scenarios were analyzed whether they showed a similar pattern according to the SNPs’ distance. Also, the LD pattern was investigated in the test population according to the marker density (5k, 10k, 15k, and 50k).

Genotype imputation

Imputation of low-density (5k, 10k, 15k) data set to high-density (50k) genotypes was performed with the beagle program ver. 3.3 [25]. The beagle program, which is a population-based method, does not require pedigree information. The beagle program clusters haplotypes in each marker using a localized haplotype cluster model and then uses the Hidden Markov Model (HMM) to find the most probable haplotype based on the known genotype of each individual [26]. Therefore, collecting haplotype information and imputing un-genotyped SNP in the reference population is important for imputing validation data from low-density to high-density. The imputation was performed for each chromosome by pairing the reference data set and the validation data set in all scenarios. After imputation, the genotype was recorded for accuracy comparison, the AA, AB, and BB types were changed to 0, 1, and 2, respectively. The ratio was used as the imputation accuracy by direct genotype comparison of raw genotype and imputed genotypes. In addition, how the imputation accuracy changes are checked according to the minor allele frequency, reference population size, and marker density.

RESULTS

We investigated the LD pattern of all of our scenario sets (Fig. 3). Fig. 3A shows the LD pattern of the four validation sets according to marker density (5k, 10k, and 15k). In all validation sets, the overall LD estimation results showed very similar tendency; the level of LD decreases as the distance to the SNP increases. In each validation set, the LD pattern also differed slightly with the marker density, but the difference was less than 0.01. As there was no difference among the validation sets, the data were free from bias. Fig. 3B shows the LD pattern of the reference sets according to population size. The LD levels of the reference sets used in all scenarios were similar when the reference population size was 500, 1,000, 1,500, 2,000, or 3,600. Because the population size is larger than the validation set, the LD level does not change as the population size changes relative to the reference population. In addition, the LD level difference among the reference groups was smaller than the LD level difference among the validation sets. Comparing the validation set with the reference population set showed that the LD levels have similar patterns. In particular, the distance between SNPs can be divided into 0–20, 20–50, 50–100, 100–200, 200–500, 500–1,000 kb. Table 4 gives the number of SNP pairs in the reference and validation sets, and the average r-square (r2) values and deviations. As the distance between SNPs increased, the number of SNP pairs gradually increased; there were 750 pairs at 0–20 kb and 500,000 at 200–1,000 kb. The r2 value was 0.28 (0.32) at 0–20 kb and 0.02 (0.04) at 200–1,000 kb; it decreased rapidly up to the first 200 kb, and decreased slowly thereafter.

Fig. 3.

The interval means linkage disequilibrium (r2) value between marker pairs about the marker distance according to the test set (A) and reference set (B).

(A) Total four validation data sets have different marker density consisted of 5k, 10k, 15k, and 50k for imputation. (B) Total five reference data sets consisted of 500, 1,000, 1,500, 2,000, and over 2,000 (3,600). In addition, over 2,000 reference data include other validation data also into reference data.

Table 4.

Linkage disequilibrium (r2) information in all scenario data sets

Data type	0–20 kb interval		20–50 kb interval		50–100 kb interval		100–200 kb interval		200–1,000 kb interval
Data type	No. SNP pair	Average r² (SD)	No. SNP pair	Average r² (SD)	No. SNP pair	Average r² (SD)	No. SNP pair	Average r² (SD)	No. SNP pair	Average r² (SD)
Val 1	750	0.28 (0.33)	22,732	0.17 (0.25)	33,014	0.10 (0.17)	64,732	0.06 (0.11)	500,518	0.02 (0.04)
Val 2	751	0.28 (0.33)	22,722	0.17 (0.25)	33,002	0.10 (0.17)	64,713	0.06 (0.11)	500,429	0.02 (0.04)
Val 3	751	0.28 (0.33)	22,731	0.17 (0.25)	33,010	0.10 (0.17)	64,732	0.06 (0.11)	500,464	0.02 (0.04)
Val 4	752	0.28 (0.33)	22,727	0.17 (0.25)	33,006	0.10 (0.17)	64,704	0.06 (0.11)	500,397	0.02 (0.04)
Ref 500	744	0.27 (0.32)	22,572	0.17 (0.24)	32,793	0.09 (0.16)	64,243	0.05 (0.10)	496,752	0.02 (0.04)
Ref 1,000	749	0.27 (0.32)	22,604	0.17 (0.24)	32,814	0.09 (0.16)	64,299	0.05 (0.10)	497,219	0.02 (0.04)
Ref 1,500	749	0.27 (0.32)	22,618	0.17 (0.24)	32,827	0.09 (0.16)	64,374	0.05 (0.10)	497,641	0.02 (0.04)
Ref 2,000	750	0.27 (0.32)	22,646	0.17 (0.24)	32,877	0.09 (0.16)	64,488	0.05 (0.10)	498,467	0.02 (0.04)
Ref 2,000 + (val 1)	752	0.27 (0.32)	22,739	0.17 (0.24)	33,025	0.09 (0.16)	64,758	0.05 (0.10)	500,701	0.02 (0.04)
Ref 2,000 + (val 2)	752	0.27 (0.32)	22,739	0.17 (0.24)	33,025	0.09 (0.16)	64,758	0.05 (0.10)	500,701	0.02 (0.04)
Ref 2,000 + (val 3)	752	0.27 (0.32)	22,739	0.17 (0.24)	33,025	0.09 (0.16)	64,758	0.05 (0.10)	500,701	0.02 (0.04)
Ref 2,000 + (val 4)	752	0.27 (0.32)	22,739	0.17 (0.24)	33,025	0.09 (0.16)	64,758	0.05 (0.10)	500,701	0.02 (0.04)

SNP, single nucleotide polymorphism; Val, validation, Ref, reference population.

Imputation accuracy

We assessed the genotype imputation accuracy according to the SNP panel density and reference population size. The average of the four validation sets represents the accuracy of each scenario. The lowest imputation accuracy was 79% with 5k marker density and a reference population of 500, and the highest accuracy was 97% with 15k marker density and a reference population over 2,000 (3,600). When we assessed the accuracy of each validation set, the maximum difference among the sets was about 4%, when the reference population size was 1,000 and the marker density 5k. The other scenarios had similar imputation accuracies. Fig. 4 plots the accuracy according to the marker density of each chromosome for a reference population of 1,500 (The imputation accuracies for each chromosome for reference population sizes 500, 10,00, 2,000, and over 2,000 are presented in Figs. 5–8); chromosome 21 show maximum variability in imputation accuracy. Fig. 9 plots the misplaced SNPs on the entire autosomal segment with markers of 5k (A), 10k (B), and 15k (C) from above and confirms the presence of several regions with poor imputation quality. Based on 0.75 as a threshold, 1275 SNPs were identified as substandard at 5k, 151 SNPs at 10k, and 65 SNPs were identified as substandard at 15k.

Fig. 4.

Average imputation accuracy of each chromosome different marker density in reference population size 1,500.

The average accuracy of each chromosome is indicated by a different color depending on the marker density of the test data set, which is 5k, 10k, 15k represented to green, yellow, and red, respectively.

Fig. 5.

Average imputation accuracy of each chromosome different marker density in reference population size 500.

Fig. 8.

Average imputation accuracy of each chromosome different marker density in reference population size over 2,000.

Fig. 9.

Average imputation accuracy of each SNPs different marker density in reference population size 1,500.

We did genome-wide plotting for each SNP imputation accuracy to find a region where the low imputation efficiency. Each chromosome has a different color, and the inferior area exists at the end of the chromosome. As the marker density increases; (A) 5k, (B) 10k and (C) 15k from above, the overall imputation accuracy also increases. 1,275 SNPs were identified as substandard at 5k, 151 SNPs at 10k, and 65 SNPs were identified as substandard at 15k. Brown horizontal threshold set to 0.75. SNP, single nucleotide polymorphism.

The interval means linkage disequilibrium (r2) value between marker pairs about the marker distance according to the test set (A) and reference set (B).

Average imputation accuracy of each chromosome different marker density in reference population size 1,500.

Average imputation accuracy of each chromosome different marker density in reference population size 500.

Average imputation accuracy of each chromosome different marker density in reference population size 1,000.

Average imputation accuracy of each chromosome different marker density in reference population size 2,000.

Average imputation accuracy of each chromosome different marker density in reference population size over 2,000.

Average imputation accuracy of each SNPs different marker density in reference population size 1,500.

Imputation accuracy by marker density

This study investigated the effect of marker density on imputation accuracy. Three low-density (5k, 10k, and 15k) datasets were used as marker data for validation and imputed to high-density (50k; 38,933). In the 5k marker dataset, the imputation accuracy was 0.793–0.929, with a 13.6% accuracy difference according to the reference population size. In comparison, in the 15k marker dataset, the imputation accuracy was 0.904–0.967, with a 6.3% accuracy difference according to the reference population size (Table 5). This shows that the higher the density of the validation set, the greater the imputation accuracy; moreover, the imputation accuracy difference according to reference population size is much greater at low than high density. The difference in imputation accuracy between 5k and 10k is also more significant than that between 10k and 15k. The efficiency of imputation increased with marker density in the validation set. Imputation took a comparatively long time when the marker density of the validation set was low. Time costs are not shown in this study.

Table 5.

Average imputation accuracy of validation data sets

Density	No. Ref	Val 1	Val 2	Val 3	Val 4	Average	SD
5k	500	0.796	0.789	0.797	0.792	0.793	0.004
	1,000	0.862	0.938	0.862	0.861	0.881	0.038
	1,500	0.886	0.883	0.888	0.886	0.886	0.002
	2,000	0.899	0.897	0.902	0.9	0.9	0.002
	2,000+	0.929	0.926	0.931	0.929	0.929	0.002
10k	500	0.864	0.859	0.864	0.861	0.862	0.002
	1,000	0.909	0.907	0.911	0.91	0.909	0.002
	1,500	0.927	0.924	0.928	0.927	0.926	0.001
	2,000	0.936	0.934	0.938	0.937	0.936	0.002
	2,000+	0.955	0.953	0.956	0.955	0.955	0.001
15k	500	0.907	0.904	0.908	0.906	0.906	0.002
	1,000	0.938	0.936	0.939	0.938	0.937	0.001
	1,500	0.949	0.948	0.951	0.95	0.949	0.001
	2,000	0.955	0.954	0.957	0.956	0.956	0.001
	2,000+	0.969	0.967	0.969	0.969	0.969	0.001

Ref, reference Val, validation.

Imputation accuracy by reference population size

Five reference populations were examined: 500, 1,000, 1,500, 2,000, and 3,600. When selecting the animals for the reference population, we used random sampling based on birth year; the relatedness of the animals was not considered. Fig. 10 plots the average imputation accuracy according to reference population size and test data marker density. For the smallest reference population (n = 500), the imputation accuracy was 0.793–0.906, differing by 11.3% according to marker density. For the largest reference population (3,600), the imputation accuracy was 0.929–0.969, differing by 4% according to marker density (Table 5). These results show that the larger the reference population, the higher the imputation accuracy. Moreover, the difference in imputation accuracy according to marker density is much more significant with small reference populations. When the reference population in Hanwoo cattle exceeds 1,000, the average imputation accuracy exceeds 88%, even using 5 k SNP data (Fig. 10). The imputation efficiency increased with reference population size, and imputation took longer if the reference population was small.

Fig. 10.

The Average accuracy of imputation according to reference population size and validation data marker density.

The validations average imputation accuracy was calculated according to the reference population size, which was displayed according to the density of each marker density. Gray, yellow, and blue represent 5k, 10k, and 15k, respectively.

The Average accuracy of imputation according to reference population size and validation data marker density.

Imputation accuracy by minor allele frequency

To investigate the effect of minor allele frequency, which affects imputation accuracy, the minor allele frequency of all SNPs was increased from 0 to 0.5 in 0.005 increments, with 100 groups in all scenarios. The imputation accuracy of each minor allele group based on population size and marker density was compared. Fig. 11 shows the imputation accuracy in five reference populations with three different marker densities up to 50k, according to the minor allele frequencies. The imputation accuracy was negatively related to the minor allele frequency, confirming that the imputation accuracy decreased as the minor allele frequency increased. Using the 5k marker data in the validation set, the 0.005 and 0.5 groups had accuracies of 98.3%–99.4% and 69.4%–88.9%, respectively, depending on the size of the reference group, and the difference in accuracy was 10.5%–28.9%. However, when the 15k marker data was used in the validation set, the 0.005 and 0.5 groups had respective accuracies of 99.3%–99.7% and 85.2%–94.9%, varying depending on the reference population size, and a 4.7%–14.1% accuracy difference. Therefore, as the marker density or reference population size increases, the difference in imputation accuracy decreases, even if the frequency of minor alleles increases. There was a clear distinction among scenarios when the imputation accuracy threshold was 85%. If a 15k marker density was used in all scenarios, the accuracy exceeded the threshold value.

Fig. 11.

The average accuracy of imputation according to marker density and minor allele frequency and reference population.

The minor allele frequency was divided into 100 groups between 0 and 0.5 by 0.005, and then calculated the average accuracy of SNPs belonging to the group (A) 5k, (B) 10k and (C) 15k. SNP, single nucleotide polymorphism.

The average accuracy of imputation according to marker density and minor allele frequency and reference population.

DISCUSSION

In this study, we use Hanwoo genotype dataset from BioGreen 21 data set of National Institute of Animal Science, RDA, to generate scenario data for genotype imputation. The validation data were the data set of the youngest 889 animals; and these were divided into four validation sets. The others were used as a reference group to perform imputation using Beagle 3.3, a population-based method. The imputation accuracy was examined by direct comparison between the true and imputed genotypes. We investigated LD in Hanwoo cattle because the population LD level affects the imputation accuracy. Uemeto et al. (2015) confirmed that Japanese black cattle had 0.1 LD (r2) when there was 200 kb between SNP pairs [27]. Using 16 Holstein breeds, Hoze et al. (2013) reported 0.2 LD (r2) when there was 100 kb between SNP pairs [28]. Thus, high LD means that the association between SNP markers is also high. This increases the probability of appropriate inference for closely located SNPs during imputation. Hanwoo cattle have a lower LD value than other cattle breeds and require a larger reference population to achieve high imputation accuracy. Imputation accuracy of 95% in Japanese black cattle was obtained with reference populations greater than 400 [27]. In comparison, 90% imputation accuracy was obtained in Holstein cattle with reference populations greater than 300, and 95% imputation accuracy in Fleckvieh cattle with reference populations greater than 400 [29]. In Hanwoo cattle, the imputation accuracy was 88% at low-density (5k) for reference populations greater than 1,000, while it was the same as in Holsteins (where long chromosomes have greater imputation accuracy than short ones) [30]. Imputation accuracy is also influenced by the marker density of the validation data. In dairy cattle, the imputation accuracy of a reference population of 2,406 was 72%, 82%, 91%, 93%, and 97% at marker densities of 384, 768, 1,536, 2,480, and 6,177, respectively [31]. In this study, comparing the results of three low-density panels (5k, 10k, and 15k), the accuracy differed by up to 13.6% according to the marker density. We need to assess imputation accuracy according to the reference population size because, as the population size increases, the haplotype data increase along with the explanatory power of each haplotype, and the imputation error rate decreases. We assessed imputation accuracy according to five reference population sizes, to determine the effect of reference population size on imputation accuracy in Hanwoo cattle. When the imputation was performed with a reference population over 2,000 (3,600), the accuracy was 93% at the lowest density (5k), which is lower than in other breeds. The minor allele frequency also negatively affects the imputation accuracy. Because imputation imputes missing values through a statistical method, a correct genotype is accidentally introduced more often at a low minor allele frequency [27,32]. However, as the marker density or reference population size increases, the difference in imputation accuracy decreases, even if the frequency of minor alleles increases. In conclusion, the imputation accuracy difference was 6.3%–13.6% among marker densities, varying depending on the reference population size, and 4%–11.3% among reference population sizes, varying according to marker density. In Hanwoo cattle, a reference population of at least 1,000 is needed to obtain more than 88% imputation accuracy.

26 in total

1. Prediction of total genetic value using genome-wide dense marker maps.

Authors: T H Meuwissen; B J Hayes; M E Goddard
Journal: Genetics Date: 2001-04 Impact factor: 4.562

2. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels.

Authors: M Erbe; B J Hayes; L K Matukumalli; S Goswami; P J Bowman; C M Reich; B A Mason; M E Goddard
Journal: J Dairy Sci Date: 2012-07 Impact factor: 4.034

3. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Authors: Paul Scheet; Matthew Stephens
Journal: Am J Hum Genet Date: 2006-02-17 Impact factor: 11.025

4. Genomic selection using low-density marker panels.

Authors: D Habier; R L Fernando; J C M Dekkers
Journal: Genetics Date: 2009-03-18 Impact factor: 4.562

5. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

Authors: Sharon R Browning; Brian L Browning
Journal: Am J Hum Genet Date: 2007-09-21 Impact factor: 11.025

6. Accuracy of high-density genotype imputation in Japanese Black cattle.

Authors: Y Uemoto; S Sasaki; Y Sugimoto; T Watanabe
Journal: Anim Genet Date: 2015-07-07 Impact factor: 3.169

7. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.

Authors: Bryan Howie; Christian Fuchsberger; Matthew Stephens; Jonathan Marchini; Gonçalo R Abecasis
Journal: Nat Genet Date: 2012-07-22 Impact factor: 38.330

8. Accuracy of genomic prediction using low-density marker panels.

Authors: Z Zhang; X Ding; J Liu; Q Zhang; D-J de Koning
Journal: J Dairy Sci Date: 2011-07 Impact factor: 4.034

9. False positive findings during genome-wide association studies with imputation: influence of allele frequency and imputation accuracy.

Authors: Zhihui Zhang; Xiangjun Xiao; Wen Zhou; Dakai Zhu; Christopher I Amos
Journal: Hum Mol Genet Date: 2021-12-17 Impact factor: 5.121

10. Impact of genotype imputation on the performance of GBLUP and Bayesian methods for genomic prediction.

Authors: Liuhong Chen; Changxi Li; Mehdi Sargolzaei; Flavio Schenkel
Journal: PLoS One Date: 2014-07-15 Impact factor: 3.240