| Literature DB >> 34925451 |
S A Durward-Akhurst1, R J Schaefer1, B Grantham2, W K Carey2, J R Mickelson3, M E McCue1.
Abstract
Genetic variation is a key contributor to health and disease. Understanding the link between an individual's genotype and the corresponding phenotype is a major goal of medical genetics. Whole genome sequencing (WGS) within and across populations enables highly efficient variant discovery and elucidation of the molecular nature of virtually all genetic variation. Here, we report the largest catalog of genetic variation for the horse, a species of importance as a model for human athletic and performance related traits, using WGS of 534 horses. We show the extent of agreement between two commonly used variant callers. In data from ten target breeds that represent major breed clusters in the domestic horse, we demonstrate the distribution of variants, their allele frequencies across breeds, and identify variants that are unique to a single breed. We investigate variants with no homozygotes that may be potential embryonic lethal variants, as well as variants present in all individuals that likely represent regions of the genome with errors, poor annotation or where the reference genome carries a variant. Finally, we show regions of the genome that have higher or lower levels of genetic variation compared to the genome average. This catalog can be used for variant prioritization for important equine diseases and traits, and to provide key information about regions of the genome where the assembly and/or annotation need to be improved.Entities:
Keywords: breed differences; equine; genetic variation; genetics; variant discovery; whole genome sequence
Year: 2021 PMID: 34925451 PMCID: PMC8676274 DOI: 10.3389/fgene.2021.758366
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Median, mean, and range of summary statistics of the mapping pipeline derived from WGS data from 534 horses.
| Median | Mean | Range | |
|---|---|---|---|
| Read length (bp) | 99.6 | 114.5 | 73.9–234.2 |
| Uniquely mapped paired end reads | 240,563,458 | 290,638,968 | 17,030,804–1,536,494,934 |
| Depth of coverage (X) | 9.2 | 11.5 | 1.4–46.7 |
Number of variants, TsTv, and HetNRhom ratio from 534 WGS identified by each variant caller [GATK Haplotype Caller (HC) and BCFtools (BT), and the union and intersection of the variant callers].
| Variant Caller | Variants | SNPs | INDELs | MA sites | MA SNP sites | TsTv ratio | HetNRhom ratio |
|---|---|---|---|---|---|---|---|
| HC | 42,900,494 | 38,205,867 | 4,694,627 | 2,974,935 | 2,127,391 | 1.87 | 2.48 |
| BT | 33,395,275 | 30,642,613 | 2,752,662 | 783,956 | 295,703 | 2.08 | 2.18 |
| Union | 45,154,996 | 439,810,450 | 5,344,546 | 3,298,397 | 2,178,474 | 1.54 | 2.17 |
| Intersect | 31,140,769 | 29,038,030 | 2,102,379 | 1,547,737 | 990,223 | 1.94 | 2.24 |
MA, multiallelic; TsTv, transition/transversion; HetNRhom, heterozygous-non-reference homozygous.
FIGURE 1Distribution of the average number of variants identified for each breed by DOC quantiles (Q1 = 1.43–7.14 X, Q2 = 7.15–9.16 X, Q3 = 9.17–14.6 X, Q4 = 14.7–46.7 X). The colored lines represent the 10 target horse breeds [Arabian, Belgian, Clydesdale, Icelandic horse, Morgan horse, Quarter Horse (QH), Shetland, Standardbred (STB), Thoroughbred (TB), and Welsh Pony (WP)] and the remaining horse breeds (Other).
FIGURE 2Missingness by individual (A), by depth of coverage (B) and by chromosome (C). The 10 target breeds and other breeds are represented in colors shown in the figure legend in Figures 2A,B.
Estimated marginal mean of the transition to transversion (TsTv) and heterozygous to non-reference homozygous (hetNRhom) ratios accounting for depth of coverage.
| Breed | TsTv ratio | hetNRhom ratio |
|---|---|---|
| Arabian | 1.93 | 1.53 |
| Belgian | 1.94 | 2.02 |
| Clydesdale | 1.93 | 1.68 |
| Icelandic | 1.94 | 1.89 |
| Morgan | 1.94 | 2.26 |
| Quarter Horse | 1.93 | 2.26 |
| Shetland | 1.95 | 2.42 |
| Standardbred | 1.93 | 2.03 |
| Thoroughbred | 1.92 | 3.19 |
| Welsh Pony | 1.94 | 2.27 |
FIGURE 3Minor allele frequency distribution of the variants.
EMMEANs for number of variants within breeds accounting for DOC, standard error (SE), and 95% confidence intervals, with breed and DOC as predictor variables.
| Breed | Variants included | EMMEAN | SE | Lower confidence interval | Upper confidence interval |
|---|---|---|---|---|---|
| Average | All | 5,580,202 | 33,642 | 5,514,039 | 5,646,364 |
| Arabian | All | 5,587,952 | 54,820 | 5,480,322 | 5,695,582 |
| Belgian | All | 6,100,544 | 67,559 | 5,967,903 | 6,233,186 |
| Clydesdale | All | 5,977,299 | 69,379 | 5,841,084 | 6,113,515 |
| Icelandic | All | 6,093,155 | 72,659 | 5,950,500 | 6,235,810 |
| Morgan | All | 5,602,287 | 67,541 | 5,469,681 | 5,734,893 |
| Quarter Horse | All | 5,473,401 | 39,427 | 5,395,992 | 5,550,810 |
| Shetland | All | 5,645,668 | 44,377 | 5,558,541 | 5,732,794 |
| Standardbred | All | 5,623,603 | 44,981 | 5,535,289 | 5,711,917 |
| Thoroughbred | All | 5,000,516 | 43,061 | 4,915,972 | 5,085,060 |
| Welsh Pony | All | 5,836,729 | 67,893 | 5,703,431 | 5,970,028 |
| Average | Homozygous | 1,805,127 | 10,931 | 1,783,628 | 1,826,626 |
| Arabian | Homozygous | 1,812,878 | 54,820 | 1,705,248 | 1,920,508 |
| Belgian | Homozygous | 2,325,470 | 67,559 | 2,192,828 | 2,458,111 |
| Clydesdale | Homozygous | 2,202,225 | 69,379 | 2,066,010 | 2,338,440 |
| Icelandic | Homozygous | 2,318,081 | 72,659 | 2,175,426 | 2,460,736 |
| Morgan | Homozygous | 1,827,212 | 67,541 | 1,694,607 | 1,959,818 |
| Quarter Horse | Homozygous | 1,698,327 | 39,427 | 1,620,918 | 1,775,735 |
| Shetland | Homozygous | 1,870,593 | 44,377 | 1,783,466 | 1,957,720 |
| Standardbred | Homozygous | 1,848,529 | 44,981 | 1,760,215 | 1,936,843 |
| Thoroughbred | Homozygous | 1,225,441 | 43,061 | 1,140,897 | 1,309,985 |
| Welsh Pony | Homozygous | 2,061,655 | 67,893 | 1,928,356 | 2,194,953 |
The top half of the table provides the EMMEAN for the total number of variants per individual (All) and the bottom half of the table provides the total number of homozygous variants present in each individual.
FIGURE 4Percentage of coding variants for each type of variant called by SnpEff for low variation regions (orange) and high variation regions (teal).
Impact of variants identified in high and low variation regions.
| High | Moderate | Low | Modifier | |
|---|---|---|---|---|
|
| 2061 | 39,260 | 48,234 | 2,535,827 |
|
| 303 | 6,293 | 11,521 | 602,660 |