| Literature DB >> 26640477 |
Amanda Warr1, Christelle Robert1, David Hume1, Alan L Archibald1, Nader Deeb2, Mick Watson1.
Abstract
Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Variant calling often produces large data sets that cannot be realistically validated and which may contain large numbers of false-positives. Errors in the reference assembly increase the number of false-positives. While resources are available to aid in the filtering of variants from human data, for other species these do not yet exist and strict filtering techniques must be employed which are more likely to exclude true-positives. This work assesses the accuracy of the pig reference genome (Sscrofa10.2) using whole genome sequencing reads from the Duroc sow whose genome the assembly was based on. Indicators of structural variation including high regional coverage, unexpected insert sizes, improper pairing and homozygous variants were used to identify low quality (LQ) regions of the assembly. Low coverage (LC) regions were also identified and analyzed separately. The LQ regions covered 13.85% of the genome, the LC regions covered 26.6% of the genome and combined (LQLC) they covered 33.07% of the genome. Over half of dbSNP variants were located in the LQLC regions. Of copy number variable regions identified in a previous study, 86.3% were located in the LQLC regions. The regions were also enriched for gene predictions from RNA-seq data with 42.98% falling in the LQLC regions. Excluding variants in the LQ, LC, or LQLC from future analyses will help reduce the number of false-positive variant calls. Researchers using WGS data should be aware that the current pig reference genome does not give an accurate representation of the copy number of alleles in the original Duroc sow's genome.Entities:
Keywords: copy number variable regions; draft assemblies; false positives; missassembly; structural variation
Year: 2015 PMID: 26640477 PMCID: PMC4662242 DOI: 10.3389/fgene.2015.00338
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Table summarizing the regions identified by different parameters measured.
| No. of features | Mean feature size | Percentage of genome | |
|---|---|---|---|
| High coverage | 60,281 | 1,202 | 2.6 |
| Small insert | 82,097 | 1,363 | 3.99 |
| Large insert | 31,833 | 1,343 | 1.52 |
| Improperly paired | 77,785 | 1,786 | 4.95 |
| Homozygous variants | 245,972 | 256 | 2.25 |
| Low quality (LQ) | 409,905 | 949 | 13.85 |
| Low coverage (LC) | 119,251 | 6,275 | 26.6 |
| 337,276 | 2,753 | 33.07 |
Table summarizing the proportion of called variants in publicly available data that fall in the abnormal regions identified in the current study.
| Total | LQ | LC | Combined (LQLC) | |
|---|---|---|---|---|
| % of genome | – | 13.85% | 26.6% | 33.07% |
| % of coding region | – | 13.89% | 17.72% | 26.37% |
| dbSNP variantsa | 52,634,111 | 19,121,760 (36.33%) | 15,483,445 (29.42%) | 27,009,232 (51.3%) |
| CNVRsb | 3,118 | 1,081 (34.66%) | 1,706 (54.71%) | 2,692 (86.3%) |
| RNA-seq genesc (intersecting bases) | 41,788,900 | 11,155,280 (26.69%) | 11,360,980 (27.19%) | 17,959,798 (42.98%) |