| Literature DB >> 23935067 |
Christopher R Cabanski1, Matthew D Wilkerson, Matthew Soloway, Joel S Parker, Jinze Liu, Jan F Prins, J S Marron, Charles M Perou, D Neil Hayes.
Abstract
Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin ('mismapping') and inferring differences in mismapped sequences to be true variants. We developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. We show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. We queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. We demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.Entities:
Mesh:
Year: 2013 PMID: 23935067 PMCID: PMC3799449 DOI: 10.1093/nar/gkt692
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The proportion of unmapped, multimapped and uniquely mismapped reads for (A) all eight SE data sets and (B) all eight PE data sets. The remaining reads were correctly mapped.
Figure 2.(A) The number of mismatch positions (SNDs) covered by at least one non-reference base for the eight SE data sets, where the number of exon positions is shaded black. The overlap of SNDs across the four SE data sets aligned with (B) MapSplice and (C) TopHat, showing that these positions are highly dependent on read length. (D) Total number of called variants, where the number of exon positions is shaded black. Although MapSplice has a larger number of SNDs, TopHat has more called variants.
Figure 3.(A) The number of exon SNDs reported in dbSNP for the eight SE data sets. Each shaded bar represents a different version of dbSNP (131, 132 and 135). (B) The number of SNDs reported in COSMIC.
Figure 4.(A) The proportion of unmapped, multimapped and uniquely mismapped reads for all 8 SE data sets after manually inserting known SNPs from dbSNP. (B) The number of SNDs covered by at least one non-reference base, where the number of exon positions is shaded black.
Figure 5.The proportion of multimapped and uniquely mismapped reads for the four DNA-WES data sets aligned with MapSplice that match the reference sequence (left) and have known SNPs from dbSNP manually inserted (right). None of the DNA-WES data sets have any unmapped reads.