| Literature DB >> 29697369 |
Tony Kuo1,2, Martin C Frith1,3,4, Jun Sese1,2, Paul Horton5,6.
Abstract
BACKGROUND: Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options.Entities:
Keywords: Generative probabilistic models; Genomic variants; Next generation sequencing data analysis; Variant calling; Variant quality score
Mesh:
Year: 2018 PMID: 29697369 PMCID: PMC5918433 DOI: 10.1186/s12920-018-0342-1
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1The model aims to handle the various uncertainties inherent in the variant calling process in order to calculate and compare the probability of the data given: the candidate variant sequence and the reference sequence. Simply, the more uncertain the data (g representing other possible sources of read r) the more uncertain the explanation
Fig. 2A high level overview of the EAGLE workflow. EAGLE requires: read alignments in BAM format, a set of candidate variants in VCF format, and the reference genome as a FASTA file. Preprocessing in this study refers to preprocessing steps described in GATK best practices
Fig. 3Precision vs Recall of NS12911 based ∼30× fold coverage simulated reads is shown for Indels (top) and SNPs (bottom). Solid lines represent EAGLE’s likelihood. Dotted lines represent the caller’s quality score. Recall levels are shown in increments of 50 variant calls with the maximum level based on the number of variants in the NS12911 benchmark set. The variant calls were ranked based on our model’s marginal posterior probability or each caller’s quality score respectively. Precision is the fraction of high ranking variants which are correct, plotted over a wide range of thresholds
Fig. 4Precision vs Recall for NA12878 using benchmarks from exome sequencing GIAB Indels (a, c) and SNPs (b,d). Plots a) and b) show the full precision vs recall for each method. Solid lines represent EAGLE’s likelihood. Dotted lines represent the caller’s quality score. Recall levels are shown in increments of 50 variant calls with the maximum level based on the number of variants in the GIAB benchmark set. The variant calls were ranked based on our model’s marginal posterior probability or each caller’s quality score respectively. Precision is the fraction of high ranking variants which are correct, plotted over a range of thresholds. Plots c) and d) show the best precision at a given recall among all methods with EAGLE versus among all methods without EAGLE, for indels and SNPs respectively
Fig. 5Precision vs Recall for NA12878 using benchmarks from whole genome sequencing Illumina Platinum Genome Indels (a, c) and SNPs (b,d). Plots a) and b) show the precision vs recall for each method. Solid lines represent EAGLE’s likelihood. Dotted lines represent the caller’s quality score. Recall levels are shown in increments of 50 variant calls with the maximum level based on the number of variants in the GIAB benchmark set. The variant calls were ranked based on our model’s marginal posterior probability or each caller’s quality score respectively. Precision is the fraction of high ranking variants which are correct, plotted over a wide range of thresholds. Plots c) and d) show the best precision at a given recall among all methods with EAGLE versus among all methods without EAGLE, for indels and SNPs respectively