| Literature DB >> 26217378 |
Nathan D Olson1, Steven P Lund2, Rebecca E Colman3, Jeffrey T Foster4, Jason W Sahl5, James M Schupp3, Paul Keim5, Jayne B Morrow1, Marc L Salit6, Justin M Zook1.
Abstract
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit's focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards.Entities:
Keywords: indel; next-generation sequencing; performance metrics; single nucleotide variants; variant calling
Year: 2015 PMID: 26217378 PMCID: PMC4493402 DOI: 10.3389/fgene.2015.00235
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1SNP calling workflow diagram. Horizontal boxes represent steps in the workflow and arrows to the left indicate steps in the workflow challenged with reference genomic DNA, and sequence data.
FIGURE 2Cause-effect diagram indicating the sources of error associated with different steps in the variant calling measurement process. Note that the SNP calling is performed using one of two methods, either read mapping or de novo assembly.
Definitions of common performance metrics used in evaluating variant callers.
| Accuracy | Ratio of correct calls to total calls and variants (1) | ||
| Specificity | Non-variants not called as variants relative to the total non-variants (1) | ||
| Sensitivity | True variants called relative to all variants (1) | Recall, true positive rate (TPR), positive call rate | |
| Precision | True variants called relative to total calls (1) | Positive predictive value (PPV) | |
| False positive rate | Non-variants called relative to the total non-variants (0) |
FIGURE 3Contingency table. True Positive, False Positive, False Negative, and True Negatives are defined based by the relationship between variants called by the SNP calling algorithm and known differences between the reference genome and the analyzed sample.
FIGURE 4Comparison of two variant calling algorithms using two of the performance metrics in Table Methods A (red) and B (teal) indicate two different variant calling methods. Left: A smoothing function (generalized additive model) was used to summarize the contingency table metrics across the considered quality score interval. Red and teal lines are smoothing functions, and the gray area represents the 95% confidence interval. The vertical dashed line indicates the quality score cutoff (Q = 3195) for the static tables. Right: Boxplots are used to summarize the performance metrics calculated from the contingency table value for the replicate datasets at the defined cutoff value.
FIGURE 5Scatter plot showing the relationship between two performance metrics for variant call sets. The individual data points are based on metrics calculated from static contingency tables. The error bars represent the 95% confidence interval for each performance metric.
FIGURE 6Curves represent the relationship between the two performance metrics, in this case true positive and false positive rate. Replicate samples are plotted individually to indicate the robustness in the relative performance of the two methods.