| Literature DB >> 23879730 |
Kerensa McElroy1, Osvaldo Zagordi, Rowena Bull, Fabio Luciani, Niko Beerenwinkel.
Abstract
BACKGROUND: Deep sequencing is a powerful tool for assessing viral genetic diversity. Such experiments harness the high coverage afforded by next generation sequencing protocols by treating sequencing reads as a population sample. Distinguishing true single nucleotide variants (SNVs) from sequencing errors remains challenging, however. Current protocols are characterised by high false positive rates, with results requiring time consuming manual checking.Entities:
Mesh:
Year: 2013 PMID: 23879730 PMCID: PMC3848937 DOI: 10.1186/1471-2164-14-501
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Limit of SNV detection. Shown is the maximal error rate ϵ at which SNVs of frequency f are detectable if d sites are analysed simultaneously. In a simple model of SNV calling (see Methods), this bound is ϵ < 1/[1 + (1 − f)/f] ≈ f, giving rise to the lines of slope 1/d for two haplotypes at Hamming distances d in the double logarithmic plot.
Estimated true and intended clone frequencies, HCV
| 023-180,609-2 | 25.0 | 37.1 | 39.2 |
| 023-180,609-1 | 25.0 | 25.9 | 24.7 |
| 023-180,609-6 | 25.0 | 25.0 | 24.3 |
| 023-180,609-5 | 25.0 | 12.0 | 12.0 |
Estimated true and intended clone frequencies, HIV
| 07-56,951 | 25 | 37.6 |
| 07-54,825 | 6.3 | 33.8 |
| 08-04,134 | 3.1 | 12.2 |
| 08-59,712 | 12.5 | 9.1 |
| 07-56,681 | 50 | 5.6 |
| 08-02,659 | 0.2 | 0.8 |
| 08-01,315 | 1.6 | 0.4 |
| 08-04,512 | 0.1 | 0.2 |
| 08-55,163 | 0.8 | 0.2 |
| 08-57,881 | 0.4 | 0.2 |
SNV calling accuracy
| HCV1 | Raw | 38 | 29 | 0 | 1.000 | 0.567 | 38 | 206 | 0 | 1.000 | 0.156 | 37 | 1 | 1 | 0.974 | 0.974 |
| | Fisher’s exact | 32 | 6 | 6 | 0.842 | 0.842 | 36 | 179 | 2 | 0.947 | 0.167 | 29 | 0 | 9 | 0.763 | 1.000 |
| | Bin. (σ = 0) | 37 | 7 | 1 | 0.974 | 0.841 | 36 | 108 | 2 | 0.947 | 0.250 | | | | | |
| | B-bin., σ = 0.0004 | 37 | 7 | 1 | 0.974 | 0.841 | 37 | 108 | 1 | 0.947 | 0.255 | | | | | |
| | B-bin., σ = 0.0014 | 37 | 8 | 1 | 0.974 | 0.822 | 37 | 108 | 1 | 0.947 | 0.255 | | | | | |
| | B-bin., σ = 0.0111 | 38 | 8 | 0 | 1.000 | 0.826 | 38 | 108 | 0 | 1.000 | 0.261 | | | | | |
| HCV2 | Raw | 38 | 50 | 0 | 1.000 | 0.432 | 38 | 826 | 0 | 1.000 | 0.044 | 37 | 1 | 1 | 0.974 | 0.974 |
| | Fisher’s exact | 33 | 9 | 5 | 0.868 | 0.785 | 36 | 766 | 2 | 0.947 | 0.045 | 29 | 0 | 9 | 0.763 | 1.000 |
| | Bin. (σ = 0) | 34 | 8 | 4 | 0.895 | 0.810 | 35 | 577 | 3 | 0.921 | 0.057 | | | | | |
| | B-bin., σ = 0.0004 | 36 | 8 | 2 | 0.947 | 0.818 | 36 | 577 | 2 | 0.947 | 0.059 | | | | | |
| | B-bin., σ = 0.0014 | 36 | 8 | 2 | 0.947 | 0.818 | 36 | 577 | 2 | 0.947 | 0.059 | | | | | |
| | B-bin., σ = 0.0111 | 38 | 13 | 0 | 1.000 | 0.745 | 38 | 589 | 0 | 1.000 | 0.061 | | | | | |
| HCV2 | Raw | 153 | 2 | 35 | 0.814 | 0.987 | 175 | 732 | 13 | 0.931 | 0.193 | 121 | 0 | 67 | 0.644 | 1.000 |
| | Fisher’s exact | 87 | 0 | 101 | 0.462 | 1.000 | 125 | 671 | 63 | 0.665 | 0.157 | 41 | 0 | 147 | 0.218 | 1.000 |
| | Bin. (σ = 0) | 88 | 0 | 100 | 0.468 | 1.000 | 113 | 473 | 75 | 0.601 | 0.193 | | | | | |
| | B-bin., σ = 0.0004 | 101 | 1 | 87 | 0.537 | 0.990 | 121 | 474 | 67 | 0.644 | 0.203 | | | | | |
| | B-bin., σ = 0.0014 | 126 | 1 | 62 | 0.670 | 0.992 | 135 | 475 | 53 | 0.718 | 0.221 | | | | | |
| B-bin., σ = 0.0111 | 151 | 2 | 37 | 0.803 | 0.987 | 162 | 490 | 26 | 0.862 | 0.248 | ||||||
SNV calling statistics for ShoRAH, VarScan, and LoFreq. For ShoRAH and VarScan, SNV calls without the strand bias test (Raw) and using different values of the beta-binomial dispersion parameter σ in the strand bias test are given, with σ = 0 corresponding to a binomial forward read distribution. For LoFreq, the results of applying our strand bias test are absent as this software does not report forward and reverse strand counts. For all SNV calling methods, the results of applying Fisher’s exact test to the raw output are also given (this was possible for LoFreq as it reports a strand bias Fisher’s exact test p-value for each variant). Reported statistics include true positives (TP), i.e., genomic sites with a variant matching a known true variant, false positives (FP), i.e., genomic sites with a variant that is not a known true variant, and false negatives (FN), i.e., known variants which are not identified by the relevant SNV calling method. Individual genomic sites may contribute to both true positives and false positives. Recall (TP/(TP + FN)) and precision (TP/(TP + FP)) are also reported.
Figure 2SNV recall and false positives by frequency. Detailed analysis of SNV calling for variants with population frequency less than 15%. Both raw results, and results subjected to a strand bias test with σ=0.0111, are given. In cases where cross hairs fill a symbol of the same colour, the strand bias test had no effect on results. (a) Recall of true SNVs for HIV data, by frequency (calculated using known haplotype frequencies (see Table 2)). Note: for SNVs with population frequency greater than 15%, recall was perfect for all methods. For this reason, recall plots for HCV1 and HCV2 are also omitted, as they are straight lines with perfect recall = 1. (b) HIV, (c) HCV run 1, and (d) HCV run 2, absolute false positive SNV counts by predicted frequency for each method, binned in intervals of 1%. A double logarithmic scale is used for plots (b), (c), and (d); thus data points where no false positives are recorded are omitted. For all ShoRAH runs, no false positives were observed with population frequency greater than 15%. Some false positives with population frequency greater than 15% occurred for the VarScan runs, however close inspection revealed the majority of these to be the result of a VarScan bug when calling SNVs for genomic positions with two or more variants.
Figure 3Precision recall curves for various strand bias tests. Precision versus recall, for strand bias tests performed on raw SNV calls from (a) ShoRAH applied to HCV run 1, (b) VarScan applied to HCV run 1, (c) ShoRAH applied to HCV run 2, and (d) VarScan applied to HCV run 2. In all cases, strand bias tests based on an underlying beta-binomial forward read distribution exhibited greater or equal precision for the same recall value, when compared to either a strand bias test with a binomial forward read distribution, or a Fisher’s exact test of strand bias.