| Literature DB >> 31349684 |
Luca Ferretti1, Chandana Tennakoon1, Adrian Silesian1, Graham Freimanis andPaolo Ribeca2.
Abstract
Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.Entities:
Keywords: heterogeneous populations, Bayesian modelling; low-frequency variants; next generation sequencing
Year: 2019 PMID: 31349684 PMCID: PMC6722845 DOI: 10.3390/genes10080561
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1(Left) Number of variants as a function of the number of CirSeq reads supporting them. (Right) Schematic depiction of the procedure used to determine the sets of true and false positive variants from CirSeq data.
Number of false and true positive variants returned by the selection procedure of Figure 1 as a function of the coverage threshold.
| Coverage Threshold | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| True positives | 1125 | 233 | 96 | 59 | 35 |
| False positives | 10,158 | 4739 | 2386 | 1302 | 739 |
Figure 2Comparison of variants discovered on simulated poliovirus data by SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), LoFreq and VarScan2. We simulated coverages of 500 and 5000, and for each coverage value three different levels of error: 0.6%, 1%, and 5%.
Figure 3The Receiver-Operator Curves (ROCs) for the specificity and the sensitivity of variants predicted by SiNPle, LoFreq, VarScan2 (both standard and sensitive mode), and a control method based on the random selection of variable positions on the genome irrespective of whether they are true variants or not (see Section 2), from an input of 100,000 reads randomly extracted from a poliovirus dataset sequenced with CirSeq [19]. CirSeq allows the experimental discovery and validation of very low-frequency variants present at population level. The figures were obtained by setting different thresholds on the lists of true and false positives identified by calling consensus on CirSeq reads (see Section 2). All the three methods were run in both a default and a sensitive mode (see Section 2 for their definitions).
Variants called by SiNPle, Lofreq and VarScan2 on 100,000 randomly sampled CirSeq reads. The table also shows the number of consensus reads from CirSeq that validate the second most frequent genotype. The variants in the table are the list of true positives thresholded at coverage 5 (see Section 2) and hence they are all likely to be real. We chose coverage 5 in order to obtain a table of reasonable size; other values produce similar results (see Figure 3).
| Position | Genotype | Supporting Reads | LoFreq Default | Sensitive | SiNPle Default | VarScan2 Default | Sensitive |
|---|---|---|---|---|---|---|---|
| 2133 |
| 172 | ✓ | ✓ | ✓ | ✓ | ✓ |
| 4348 |
| 42 | ✓ | ✓ | ✓ | ✓ | |
| 6952 |
| 15 | ✓ | ✓ | |||
| 2456 |
| 13 | ✓ | ✓ | |||
| 4357 |
| 13 | ✓ | ✓ | |||
| 1867 |
| 11 | ✓ | ✓ | |||
| 2547 |
| 10 | ✓ | ✓ | ✓ | ✓ | |
| 4104 |
| 10 | ✓ | ✓ | ✓ | ✓ | |
| 1870 |
| 8 | ✓ | ✓ | |||
| 3255 |
| 8 | ✓ | ✓ | ✓ | ||
| 4994 |
| 8 | ✓ | ✓ | |||
| 5091 |
| 8 | ✓ | ||||
| 5233 |
| 8 | ✓ | ✓ | |||
| 6937 |
| 8 | ✓ | ✓ | |||
| 2009 |
| 7 | ✓ | ✓ | |||
| 3950 |
| 7 | ✓ | ✓ | |||
| 4100 |
| 7 | ✓ | ✓ | ✓ | ✓ | |
| 6224 |
| 7 | ✓ | ✓ | |||
| 3978 |
| 6 | ✓ | ✓ | |||
| 4203 |
| 6 | ✓ | ✓ | |||
| 5143 |
| 6 | ✓ | ✓ | |||
| 5192 |
| 6 | ✓ | ✓ | |||
| 6802 |
| 6 | ✓ | ✓ | |||
| 7029 |
| 6 | ✓ | ✓ | |||
| 2088 |
| 5 | ✓ | ✓ | |||
| 2684 |
| 5 | ✓ | ✓ | |||
| 3690 |
| 5 | ✓ | ✓ | |||
| 4356 |
| 5 | ✓ | ||||
| 6234 |
| 5 | ✓ | ✓ | |||
| 6413 |
| 5 | ✓ | ✓ | |||
| 6477 |
| 5 | ✓ | ||||
| 6508 |
| 5 | ✓ | ✓ | |||
| 7080 |
| 5 | ✓ | ✓ | |||
| 7320 |
| 5 | ✓ | ✓ |
Figure 4The comparison of variants discovered on IBV and HIV datasets using SiNPle, LoFreq and VarScan2 in their default modes.
Time taken in seconds to process the IBV, HIV and poliovirus datasets on a single core using default settings. The time is the average of three runs, and figures for SiNPle and VarScan2 include the time spent on generating the pileup with samtools pileup. For LoFreq, pileup generation is not a factor as LoFreq works directly with BAM files.
| Dataset | Reads | Pileup | SiNPle | VarScan2 | LoFreq |
|---|---|---|---|---|---|
| HIV | 74,996 | 20.5 | 23.4 | 43.2 | 72.3 |
| Poliovirus | 100,000 | 34.8 | 36.6 | 45.7 | 51.2 |
| IBV | 3,778,012 | 690 | 755 | 910 | 7,324 |