| Literature DB >> 23066108 |
Andreas Wilm1, Pauline Poh Kim Aw, Denis Bertrand, Grace Hui Ting Yeo, Swee Hoe Ong, Chang Hua Wong, Chiea Chuen Khor, Rosemary Petric, Martin Lloyd Hibberd, Niranjan Nagarajan.
Abstract
The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23066108 PMCID: PMC3526318 DOI: 10.1093/nar/gks918
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Distribution of clinical dengue virus sequencing datasets
| Drug | Placebo | Total | |
|---|---|---|---|
| DENV1 | 8 (19) | 11 (22) | 19 (41) |
| DENV2 | 5 (11) | 2 (4) | 7 (15) |
| DENV3 | 2 (5) | 2 (4) | 4 (9) |
| Total | 15 (35) | 15 (30) | 30 (65) |
The samples analyzed here were collected as part of a drug-trial study for the nucleoside-analog Balapiravir (31). Numbers in parentheses report the total number of samples sequenced, while un-parenthesized numbers report the number of pairs (a pre- and a post-dose sample) that were sequenced.
Performance of variant callers as a function of coverage
| Coverage | Goto | Wright | Breseq* | SNVer | LoFreq | |
|---|---|---|---|---|---|---|
| 50× | Sensitivity | 61 | 71 | 56 | 58 | 60 |
| PPV | 100 | 50 | 100 | 100 | 100 | |
| 100× | Sensitivity | 64 | 76 | 59 | 62 | 64 |
| PPV | 100 | 33 | 100 | 100 | 100 | |
| 500× | Sensitivity | 66 | 90 | 66 | 67 | 73 |
| PPV | 100 | 9 | 100 | 100 | 100 | |
| 1000× | Sensitivity | 67 | 95 | 68 | 70 | 77 |
| PPV | 100 | 5 | 100 | 100 | 100 | |
| 5000× | Sensitivity | 67 | 100 | 76 | 74 | 87 |
| PPV | 100 | 1 | 100 | 100 | 100 | |
| 10 000× | Sensitivity | 67 | 100 | 78 | 77 | 94 |
| PPV | 100 | 2 | 100 | 100 | 100 |
Sensitivity and PPV are reported as an average of 10 replicates. Sensitivity was measured as the fraction of true SNVs that were correctly called and PPV was measured as the fraction of SNV calls that were correct. In all cases, standard deviation was <2%. We present results for Breseq’s stand-alone variant caller (indicated with Breseq*) in this comparison as the Breseq pipeline unexpectedly performed poorly on this dataset.
Figure 1.In silico and experimental validation. (a) Sensitivity as a function of SNV frequency for LoFreq, SNVer and Breseq on a simulated viral population (see ‘Materials and Methods’ section). (b) Venn diagram showing the overlap of SNV predictions on the simulated population. (c) Detection limits for LoFreq and SNVer as a function of sequencing quality and coverage. Note that SNVer results are unaffected by varying quality values. (d) Validation results for rare variants on a Fluidigm Digital Array. Standard deviations are shown as boxes with error-bars. Note that three assays failed (reporting a non-sense frequency of 50%) and are not shown here.
Reproducibility and robustness of variant callers
| Reproducibility | Robustness | Average number of SNVs | |
|---|---|---|---|
| Breseq | 90.6 | 90.6 | 40.3 |
| SNVer | 99.4 | 97.1 | 27.7 |
| LoFreq | 95.7 | 96.5 | 57.5 |
Results were computed from dengue virus sequencing data for six TSV01 DENV2 replicates (see ‘Materials and Methods’ section). Reproducibility was computed as the percentage of SNVs in the replicate datasets that were seen in another replicate and robustness was computed as the percentage of SNVs in the replicates that were seen in the pooled dataset (obtained by combining the replicates; reproducible SNVs were included in the pooled calls).
Figure 2.SNV calling in the presence of tumor sample heterogeneity. Germline and somatic variant frequencies for paired tumor-normal exome sequencing datasets from a custom samtools-based pipeline (32) are compared here with those from LoFreq (see ‘Materials and Methods’ section). As shown, while germline variants are consistently distributed around 50% (as expected for heterozygous variants), somatic variants are shifted to lower frequencies, likely due to contamination in the tumor sample from normal stromal tissue. Note that while samtools-based somatic calls appear ‘clipped’ at lower frequencies, LoFreq calls are symmetrically distributed as expected.
Figure 3.Mutational hotspots and cold-spots in the dengue virus genome. Circos plots (56) of mutational hotspots and cold-spots derived from clinical (a) DENV1 and (b) DENV2 samples. Outer ring: gene annotation; inner ring: average coverage (log10-scaled). The inner bars mark mutational hotspots (red) and cold-spots (blue), which were derived from intra-host variations called by LoFreq (see ‘Materials and Methods’ section). Height of hotspots indicates how often the hotspot was found (sqrt(count)), whereas the height of cold-spots is fixed. The cold-spot in prM is shared between both serotypes. The last hotspot window in NS1 for the DENV2 samples was only found in pre-dose samples (Table 3) and disappears at later time points.
Figure 4.Structural view of hot and cold-spots in the dengue virus genome. (a) Surface representation of dengue virus NS5 methyltransferase (PDB accession number 1R6A). The nucleoside-analog ribavirin 5′-triphosphate (RTP) is shown in blue and the by-product of S-adenosyl-l-methionine (SAM) after the transfer of a methyl group, S-adenosyl-l-homocysteine (SAH), is in red, both in ball-and-stick representation. Cold-spots are colored in violet. The first group of cold-spots consists of contiguous residues which completely enclose the binding site for SAM. SAM molecules serve as a methyl donor in the reaction catalyzed by the NS5 methyltransferase, which results in the capping of viral mRNAs. The second group of cold-spots corresponds to the carboxyl end of the NS5 methyltransferase which act as the linker region that connects the domain to the NS5 polymerase domain. (b) Surface representation of dengue virus NS5 RNA-dependent RNA polymerase (PDB accession number 2J7W). The GDD catalytic triad is colored in red whereas the cold-spots identified from SNV analysis are colored in violet. Cold-spots include the dengue virus NS5 RNA-dependent RNA polymerase GDD catalytic triad and also parts of the template tunnel through which the viral RNA substrate enters and exits during replication.