| Literature DB >> 29202193 |
Nicholas J Hathaway1, Christian M Parobek2, Jonathan J Juliano2,3, Jeffrey A Bailey1,4.
Abstract
PCR amplicon deep sequencing continues to transform the investigation of genetic diversity in viral, bacterial, and eukaryotic populations. In eukaryotic populations such as Plasmodium falciparum infections, it is important to discriminate sequences differing by a single nucleotide polymorphism. In bacterial populations, single-base resolution can provide improved resolution towards species and strains. Here, we introduce the SeekDeep suite built around the qluster algorithm, which is capable of accurately building de novo clusters representing true, biological local haplotypes differing by just a single base. It outperforms current software, particularly at low frequencies and at low input read depths, whether resolving single-base differences or traditional OTUs. SeekDeep is open source and works with all major sequencing technologies, making it broadly useful in a wide variety of applications of amplicon deep sequencing to extract accurate and maximal biologic information.Entities:
Mesh:
Year: 2018 PMID: 29202193 PMCID: PMC5829576 DOI: 10.1093/nar/gkx1201
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.SeekDeep Overview. The depicted SeekDeep pipeline was designed to handle diverse experimental and computational workflows. In general, input sequence data is organized as one or more groups of samples that can represent natural populations, different experimental conditions, or any other defined classification. The pipeline is modular, allowing for substitute or additional processing at any step as well as access to the underlying data. The goal of SeekDeep is to perform initial processing and clustering along with exploration of the results and quality control. Extraction is done by extractor to demultiplex on sample barcodes (depicted here as colored squares at the beginning of sequences) and/or multiple primers if either are still present in input data. Next, sequences are clustered at the sample level by qluster based on either presets for specific sequencing technologies or user defined parameters to provide the requisite level of resolution (see Supplementary Figure S2 for how these errors are characterized). Finally the haplotypes generated by qluster are analyzed by processClusters to take into account replicate comparisons (if available) and then compare sample haplotypes to generate population-level haplotypes and statistics. Final results can be viewed with popClusteringViewer in an interactive HTML viewer. For more specific downstream analyses, data can be outputted in multiple formats.
In vitro control datasets
| Dataset amplicon | Technology | Read deptha | Sample number | Replicateb | Read length | Region length | Unique haplotype number | Range of haplotype base differences (% identity)c |
|---|---|---|---|---|---|---|---|---|
| PfTRAP | 454 | 812–987 | 1 | 2x | 345 | 345 | 5 | 1 (99.7%)–7 (97.9%) |
| PfAMA1 | Ion Torrent | 1323–1712 | 2 | 2x | 494 | 494 | 5 | 2 (99.1%)–12 (94.9%) |
| PfCSP | Ion Torrent | 1054–6403 | 4 | 2x | 319 | 319 | 4 | 2 (99.3%)–9 (97.2%) |
| Various | Illumina MiSeq | 614–4497 | 28 | None | 2 × 250 | 330–403 | 2–4 | 1 (99.7%)–17 (95.3%) |
| Microbiome 16S-V1 | Illumina MiSeq | 584 575–899 804 | 1 | 3x | 2 × 250 | 280 | 47 | 1 (99.6%)–101 (63.9%) |
| EBV | Illumina MiSeq | 342–1350 | 6 | 2x | 2 × 250 | 372 | 2 | 20 (92.6%) |
| HIV | Illumina MiSeq | 10 000 | 20 | 2x | 2 × 250 | 206 | 5 | 2 (99%)–5 (97.5%) |
aRead depth equals number of stitched read pairs with minimum and maximum observed depths in the case of multiple samples and replicates.
b2x = two independent PCRs; 3x = three independent PCRs, or none = no replicate (single PCR) done.
cNumber of differences are enumerated and followed by the corresponding percent identity, the range is shown when there are more than 2 unique haplotypes.
dSummary of the 28 targets here, see Supplementary Table S1 for details for each target.
Figure 2.Haplotype recovery of simulated minor haplotypes differing by a single base. (A) Recovery of the haplotype differing by a single-base from a major haplotype in the mixture described by Supplementary Figure S4A and B. (B) Recovery of the two minor haplotypes that are one-off from each other described in the mixture described by Supplementary Figure S4C and D. For both panels, the y-axis represents the percent of simulations in which the haplotype differing by a single-base was detected and the x-axis represents the simulated expected abundance of the minor haplotype. Data is broken down by read depth (rows) and sequencing technology (columns), and bars are colored by program. Grey boxes at low-abundances represent combinations where the depth is not sufficient for reads to be observed for the minor haplotypes. For each minor haplotype abundance, there are 20 simulations from which DADA2, MED and UNOISE haplotype recovery was calculated as a percent of simulations in which the minor haplotype was detected. To best emulate real world situations in which a user would use SeekDeep to analyze replicates, we used paired simulations with the requirement that SeekDeep detect haplotypes in both simulations.
Figure 3.In vitro Ion Torrent and 454 mixtures performance. (A) The mean haplotype recovery for in vitro pyrosequencing samples with bars showing standard error. (B) Predicted abundance (y-axis) estimated by the various programs is plotted against the expected abundance (x-axis). Deviation from the line of identity represents the error and is summarized by the correlation coefficient. (C) False haplotypes are shown on a jitterplot to demonstrate their relative abundances and numbers (see Supplementary Table S3 for exact counts). Results are shown per program and also by the effect of utilizing or not utilizing replicates (haplotypes are only accepted if they appear in both replicates).
Figure 4.In vitro Illumina P. falciparum performance. (A) The mean haplotype recovery for P. falciparum in vitro Illumina datasets with bars showing standard error. (B) Predicted abundance (y-axis) estimated by the various programs plotted against the expected abundance (x-axis). Deviation from the line of identity represents the error and is summarized by the correlation coefficient. (C) False haplotypes are shown on a jitterplot to demonstrate their relative abundances and numbers (see Supplementary Table S4 for exact counts). No replicates were available for this dataset.