| Literature DB >> 30126356 |
Felix Francis1,2, Michael D Dumas1, Scott B Davis1, Randall J Wisser3,4.
Abstract
BACKGROUND: Targeted resequencing with high-throughput sequencing (HTS) platforms can be used to efficiently interrogate the genomes of large numbers of individuals. A critical issue for research and applications using HTS data, especially from long-read platforms, is error in base calling arising from technological limits and bioinformatic algorithms. We found that the community standard long amplicon analysis (LAA) module from Pacific Biosciences is prone to substantial bioinformatic errors that raise concerns about findings based on this pipeline, prompting the need for a new method.Entities:
Keywords: Divide and conquer; Long-range PCR; PacBio amplicon analysis; Resequencing; Sequence error; Target enrichment
Mesh:
Year: 2018 PMID: 30126356 PMCID: PMC6102811 DOI: 10.1186/s12859-018-2293-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Graphical representation of the C3S-LAA process and pipeline. a Raw reads comprised of multiple subreads are depicted for three different amplicons [green, fuchsia and blue boxes; different shades of color are used to portray variable subread sequence qualities (darker shading portrays higher quality)]. Subreads are separated by a shared adapter sequence (grey boxes). The higher quality CCS read for each raw read is used to cluster the corresponding raw reads into CCS-based cluster groups. Error correction is performed per CCS-based cluster, producing top quality consequences sequences, followed by assembly of any overlapping consensus sequences. b A single run parameters file is used by all components of the pipeline. The grey highlighted rectangles represent two main steps of C3S-LAA. (i) Using the CCS reads generated by the SMRT analysis reads of insert protocol, C3S clusters the raw reads according to each barcode-primer pair combination, producing files of read identifiers to whitelist the corresponding raw reads. (ii) Raw read clusters are passed to Quiver to generate amplicon-specific consensus sequences, which are then passed to Minimus for sequence assembly. Rectangles with folded corners represent single files or multiple files (depicted as stacks of files) and those with rounded edges represent scripts and tools. Arrows indicates output files that are generated. Connecting lines with dots at one end depict input files, with the dot corresponding to the source data for the connected script or tool
Comparison of LAA and C3S-LAA consensus sequences for B73 amplicons
| Library typea | Method | Number of consensus sequences | Complete matchb (100% identity) | Truncated match (100% identity) | Partial match (<100% identity) |
|---|---|---|---|---|---|
| Single | LAA | 14 | 7 | 1 | 6 |
| Single | C3S-LAA | 9 | 9 | 0 | 0 |
| Multiplex | LAA | 8 | 4 | 1 | 3 |
| Multiplex | C3S-LAA | 6 | 5 | 0 | 1 |
aThe single library had nine expected consensus sequences, whereas the multiplex library had six expected consensus sequences.
bFor the multiplex sample library, the B73 v3 assembly contained a gap relative to one of the five amplicon sequences, leading to one C3S-LAA sequence having a partial match. This gap was filled in the latest B73 v4 release
Fig. 2Sequence accuracy as a function of subread depth. a Accuracy of consensus and b assembly sequences. Data from all the amplicons were pooled together to evaluate the consensus calling accuracy as a function of depth of coverage of SMRT raw reads. The vertical line shows the minimum read depth of the consensus sequences used for assemblies
Fig. 3Total number of accurate bootstrap assemblies per CCS sample size. At each level of the CCS read depth sample (1-40), the figure shows the total number of bootstrapped assemblies that were 100% identical to the reference sequence. This was determined for the four target regions (25 bootstrap assemblies at each of 4 loci, giving rise to a maximum of 100 on the x-axis) formed from the consensus sequences among the eight overlapping amplicons
Fig. 4Sequence alignment highlighting a recurring insertion error in some bootstrap samples. The alignment corresponds to the consensus sequence for a part of the amplicon from a locus_6_7045710_7052049 (Query) and b locus_1_25390617_25396540 (Query) on maize chromosome 6 and 1 respectively compared to the B73 v3 reference sequence (Sbjct)
The number of consensus sequences generated from the multiplex library, following barcode demultiplexing
| Samplea | Barcode ID | LAA consensus | C3S-LAA consensus |
|---|---|---|---|
| B73 | 32 | 8 | 6 |
| CML277 | 35 | 6 | 6 |
| Hp301 | 31 | 6 | 6 |
| Mo17 | 20 | 7 | 6 |
| P39 | 2 | 7 | 6 |
| Tx303 | 4 | 7 | 6 |
| N/A | 8 | 7 | 0 |
| N/A | 23 | 5 | 0 |
| N/A | 49 | 1 | 0 |
| N/A | 82 | 2 | 0 |
| N/A | 85 | 1 | 0 |
| N/A | 91 | 6 | 0 |
| N/A | 92 | 3 | 0 |
aNo samples were associated with the N/A barcode