| Literature DB >> 28701704 |
Ulrich Schlecht1, Janine Mok2, Carolina Dallett2, Jan Berka2.
Abstract
Single molecule sequencing (SMS) platforms enable base sequences to be read directly from individual strands of DNA in real-time. Though capable of long read lengths, SMS platforms currently suffer from low throughput compared to competing short-read sequencing technologies. Here, we present a novel strategy for sequencing library preparation, dubbed ConcatSeq, which increases the throughput of SMS platforms by generating long concatenated templates from pools of short DNA molecules. We demonstrate adaptation of this technique to two target enrichment workflows, commonly used for oncology applications, and feasibility using PacBio single molecule real-time (SMRT) technology. Our approach is capable of increasing the sequencing throughput of the PacBio RSII platform by more than five-fold, while maintaining the ability to correctly call allele frequencies of known single nucleotide variants. ConcatSeq provides a versatile new sample preparation tool for long-read sequencing technologies.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28701704 PMCID: PMC5507877 DOI: 10.1038/s41598-017-05503-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Gibson Assembly concatenates short DNA amplicons into long concatemers. (a) Schematic of the ConcatSeq sample preparation workflow for PacBio sequencing. A target amplicon is generated by PCR using primers flanked with spacer sequences (gray portions). These products are then used for a second round of PCR, where in two separate reactions, two different primer sets are used to incorporate complementary Gibson Assembly-compatible adapters to the target molecules (blue and yellow portions). Portions depicted in blue are the reverse complement of those depicted in yellow. The two amplicon pools are then mixed at equimolar quantities and incubated with an enzyme master mix that carries out the Gibson Assembly to generate a pool of concatemers of various lengths and randomized composition. Concatenation is followed by library preparation that attaches PacBio-specific hairpin adapters via A-tailed ligation and subsequent sequencing on a PacBio RSII instrument. (b) Bioanalyzer DNA7500 gel image showing ladder [L], the non-concatenated amplicon pool [N], and the concatenated sample [C]. Banding pattern in the concatenated sample indicates depletion of monomers and accumulation of n-mers of higher degree. The expected size of n-mers created this way corresponds to (n * x) − ((n−1) * y) where x is the size of a monomer and y is the size of the adapter. As in the example shown here the size of a monomer is 247 bp, the expected sizes are: 464, 681, 898, 1115, 1332, and 1549 bp for 2- to 7-mers. (c) Histogram of the length of PacBio circular consensus sequences (CCS reads) of the concatenated sample. Peaks coincide with the expected sizes of n-mers. For clarity, the histogram was truncated at 3 kb.
Figure 2ConcatSeq can increase sequencing throughput by more than five-fold. (a) Schematic of the architecture of one example read (read number 5 in SRR5168830), depicting types and orientation of different sequence features identified: fragments and adapters in forward and reverse complement orientation, and spacers. (b) Histogram depicting the frequency of fragments in each size bin (10 bp interval) after deconcatenation of sequencing reads using the adapter scanning approach. +: fragments with exactly (187 bp) or very close to the expected size (180–190 bp); ++: fragments that are shorter than 10 bp; +++: fragments that are slightly larger than the expected size (>190 bp). (c) Barplot depicting the number of fragments [Frag] that aligned to the reference and adapters [Adap] that were identified by the adapter scanning approach in forward [fw] and reverse complement [rc] orientation in all reads. (d) Scatterplot depicting the relationship between read length and number of fragments identified in that read. Red dots indicate the cases in which the read was significantly longer than the length expected by the number of fragments identified in that read. (e) Histogram depicting the frequency of number of fragments identified per read across all reads. For clarity, the histogram was truncated at 30 fragments per read.
Overview of PacBio sequencing runs.
| Figure(s) | DNA source | # of reads | # of fragmentsa | degree of concatenationb | # of aligned fragments | on-target ratec | SRA Accd |
|---|---|---|---|---|---|---|---|
|
| NRAS (exon 3) | 14,739 | 83,678 | 5.68 | 82,008 | 98.0% | SRR5168830 |
|
| Cancer panel (NC) | 15,143 | 15,143 | 1 | 14,700 | 97.1% | SRR5168831 |
|
| Cancer panel (C-1) | 18,561 | 98,250 | 5.29 | 94,892 | 96.6% | SRR5168832 |
|
| Cancer panel (C-2) | 26,601 | 134,146 | 5.04 | 128,971 | 96.1% | SRR5168833 |
|
| Cancer panel (C-3) | 20,686 | 108,078 | 5.22 | 104,562 | 96.7% | SRR5168834 |
|
| EGFR locus | 52,341 | 231,801 | 4.43 | 224,595 | 96.9% | SRR5168838 |
| Supp. Figure | LMW DNA ladder | 48,183 | 181,901 | 3.78 | 148,300 | 81.5% | SRR5168841 |
‘# of’ stands for ‘number of’;
NC: non-concatenated pool; C-1,2,3: concatenated pool, replicates 1,2,3;
athis excludes all fragments that are only 1 bp long;
bthis is the ratio of # of fragments and the # of total reads;
cthis is the ratio of # of aligned reads and the # of fragments;
dthis is the accession number for sequence data deposited in the SRA (Sequence Read Archive).
Figure 3ConcatSeq correctly identifies single-nucleotide variants (SNVs) in an oncology amplicon panel. (a) Schematic depicting the bioinformatics analysis pipeline describing all steps and tools used, starting with PacBio raw reads (subreads) to determining allele frequencies (AFs) of known SNVs in HD701. (b) Scatterplot showing comparison of AFs identified in replicates of concatenation samples plotted against the expected frequencies. Average values from three independent experiments are shown. Error bars indicate standard deviation from the three measurements. (c) Scatterplot showing a comparison of AFs identified in replicates of concatenation samples plotted against frequencies found in the non-concatenation control sample. Average values from three independent experiments are shown. Error bars indicate standard deviation from the three measurements. (d) Barplot comparing amplicon coverage in non-concatenated and three replicates of concatenation samples. Frequencies were calculated by dividing the number of fragments that aligned to each of the 20 amplicons by the total number of aligned fragments. Pearson’s r was calculated for every replicate independently, and the lowest of the three correlation coefficients is indicated in the plot.
Figure 4Adaptation of Concat-Seq to an alternative target enrichment workflow. (a) Schematic of Roche Nimblegen’s SeqCap workflow and its adaptation to ConcatSeq. Short DNA fragments, for example cell-free DNA, are prepared for adapter ligation by end-repair and A-tailing (ERAT). During adapter ligation, two types of ConcatSeq adapters are used (blue and yellow portions) instead of the Y-shaped adapters normally used in this step. The resulting library pool is then used for hybridization to biotinylated (green dots) hybrid capture probes (gray bars), and enriched targets are subsequently amplified in a post-capture PCR. The amplified material is then concatenated and processed as described in Fig. 1a. (b) Bioanalyzer DNA7500 gel image showing ladder [L], the non-concatenated amplicon pool (a pool of 4 amplicons from the EGFR locus) [N], the non-concatenated amplicon pool with adapters [A], and the concatenated sample [C]. As expected a shift of about 60 bp was observed between lanes [N] and [A] indicating successful ligation of ConcatSeq adapters to the original pool of amplicons. Banding pattern in lane [C] indicates depletion of monomers and accumulation of n-mers of higher degree. (c) Histogram depicting frequencies of fragment lengths after deconcatenation of EGFR-concatemer reads. The size of the vast majority of fragments coincides with the expected EGFR amplicon length (220 bp).