| Literature DB >> 26789840 |
James A Stapleton1, Jeongwoon Kim2, John P Hamilton2, Ming Wu3, Luiz C Irber4, Rohan Maddamsetti5,6, Bryan Briney7,8,9, Linsey Newton2, Dennis R Burton7,8,9, C Titus Brown4,10, Christina Chan1, C Robin Buell2, Timothy A Whitehead1,11.
Abstract
Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26789840 PMCID: PMC4720449 DOI: 10.1371/journal.pone.0147229
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1A method for assembling synthetic long reads.
(a) Schematic of the approach. A supplemental barcode-pairing protocol (grey box) resolves the two distinct barcodes affixed to each original target molecule. (b) Reads associated with two distinct barcodes are shown aligned to the E. coli MG1655 reference genome. Barcode pairing merges the groups (bottom), increasing and evening the coverage and allowing assembly of the full 10-kb target sequence. (c) Length histogram of synthetic reads assembled from E. coli MG1655 genomic reads (minimum length 1 kb). The N50 length of the synthetic reads is 6.0 kb, and the longest synthetic read is 11.6 kb. (d) Mismatch rates of synthetic reads from the E. coli MG1655 dataset as a function of relative position along the synthetic read. (e) Length histogram of synthetic long reads assembled from Gelsemium sempervirens genomic reads (minimum length 1.5 kb). The N50 length of the synthetic reads is 4.3 kb. (f) An additional multiplexing index region (grey square) allows adapter-ligated samples to be mixed and processed in a single tube. Genomic DNA from twenty-four experimentally evolved strains of E. coli was separately ligated to adapters and amplified, then mixed into a single tube for the remaining steps of the protocol. E. coli genome coverage and N50 length are plotted for synthetic reads from each strain. Circle size indicates the number of short reads demultiplexed to a given strain.
Genome assembly statistics for G. sempervirens.
| Shotgun contigs | Synthetic read scaffolds | |
|---|---|---|
| Contig/scaffold N50 size (bp) | 19,656 | 29,078 |
| Total assembly size (bp) | 215,038,998 | 218,719,799 |
| No. of contigs/scaffolds | 25,276 | 18,106 |
| Maximum length (bp) | 197,779 | 365,589 |
Small contigs (< 1 kb) were filtered out.
Fig 2Individual assembly of full-length mRNA sequences.
(a) Length distribution of synthetic long reads (minimum length 500 bp) from HCT116 mRNA. (b) Length distribution of synthetic long reads (minimum length 500 bp) from HepG2 mRNA. (c) Box plots showing the number of splice junctions spanned by short reads and synthetic long reads. The axis is broken between 5–10 junctions spanned and the scale changed; a version with a standard axis is presented as S12 Fig. Inset: 97% of the junctions identified in the synthetic reads are known, providing validation for the method.
Fig 3Individual assembly of full-length env genes from a mixture of two variants.
(a) The length distribution of the synthetic long reads (minimum length 1 kb) shows assembly of full-length 3-kb env gene sequences. (b) 1,173 synthetic reads between 1.5 and 3.2 kb in length were aligned to each of the two original env sequences (env1 and env2). The alignment match rates are shown as a heatmap, with each synthetic read represented by a thin horizontal line. The majority of the synthetic reads align with low error to exactly one of the two original sequences, indicating high accuracy and a low rate of chimera formation. Chimeric reads would be expected to match both original sequences at intermediate accuracies. (c) Scatter plot showing the mismatch rates of each synthetic read against the two known env sequences. Synthetic reads (open circles to emphasize extensive overlap) cluster into two distinct groups along the axes (near-zero mismatch rate). Even the sixteen reads that do not fall on the clusters are distant from three manually created mock chimeras (crosses), indicating a low frequency of chimera formation.
Fig 4Simulated haplotype phasing by correlation of unique sequences within barcode-defined groups.
Short unique sequences were identified at each end of the two variants (Env1_1 and Env1_2 from variant 1, Env2_1 and Env2_2 from variant 2). Each barcode-defined group of short reads was searched for the four sequences. A high number of counts of occurrences of a unique sequence from near the 5’ end of one env variant (Env1_1, Env2_1) in a barcode-defined group of short reads is a strong predictor of a high number of occurrences of a second unique sequence from the 3’ end of the same variant (Env1_2, Env2_2) in the same group, and also a strong predictor of a low number of occurrences of the unique sequence from the 3’ end of the other variant. Therefore, the haplotype across these two loci in a given barcoded individual can be phased regardless of the length or identity of the intervening sequence.