| Literature DB >> 25406369 |
Lewis Z Hong, Shuzhen Hong, Han Teng Wong, Pauline P K Aw, Yan Cheng, Andreas Wilm, Paola F de Sessions, Seng Gee Lim, Niranjan Nagarajan, Martin L Hibberd, Stephen R Quake, William F Burkholder.
Abstract
We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.Entities:
Mesh:
Year: 2014 PMID: 25406369 PMCID: PMC4269956 DOI: 10.1186/PREACCEPT-6768001251451949
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Outline of BAsE-Seq methodology. (a) The goal of library preparation is to attach unique barcodes to full-length HBV genomes, and then juxtapose the assigned barcode to random overlapping fragments of the viral genome. A unique barcode is first assigned to each HBV genome using PCR. The two barcode assignment primers contain HBV-specific sequences on their 3′ ends, universal sequences (green) on their 5′ ends, and one of the primers also contains a random barcode (blue). Subsequently, barcode-tagged genomes are clonally amplified by PCR using primers that anneal to Uni-A and Uni-B and that add a biotin label (Bio) to the barcode-proximal end. The barcode-distal end is digested with exonuclease to obtain a broad size distribution of nested deletion fragments. Barcode-containing fragments are purified using Dynabeads, and intramolecular ligation of these fragments yields a library of circular molecules in which different regions of each HBV genome are juxtaposed to its assigned barcode. The circularized molecules are used as a template for random fragmentation and adapter tagging following the Nextera protocol. During PCR enrichment, a set of primers is used to incorporate Illumina-specific paired-end adapters and enrich for barcode-tagged molecules during sequencing. (b) Bioinformatics workflow. Barcode-containing read pairs are used to obtain a 'bulk consensus' genome by iterative alignment of read pairs against a GenBank sequence. Aligned read pairs are de-multiplexed into individual genomes based on barcode identity. Consensus base calls are extracted to obtain 'individual consensus' genomes and SNVs are identified in each genome to construct haplotypes.
Summary statistics from BAsE-Seq and Deep-Seq of hepatitis B virus
|
|
|
| |||||
|---|---|---|---|---|---|---|---|
| Read-pairs | 14,352,128 | 17,083,497 | 42,997,995 | 8,197,770 | |||
|
|
|
|
|
| |||
| Type of sample | Mixed clone | Mixed clone | Internal standard | Patient | Patient | ||
| Pass-filter read pairsa | 6,751,411 (47%) | 8,816,934 (52%) | 545,960 (1%) | 26,066,408 (61%) | 6,351,796 (77%) | ||
| Concordantly alignedb | 6,027,421 (89%) | 8,150,721 (92%) | 496,356 (91%) | 23,366,358 (90%) | 4,261,572 (67%) | ||
| High quality genomes | 2,390c | 3,673c | 345d | 12,444d | |||
| Type of analysis | Bulk | Individual | Bulk | Individual | Individual | Individual | |
| Median per-base coverage depth | 333,677 | 86 | 470,036 | 63 | 38 | 45 | 131,492 |
| True SNVs detected | 17 /17 | 17/17 | 15/17 | 17/17 | 68 | ||
| SNVs detected | 308 | ||||||
| Errors detected | 522 | 218 | 328 | 257 | 11 | ||
| Highest per-base error | 1.91% | 0.202% | 2.14% | 0.231% | 0.69% | ||
| Overall error | 0.0524% | 0.00674% | 0.0324% | 0.00541% | 0.00214% | ||
Each BAsE-Seq library was analyzed using the 'bulk' approach, i.e., without de-multiplexing by barcode identity, or the 'individual' approach, i.e., sequence reads associated with unique barcodes were analyzed separately. True SNVs in S7.1 were identified by BAsE-Seq as those that occurred at >0.69% frequency.
aRead pairs after removal of adaptor and/or universal sequences. For BAsE-Seq libraries, this only includes read pairs that carry a barcode.
bBoth reads in a pair were aligned in the expected orientation.
c≥ 4 unique reads per base position across ≥85% of the genome.
d≥ 4 unique reads per base position across ≥50% of the genome.
Figure 2SNVs in BAsE-Seq and Deep-Seq libraries. (a-d) SNVs in BAsE-Seq libraries Lib_1:9 and Lib_1:99 were identified as true SNVs (red diamonds) or errors (blue dots) using the 'bulk' approach (a,c) or the 'individual' approach (b,d). The frequency of each SNV (y-axis) is plotted against base position in the consensus sequence (x-axis). Additional information is also provided in Tables 1 and 3. (e,f) SNVs from S7.1 were identified using Deep-Seq and BAsE-Seq. The BAsE-Seq library contained an internal standard that was used to calculate the error-free frequency cutoff for the library; hence, only error-free SNVs are shown in the BAsE-Seq analysis of S7.1. (g) The frequency of SNVs detected in the BAsE-Seq library (y-axis) is plotted against the frequency of SNVs detected in the Deep-Seq library (x-axis). All 68 error-free SNVs identified by BAsE-Seq were also identified by Deep-Seq (Pearson correlation coefficient = 0.94).
Comparison of haplotypes observed over a 367 bp region in S7.1
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
|
|
|
|
|
|
| ||
|
|
|
|
|
|
| ||
|
|
|
|
|
|
| ||
|
|
| ||||||
|
| . | A | . | . | . | 10 (50%) | 1,588 (62%) |
| . | . | T | C | . | 2 (10%) | 428 (17%) | |
| G | . | T | C | . | 5 (25%) | 403 (16%) | |
| . | . | T | C | A | 0 | 65 (3%) | |
| G | . | T | C | A | 0 | 27 (1%) | |
| . | . | . | . | . | 0 | 24 (1%) | |
| G | A | T | C | . | 0 | 14 (0.5%) | |
| . | A | . | . | A | 1 (5%) | 2 (0.08%) | |
| . | . | T | . | . | 0 | 2 (0.08%) | |
| G | . | . | C | . | 0 | 2 (0.08%) | |
| . | A | T | C | . | 2 (10%) | 0 | |
|
| 5 | 10 | |||||
|
| 20 | 2,555 | |||||
aSNVs identified by all three methods - Sanger sequencing of clones, Deep-Seq and BAsE-Seq.
bBase call in the bulk consensus genome.
cA period on the haplotype indicates that the position carries the consensus base. SNVs are represented by the identity of the variant base.
Haplotypes identified by BAsE-Seq in Lib_1:9 and Lib_1:99
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
| |||
| ........................................................ | 2,118 | 88.6% | 85.7% | 3,656 | 99.5% | 99.3% |
| ACACTAAATTTAAACAG | 270 | 11.3% | 14.3% | 14 | 0.4% | 0.7% |
| ....C................................................ | 1 | 0.04% | 0% | 1 | 0.03% | 0% |
| ACACTAAATTTAAA...AG | 1 | 0.04% | 0% | 0 | 0% | 0% |
| ....................A................................ | 0 | 0% | 0% | 2 | 0.05% | 0% |
| Total | 2,390 | 100% | 100% | 3,673 | 100% | 100% |
aHaplotypes observed in individual genomes. The major allele (Clone-2) is represented by a period and the minor allele (Clone-1) is represented by the base at that position (Table S2 in Additional file 2).
bHaplotype frequencies observed in individual genomes.
cAverage allele frequency of SNPs from 'bulk' analysis.
Figure 3Phylogenetic analysis of intra-host viral quasispecies. A phylogenetic analysis of HBV haplotypes identified by BAsE-Seq identified six distinct clades (numbered 1 to 6) in S7.1. The black scale bar represents the expected number of substitutions per site and the blue scale bar represents the frequency at which a particular haplotype was identified in the sample. Amino acid changes that are found in ≥70% of clade members are listed within each clade. Amino acid changes that are unique to each clade are listed with an asterisk. Five out of six clades contain at least one amino acid change (red) that is likely to confer the ability to escape immune detection.