| Literature DB >> 26542840 |
Abstract
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Additionally, PacBio's sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone.Entities:
Keywords: De novo assembly; Gene isoform detection; Hybrid sequencing; Methylation; Third-generation sequencing
Mesh:
Substances:
Year: 2015 PMID: 26542840 PMCID: PMC4678779 DOI: 10.1016/j.gpb.2015.08.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1SMRTbell template
Hairpin adaptors (green) are ligated to the end of a double-stranded DNA molecule (yellow and purple), forming a closed circle. The polymerase (gray) is anchored to the bottom of a ZMW and incorporates bases into the read strand (orange). The image is adapted from [2] with permission from the Oxford University Press.
Figure 2A single SMRT cell
Each SMRT cell contains 150,000 ZMWs. Approximately 35,000–75,000 of these wells produce a read in a run lasting 0.5–4 h, resulting in 0.5–1 Gb of sequence. The image is adapted with permission from Pacific Biosciences [3]. ZMW, zero-mode waveguide.
Figure 3Sequencing via light pulses
A. A SMRTbell (gray) diffuses into a ZMW, and the adaptor binds to a polymerase immobilized at the bottom. B. Each of the four nucleotides is labeled with a different fluorescent dye (indicated in red, yellow, green, and blue, respectively for G, C, T, and A) so that they have distinct emission spectrums. As a nucleotide is held in the detection volume by the polymerase, a light pulse is produced that identifies the base. (1) A fluorescently-labeled nucleotide associates with the template in the active site of the polymerase. (2) The fluorescence output of the color corresponding to the incorporated base (yellow for base C as an example here) is elevated. (3) The dye-linker-pyrophosphate product is cleaved from the nucleotide and diffuses out of the ZMW, ending the fluorescence pulse. (4) The polymerase translocates to the next position. (5) The next nucleotide associates with the template in the active site of the polymerase, initiating the next fluorescence pulse, which corresponds to base A here. The figure is adapted from [4] with permission from The American Association for the Advancement of Science.
Figure 4PacBio RS II read length distribution using P6-C4 chemistry
Data are based on a 20 kb size-selected E. coli library using a 4-h movie. Each SMRT cell produces 0.5–1 billion bases. The P6-C4 chemistry is currently the most advanced sequencing chemistry offered by PacBio. The figure is adapted with permission from Pacific Biosciences [8].
Performance comparison of sequencing platforms of various generations
| Sanger ABI 3730×l | 1st | 600–1000 | 0.001 | 96 | 0.5–3 h | 500 | |
| Ion Torrent | 2nd | 200 | 1 | 8.2 × 107 | 2–4 h | 0.1 | |
| 454 (Roche) GS FLX+ | 2nd | 700 | 1 | 1 × 106 | 23 h | 8.57 | |
| Illumina HiSeq 2500 (High Output) | 2nd | 2 × 125 | 0.1 | 8 × 109 (paired) | 7–60 h | 0.03 | |
| Illumina HiSeq 2500 (Rapid Run) | 2nd | 2 × 250 | 0.1 | 1.2 × 109 (paired) | 1–6 days | 0.04 | |
| SOLiD 5500×l | 2nd | 2 × 60 | 5 | 8 × 108 | 6 days | 0.11 | |
| PacBio RS II: P6-C4 | 3rd | 1.0–1.5 × 104 on average | 13 | 3.5–7.5 × 104 | 0.5–4 h | 0.40–0.80 | |
| Oxford Nanopore MinION | 3rd | 2–5 × 103 on average | 38 | 1.1–4.7 × 104 | 50 h | 6.44–17.90 |
De novo genome assemblies using hybrid sequencing or PacBio sequencing alone
| PacBio | HGAP | 2 | 179× | 1 | 21 fewer contigs than using SGS; no collapsed repeat regions (⩾4 using SGS) | ||
| PacBio | HGAP, Celera, minimus2, SeqMan | 26 | 320× | 1 | 6 fewer contigs than with Illumina; 100% coverage (Illumina: 90.59%); resolved 187 ambiguous nucleotides in Illumina assembly; unambiguously assigned small differences in two >25 kb inverted repeats | ||
| PacBio | PBcR, MHAP, Celera, Quiver | 1 | 85× | 1 | 4.6 CPU hours for genome assembly (10× improvement over BLASR) | ||
| PacBio | PBcR, MHAP, Celera | 12 | 117× | 21 | 27 CPU hours for genome assembly (8× improvement over BLASR); improved current reference of telomeres | ||
| PacBio | PBcR, MHAP, Celera | 46 | 144× | 38 | 1896 CPU hours for genome assembly | ||
| PacBio | PBcR, MHAP, Celera, Quiver | 42 | 121× | 132 | 1060 CPU hours for genome assembly (593× improvement over BLASR); improved current reference of telomeres | ||
| PacBio | PBcR, MHAP, Celera | 275 | 54× | 3434 | 262,240 CPU hours for genome assembly; potentially closed 51 gaps in GRCh38; assembled MHC in 2 contigs (60 contigs with Illumina); reconstructed repetitive heterochromatic sequences in telomeres | ||
| PacBio | BLASR, Celera, Quiver | 243 | 41× | N/A (local assembly) | Closed 50 gaps and extended into 40 additional gaps in GRCh37; added over 1 Mb of novel sequence to the genome; identified 26,079 indels at least 50 bp in length; cataloged 47,238 SV breakpoints | ||
| Hybrid | PBcR, Celera | 3 | 5.5× PacBio + 15.4× 454 = 3.83× corrected | 15,328 | 1st assembly of >1 Gb parrot genome; N50 = 93,069 | ||
| Hybrid | BLASR, Bambus, AHA | 195 | 200× PacBio + 28× Illumina + 22× 454 | 2 | No N’s in contigs; 99.99% consensus accuracy; N50 = 3.01 Mb | ||
| PacBio | HGAP, Quiver, PGAP | 8 per strain | 446.5× average among strains | 1 per strain | 1 complete contig for each of 8 strains; methylation analysis associated motifs with genotypes of virulence factors |
Note: N50, the contig length for which half of all bases are in contigs of this length or greater; MHC, major histocompatibility complex; SV, structural variation.
Figure 5Detection of methylated bases using PacBio sequencing
PacBio sequencing can detect modified bases, including m6A (also known as 6mA), by analyzing variation in the time between base incorporations in the read strand. The figure is adapted with permission from Pacific Biosciences [72]. a.u. stands for arbitrary unit.
Summary of PacBio sequencing applications and main achievements
| Advantage | Closes gaps and completes genomes due to longer reads | Identifies full-length transcript isoforms without need for a reference genome | Detects modifications by monitoring kinetic variation |
| Identifies non-SNP SVs | Detects novel isoforms and fusion events | Detects epigenetic motifs in low coverage settings and with mixed genomes | |
| Achievements | Produced highly-contiguous assemblies of bacterial and eukaryotic genomes | Identified previously-unannotated human intron structures | Discovered new m6A and m4C MTases and methylation patterns in 6 bacteria |
| Discovered STRs and mutations associated with | Characterized alternative splicing events involved in the formation of blood cellular components | Detected m6A and m5C residues in | |
| Discovered | Identified novel isoforms in hESC transcriptome using hybrid sequencing | Identified virulence factor genotype-dependent motifs in | |
| Characterized SVs in a personal diploid human genome | Quantified personal transcriptome, including novel isoforms, splice sites, and SNVs | Detected intercellular heterogeneity in genome DNA modifications in | |
| Refs. |
Note: STR, short tandem repeat; FMR1, fragile X mental retardation 1; CDKN2A, cyclin-dependent kinase inhibitor 2A; SV, structural variation; SNV, single nucleotide variation; MTase, methyltransferase; hESC, human embryonic stem cell.