| Literature DB >> 27646134 |
Hengyun Lu1, Francesca Giordano2, Zemin Ning3.
Abstract
The revolution of genome sequencing is continuing after the successful second-generation sequencing (SGS) technology. The third-generation sequencing (TGS) technology, led by Pacific Biosciences (PacBio), is progressing rapidly, moving from a technology once only capable of providing data for small genome analysis, or for performing targeted screening, to one that promises high quality de novo assembly and structural variation detection for human-sized genomes. In 2014, the MinION, the first commercial sequencer using nanopore technology, was released by Oxford Nanopore Technologies (ONT). MinION identifies DNA bases by measuring the changes in electrical conductivity generated as DNA strands pass through a biological pore. Its portability, affordability, and speed in data production makes it suitable for real-time applications, the release of the long read sequencer MinION has thus generated much excitement and interest in the genomics community. While de novo genome assemblies can be cheaply produced from SGS data, assembly continuity is often relatively poor, due to the limited ability of short reads to handle long repeats. Assembly quality can be greatly improved by using TGS long reads, since repetitive regions can be easily expanded into using longer sequencing lengths, despite having higher error rates at the base level. The potential of nanopore sequencing has been demonstrated by various studies in genome surveillance at locations where rapid and reliable sequencing is needed, but where resources are limited.Entities:
Keywords: De novo assembly; Molecular clinical diagnostics; Oxford nanopore MinION device; Structural variations; Third-generation sequencing
Mesh:
Year: 2016 PMID: 27646134 PMCID: PMC5093776 DOI: 10.1016/j.gpb.2016.05.004
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1The MinION sequencing device
DNA sequencing is performed by adding the sample to the flowcell. When DNA molecules pass through or near the nanopore, there will be a change in the magnitude of the current in the nanopore, which is measured by a sensor. The data streams are passed to the ASIC and MinKNOW, the software that generates the signal-level data. ASIC, application-specific integrated circuit.
Figure 2Flow of Oxford Nanopore sequencing process
A. Library preparation schematic for the genomic DNA sequencing kit (SQK-MAP-003). B. 512 channels with different levels of activity in a flowcell are shown in different color (most active channels are in red). C. “Squiggle plots” of fluctuating electrical signals, which can be translated into DNA bases. D. 5-mers decoding from event information and alignment of 1D and 2D base calls.
Figure 3Time course of base throughput and read length distribution
A. The throughput of bases with time generated using Poretools [23]. B. Distribution of read length using data from one flowcell run [13].
List of analysis tools developed for Oxford Nanopore data
| BWA | Fast nanopore data tuned alignment tool | ||
| GraphMap | Mapper for long and error-prone reads | ||
| LAST | Nanopore tuned alignment tool | ||
| LINKS | Software tool for long read scaffolding | ||
| marginAlign | Tools to align nanopore reads to a reference | ||
| minoTour | Real time analysis tools | ||
| nanoCORR | Error-correction tool for nanopore sequence data | ||
| NanoOK | Software for nanopore data, quality and error profiles | ||
| Nanopolish | Nanopore analysis and genome assembly software | ||
| nanopore | Variant-detection tool for nanopore sequence data | ||
| Nanocorrect | Error-correction tool for nanopore sequence data | ||
| npReader | Real-time conversion and analysis of nanopore reads | ||
| poRe | Tool for analyzing and visualizing nanopore data | ||
| PoreSeq | Error-correction and variant-calling software | ||
| Poretools | Nanopore sequence analysis and visualization software | ||
| SSPACE-LongRead | Genome scaffolding tool | ||
| SMIS | Genome scaffolding tool |
List of assemblers for Oxford Nanopore MinION long reads
| LQS | DALIGNER, Celera OLC | Nanocorrect, Nanopolish | ||
| PBcR | HGAP or BLASR, Celera OLC | PBcR | – | |
| Canu | MHAP, Celera OLC | Canu | – | |
| Falcon | String graph, Celera OLC | Falcon | ||
| Miniasm | OLC | None | ||
| ra-integrate | OLC | None | ||
| ALLPATHS-LG | de Bruijn graph | ALLPATHS-LG | ||
| SPAdes | de Bruijn graph | SPAdes |
Note: LQS, Loman, Quick and Simpson; PBcR, PacBio Corrected Reads; HGAP, hierarchical genome-assembly process; BLASR, basic local alignment with successive refinement; MHAP, MinHash alignment process; OLC, overlap-layout-consensus.
Datasets of MinION, PacBio, and MiSeq used for assembly comparison
| MinION | 48× reads | 225,086,343 | 30,364 | 7413 | 45,588 | 8931 | 48.5 | 91.41 |
| 20× reads | 92,524,457 | 12,555 | 7370 | 45,588 | 8867 | 19.9 | 91.03 | |
| PacBio | 25× reads | 116,263,784 | 13,124 | 8859 | 42,279 | 14,159 | 25.0 | 91.64 |
| 20× reads | 90,335,723 | 10,154 | 8897 | 42,279 | 14,046 | 19.5 | 91.54 | |
| MiSeq | 1263× reads | 5,862,970,443 | 25,758,933 | 228 | 302 | 300 | 1263 | >99.99 |
Note: The reads with lower 20× coverage are subsets of 48× for MinION and 25× for PacBio platforms, respectively. Matching identities for the three sets of data were estimated using the reference assembly.
Assembly assessment using different tools for three types of reads at different levels of coverage
| 48× MinION reads | LQS | 4,636,840 | 2 | 4,624,206 | 1677 | 41,646 | 99.06 | 8662 |
| Falcon | 4,562,630 | 1 | 4,562,630 | 2606 | 102,872 | 97.70 | 16.2 | |
| Canu | 4,574,127 | 1 | 4,574,127 | 760 | 73,623 | 98.39 | 20.2 | |
| PBcR | 4,567,749 | 1 | 4,567,749 | 982 | 80,325 | 98.22 | 46.4 | |
| Miniasm | 4,544,438 | 1 | 4,544,438 | 173,356 | 330,040 | 88.90 | 0.05 | |
| SPAdes | 4,638,974 | 1 | 4,638,974 | 255 | 31 | 99.99 | 19.1 | |
| 20× MinION reads | Falcon | 4,137,874 | 23 | 2,693,898 | 4608 | 107,781 | 97.29 | 1.6 |
| Canu | 4,546,300 | 1 | 4,546,300 | 2141 | 94,233 | 97.90 | 6.8 | |
| PBcR | 4,398,520 | 35 | 1,021,865 | 1824 | 88,269 | 97.95 | 10.0 | |
| Miniasm | 4,526,673 | 17 | 2,921,034 | 173,127 | 326,625 | 88.89 | 0.01 | |
| SPAdes | 4,639,128 | 3 | 3,108,521 | 254 | 31 | 99.99 | 24 | |
| 48× MinION reads with Nanopolish | LQS | 4,685,134 | 2 | 4,672,162 | 903 | 18,464 | 99.58 | 375 |
| Miniasm | 4,666,535 | 1 | 4,666,535 | 16,153 | 51,954 | 98.54 | 2540 | |
| Canu | 4,654,817 | 1 | 4,654,817 | 1155 | 18,730 | 99.57 | 434 | |
| 25× PacBio reads | Falcon | 4,621,993 | 2 | 4,196,046 | 457 | 10,898 | 99.75 | 1.4 |
| Canu | 4,663,990 | 1 | 4,663,990 | 20 | 3903 | 99.91 | 2.8 | |
| PBcR | 4,638,751 | 2 | 3,746,511 | 48 | 3368 | 99.93 | 7.2 | |
| Miniasm | 4,830,837 | 1 | 4,830,837 | 81,535 | 432,590 | 89.33 | 0.02 | |
| SPAdes | 4,638,975 | 1 | 4,638,975 | 254 | 30 | 99.99 | 19 | |
| 20× PacBio reads | Falcon | 4,036,562 | 47 | 145,816 | 459 | 12,422 | 99.68 | 0.8 |
| Canu | 4,637,297 | 1 | 4,637,297 | 90 | 7736 | 99.83 | 2.6 | |
| PBcR | 4,456,708 | 32 | 239,659 | 114 | 6260 | 99.86 | 5.8 | |
| Miniasm | 4,777,699 | 9 | 917,934 | 75,726 | 419,198 | 89.46 | 0.01 | |
| SPAdes | 4,638,975 | 1 | 4,638,975 | 254 | 30 | 99.99 | 18.9 | |
Note: The SPAdes assemblies were obtained using both long and short reads, while other assemblies used only one type of long reads, either MinION or PacBio. Take the dataset 48× MinION as an example, the SPAdes assembly was obtained using 48× MinION reads as well as MiSeq reads, while the assembly of Falcon was generated from MinION reads only.
Figure 4Counts of 5-mers from the reference and the assembly before and after polishing using event data
A. Counts of 5-mer homopolymers before event polishing. B. Counts of 5-mer homopolymers after Nanopolish using event data.
Under-represented and over-represented 5-mers of the E. coli data
| Under-represented | AAAAA | 0.247 | 0.086 | −0.161 | CGCCA | 0.288 | 0.092 | −0.196 | TTTTT | 0.251 | 0.047 | −0.204 |
| TTTTT | 0.251 | 0.093 | −0.158 | AAAAA | 0.247 | 0.055 | −0.192 | AAAAA | 0.247 | 0.058 | −0.189 | |
| CGCTG | 0.258 | 0.104 | −0.155 | TTTTT | 0.251 | 0.065 | −0.186 | CAAAA | 0.170 | 0.111 | −0.058 | |
| GCTGG | 0.279 | 0.148 | −0.132 | CACCA | 0.184 | 0.054 | −0.130 | AAAAT | 0.195 | 0.138 | −0.057 | |
| CGCCA | 0.288 | 0.168 | −0.120 | CCAGC | 0.288 | 0.162 | −0.126 | AAAAG | 0.132 | 0.081 | −0.051 | |
| CCAGC | 0.288 | 0.180 | −0.108 | CGCTG | 0.258 | 0.135 | −0.123 | CGCCA | 0.288 | 0.239 | −0.049 | |
| GCCAG | 0.280 | 0.173 | −0.107 | GCCAG | 0.280 | 0.157 | −0.122 | TAAAA | 0.145 | 0.097 | −0.048 | |
| CTGGC | 0.278 | 0.178 | −0.100 | CAGCA | 0.262 | 0.140 | −0.122 | TGGTG | 0.185 | 0.138 | −0.048 | |
| CAGCA | 0.262 | 0.168 | −0.095 | CTGGC | 0.278 | 0.159 | −0.119 | CGCTG | 0.258 | 0.213 | −0.046 | |
| CGGCA | 0.222 | 0.129 | −0.093 | TGGCG | 0.275 | 0.163 | −0.112 | GCCAG | 0.280 | 0.238 | −0.042 | |
| Over-represented | ACCCC | 0.040 | 0.136 | 0.096 | ACCCC | 0.040 | 0.143 | 0.103 | CAAAT | 0.105 | 0.164 | 0.059 |
| CCCCG | 0.055 | 0.149 | 0.093 | CCCCG | 0.055 | 0.134 | 0.079 | GGGGT | 0.039 | 0.074 | 0.035 | |
| CCCCC | 0.033 | 0.122 | 0.089 | CCCCA | 0.064 | 0.128 | 0.065 | CCCAA | 0.047 | 0.080 | 0.033 | |
| CCCCA | 0.064 | 0.138 | 0.075 | CCTAG | 0.003 | 0.066 | 0.063 | TGAAT | 0.121 | 0.154 | 0.033 | |
| CCTAG | 0.003 | 0.075 | 0.072 | CTGAG | 0.050 | 0.112 | 0.063 | GAAGG | 0.094 | 0.127 | 0.033 | |
| GCCCC | 0.062 | 0.131 | 0.069 | TACCC | 0.073 | 0.136 | 0.062 | CGGGG | 0.054 | 0.087 | 0.032 | |
| CTCCC | 0.039 | 0.107 | 0.067 | CCTAA | 0.026 | 0.087 | 0.061 | ACCGT | 0.123 | 0.155 | 0.032 | |
| TCTAC | 0.048 | 0.113 | 0.065 | GACCC | 0.040 | 0.100 | 0.060 | CGTGA | 0.102 | 0.134 | 0.032 | |
| TCCCC | 0.056 | 0.121 | 0.065 | TCCCC | 0.056 | 0.115 | 0.059 | GAAGC | 0.124 | 0.156 | 0.032 | |
| TACCC | 0.073 | 0.138 | 0.064 | TCCTA | 0.013 | 0.071 | 0.058 | AGGCA | 0.093 | 0.124 | 0.031 | |
Note: Poretools was used to extract FASTA sequences for template, complement, and 2D reads from the FAST5 files. The 5-mer counts of the reads of the E. coli data [13] and of the reference assembly were calculated separately using the oligonucleotide frequency function of the R package Biostring. Frequencies of each 5-mer occurrence in reads per 100 bp were calculated and differences in reads relative to Ref are indicated. Ref, reference assembly; Templ, template read; Compl, complement read.