| Literature DB >> 34901902 |
Shangqian Xie1, Amy Wing-Sze Leung2, Zhenxian Zheng2, Dake Zhang3, Chuanle Xiao4, Ruibang Luo2, Ming Luo5, Shoudong Zhang6,7.
Abstract
The Human Genome Project opened an era of (epi)genomic research, and also provided a platform for the development of new sequencing technologies. During and after the project, several sequencing technologies continue to dominate nucleic acid sequencing markets. Currently, Illumina (short-read), PacBio (long-read), and Oxford Nanopore (long-read) are the most popular sequencing technologies. Unlike PacBio or the popular short-read sequencers before it, which, as examples of the second or so-called Next-Generation Sequencing platforms, need to synthesize when sequencing, nanopore technology directly sequences native DNA and RNA molecules. Nanopore sequencing, therefore, avoids converting mRNA into cDNA molecules, which not only allows for the sequencing of extremely long native DNA and full-length RNA molecules but also document modifications that have been made to those native DNA or RNA bases. In this review on direct DNA sequencing and direct RNA sequencing using Oxford Nanopore technology, we focus on their development and application achievements, discussing their challenges and future perspective. We also address the problems researchers may encounter applying these approaches in their research topics, and how to resolve them.Entities:
Keywords: base modification; base-calling; direct DNA sequencing; direct RNA sequencing; long-read sequencing; nanopore sequencing; tools and algorithms
Year: 2021 PMID: 34901902 PMCID: PMC8640597 DOI: 10.1016/j.xinn.2021.100153
Source DB: PubMed Journal: Innovation (Camb) ISSN: 2666-6758
Figure 1Schematic field application of ONT direct DNA sequencing
CsgG is the protein for current commercial nanopores in Oxford Nanopore flowcells. DNA/RNA hybrid comes from reverse-transcription, the first-strand cDNA and template RNA formed hybrid can prevent the secondary structure of native RNAs, thus facilitate direct RNA sequencing, but cDNA does not enter into the pores for sequencing. Motor protein, a kind of protein that can control the speed of nucleic acid molecules passing through nanopores, is normally used to decrease the speed.
Tools and algorthms developed for basecalling
| Tool | Description | Algorithm | Advantages | Rate | Disadvantages | Link | Reference (PMID) |
|---|---|---|---|---|---|---|---|
| Chiron | Basecalling | deep learning | no segmentation | 2000 (bp/s) | Not suitable for large genomes | ||
| Causalcall | Basecalling | Temporal Convolutional Network | directly identifies base sequences of varying lengths | 7000 (bp/s) | base deletions | ||
| URnano | Basecalling | deep neural networks | model sequential dependencies for a one-dimensional segmentation task | 3600 (bp/s) | segmentation | ||
| DeepNano | Basecalling | Deep recurrent neural networks | open-source | 1250 (bp/s) | Not suitable for large genomes | ||
| BasecRAWller | Basecalling | unidirectional recurrent neural networks | 1) streaming basecalling, 2) tunable ratio of insertions to deletions, and 3) potential for streaming detection of modified bases | 200 (bp/s) | non-detectable covalently modified bases | ||
| Halcyon | Basecalling | Convolutional Neural Network and recurrent neural network | no segmentation and semantic correspondence | 250 (bp/s) | decrease speed | ||
| CMOS | nanopore sensors | complementary metal-oxide-semiconductor | - | - | - | - | |
| FPGA | nanopore sensors | Field-programmable gate array | - | - | - | - | |
| Nanocall | Basecalling | Hidden Markov Model | offline, free and private | 700 (bp/s) | not currently integrate '2D' read | ||
| Guppy | Basecalling | taxon-specific dataset and neural network model | reduction of errors in methylation motifs and no segmentation | 120,000 (bp/s) | a custom model using a larger neural network and/or training data from the same species | ||
| PoreOver | Basecalling | CTC-trained neural network and hidden Markov models | compatible with multiple nanopore basecallers | 450 (bp/s) | not currently integrate '2D' read |
Figure 2Two genome assembly strategies for ONT
Short solid strips stand for nanopore reads, the red one means error-prone reads and the green one means corrected reads; long hollow strips represent contigs, the red one means low reliable contig and the green one means high reliable contig; the black boxes in solid strips and hollow strips indicate errors.
Figure 3NECAT two-step progressive correction and assembly
(A and B) Error correction for nanopore reads, and (C) assembly of nanopore reads. Pale yellow strips represent raw uncorrected reads; pink stripes represent the reads with corrected LERS. Green strips represent the reads with corrected RERS; the red stripes represent the error-prone reads that failed to be corrected. Purple strips represent the contigs. The block boxes in strips indicate errors and the white rectangle in strips is high-error-rate region. The black rectangle in pink strips means the high-error-rate region is shielded from correcting during first correction step. Dotted lines mean overlapping-error-rate threshold used for selecting supporting reads. The pale yellow box between two purple strips means contig bridge selected from raw reads. LR means the targeted long-read that would be corrected.
Figure 4Schematic flow chart of small variant and SV calling with ONT DDS data
The “data preprocessing” box shows the commonly used bioinformatics tools for preprocessing the ONT sequencing data. The “small variant calling” and “Structural variant calling” boxes show the essential steps their critical parameters. The “functional analysis” box shows the applications of the detected variants. SV, structural variation; DB, database; CLISIG, clinical significance.
Figure 5The workflow of methylation identification for ONT DDS and DRS data
Raw electrolytic current signal files (FAST5) can be decoded to sequence information and electrolytic current signal information. Using the indexed electrolytic current signals and established detection models, the electrolytic current signals and sequence information can be mapped to the reference genome, then using the established detection models, we can detect methylation sites in the genome.
Tools developed for analysis of Nanopore sequencing data
| Tool | Description | Algorithm | Advantages | Rate | Disadvantages | Link | Reference (PMID) |
|---|---|---|---|---|---|---|---|
| LAST | Alignment | adaptive seeds | Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches | - | the running time increases linearly with sequence length and short DNA reads | ||
| Minimap2 | Alignment | split-read alignment | DNA or long mRNA, higher accuracy, faster, and full length of reads | - | not suitable for chimeric alignments | ||
| GraphMap | Alignment | candidate alignments and fast graph traversal | long reads with speed, high sensitivity | - | large-memory | ||
| UNCALLED | Alignment | Ferragina-Manzini index | mapping during sequencing and the leftmost mapping | - | not full length | ||
| tailfindr | poly(A) | measures poly(A) tail length | - | - | - | ||
| NaS | Assembly | illumina hybrid | entirely and with no error | - | Not suitable for large genomes | ||
| LQS | Assembly | multiple-alignment corrected | corrected by a multiple-alignment and 99.5% nucleotide identity | - | Not suitable for large genomes | ||
| Canu | Assembly | halves depth-of-coverage requirements, improves assembly continuity and reduces runtime on large genomes | - | accuracy depends on signal-level polishing | |||
| Miniasm | Assembly | No correction | magnitude faster | - | error rate is as high as raw reads | ||
| Nanopolish | Variant caller/Methylation detection | Hidden Markov Model | calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels | - | signal-level analysis | ||
| Clairvoyante | Variant caller/SV caller | convolutional neural network | SV calling, small variants and genotype | - | higher sequencing depth | ||
| Clair | Variant caller | Deep neural network | faster and complex variants with multiple alternative alleles | - | accuracy depends on pileup data and greater computational demands | ||
| NanoSV | SV caller | split- and gapped-aligned reads | genotyping | - | non-detectable inversion, complex repeat regions and segmental duplications | ||
| Picky | SV caller | seed-and-extend process and split-read | micro-insertions and phased SV | - | high specificity | ||
| NanoVar | SV caller | artificial neural network | low-depth (8X) | - | the alignment profile of each read requires re-training | ||
| SENSV | SV caller | Deep neural network | low-depth | - | balanced translocation missed | ||
| CAMPHOR | SV caller | SV breakpoints | polymorphic SVs and somatic SVs | - | removed indels in short repeats, the average read length 5 kbps and non-detectable indels < 100 bp | ||
| NanoMod | Methylation detection | signal intensities | raw signal data and 5mC | - | two pair sample reads | ||
| DeepSignal | Methylation detection | deep learning | 6mA/5mC, lower coverage, and predict methylation states | - | train DeepSignal to detect more types of base modification | ||
| mCaller | Methylation detection | neural network | 6mA and detect known or confirm suspected methyltransferase target motifs | - | only bacteria genome | ||
| DeepMod | Methylation detection | recurrent neural network | 6mA/5mC,strand-sensitive and has single-base resolution | - | non-detectable other types of modifications or other different motifs, not suitable for RNA, neighboring bases inflence, elied on alignment tool to find correct reference positions of bases | ||
| MINES | Methylation detection | random forest | m6A sites within DRACH motifs | - | lost small difference modification sites and not suitable for DNA | ||
| Nanom6A | Methylation detection | XGBoost model | m6A at single-base resolution and quantified abundance of m6A sites | - | not suitable for DNA | ||
| FLAIR | Isoform detection | correct and realign | assessing 3′ poly(A) tail length, base modifications, and transcript haplotypes | - | combined short Illumina reads | ||
| TrackCluster | Isoform detection | read tracks | read classification, a transcript isoform with numerous exons, stage-specific or cell-specific expression of isoforms | - | not suitable for large genomes |
Figure 6Flowchart for ONT DRS of NAD-capped RNAs
The 5′and 3′ indicate the NAD-capped RNA direction. The NAD structure was illustrated and connected to the 5′ end of the NAD-capped RNA, after ADPRC catalysis, the azide moiety was linked to NAD via replacing nicotinamide, the azide contained NAD-capped RNA then can react with SPAAC, and the synthetic RNA adapter with DBCO at its 3′ end can conjugate with azide functionalized NAD-capped RNA. The tagged NAD-capped RNA then can be used for ONT DRS library preparation and further sequenced and analyzed. ADPRC, ADP-ribosyl cyclase; NAD, nicotinamide adenine dinucleotide; SPAAC, strain-promoted azide-alkyne cycloaddition; tagRNA, synthetic adaptor RNA to tag NAD-capped RNA; ONT DRS, Oxford Nanopore Technologies Direct RNA Sequencing.