| Literature DB >> 32033565 |
Shanika L Amarasinghe1,2, Shian Su1,2, Xueyi Dong1,2, Luke Zappia3,4, Matthew E Ritchie1,2,5, Quentin Gouil6,7.
Abstract
Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.Entities:
Keywords: Data analysis; Long-read sequencing; Oxford Nanopore; PacBio
Year: 2020 PMID: 32033565 PMCID: PMC7006217 DOI: 10.1186/s13059-020-1935-5
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of long-read analysis tools and pipelines. a Release of tools identified from various sources and milestones of long-read sequencing. b Functional categories. c Typical long-read analysis pipelines for SMRT and nanopore data. Six main stages are identified through the presented workflow (i.e. basecalling, quality control, read error correction, assembly/alignment, assembly refinement, and downstream analyses). The green-coloured boxes represent processes common to both short-read and long-read analyses. The orange-coloured boxes represent the processes unique to long-read analyses. Unfilled boxes represent optional steps. Commonly used tools for each step in long-read analysis are within brackets. Italics signify tools developed by either PacBio or ONT companies, and non-italics signify tools developed by external parties. Arrows represent the direction of the workflow
Fig. 2Paradigms of error correction (a) and polishing (b). Errors in long reads and assembly are denoted by red crosses. Non-hybrid methods only require long reads, while hybrid methods additionally require accurate short reads (purple)
Fig. 3Methods to detect base modifications in long-read sequencing. Base modifications can be inferred from their effect on the current intensity (nanopore) and inter-pulse duration (IPD, SMRT). Strategies to call base modifications in nanopore sequencing and the corresponding tools are further depicted
Tools and strategies to detect base modifications in Nanopore data (HMM hidden Markov model, HPD hierarchical Dirichlet process, CNN convolutional neural network, LSTM long short-term memory, RNN recurrent neural network, SVM support vector machine)
| Tool | Base modifications | Strategy | Reference |
|---|---|---|---|
| Guppy | 5mCpG, 5mC (Dcm), 6mA (Dam) | Basecall | [ |
| Taiyaki | – | Basecall | [ |
| RepNano | BrdU | Basecall | [ |
| D-Nascent | BrdU | HMM | [ |
| Nanopolish | 5mCpG | HMM | [ |
| Megalodon | 6mA, 5mCpG | HMM | [ |
| signalAlign | 6mA, 5mC, 5hmC | HMM-HDP | [ |
| DeepSignal | 6mA (Dam), 5mCpG | Neural network (CNN + classifier) | [ |
| DeepMod | 6mA, 5mCpG | Neural network (LSTM-RNN) | [ |
| mCaller | 6mA, 5mCpG | Neural network classifier | [ |
| Tombo | 6mA (DNA), 5mC (RNA, DNA), de novo | Statistical test | [ |
| NanoMod | de novo | Statistical test | [ |
| EpiNano | m6A (RNA) | SVM | [ |
Fig. 4Types of transcriptomic analyses and their steps. The choice of sequencing protocol amongst the six available workflows affects the type, characteristics, and quantity of data generated. Only direct RNA sequencing allows epitranscriptomic studies, but SMRT direct RNA sequencing is a custom technique that is not fully supported. The remaining non-exclusive applications are isoform detection, quantification, and differential analysis. The dashed lines in arrows represent upstream processes to transcriptomics