Literature DB >> 32033565

Opportunities and challenges in long-read sequencing data analysis.

Shanika L Amarasinghe^1,2, Shian Su^1,2, Xueyi Dong^1,2, Luke Zappia^3,4, Matthew E Ritchie^1,2,5, Quentin Gouil^6,7.

Abstract

Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.

Entities: Chemical Disease Gene Species

Keywords: Data analysis; Long-read sequencing; Oxford Nanopore; PacBio

Year: 2020 PMID： 32033565 PMCID： PMC7006217 DOI： 10.1186/s13059-020-1935-5

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Introduction

Long-read sequencing, or third-generation sequencing, offers a number of advantages over short-read sequencing [1, 2]. While short-read sequencers such as Illumina’s NovaSeq, HiSeq, NextSeq, and MiSeq instruments [3-5]; BGI’s MGISEQ and BGISEQ models [6]; or Thermo Fisher’s Ion Torrent sequencers [7, 8] produce reads of up to 600 bases, long-read sequencing technologies routinely generate reads in excess of 10 kb [1]. Short-read sequencing is cost-effective, accurate, and supported by a wide range of analysis tools and pipelines [9]. However, natural nucleic acid polymers span eight orders of magnitude in length, and sequencing them in short amplified fragments complicates the task of reconstructing and counting the original molecules. Long reads can thus improve de novo assembly, mapping certainty, transcript isoform identification, and detection of structural variants. Furthermore, long-read sequencing of native molecules, both DNA and RNA, eliminates amplification bias while preserving base modifications [10]. These capabilities, together with continuing progress in accuracy, throughput, and cost reduction, have begun to make long-read sequencing an option for a broad range of applications in genomics for model and non-model organisms [2, 11]. Two technologies currently dominate the long-read sequencing space: Pacific Biosciences’ (PacBio) single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies’ (ONT) nanopore sequencing. We henceforth refer to these simply as SMRT and nanopore sequencing. SMRT and nanopore sequencing technologies were commercially released in 2011 and 2014, respectively, and since then have become suitable for an increasing number of applications. The data that these platforms produce differ qualitatively from second-generation sequencing, thus necessitating tailored analysis tools. Given the broadening interest in long-read sequencing and the fast-paced development of applications and tools, the current review aims to provide a description of the guiding principles of long-read data analysis, a survey of the available tools for different tasks as well as a discussion of the areas in long-read analysis that require improvements. We also introduce the complementary open-source catalogue of long-read analysis tools: long-read-tools.org. The long-read-tools.org database allows users to search and filter tools based on various parameters such as technology or application.

The state of long-read sequencing and data analysis

Nanopore and SMRT long-read sequencing technologies rely on very distinct principles. Nanopore sequencers (MinION, GridION, and PromethION) measure the ionic current fluctuations when single-stranded nucleic acids pass through biological nanopores [12, 13]. Different nucleotides confer different resistances to the stretch of nucleic acid within the pore; therefore, the sequence of bases can be inferred from the specific patterns of current variation. SMRT sequencers (RSII, Sequel, and Sequel II) detect fluorescence events that correspond to the addition of one specific nucleotide by a polymerase tethered to the bottom of a tiny well [14, 15]. Read length in SMRT sequencing is limited by the longevity of the polymerase. A faster polymerase for the Sequel sequencer introduced with chemistry v3 in 2018 increased the read lengths to an average 30-kb polymerase read length. The library insert sizes amenable to SMRT sequencing range from 250 bp to 50 kbp. Nanopore sequencing provides the longest read lengths, from 500 bp to the current record of 2.3 Mb [16], with 10–30-kb genomic libraries being common. Read length in nanopore sequencing is mostly limited by the ability to deliver very high-molecular weight DNA to the pore and the negative impact this has on run yield [17]. Basecalling accuracy of reads produced by both these technologies have dramatically increased in the recent past, and the raw base-called error rate is claimed to have been reduced to < 1% for SMRT sequencers [18] and < 5% for nanopore sequences [17]. While nanopore and SMRT are true long-read sequencing technologies and the focus of this review, there are also synthetic long-read sequencing approaches. These include linked reads, proximity ligation strategies, and optical mapping [19-28], which can be employed in synergy with true long reads. With the potential for accurately assembling and re-assembling genomes [17, 29–32], methylomes [33, 34], variants [18], isoforms [35, 36], haplotypes [37-39], or species [40, 41], tools to analyse the sequencing data provided by long-read sequencing platforms are being actively developed, especially since 2011 (Fig. 1a).

Fig. 1

Overview of long-read analysis tools and pipelines. a Release of tools identified from various sources and milestones of long-read sequencing. b Functional categories. c Typical long-read analysis pipelines for SMRT and nanopore data. Six main stages are identified through the presented workflow (i.e. basecalling, quality control, read error correction, assembly/alignment, assembly refinement, and downstream analyses). The green-coloured boxes represent processes common to both short-read and long-read analyses. The orange-coloured boxes represent the processes unique to long-read analyses. Unfilled boxes represent optional steps. Commonly used tools for each step in long-read analysis are within brackets. Italics signify tools developed by either PacBio or ONT companies, and non-italics signify tools developed by external parties. Arrows represent the direction of the workflow A search through publications, preprints, online repositories, and social media identified 354 long-read analysis tools. The majority of these tools are developed for nanopore read analyses (262) while there are 170 tools developed to analyse SMRT data (Fig. 1a). We categorised them into 31 groups based on their functionality (Fig. 1b). This identified trends in the evolution of research interests: likely due to the modest initial throughput of long-read sequencing technologies, the majority of tools were tested on non-human data; tools for de novo assembly, error correction, and polishing categories have received the most attention, while transcriptome analysis is still in early stages of development (Fig. 1b). We present an overview of the analysis pipelines for nanopore and SMRT data and highlight popular tools (Fig. 1c). We do not attempt to provide a comprehensive review of tool performance for all long-read applications; dedicated benchmark studies are irreplaceable, and we refer our readers to those when possible. Instead, we present the principles and potential pitfalls of long-read data analysis with a focus on some of the main types of downstream analyses: structural variant calling, error correction, detection of base modifications, and transcriptomics.

Basecalling

The first step in any long-read analysis is basecalling, or the conversion from raw data to nucleic acid sequences (Fig. 1c). This step receives greater attention for long reads than short reads where it is more standardised and usually performed using proprietary software. Nanopore basecalling is itself more complex than SMRT basecalling, and more options are available: of the 26 tools related to basecalling that we identified, 23 relate to nanopore sequencing. During SMRT sequencing, successions of fluorescence flashes are recorded as a movie. Because the template is circular, the polymerase may go over both strands of the DNA fragment multiple times. SMRT basecalling starts with segmenting the fluorescence trace into pulses and converting the pulses into bases, resulting in a continuous long read (also called polymerase read). This read is then split into subreads, where each subread corresponds to 1 pass over the library insert, without the linker sequences. Subreads are stored as an unaligned BAM file. From aligning these subreads together, an accurate consensus circular sequence (CCS) for the insert is derived [42]. SMRT basecallers are chiefly developed internally and require training specific to the chemistry version used. The current basecalling workflow is ccs [43]. Nanopore raw data are current intensity values measured at 4 kHz saved in fast5 format, built on HDF5. Basecalling of nanopore reads is an area of active research, where algorithms are quickly evolving (neural networks have supplanted HMMs, and various neural networks structures are being tested [44]) as are the chemistries for which they are trained. ONT makes available a production basecaller (Guppy, currently) as well as development versions (Flappie, Scrappie, Taiyaki, Runnie, and Bonito) [45]. Generally, the production basecaller provides the best accuracy and most stable performance and is suitable for most users [46]. Development basecallers can be used to test features, for example, homopolymer accuracy, variant detection, or base modification detection, but they are not necessarily optimised for speed or overall accuracy. In time, improvements make their way into the production basecaller. For example, Scrappie currently maps homopolymers explicitly [47]. Independent basecaller with different network structures are also available, most prominently Chiron [48]. These have been reviewed and their performance evaluated elsewhere [13, 46, 49]. The ability to train one’s own basecalling model opens the possibility to improve basecalling performance by tailoring the model to the sample’s characteristics [46]. As a corollary, users have to keep in mind that the effective accuracy of the basecaller on their data set may be lower than the advertised accuracy. For example, ONT’s basecallers are currently trained on a mixture of human, yeast, and bacterial DNA; their performance on plant DNA where non-CG methylation is abundant may be lower [50]. As the very regular updates to the production Guppy basecaller testify, basecalling remains an active area of development.

Errors, correction, and polishing

Both SMRT and nanopore technologies provide lower per read accuracy than short-read sequencing. In the case of SMRT, the circular consensus sequence quality is heavily dependent on the number of times the fragment is read—the depth of sequencing of the individual SMRTbell molecule (Fig. 1c)—a function of the length of the original fragment and longevity of the polymerase. With the Sequel v2 chemistry introduced in 2017, fragments longer than 10 kbp were typically only read once and had a single-pass accuracy of 85–87% [51]. The late 2018 v3 chemistry increases the longevity of the polymerase (from 20 to 30 kb for long fragments). An estimated four passes are required to provide a CCS with Q20 (99% accuracy) and nine passes for Q30 (99.9% accuracy) [18]. If the errors were non-random, increasing the sequencing depth would not be sufficient to remove them. However, the randomness of sequencing errors in subreads, consisting of more indels than mismatches [52-54], suggests that consensus approaches can be used so that the final outputs (e.g. CCS, assembly, variant calls) should be free of systematic biases. Still, CCS reads retain errors and exhibit a bias for indels in homopolymers [18]. On the other hand, the quality of nanopore reads is independent of the length of the DNA fragment. Read quality depends on achieving optimal translocation speed (the rate of ratcheting base by base) of the nucleic acid through the pore, which typically decreases in the late stages of sequencing runs, negatively affecting the quality [55]. Contrary to SMRT sequencing, a nanopore sequencing library is made of linear fragments that are read only once. In the most common, 1D sequencing protocol, each strand of the dsDNA fragment is read independently, and this single-pass accuracy is the final accuracy for the fragment. By contrast, the 1D2 protocol is designed to sequence the complementary strand in immediate succession of up to 75% of fragments, which allows the calculation of a more accurate consensus sequence for the library insert. To date, the median single-pass accuracy of 1D sequencing across a run can reach 95% (manufacturer’s numbers [56]). Release 6 of the human genomic DNA NA12878 reference data set reports 91% median accuracy [17]. 1D2 sequencing can achieve a median consensus accuracy of 98% [56]. An accurate consensus can also be derived from linear fragments if the same sequence is present multiple times: the concept of circularisation followed by rolling circle amplification for generating nanopore libraries is similar to SMRT sequencing, and subreads can be used to determine a high-quality consensus [57-59]. ONT is developing a similar linear consensus sequencing strategy based on isothermal polymerisation rather than circularisation [56]. Indels and substitutions are frequent in nanopore data, partly randomly but not uniformly distributed. Low-complexity stretches are difficult to resolve with the current (R9) pores and basecallers [56], as are homopolymer sequences. Measured current is a function of the particular k-mer residing in the pore, and because translocation of homopolymers does not change the sequence of nucleotides within the pore, it results in a constant signal that makes determining homopolymer length difficult. A new generation of pores (R10) was designed to increase the accuracy over homopolymers [56]. Certain k-mers may differ in how distinct a signal they produce, which can also be a source of systematic bias. Sequence quality is of course intimately linked to the basecaller used and the data that has been used to train it. Read accuracy can be improved by training the basecaller on data that is similar to the sample of interest [46]. ONT regularly release chemistry and software updates that improve read quality: 4 pore versions were introduced in the last 3 years (R9.4, R9.4.1, R9.5.1, R10.0), and in 2019 alone, there were 12 Guppy releases. PacBio similarly updates hardware, chemistry, and software: the last 3 years have seen the release of 1 instrument (Sequel II), 4 chemistries (Sequel v2 and v3; Sequel II v1 and v2), and 4 versions of the SMRT-LINK analysis suite. Although current long-read accuracy is generally sufficient to uniquely determine the genomic origin of the read, certain applications require high base-level accuracy, including de novo assembly, variant calling, or defining intron-exon boundaries [54]. Two groups of methods to error correct long-reads can be employed: methods that only use long reads (non-hybrid) and methods that leverage the accuracy of additional short-read data (hybrid) (Fig. 2). Zhang et al. recently reviewed and benchmarked 15 of these long-read error correction methods [60], while Fu et al. focused on 10 hybrid error correction tools [61]. Lima et al. benchmarked 11 error correction tools specifically for nanopore cDNA reads [62].

Fig. 2

Paradigms of error correction (a) and polishing (b). Errors in long reads and assembly are denoted by red crosses. Non-hybrid methods only require long reads, while hybrid methods additionally require accurate short reads (purple) In non-hybrid methods, all reads are first aligned to each other and a consensus is used to correct individual reads (Fig. 2a). These corrected reads can then be taken forward for assembly or other applications. Alternatively, because genomes only contain a small subset of all possible k-mers, rare k-mers in a noisy long-read data set are likely to represent sequencing errors. Filtering out these rare k-mers, as the wtdbg2 assembler does [63], effectively prevents errors from being introduced in the assembly (Fig. 2a). Hybrid error correction methods can be further classified according to how the short reads are used. In alignment-based methods, the short reads are directly aligned to the long reads, to generate corrected long reads (Fig. 2a). In assembly-based methods, the short reads are first used to build a de Bruijn graph or assembly. Long reads are then corrected by aligning to the assembly or by traversing the de Bruijn graph (Fig. 2a). Assembly-based methods tend to outperform alignment-based methods in correction quality and speed, and FMLRC [64] was found to perform best in the two benchmark studies [60, 61]. After assembly, the process of removing remaining errors from contigs (rather than raw reads) is called ‘polishing’. One strategy is to use SMRT subreads through Arrow [65] or nanopore current traces through Nanopolish [66], to improve the accuracy of the consensus (Fig. 2b). For nanopore data, polishing while also taking into account the base modifications (as implemented for instance in Nanopolish [66]) further improves the accuracy of an assembly [46]. Alternatively, polishing can be done with the help of short reads using Pilon [67], Racon [68], or others, often in multiple rounds [50, 69, 70] (Fig. 2b). The rationale for iterative hybrid polishing is that as errors are corrected, previously ambiguously mapped short reads can be mapped more accurately. While certain pipelines repeat polishing until convergence (or oscillatory behaviour, where the same positions are changed back and forth between each round), too many iterations can decrease the quality of the assembly, as measured by the BUSCO score [71]. To increase scalability, ntEdit foregoes alignment in favour of comparing the draft assembly’s k-mers to a thresholded Bloom filter built from the sequencing reads [72] (Fig. 2b). Despite continuous improvements in the accuracy of long reads, error correction remains indispensable in many applications. We identified 62 tools that are able to carry out error correction. There is no silver bullet, and correcting an assembly requires patience and careful work, often combining multiple tools (e.g. Racon, Pilon, and Nanopolish [50]). Adding to the difficulty of the absence of an authoritative error correction pipeline, certain tools do not scale well for deep sequencing or large genomes [50]. Furthermore, most tools are designed with haploid assemblies in mind. Allelic variation, repeats, or gene families may not be correctly handled.

Detecting structural variation

While short reads perform well for the identification of single nucleotide variants (SNVs) and small insertion and deletions (indels), they are not well suited to the detection of larger sequence changes [73]. Collectively referred to as structural variants (SVs), insertions, deletions, duplications, inversions, or translocations that affect ≥ 50 bp [74] are more amenable to long-read sequencing [75, 76] (Fig 1c). Because of these past technical limitations, structural variants have historically been under-studied despite being an important source of diversity between genomes and relevant for human health [77, 78]. The ability of long reads to span repeated elements or repetitive regions provides unique anchors that facilitate de novo assembly and SV calling [73]. Even relatively short (5 kb) SMRT reads can identify structural variants in the human genome that were previously missed by short-read technologies [79]. Obtaining deep coverage of mammalian-sized genomes with long reads remains costly; however, modest coverage may be sufficient: 8.6 × SMRT sequencing [14] and 15–17 × nanopore sequencing [80, 81] have been shown to be effective in detecting pathogenic variants in humans. Heterozygosity or mosaicism naturally increase the coverage requirements. Evaluating the performance of long-read SV callers is complicated by the fact that benchmark data sets may be missing SVs in their annotation [73, 77], especially when it comes only from short reads. Therefore, validation of new variants has to be performed via other methods. Developing robust benchmarks is an ongoing effort [82], as is devising solutions to visualise complex, phased variants for critical assessment [82, 83]. For further details on structural variant calling from long-read data, we refer the reader to two recent reviews: Mahmoud et al. [73] and Ho et al. [77].

Detecting base modifications

In addition to the canonical A, T, C, and G bases, DNA can contain modified bases that vary in nature and frequency across organisms and tissues. N-6-methyladenine (6mA), 4-methylcytosine (4mC), and 5-methylcytosine (5mC) are frequent in bacteria. 5mC is the most common base modification in eukaryotes, while its oxidised derivatives 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC), and 5-carboxycytosine (5caC) are detected in certain mammalian cell types but have yet to be deeply characterised [84-88]. Still, more base modifications that result from DNA damage occur at a low frequency [87]. The nucleotides that compose RNA are even more varied. Over 150 modified bases have been documented to date [89, 90]. These modifications also have functional roles, for example, in mRNA stability [91], transcriptional repression [92], and translational efficiency [93]. However, most RNA modifications remain ill-characterised due to technological limitations [94]. Aside from the modifications to standard bases, base analogues may also be introduced to nucleic acids, such as the thymidine analogue BrdU which is used to track genomic replication [95]. Mapping of nucleic acid modifications has traditionally relied on specific chemical treatment (e.g. bisulfite conversion that changes unmethylated cytosines to uracils [96]) or immunoprecipitation followed by sequencing [97]. The ability of the long-read platforms to sequence native nucleic acids provides the opportunity to determine the presence of many more modifications, at base resolution in single molecules, and without specialised chemistries that can be damaging to the DNA [98]. Long reads thus allow the phasing of base modifications along individual nucleic acids, as well as their phasing with genetic variants, opening up opportunities in exploring epigenetic heterogeneity [34, 99]. Long reads also enable the analysis of base modifications in repetitive regions of the genome (centromeres or transposons), where short reads cannot be mapped uniquely. In SMRT sequencing, base modifications in DNA or RNA [100, 101] are inferred from the delay between fluorescence pulses, referred to as interpulse duration (IPD) [98] (Fig. 3). Base modifications impact the speed at which the polymerase progresses, at the site of modification and/or downstream. Comparison with the signal from an in silico or non-modified reference (e.g. amplified DNA) suggests the presence of modified bases [102, 103]. It is notably possible to detect 6mA, 4mC, 5mC, and 5hmC DNA modifications, although at different sensitivity. Reliable calling of 6mA and 4mC requires 25 × coverage per strand, whereas 250 × is required for 5mC and 5hmC, which have subtler impacts on polymerase kinetics [102]. Such high coverage is not realistic for large genomes and does not allow single-molecule epigenetic analysis. Coverage requirements can be reduced by conjugating a glucose moeity to 5hmC, which gives a stronger IPD signal during SMRT sequencing [102, 103]. Polymerase dynamics and base modifications can be analysed directly via the SMRT Portal, or for more advanced analyses with R-kinetics, kineticsTools or basemods [104]. SMALR [99] is dedicated to the detection of base modifications in single SMRT reads.

Fig. 3

Methods to detect base modifications in long-read sequencing. Base modifications can be inferred from their effect on the current intensity (nanopore) and inter-pulse duration (IPD, SMRT). Strategies to call base modifications in nanopore sequencing and the corresponding tools are further depicted In nanopore sequencing, modified RNA or DNA bases affect the flow of the current through the pore differently than non-modified bases, resulting in signal shifts (Fig. 3). These shifts can be identified post-basecalling and post-alignment with three distinct methods: (a) without prior knowledge about the modification (de novo) by comparing to an in silico reference [105], or a control, non-modified sample (typically amplified DNA) [105, 106]; (b) using a pre-trained model [66, 107, 108] (Fig. 3, Table 1); and (c) directly by a basecaller using an extended alphabet [45, 109].

Table 1

Tool	Base modifications	Strategy	Reference
Guppy	5mCpG, 5mC (Dcm), 6mA (Dam)	Basecall	[45]
Taiyaki	–	Basecall	[45]
RepNano	BrdU	Basecall	[109]
D-Nascent	BrdU	HMM	[95]
Nanopolish	5mCpG	HMM	[66]
Megalodon	6mA, 5mCpG	HMM	[45]
signalAlign	6mA, 5mC, 5hmC	HMM-HDP	[107]
DeepSignal	6mA (Dam), 5mCpG	Neural network (CNN + classifier)	[110]
DeepMod	6mA, 5mCpG	Neural network (LSTM-RNN)	[111]
mCaller	6mA, 5mCpG	Neural network classifier	[108]
Tombo	6mA (DNA), 5mC (RNA, DNA), de novo	Statistical test	[105]
NanoMod	de novo	Statistical test	[106]
EpiNano	m6A (RNA)	SVM	[112]

Tools and strategies to detect base modifications in Nanopore data (HMM hidden Markov model, HPD hierarchical Dirichlet process, CNN convolutional neural network, LSTM long short-term memory, RNN recurrent neural network, SVM support vector machine) De novo approaches, as implemented by Tombo [105] or NanoMod [106], allow the discovery of modifications and modified motifs by statistically testing the deviation of the observed signal relative to a reference. However these methods suffer from a high false discovery rate and are not reliable at the single-molecule level. The comparison to a control sample rather than an in silico reference increases the accuracy of detection, but requires the sequencing of twice as many sample as well as high coverage to ensure that genomic segments are covered by both control and test sample reads. De novo calling of base modifications is limited to highlighting regions of the genomes that may contain modified bases, without being able to reveal the precise base or the nature of the modification. Pre-trained models interrogate specific sites and classify the data as supporting a modified or unmodified base. Nanopolish [66] detects 5mC with a hidden Markov model, which in signalAlign [107] is combined with a hierarchical Dirichlet process, to determine the most likely k-mer (modified or unmodified). D-NAscent [95] utilises an approach similar to Nanopolish to detect BrdU incorporation, while EpiNano uses support vector machines (SVMs) to detect RNA m6A. Recent methods use neural network classifiers to detect 6mA and 5mC (mCaller [108], DeepSignal [110], DeepMod [111]). The accuracy of these methods is upwards of 80% but varies between modifications and motifs. Appropriate training data is crucial and currently a limiting factor. Models trained exclusively on samples with fully methylated or unmethylated CpGs will not perform optimally on biological samples with a mixture of CpG and mCpGs, or 5mC in other sequence contexts [66, 105]. Low specificity is particularly problematic for low abundance marks. m6A is present at 0.05% in mRNA [113, 114]; therefore, a method testing all adenosines in the transcriptome with sensitivity and specificity of 90% at the single-molecule, single-base level would result in an unacceptable false discovery rate of 98%. Direct basecalling of modified bases is a recent addition to ONT’s basecaller Guppy, currently limited to 5mC in the CpG context. A development basecaller, Taiyaki [45], can be trained for specific organisms or base modifications. RepNano can basecall BrdU in addition to the four canonical DNA bases [109]. Two major bottlenecks in the creation of modification-ready basecallers are the need for appropriate training data and the combinatorial complexity of adding bases to the basecalling alphabet. There is also a lack of tools for the downstream analysis of base modifications: most tools output a probability that a certain base is modified, while traditional differential methylation algorithms expect binary counts of methylated and unmethylated bases.

Analysing long-read transcriptomes

Alternative splicing is a major mechanism increasing the complexity of gene expression in eukaryotes [115, 116]. Practically, all multi-exon genes in humans are alternatively spliced [117, 118], with variations between tissues and between individuals [119]. However, fragmented short reads cannot fully assemble nor accurately quantify the expressed isoforms, especially at complex loci [120, 121]. Long-read sequencing provides a solution by ideally sequencing full-length transcripts. Recent studies that used bulk, single-cell, or targeted long-read sequencing suggest that our best transcript annotations are still missing vast numbers of relevant isoforms [122-126]. As noted above, sequencing native RNA further provides the opportunity to better characterise RNA modifications or other characteristics such as poly-A tail length. Despite its many promises, analysis of long-read transcriptomes remains challenging. Few of the existing tools for short-read RNA-seq analysis are able to appropriately deal with the high error rate of long reads, necessitating the development of dedicated tools and extensive benchmarks. Although recently, the field of long-read transcriptomics is rapidly expanding, we tallied 36 tools related to long-read transcriptome analysis (Fig. 1b). Most long-read isoform detection tools work by clustering aligned and error-corrected reads into groups and collapsing these into isoforms, but the detailed implementations differ between tools (Fig. 4). PacBio’s Iso-Seq3 [127, 128] is the most mature pipeline for long-read transcriptome analysis, allowing the assembly of full-length transcripts. It performs pre-processing for SMRT reads, de novo discovery of isoforms by hierarchical clustering and iterative merging, and polishing. Cupcake [129] provides scripts for downstream analysis such as collapsing redundant isoforms and merging Iso-Seq runs from different batches, giving abundance information as well as performing junction analysis. In the absence of a reference genome, Iso-Seq can assemble a transcriptome, but transcripts from related genes may be merged [130] as a trade-off for correcting reads with a high error rate. Furthermore, the library preparation for Iso-Seq usually requires size fractionation, which makes absolute and relative quantification difficult. The per-read cost remains high, making well-replicated differential expression study designs prohibitively expensive.

Fig. 4

Types of transcriptomic analyses and their steps. The choice of sequencing protocol amongst the six available workflows affects the type, characteristics, and quantity of data generated. Only direct RNA sequencing allows epitranscriptomic studies, but SMRT direct RNA sequencing is a custom technique that is not fully supported. The remaining non-exclusive applications are isoform detection, quantification, and differential analysis. The dashed lines in arrows represent upstream processes to transcriptomics Alternative isoform detection pipelines such as IsoCon [130], SQANTI [131], and TALON [132] attempt to mitigate the erroneous merging of similar transcripts of the Iso-Seq pipeline. IsoCon and SQANTI specifically work with SMRT data while TALON is a technology-independent approach. IsoCon uses the full-length transcripts from Iso-Seq to perform clustering and partial error correction and identify candidate transcripts without losing potential true variants within each cluster. SQANTI generates quality control reports for SMRT Iso-Seq data and detects and removes potential artefacts. TALON, on the other hand, relies heavily on the GENCODE annotation. Since both IsoCon and TALON focus on the human genome, they may not perform equally well with genomes from non-model organisms. A number of alternative isoform annotation pipelines for SMRT and/or nanopore data have recently emerged, such as FLAIR [133], Tama [134], IDP [122], TAPIS [135], Mandalorion Episode II [36, 57], and Pinfish [136]. Some of them use short reads to improve exon junction annotation. However, their accuracy has not yet been extensively tested. In addition to high error rates, potential coverage biases are currently not explicitly taken into account by long-read transcriptomic tools. In ONT’s direct RNA sequencing protocol, transcripts are sequenced from the 3′ to the 5′ end; therefore, any fragmentation during the library prep, or pore blocking, results in truncated reads. In our experience, it is common to see a coverage bias towards the 3′ end of transcripts, which can affect isoform characterisation and quantification. Methods that sequence cDNA will also show these coverage biases due to fragmentation and pore-blocking (for nanopore data), compounded by non-processivity of the reverse transcriptase [124], more likely to stall when it encounters RNA modifications [137]. Finally, the length-dependent or sequence-dependent biases introduced by protocols that rely on PCR are currently not well characterised nor accounted for. To quantify the abundance of transcripts or genes, several methods can be used (Fig. 4). Salmon’s [138] quasi-mapping mode quantifies reads directly against a reference index, and its alignment-based mode instead works with aligned sequences. The Wub package [139] also provides a script for read counting. The featureCounts [140] function from the Subread package [141, 142] supports long-read gene level counting. The FLAIR [133] pipeline provides wrappers for quantifying FLAIR isoform usage across samples using minimap2 or Salmon. Of course, for accurate transcript-level quantification, these methods rely on a complete and accurate isoform annotation; this is currently the difficult step. Two types of differential analyses can be run: gene level or transcript level (Fig. 4). Transcript-level analyses may be further focused on differential transcript usage (DTU), where the gene may overall be expressed at the same level between two conditions, but the relative proportions of isoforms may vary. The popular tools for short-read differential gene expression analysis, such as limma [143], edgeR [144, 145], and DESeq2 [146], can also be used for long-read differential isoform or gene expression analyses. DRIMSeq [147] can perform differential isoform usage analysis using the Dirichlet-multinomial model. One difference between short- and long-read counts is that for the latter, counts per million (cpm) are effectively transcripts per million (tpm), whereas for short reads (and random fragmentation protocols), transcript length influences the number of reads, and therefore, cpms need scaling by transcript length to obtain tpms. The biological interpretation of differential isoform expression strongly depends on the classification of the isoforms, for example, whether the isoforms code for the same or different proteins or whether premature stop codons make them subject to nonsense-mediated decay. This is currently not well integrated into the analyses.

Combining long reads, synthetic long reads, and short reads

Assemblies based solely on long reads generally produce highly complete and contiguous genomes [148-150]; however, there are many situations where short reads or reads generated from synthetic long-read technology further improve the results [151-153]. Different technologies can intervene at different scales: short reads ensure base-level accuracy, high-quality 5–15-kb SMRT reads generate good contigs, while ultra-long (100 kb+) nanopore reads, optical mapping or Hi-C improve scaffolding of the contigs into chromosomes [11, 17, 154–157]. Combining all of these technologies in a single genomic project would be costly. Instead, combinations of subsets are frequent, in particular, nanopore/SMRT with short-read sequencing [50, 152, 153, 158], although other combinations can be useful. Nanopore assembly of wild strains of Drosophila melanogaster supported by scaffolds generated from Hi-C corrected two misalignments of contigs in the reference assembly [154]. Optical maps helped resolve misassembly of SMRT-based chromosome level contigs of three plant relatives of Arabidopsis thaliana, where unrelated parts of the genome were erroneously linked [155]. For structural variation or base modification detection, obtaining orthogonal support from SMRT and nanopore data is valuable to confirm discoveries and limit false positives [77, 108, 159]. The error profiles of SMRT and nanopore sequencing are not identical—though both technologies experience difficulty around homopolymers—combining them can draw on their respective strengths. Certain tools such as Unicycler [160] integrate long- and short-read data to produce hybrid assemblies, while other tools have been presented as pipelines to achieve this purpose (e.g. Canu, Pilon, and Racon in the ont-assembly-polish pipeline [45]). Still, combining tools and data types remains a challenge, usually requiring intensive manual integration.

long-read-tools.org: a catalogue of long-read sequencing data analysis tools

The growing interest in the potential of long reads in various areas of biology is reflected by the exponential development of tools over the last decade (Fig. 1a). There are open-source static catalogues (e.g. github.com/B-UMMI/long-read-catalog), custom pipelines developed by individual labs for specific purposes (e.g. Search results from GitHub), and others that attempt to generalise them for a wider research community [46]. Being able to easily identify what tools exist—or do not exist—is crucial to plan and perform best-practice analyses, build comprehensive benchmarks, and guide the development of new software. For this purpose, we introduce long-read-tools.org, a timely database that comprehensively collates tools used for long-read data analysis. Users can interactively search tools categorised by technology and intended type of analysis. In addition to true long-read sequencing technologies (SMRT and nanopore), we include synthetic long-read strategies (10X linked reads, Hi-C, and Bionano optical mapping). The fast-paced evolution of long-read sequencing technologies and tools also means that certain tools become obsolete. We include them in our database for completeness but indicate when they have been superseded or are no longer maintained. long-read-tools.org is an open-source project under the MIT License, whose code is available through GitHub [161]. We encourage researchers to contribute new database entries of relevant tools and improvements to the database, either directly via the GitHub repository or through the submission form on the database webpage.

Discussion

At the time of writing, for about USD1500, one can obtain around 30 Gbases of ≥ 99% accurate SMRT CCS (1 Sequel II 8M SMRT cell) or 50–150 Gbases of noisier but potentially longer nanopore reads (1 PromethION flow cell). While initially, long-read sequencing was perhaps most useful for assembly of small (bacterial) genomes, the recent increases in throughput and accuracy enable a broader range of applications. The actual biological polymers that carry genetic information can now be sequenced in their full length or at least in fragments of tens to hundreds of kilobases, giving us a more complete picture of genomes (e.g. telomere-to-telomere assemblies, structural variants, phased variations, epigenetics, metagenomics) and transcriptomes (e.g. isoform diversity and quantity, epitranscriptomics, polyadenylation). These advances are underpinned by an expanding collection of tools that explicitly take into account the characteristics of long reads, in particular, their error rate, to efficiently and accurately perform tasks such as preprocessing, error correction, alignment, assembly, base modification detection, quantification, and species identification. We have collated these tools in the long-read-tools.org database. The proliferation of long-read analysis tools revealed by our census makes a compelling case for complementary efforts in benchmarking. Essential to this process is the generation of publicly available benchmark data sets where the ground truth is known and whose characteristics are as close as possible to those of real biological data sets. Simulations, artificial nucleic acids such as synthetic transcripts or in vitro-methylated DNA, resequencing, and validation endeavours will all contribute to establishing a ground truth against which an array of tools can be benchmarked. In spite of the rapid iteration of technologies, chemistries, and data formats, these benchmarks will encourage the emergence of best practices. A recurrent challenge in long-read data analysis is scalability. For instance in genome assembly, Canu [69] produces excellent assemblies for small genomes but takes too long to run for large genomes. Fast processing is crucial to enable parameter optimisation in applications that are not yet routine. The recently released wtdbg2 [63], TULIP [70], Shasta [162], Peregrine [163], Flye [164], and Ra [165] assemblers are orders of magnitude faster and are quickly being adopted. Similarly, for mapping long reads, minimap2’s speed, in addition to its accuracy, has contributed to its fast and wide adoption. Nanopolish [66] is popular both for assembly correction and base modification detection; however, it is slow on large data sets. The refactoring of its call-methylation function in f5c tool greatly facilitates work with large genomes or data sets [166]. Beyond data processing speed, scalability is also impacted by data generation, storage, and integration. Nanopore sequencing presents the fastest turnaround time. Once DNA is extracted, sequencing is underway in a matter of minutes to hours, and the PromethION sequencer provides adjustable high throughput with individually addressable parallel flow cells. All other library preparation procedures are more labour intensive, and sequencing may have to await pooling to fill a run, and flow cells need to be run in succession rather than in parallel. The raw nanopore data is however extremely voluminous (about 20 bytes per base), leading to substantial IT costs for large projects. SMRT movies are not saved for later re-basecalling, and the sequence and kinetic information takes up a smaller 3.5 bytes per base. Furthermore, hybrid methods incorporating strengths from other technologies such as optical mapping (Bionano, OpGen) and Hi-C add to the cost and analytical complexity of genomic projects. For these, manual data integration is a significant bottleneck, but the rewards are worth the effort. Despite increasing accuracy of both SMRT and nanopore sequencing platforms, error correction remains an important step in long-read analysis pipelines. Published assemblies that omit careful error correction are likely to predict many spurious truncated proteins [167]. Hybrid error correction, leveraging the accuracy of short reads, is still outperforming long-read-only correction [60]. Modern short-read sequencing protocols require small input amounts (some even scale down to single cells) so sample amount is usually not a barrier to combining short- and long-read sequencing. Removing the need for short reads, and higher coverage via improvements in non-hybrid error correction tools and/or long-read sequencing accuracy, would reduce the cost, length, and complexity of genomic projects. The much anticipated advances in epigenetics/epitranscriptomics promised by long-read sequencing are still in development. Many modifications, including 5mC, do not influence the SMRT polymerase’ dynamics sufficiently to be detected at a useful sensitivity (5mC requires 250 × coverage). In this case, software improvements are unlikely to yield significant gains, and improvements in sequencing chemistries are probably required [168]. Nanopore sequencing appears more amenable to the detection of a wide array of base modifications (to date: 5mCG, BrdU, 6mA), but the lack of ground truth data to train models and the combinatorial complexity of introducing multiple alternative bases are hindering progress towards a goal of seamless basecalling from an extended alphabet of canonical and non-canonical bases. Downstream analyses, in particular, differential methylation, exploiting the phasing of base modifications, as well as visualisation, suffer from a dearth of tools. The field of long-read transcriptomics is equally in its infancy. To date, the Iso-Seq pipeline has been used to build catalogues of transcripts in a range of species [128, 169, 170]. Nanopore reads-based transcriptomes are more recent [10, 171–173], and work is still needed to understand the characteristics of these data (e.g. coverage bias, sequence biases, reproducibility). Certain isoform assembly pipelines predict a large number of unannotated isoforms requiring validation and classification. Even accounting for artefacts and transcriptional noise, these early studies reveal an unexpectedly large diversity in isoforms. Benchmark data and studies will be required in addition to atlas-type sequencing efforts to generate high-quality transcript annotations that are more comprehensive than the current ones. Long reads theoretically confer huge advantages over short reads for transcript-level differential expression, however the low-level of replication and modest read counts obtained from long-read transcriptomic experiments are currently limiting. Until throughput increases and price decreases sufficiently, hybrid approaches that use long reads to define the isoforms expressed in the samples and short reads to get enough counts for well-powered differential expression may be successful; these do not yet exist. Long-read sequencing technologies have already opened exciting avenues in genomics. Taking on the challenge of obtaining phased, accurate, and complete (including base modifications) genomes and transcriptomes that can be compared will require continued efforts in developing and benchmarking tools. Additional file 1 Review history.

143 in total

1. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing.

Authors: Qun Pan; Ofer Shai; Leo J Lee; Brendan J Frey; Benjamin J Blencowe
Journal: Nat Genet Date: 2008-11-02 Impact factor: 38.330

2. Characterization of the human ESC transcriptome by hybrid sequencing.

Authors: Kin Fai Au; Vittorio Sebastiano; Pegah Tootoonchi Afshar; Jens Durruthy Durruthy; Lawrence Lee; Brian A Williams; Harm van Bakel; Eric E Schadt; Renee A Reijo-Pera; Jason G Underwood; Wing Hung Wong
Journal: Proc Natl Acad Sci U S A Date: 2013-11-26 Impact factor: 11.205

3. Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome.

Authors: Zhanshan Sam Ma; Lianwei Li; Chengxi Ye; Minsheng Peng; Ya-Ping Zhang
Journal: Genomics Date: 2018-12-27 Impact factor: 5.736

4. Efficiency of PacBio long read correction by 2nd generation Illumina sequencing.

Authors: Medhat Mahmoud; Marek Zywicki; Tomasz Twardowski; Wojciech M Karlowski
Journal: Genomics Date: 2017-12-18 Impact factor: 5.736

Review 5. Structural variation in the sequencing era.

Authors: Steve S Ho; Alexander E Urban; Ryan E Mills
Journal: Nat Rev Genet Date: 2019-11-15 Impact factor: 53.242

6. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Authors: Michael I Love; Wolfgang Huber; Simon Anders
Journal: Genome Biol Date: 2014 Impact factor: 13.583

7. Single-cell mRNA isoform diversity in the mouse brain.

Authors: Kasper Karlsson; Sten Linnarsson
Journal: BMC Genomics Date: 2017-02-03 Impact factor: 3.969

Review 8. Next-generation sequencing technologies for detection of modified nucleotides in RNAs.

Authors: Schraga Schwartz; Yuri Motorin
Journal: RNA Biol Date: 2016-10-28 Impact factor: 4.652

Review 9. The sequence of sequencers: The history of sequencing DNA.

Authors: James M Heather; Benjamin Chain
Journal: Genomics Date: 2015-11-10 Impact factor: 5.736

10. Long-read sequence assembly of the gorilla genome.

Authors: David Gordon; John Huddleston; Mark J P Chaisson; Christopher M Hill; Zev N Kronenberg; Katherine M Munson; Maika Malig; Archana Raja; Ian Fiddes; LaDeana W Hillier; Christopher Dunn; Carl Baker; Joel Armstrong; Mark Diekhans; Benedict Paten; Jay Shendure; Richard K Wilson; David Haussler; Chen-Shan Chin; Evan E Eichler
Journal: Science Date: 2016-04-01 Impact factor: 47.728

190 in total

Review 1. Towards improved genetic diagnosis of human differences of sex development.

Authors: Emmanuèle C Délot; Eric Vilain
Journal: Nat Rev Genet Date: 2021-06-03 Impact factor: 53.242

Review 2. Examining horizontal gene transfer in microbial communities.

Authors: Ilana Lauren Brito
Journal: Nat Rev Microbiol Date: 2021-04-12 Impact factor: 60.633

3. A bird-like genome from a frog: Mechanisms of genome size reduction in the ornate burrowing frog, Platyplectrum ornatum.

Authors: Sangeet Lamichhaney; Renee Catullo; J Scott Keogh; Simon Clulow; Scott V Edwards; Tariq Ezaz
Journal: Proc Natl Acad Sci U S A Date: 2021-03-16 Impact factor: 11.205