| Literature DB >> 31587638 |
Ashley Byrne1, Charles Cole2, Roger Volden2, Christopher Vollmers2.
Abstract
Long-read sequencing holds great potential for transcriptome analysis because it offers researchers an affordable method to annotate the transcriptomes of non-model organisms. This, in turn, will greatly benefit future work on less-researched organisms like unicellular eukaryotes that cannot rely on large consortia to generate these transcriptome annotations. However, to realize this potential, several remaining molecular and computational challenges will have to be overcome. In this review, we have outlined the limitations of short-read sequencing technology and how long-read sequencing technology overcomes these limitations. We have also highlighted the unique challenges still present for long-read sequencing technology and provided some suggestions on how to overcome these challenges going forward. This article is part of a discussion meeting issue 'Single cell ecology'.Entities:
Keywords: Oxford Nanopore Technologies; Pacific Biosciences; long-read sequencing; transcriptome analysis
Mesh:
Year: 2019 PMID: 31587638 PMCID: PMC6792442 DOI: 10.1098/rstb.2019.0097
Source DB: PubMed Journal: Philos Trans R Soc Lond B Biol Sci ISSN: 0962-8436 Impact factor: 6.237
Sequencing technology characteristics. (Read number per dollar is hard to establish considering different pricing structures and instrument costs. Here, we assume a laboratory would use sequencing cores for Illumina and PacBio sequencing while performing ONT MinION sequencing themselves.)
| technology | read number/$1 k | read accuracy (%) | consensus accuracy |
|---|---|---|---|
| Illumina NextSeq | ∼2 × 108 | 99.9 | N/A |
| Pacific Biosciences (PacBio) sequel | ∼4 × 105 | 89 | >99% |
| Oxford Nanopore Technologies (ONT) MinION | ∼5 × 106 | 88 | >97.5%a |
aConsensus accuracy using our R2C2 approach as published [9,10].
Figure 1.Fundamental difference between short- and long-read sequencing of transcripts. Short RNA-seq reads only capture small fragments of transcripts. RNA-seq data, therefore, lacks unambiguous isoform data leading to the inference of many erroneous isoforms. Long-read full-length cDNA data captures transcripts end-to-end making isoform inference unambiguous.
Figure 2.Long-read transcriptome sequencing approaches do not cover long transcripts. Swarmplots of length distributions of 1000 randomly sampled PacBio [9], ONT dRNA and cDNA [28] reads covering the GM12878 (human lymphoblast cell line) transcriptome. These distributions are not representative of the length distribution of the human transcriptome as annotated by GENCODE. *While we show the most recent dataset on GM12878 we could find for PacBio technology it is several years old and might not be fully representative of current platform performance.
Figure 3.Error-prone reads pose analysis challenge. Representative alignments of ONT cDNA [28] reads. Thirty read alignments (grey) to the first two exons of the CD19 gene (dark blue) are shown. Read alignments contain many insertions (orange), mismatches (red) and deletions (thin line) within exons. These errors complicate the detection of exact transcript sequences and exact positions of splice sites, TSSs and polyA sites.
Figure 4.Analysis challenges of long-read full-length sequencing. A simplified schematic shows the steps required to extract information out of long-read sequencing data. Each read has to be aligned, ideally in a allele-aware manner to the genome it originated from. Read alignments then have to be analysed to identify RNA modifications as well as new isoform features that are missing in the current transcriptome annotation. For each allele, reads then have to be grouped into isoforms which allows isoform identification and quantification. For real datasets, all these steps have to take into account the often substantial rates of sequencing errors and incomplete reads in long-read sequencing. These will complicate all steps of the analysis.