| Literature DB >> 29069314 |
Krešimir Križanovic1, Amina Echchiki2,3, Julien Roux2,3, Mile Šikic1,4.
Abstract
Motivation: High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads.Entities:
Mesh:
Year: 2018 PMID: 29069314 PMCID: PMC6192213 DOI: 10.1093/bioinformatics/btx668
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Test dataset statistics
| Data set | Type | Organism | Technology | Size | No. genes | No. reads | % AS genes |
|---|---|---|---|---|---|---|---|
| A | Real | Illumina | 1 GB | NA | 4,000,000 | NA | |
| B | Synthetic | Long read low error | 1.4 GB | 7,000 | 410,000 | 10 | |
| 1 | Synthetic | PacBio ROI | 400 MB | 6,000 | 185,000 | 0 | |
| 2 | Synthetic | PacBio ROI | 1.4 GB | 7,000 | 412,000 | 10 | |
| 3 | Synthetic | PacBio ROI | 200 MB | 1,520 | 84,000 | 60 | |
| 4 | Synthetic | ONT R9 2D | 1.4 GB | 7,000 | 342,000 | 10 | |
| 5 | Real | PacBio ROI | 1 GB | NA | 192,000 | NA | |
| 6 | Real | PacBio ROI error-corrected | 500 MB | NA | 192,000 | NA | |
| 7 | Real | PacBio Subreads | 1 GB | NA | 243,000 | NA | |
| 8 | Real | ONT R9 2D | 120 MB | NA | 40,000 | NA |
Percentage of reads aligned over all aligners and datasets
| Data set | Aligner | Tophat2 (%) | Hisat2 (%) | STAR (%) | BBMap (%) | GMAP (%) |
|---|---|---|---|---|---|---|
| No. reads | ||||||
| A | 4M | 85.2 | 94.8 | 96.8 | 96.7% | |
| B | 410K | 0 | 0 | 84.9 | 97.3 | |
| 1 | 185K | 0.7 | 6.77 | 48.9 | 89.2 | |
| 2 | 412K | 0 | 0 | 33.3 | 84.5 | |
| 3 | 84K | 0 | 0 | 32.3 | 64.3 | |
| 4 | 342K | 0 | 0 | 5.5 | 43.0 | |
| 5 | 192K | 0 | 0 | 46.1 | 74.5 | |
| 6 | 192K | 0 | 0.4 | 67.2 | 82.8 | |
| 7 | 243K | 0 | 0% | 0.1 | 72.8 | |
| 8 | 40K | 0 | 0% | 16.7 | 88.0 |
Note: Bold values present the best scoring result for a particular measured value.
Fig. 1.Evaluation of synthetic datasets
Aligner evaluation on synthetic datasets
| Dataset | STAR (%) | BBMap (%) | GMAP (%) | |
|---|---|---|---|---|
| 1 | Aligned | 48.9 | 89.2 | |
| Match rate | 92.5 | 92.3 | ||
| Correct | 22.1 | 41.8 | ||
| Hit all | 46.5 | 84.3 | ||
| Hit one | 47.1 | 85.4 | ||
| Split reads | 1.89 | 3.3 | ||
| Correct, split | 0.55 | 0.95 | ||
| Split hit all | 1.2 | 2.05 | ||
| Split hit one | 1.8 | 3.1 | ||
| 2 | Aligned | 33.3 | 84.5 | |
| Match rate | 89.9 | 92.0 | ||
| Correct | 10.4 | 24.9 | ||
| Hit all | 27.7 | 54.4 | ||
| Hit one | 30.7 | 78.4 | ||
| Split reads | 23.9 | 64.8 | ||
| Correct, split | 6.3 | 14.2 | ||
| Split hit all | 19.3 | 36.7 | ||
| Split hit one | 22.3 | 60.7 | ||
| 3 | Aligned | 32.3 | 64.3 | |
| Match rate | 86.2 | 91.8 | ||
| Correct | 11.4 | 15.3 | ||
| Hit all | 27.5 | 26.8 | ||
| Hit one | 30.5 | 61.2 | ||
| Split reads | 23.1 | 46.0 | ||
| Correct, split | 7.5 | 4.3 | ||
| Split hit all | 19.4 | 10.2 | ||
| Split hit one | 22.4 | 44.5 | ||
| 4 | Aligned | 5.5 | 43.0 | |
| Match rate | 89.6 | 88.4 | ||
| Correct | 1.2 | 7.9 | ||
| Hit all | 5.0 | 26.8 | ||
| Hit one | 5.3 | 42.1 | ||
| Split reads | 3.2 | 34.2 | ||
| Correct, split | 0.5 | 4.1 | ||
| Split hit all | 2.9 | 18.7 | ||
| Split hit one | 3.2 | 33.8 |
Note: Bold values present the best scoring result for a particular measured value.
Aligner evaluation on real datasets
| Dataset | STAR | BBMap | GMAP | |
|---|---|---|---|---|
| 5 | Aligned (%) | 46.1 | 74.5 | |
| Match rate (%) | 71 | 88 | ||
| No. expressed genes | 8884 | 9536 | ||
| Exon hit (%) | 45.7 | 73.4 | ||
| Contiguous alignment (%) | 33.1 | 48.4 | ||
| 6 | Aligned (%) | 67.2 | 82.8 | |
| Match rate (%) | 72 | 92 | ||
| No. expressed genes | 8515 | 9724 | ||
| Exon hit (%) | 65.1 | 81.8 | ||
| Contiguous alignment (%) | 35.0 | 55.6 | ||
| 7 | Aligned (%) | 0.1 | 72.8 | |
| Match rate (%) | 81 | 68 | ||
| No. expressed genes | 183 | 9013 | ||
| Exon hit (%) | 0.1 | 72.4 | ||
| Contiguous alignment (%) | 0.0 | 35.7 | ||
| 8 | Aligned (%) | 16.8 | 88.0 | |
| Match rate (%) | 67 | 81 | ||
| No. expressed genes | 2344 | 6578 | ||
| Exon hit (%) | 11.0 | 62.3 | ||
| Contiguous alignment (%) | 4.8 | 26.8 |
Note: Bold values present the best scoring result for a particular measured value.
Fig. 2.Aligned read percentage violin plots for GMAP and STAR