| Literature DB >> 28968831 |
Abstract
MOTIVATION: In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step.Entities:
Year: 2018 PMID: 28968831 PMCID: PMC5860201 DOI: 10.1093/bioinformatics/btx558
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The mapping idea of DART. The mapping can be divided into simple and normal pairs to deal with exact matches and mismatches separately
Fig. 2.(A) Gaps between simple pairs in an exonic read. (B) Gaps between simple pairs in a spanned read
Fig. 3.Identification of the splice junction between two simple pairs A and B. The simple pair A should be shrunk by two nucleotides to match the splice site ‘GT/AG’ in the splice junction
Performance comparison of DART and other selected aligners on the simulated datasets
| Synthetic datasets | Aligner | Sensitivity | Accuracy | Recall | SeqIdy | Reported SJ | True SJ | SJ accuracy | Runtime |
|---|---|---|---|---|---|---|---|---|---|
| SimRead_76 | DART | 0.991 | 0.989 | 0.957 | 0.999 | 99761 | 96700 | 0.969 | 71 |
| STAR | 0.978 | 0.981 | 0.958 | 0.996 | 108202 | 101163 | 0.935 | 129 | |
| TopHat2 | 0.852 | 0.961 | 0.853 | 0.998 | 102230 | 93850 | 0.918 | 6172 | |
| Subread | 0.965 | 0.988 | 0.929 | 0.998 | 99033 | 95469 | 0.964 | 2610 | |
| MapSplice2 | 0.962 | 0.976 | 0.940 | 0.997 | 101230 | 97895 | 0.967 | 3602 | |
| HISAT2 | 0.911 | 0.977 | 0.889 | 0.999 | 100589 | 96922 | 0.964 | 353 | |
| SimRead_101 | DART | 0.992 | 0.988 | 0.965 | 0.997 | 105584 | 102162 | 0.968 | 95 |
| STAR | 0.977 | 0.982 | 0.958 | 0.996 | 112674 | 105459 | 0.936 | 154 | |
| TopHat2 | 0.809 | 0.967 | 0.809 | 0.999 | 109153 | 99501 | 0.912 | 10357 | |
| Subread | 0.955 | 0.987 | 0.925 | 0.998 | 105269 | 101136 | 0.961 | 2346 | |
| MapSplice2 | 0.979 | 0.980 | 0.960 | 0.998 | 110219 | 104434 | 0.948 | 4736 | |
| HISAT2 | 0.898 | 0.979 | 0.879 | 0.998 | 104309 | 100633 | 0.965 | 384 | |
| SimRead_151 | DART | 0.996 | 0.989 | 0.971 | 0.994 | 112614 | 108771 | 0.966 | 146 |
| STAR | 0.969 | 0.984 | 0.953 | 0.995 | 117793 | 110832 | 0.941 | 208 | |
| TopHat2 | 0.720 | 0.974 | 0.718 | 0.999 | 114134 | 104970 | 0.920 | 20055 | |
| Subread | 0.928 | 0.987 | 0.901 | 0.998 | 111156 | 106542 | 0.958 | 2394 | |
| MapSplice2 | 0.994 | 0.979 | 0.973 | 0.997 | 110676 | 106594 | 0.963 | 6032 | |
| HISAT2 | 0.871 | 0.981 | 0.854 | 0.997 | 107932 | 104315 | 0.966 | 464 | |
| SimRead_251 | DART | 0.997 | 0.988 | 0.971 | 0.989 | 115680 | 111487 | 0.964 | 264 |
| STAR | 0.939 | 0.982 | 0.921 | 0.995 | 118922 | 112132 | 0.943 | 359 | |
| TopHat2 | 0.606 | 0.973 | 0.601 | 0.999 | 117547 | 107358 | 0.913 | 26523 | |
| Subread | 0.893 | 0.983 | 0.863 | 0.997 | 114634 | 109503 | 0.955 | 4170 | |
| MapSplice2 | 0.998 | 0.967 | 0.964 | 0.997 | 111967 | 107395 | 0.959 | 7920 | |
| HISAT2 | 0.829 | 0.978 | 0.811 | 0.995 | 108670 | 105086 | 0.967 | 635 |
Note: Each dataset contains around 40 million reads with different lengths. The simulation was based on known transcripts from the entire human genome.
Performance comparison of DART and other selected aligners on the real datasets
| Real datasets | Aligner | Sensitivity | SeqIdy | Reported SJ | True SJ | SJ accuracy | Runtime |
|---|---|---|---|---|---|---|---|
| SRR3351428 (100bp) | DART | 0.975 | 0.999 | 236920 | 150260 | 0.634 | 244 |
| STAR | 0.922 | 0.996 | 270788 | 152192 | 0.562 | 270 | |
| TopHat2 | 0.844 | 0.998 | 217011 | 146077 | 0.673 | 22464 | |
| Subread | 0.858 | 0.998 | 221700 | 146518 | 0.661 | 3312 | |
| MapSplice2 | 0.966 | 0.996 | 240918 | 149255 | 0.620 | 67446 | |
| HISAT2 | 0.883 | 0.998 | 149592 | 129379 | 0.865 | 404 | |
| ERR1518881 (101bp) | DART | 0.874 | 0.997 | 243515 | 154851 | 0.636 | 369 |
| STAR | 0.841 | 0.987 | 259194 | 157026 | 0.606 | 371 | |
| TopHat2 | 0.640 | 0.995 | 220662 | 150126 | 0.680 | 21185 | |
| Subread | 0.759 | 0.992 | 229369 | 151325 | 0.660 | 4008 | |
| MapSplice2 | 0.893 | 0.988 | 221275 | 150496 | 0.680 | 15021 | |
| HISAT2 | 0.756 | 0.993 | 162639 | 135460 | 0.833 | 480 | |
| SRR3439468 (151bp) | DART | 0.930 | 0.996 | 197235 | 129148 | 0.655 | 481 |
| STAR | 0.841 | 0.992 | 206157 | 129008 | 0.626 | 594 | |
| TopHat2 | NA | NA | NA | NA | NA | NA | |
| Subread | NA | NA | NA | NA | NA | NA | |
| MapSplice2 | 0.930 | 0.990 | 158581 | 113879 | 0.718 | 49320 | |
| HISAT2 | 0.482 | 0.994 | 131892 | 105180 | 0.797 | 1306 | |
| SRR3439488 (151bp) | DART | 0.899 | 0.995 | 142410 | 112562 | 0.790 | 427 |
| STAR | 0.775 | 0.990 | 148672 | 113102 | 0.761 | 813 | |
| TopHat2 | NA | NA | NA | NA | NA | NA | |
| Subread | NA | NA | NA | NA | NA | NA | |
| MapSplice2 | 0.851 | 0.989 | 151771 | 107025 | 0.705 | 36240 | |
| HISAT2 | 0.657 | 0.994 | 120311 | 100198 | 0.833 | 703 |
Comparison of the peak RAM usage
| Aligner | Memory usage (GB) |
|---|---|
| DART | 12.0 |
| STAR | 30.0 |
| TopHat2 | 4.5 |
| Subread | 10.0 |
| MapSplice2 | 6.3 |
| HISAT2 | 5.6 |
The scaling of runtime of DART on the dataset SimRead_76
| Dataset | Threads | runtime |
|---|---|---|
| SimRead_76 | 1 | 937 |
| 2 | 464 | |
| 4 | 240 | |
| 8 | 126 | |
| 16 | 71 |