| Literature DB >> 32926128 |
Luca Denti1, Yuri Pirola1, Marco Previtali1, Tamara Ceccato1, Gianluca Della Vedova1, Raffaella Rizzi1, Paola Bonizzoni1.
Abstract
MOTIVATION: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study.Entities:
Year: 2021 PMID: 32926128 PMCID: PMC8088329 DOI: 10.1093/bioinformatics/btaa779
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Relation between the Bloom filter BF, the bit vector P, and the vector I. To retrieve the identifiers of the genes containing a k-mer e (gacttg in the figure), we compute its image h through H and, if is the v-th 1 of BF, the positions of the -th and the v-th 1 of P, denoted as p1 and p2, respectively, can be found via rank and select operations. The interval of I from to p2 stores the set of the indices of the genes containing the k-mer e
Fig. 2.Accuracy results — exploratory analysis. Accuracy is shown in terms of average precision and average recall obtained across the 10 performed runs. Lines connect data points with
Accuracy and efficiency results — varying gene panel sizes
| Shark (multiple mode) | Shark (single mode) | RapMap | Puffaligner | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gene panel size |
|
| Time (s) | RAM (GB) |
|
| Time (s) | RAM (GB) |
|
| Time (s) | RAM (GB) |
|
| Time (s) | RAM (GB) |
| 100 | 19.6 | 99.7 | 27 | 1.41 | 21.9 | 99.7 | 33 | 1.41 | 21.5 | 99.1 | 51 | 0.20 | 3.4 | 99.3 | 47 | 0.22 |
| 250 | 45.7 | 99.7 | 35 | 1.43 | 50.3 | 99.2 | 36 | 1.43 | 52.1 | 99.0 | 63 | 0.28 | 6.6 | 99.3 | 63 | 0.27 |
| 500 | 54.2 | 99.7 | 58 | 1.46 | 60.4 | 98.1 | 58 | 1.46 | 60.8 | 99.1 | 79 | 0.48 | 7.1 | 99.4 | 90 | 0.36 |
| 1000 | 58.6 | 99.6 | 103 | 1.57 | 66.6 | 98.4 | 96 | 1.57 | 65.6 | 98.9 | 110 | 0.87 | 9.1 | 99.3 | 120 | 0.43 |
| 2500 | 60.7 | 99.7 | 241 | 1.86 | 73.7 | 91.7 | 248 | 1.86 | 66.9 | 98.9 | 174 | 1.81 | 13.6 | 99.3 | 184 | 0.70 |
| 5000 | 65.2 | 99.6 | 441 | 2.47 | 84.7 | 85.0 | 492 | 2.47 | 71.9 | 98.8 | 325 | 3.79 | 26.2 | 99.3 | 277 | 1.00 |
| 10 000 | 68.9 | 99.6 | 898 | 3.38 | 99.9 | 75.1 | 934 | 3.38 | 75.5 | 98.7 | 526 | 6.37 | 64.7 | 99.3 | 431 | 1.55 |
Note: Accuracy is shown in terms of precision (P) and recall (R), while efficiency in terms of running time (Time, in seconds) and peak memory usage (RAM, in GB).
Accuracy and efficiency of the three pipelines for differential analysis of alternative splicing on the original samples compared with those obtained on the samples filtered by Shark
| RT-PCR events | Time | RAM | ||
|---|---|---|---|---|
| Pipeline | All |
| (min) | (GB) |
| rMATS | 78 | 63 | 328 | 33.9 |
| Shark + rMATS | 78 | 63 | 154 | 33.9 |
| SplAdder | 56 | — | 915 | 33.9 |
| Shark + SplAdder | 56 | — | 351 | 33.9 |
| SUPPA2 | 66 | 44 | 117 | 1.7 |
| Shark + SUPPA2 | 66 | 51 | 42 | 1.7 |
Note: Accuracy is evaluated in terms of the number of RT-PCR validated events detected by each pipeline (over a total of 83 RT-PCR validated events). Efficiency is evaluated in terms of running time and maximum memory usage.
Accuracy and efficiency of the STAR-based pipelines for differential analysis of alternative splicing on the original samples compared with those obtained on the samples filtered by Shark
| RT-PCR events | Time | RAM | ||
|---|---|---|---|---|
| Pipeline | All |
| (min) | (GB) |
| rMATS | 78 | 63 | 632 | 15.7 |
| Shark + rMATS | 78 | 63 | 138 | 15.7 |
| SplAdder | 56 | — | 1220 | 15.7 |
| Shark + SplAdder | 56 | — | 326 | 15.7 |
Note: The results have been obtained with --genomeSAsparseD=8 — a parameter that affects the sparsity of the index built and used by STAR. Accuracy is evaluated in terms of the number of RT-PCR validated events detected by each pipeline (over a total of 83 RT-PCR validated events). Efficiency is evaluated in terms of running time and maximum memory usage.
Fig. 3.Comparison of differential alternative splicing events mapping to one of the 48 selected genes as predicted by KisSplice on the full dataset (left oval) and on the dataset filtered by Shark (right oval). Inner ovals represent the events predicted with