| Literature DB >> 26582927 |
Silvia Liu1, Wei-Hsiang Tsai2, Ying Ding1, Rui Chen3, Zhou Fang3, Zhiguang Huo3, SungHwan Kim3, Tianzhou Ma3, Ting-Yu Chang4, Nolan Michael Priedigkeit5, Adrian V Lee6, Jianhua Luo7, Hsei-Wei Wang8, I-Fang Chung9, George C Tseng10.
Abstract
BACKGROUND: Fusion transcripts are formed by either fusion genes (DNA level) or trans-splicing events (RNA level). They have been recognized as a promising tool for diagnosing, subtyping and treating cancers. RNA-seq has become a precise and efficient standard for genome-wide screening of such aberration events. Many fusion transcript detection algorithms have been developed for paired-end RNA-seq data but their performance has not been comprehensively evaluated to guide practitioners. In this paper, we evaluated 15 popular algorithms by their precision and recall trade-off, accuracy of supporting reads and computational cost. We further combine top-performing methods for improved ensemble detection.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26582927 PMCID: PMC4797269 DOI: 10.1093/nar/gkv1234
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Figures to explain terminology. (A) Intact exon (IE) type and broken exon (BE) type fusion transcripts; (B) spanning read, split read and anchor length; (C) short and long insert size of DNA fragment for sequencing.
F-measure for three representative synthetic data sets and three real data set. Type-1A: read 100 bp under 100X coverage for type-1A synthetic data; Type-1B: read 100 bp under 100X coverage for type-1B synthetic data; Type-3B: read 50 bp type-3B synthetic data (mean F-measure of the 5 control samples); Breast cancer: pool 4 samples of breast cancer data sets; Melanoma: pool 6 samples of melanoma data sets; Prostate cancer: pool 5 samples of prostate cancer data sets
| Tools | Type-1A | Type-1B | Type-3B | Breast cancer | Melanoma | Prostate cancer | Sum of syn data | Sum of real data | Sum of all data |
|---|---|---|---|---|---|---|---|---|---|
| SOAPfuse | 0.882 | 0.883 | 0.850 | 0.421 | 0.169 | 0.148 | 2.615* | 0.738 | 3.353* |
| FusionCatcher | 0.777 | 0.791 | 0.750 | 0.405 | 0.300 | 0.209 | 2.318* | 0.914* | 3.232* |
| JAFFA | 0.693 | 0.672 | 0.702 | 0.543 | 0.267 | 0.006 | 2.067 | 0.816 | 2.883* |
| EricScript | 0.779 | 0.804 | 0.752 | 0.291 | 0.074 | 0.006 | 2.335* | 0.371 | 2.706 |
| chimerascan | 0.737 | 0.706 | 0.689 | 0.267 | 0.049 | 0.010 | 2.132 | 0.326 | 2.458 |
| PRADA | 0.545 | 0.543 | 0.540 | 0.469 | 0.334 | 0 | 1.628 | 0.803 | 2.431 |
| deFuse | 0.630 | 0.854 | 0.561 | 0.235 | 0.095 | - | 2.045 | 0.330 | 2.375 |
| FusionMap | 0.684 | 0.711 | 0.606 | 0.075 | 0.041 | 0.004 | 2.001 | 0.120 | 2.121 |
| TopHat-Fusion | 0.488 | 0.557 | 0.539 | 0.300 | 0.200 | 0 | 1.584 | 0.500 | 2.084 |
| MapSplice | 0.488 | 0.500 | 0.504 | 0.400 | 0.182 | 0 | 1.492 | 0.582 | 2.074 |
| BreakFusion | 0.707 | 0.569 | 0.454 | 0.016 | 0.004 | 0 | 1.730 | 0.020 | 1.750 |
| SnowShoes-FTD | 0.039 | 0.039 | 0.039 | 0.639 | 0.500 | 0.435 | 0.117 | 1.574* | 1.691 |
| FusionQ | 0.651 | 0.479 | 0.349 | 0.017 | - | - | 1.479 | 0.017 | 1.496 |
| FusionHunter | - | - | - | 0.520 | 0.421 | - | - | 0.941* | 0.941 |
| ShortFuse | - | - | - | 0.543 | 0.291 | - | - | 0.834 | 0.834 |
Symbol* marks the top tools.
Figure 2.Fusion transcript detection results for synthetic data sets with 100 bp read lengths. (A–C): The y-axis bars show the number of true detected positives, among them IE-type and BE-type fusions are shown in solid and slashed rectangles. The total number of fusion detections are shown on the top of the bars. (A) Result for type-1A synthetic data (100 bp read length), (B) result for type-1B synthetic data (100 bp read length) and (C) result for type-2, type-3A and type-3B synthetic data (lung sample 50 bp read length). (D) Precision-recall plot for type-1A synthetic data (100 bp read length and 100X). (E) Precision-recall plot for type-1B synthetic data (100 bp read length and 100X). (F) Precision-recall plot for Type-3B synthetic data (lung sample 50 bp read length and 100X).
Figure 3.Illustration of alignment performance and similarity across tools for type-1A synthetic data with 100 bp read length and 100X. (A–C): Number of true positives (y-axis) with detected supporting reads greater than the threshold on the x-axis. (D–F): Multi-dimensional scaling (MDS) plots to demonstrate pairwise similarity of detection results from 15 tools and the underlying truth. (A) and (D): Results for all 150 true fusion transcripts. (B) and (E): Results for only IE-type fusion transcripts. (C) and (F): Results for only BE-type fusion transcripts.
Figure 4.Fusion transcript detection results for three real data sets. Figures are similar to Figure 2. (A) and (D): Breast cancer data set; (B) and (E) Melanoma data set; (C) and (F): Prostate cancer data set.
Figure 5.Computational cost comparison. The bar plots (y-axis) show the log-scaled computational time (min). Dashed lines project from the largest data set with linear computing time decrease by coverage and can be used to determine linear, super-linear (bars for smaller coverages fall below the line) or sub-linear (bars for smaller coverages exceed the line) computing load. (A) Evaluation using type-1A synthetic data for read length 100 bp at 50X, 100X and 200X. (B) Evaluation using prostate cancer 171T sample.
Figure 6.Illustration of the meta-caller workflow.
Figure 7.Precision-recall curves of top 3 performing tools and meta-caller. (A–C): Type-1A, type-1B and type-3B (lung sample) synthetic data with 100X coverage and 100, 100 and 50 bp read length respectively. (D–F): Three real data sets: breast cancer, melanoma and prostate cancer.
Figure 8.Precision-recall curves of top-3 performing tools and meta-caller (with majority vote=2) on validation data.