| Literature DB >> 34991664 |
Nadia M Davidson1,2,3, Ying Chen4, Teresa Sadras5,6, Georgina L Ryland5,6,7, Piers Blombery5,6, Paul G Ekert5,6,8,9,10, Jonathan Göke4,11, Alicia Oshlack12,13,14.
Abstract
In cancer, fusions are important diagnostic markers and targets for therapy. Long-read transcriptome sequencing allows the discovery of fusions with their full-length isoform structure. However, due to higher sequencing error rates, fusion finding algorithms designed for short reads do not work. Here we present JAFFAL, to identify fusions from long-read transcriptome sequencing. We validate JAFFAL using simulations, cell lines, and patient data from Nanopore and PacBio. We apply JAFFAL to single-cell data and find fusions spanning three genes demonstrating transcripts detected from complex rearrangements. JAFFAL is available at https://github.com/Oshlack/JAFFA/wiki .Entities:
Keywords: Fusions; Long reads; Nanopore; PacBio; RNA sequencing; Translocations
Mesh:
Year: 2022 PMID: 34991664 PMCID: PMC8739696 DOI: 10.1186/s13059-021-02588-5
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1JAFFAL pipeline steps for fusion detection. Reads are aligned to the reference transcriptome, reads split across different genes are identified as candidate fusion reads and subsequently aligned to the reference genome for confirmation. Reads are clustered into breakpoint positions which are then ranked and reported (see text for details)
The number of fusion genes and breakpoints called in the non-cancer cell line NA12878 from ONT direct RNA and amplified cDNA. Most calls are presumed to be false positives. The number of fusions in the highest rank category for each tool is shown in bold. We hypothesize that most of the multi-read fusions reported by LongGF applied to the cDNA dataset (173) are chimeras introduced during library preparation. JAFFAL ranks these events as Low Confidence. The number of breakpoints for LongGF is not shown as it only reports one breakpoint per fusion gene by default
| Direct RNA | cDNA | ||||||
|---|---|---|---|---|---|---|---|
| Total reads processed | 14,971,421 | 25,418,307 | |||||
| Fusion genes | Break points | Reads support: median (range) | Fusion genes | Break points | Reads support: median (range) | ||
| Fusion genes called by JAFFAL | High confidence | 4 | 4.5 (2–14) | 8 | 6 (2–24) | ||
| Low confidence | 5 | 7 | 2 (2–11) | 94 | 121 | 2 (2–49) | |
| Potential trans-splicing | 344 | 344 | 1 (1–1) | 412 | 412 | 1 (1–1) | |
| Fusion genes called by longGF | > 1 Read support | 2 (2–14) | 2 (2–522) | ||||
| = 1 Read support | 713 | 1 (1–1) | 386 | 1 (1–1) | |||
Fig. 2Fusion finding sensitivity on simulated ONT data with background. A The fraction of simulated fusions detected (y-axis) by JAFFAL across a range of fusion coverage levels (x-axis). Read identity levels are shown in different colors (red-purple). B The fraction of simulated fusions detected (y-axis) by JAFFAL and LongGF for sequence identity levels of 75–95%
The number of previously validated fusions rediscovered across seven long-read sequencing datasets by JAFFAL and LongGF. The total number of fusion genes reported by each tool, including those not previously validated, are indicated in parentheses
| PacBio HCT-116 | PacBio SK-BR-3 | PacBio MCF-7 | ONT HCT-116 | ONT A549 | ONT K562 | ONT MCF-7 | ||
|---|---|---|---|---|---|---|---|---|
| Reads | 156,632 | 3,070,545 | 2,389,856 | 44,416,838 | 31,393,964 | 36,751,242 | 34,654,115 | |
| # Previously validated fusions | 3 | 30 | 53 | 3 | 2 | 6 | 53 | |
JAFFAL # Previously validated fusions rediscovered (all fusions) | High confidence | 1 (1) | 13 (20) | 26 (73) | 3 (49) | 2 (21) | 2 (17) | 29 (69) |
| Low confidence | 0 (1) | 0 (5) | 1 (112) | 0 (81) | 0 (40) | 0 (31) | 1 (29) | |
| Potential trans-splicing | 0 (21) | 1 (201) | 9 (435) | 0 (2343) | 0 (1206) | 0 (615) | 2 (819) | |
LongGF # Previously validated fusions rediscovered (all fusions) | > 1 read support | 1 (2) | 10 (20) | 22 (292) | 2 (307) | 2 (224) | 2 (168) | 24 (220) |
| = 1 read support | 0 (113) | 1 (2537) | 6 (1800) | 0 (1321) | 0 (1922) | 0 (2267) | 4 (2172) | |
Fig. 3Comparison of JAFFAL and LongGF on cancer cell line sequencing. Shown are ROC style curve with the ranking of previously validated fusions against other reported fusions for A MCF-7, HCT-116, A549, and K562 cell lines sequenced with ONT and B MCF-7, HCT-116, and SK-BR-3 cell lines sequenced with PacBio. C For MCF-7 only, high confidence fusions from JAFFAL (crosses) are compared against three short-read Illumina replicates (squares) across three sequencing depths (colors). D The overlap between fusions called by JAFFAL (high and low confidence) and LongGF (> 1 read support) on MCF-7
Fig. 4Detection of fusions in single-cell ONT sequencing of five cell lines. A t-SNE plot generated from short-read gene expression. Color indicates the cell line that a fusion detection is known to be in from CCLE. Gray indicates a cell with no detected CCLE fusion. B For each of the 15 fusions detected by JAFFAL, the number of cells identified in each of the five clusters is shown. Fusion labels are colored according to the CCLE cell line they were previously identified in. Black indicates a novel fusion. C JAFFAL identified BMPR2-TYW5 and TYW5-ALS2CR11 in the H838 cell line as belonging to the same transcript and forming the three-gene fusion BMPR2-TYW5-ALS2CR11 identified in 15 reads (two different isoforms). Expressed exons in the fusion transcript are shown in blue, red, and green, with color indicating the gene of origin. Red bars show the position of translocations seen in short-read whole-genome sequencing of H838 in CCLE. The breakpoint within ALS2CR11 falls within its third final exon, and this exon appears to be spliced out. The six isoforms we identified for BMPR2-TYW5-ALS2CR11 and the number of long reads supporting each are also shown. The location of PCR forward and reverse primers which validated the translocation between BMPR2 and ALS2CR11 are shown in black (bottom)
Average and range (in parentheses) of run-time and memory consumed on nine benchmarking datasets by JAFFAL and LongGF
| Run-time (hours) | Memory consumption (GB) | ||
|---|---|---|---|
| JAFFAL (4 threads) | 2.6 (0.08–5.9) | 20.0 (19.8–21.1) | |
| LongGF | Genome Mapping (4 threads) | 9.5 (0.1–21.2) | 22.6 (20.2–24.7) |
| LongGF (1 thread) | 0.4 (0.01–1.1) | 6.4 (0.8–13.3) | |