| Literature DB >> 35164688 |
Fatih Karaoglanoglu1,2, Cedric Chauve3, Faraz Hach4,5.
Abstract
BACKGROUND: The advent of next-generation sequencing technologies empowered a wide variety of transcriptomics studies. A widely studied topic is gene fusion which is observed in many cancer types and suspected of having oncogenic properties. Gene fusions are the result of structural genomic events that bring two genes closely located and result in a fused transcript. This is different from fusion transcripts created during or after the transcription process. These chimeric transcripts are also known as read-through and trans-splicing transcripts. Gene fusion discovery with short reads is a well-studied problem, and many methods have been developed. But the sensitivity of these methods is limited by the technology, especially the short read length. Advances in long-read sequencing technologies allow the generation of long transcriptomics reads at a low cost. Transcriptomic long-read sequencing presents unique opportunities to overcome the shortcomings of short-read technologies for gene fusion detection while introducing new challenges.Entities:
Keywords: Dynamic programming; Gene fusion detection; Long-read sequencing; Transcriptomics
Mesh:
Year: 2022 PMID: 35164688 PMCID: PMC8842519 DOI: 10.1186/s12864-022-08339-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Genion pipeline. Preprocessing step produces alignment intervals for each input read. Paftools [30] is used to convert the mappings from SAM to PAF format. Mapped parts of the reads are masked and mapped again. Outputs of mapping steps are merged and converted to set of segments and annotated. Chimeric Read Identification step chains the annotated segments of each read and identifies the gene content. Chimeric Cluster Characterization step takes single-gene aligned chains to calculate gene expressions. Multi-gene aligned chains are clustered. Each cluster is statistically tested and FDR correction is applied on calculated p-values. For each cluster FiN and ff-igf scores are calculated. Clusters are characterized and ranked according to these scores and reported
Fig. 3Probability mass functions (pmf) of a EIF3D:MYH9 and b RGS17:TBL1XR1 chimeras derived from the hypergeometric distribution. Red line shows the probability of getting number of random pairings for the candidate, blue line shows the number of chimeric reads found by Genion. P-value of the candidate is calculated by the area below the pmf after the blue line
Fig. 2FiN and ff-igf values for each call made by Genion. a Ground truth of the simulated dataset, b Genion calls on the simulated dataset and c Genion calls on the MCF-7 Pacbio dataset. Simulated gene fusions, read-throughs and random-pairings are colored blue, purple and red respectively. Calls are colored with simulated ground truth in (a) and Genion predictions in (b) and (c) [44]. PASS:GF, PASS:RT and FAIL:RP represent chimeras called as gene fusions, read-throughs and random-pairings in (b) and (c). Chimeras filtered due to overlaps, homology or low support are not included in this figure
Clustering statistics for Genion and LongGF and number of fusion reads recovered by each tool
| Reads | Fusion ARI | All ARI | |
|---|---|---|---|
| Genion | 2,521 | 0.901 | 0.925 |
| LongGF | 872 | 0.183 | NA |
Simulated gene fusions called by Genion and LongGF
| Gene1 | Gene2 | Genion | LongGF | Gene1 | Gene2 | Genion | LongGF |
|---|---|---|---|---|---|---|---|
| TCF3 | PBX1 | KMT2A | MLLT3 | ||||
| JAZF1 | SUZ12 | BCR | ABL1 | ||||
| DNAJB1 | PRKACA | TMPRSS2 | ERG | ✗ | |||
| KIAA1549 | BRAF | CCDC6 | RET | ||||
| NAB2 | STAT6 | ✗ | ✗ | CBFA2T3 | GLIS2 | ||
| EWSR1 | FLI1 | PML | RARA | ||||
| SS18 | SSX1 | RUNX1 | RUNX1T1 | ||||
| COL1A1 | PDGFB | CRTC1 | MAML2 |
Genion and LongGF gene fusion calls on MCF-7 cell line data released by PacBio. These set of gene fusions are validated either experimentally (EXP) or by short reads sequencing (SRS). # Reads is the number of reads from the data release. Ranks are the order which fusion reported by the tools. Chimeras found by Genion, but identified as random pairing are given RP rank. FiN and ff-igf are the scores computed by genion
| Gene1 | Gene2 | # Reads | Validation | LongGF | Genion | SV Overlap | ||
|---|---|---|---|---|---|---|---|---|
| Rank | Rank | FiN | ff-igf | |||||
| BCAS4 | BCAS3 | 1183 | EXP | 1 | 1 | 5.15582 | 8070.86 | translocation |
| RPS6KB1 | VMP1 | 349 | EXP | 2 | 2 | 0.25847 | 2046.15 | deletion |
| SYTL2 | PICALM | 117 | SRS | 4 | 3 | 0.45365 | 857.222 | ✗ |
| RPS6KB1 | DIAPH3 | 101 | SRS | 5 | 4 | 0.31655 | 626.54 | ✗ |
| SLC25A24 | NBPF6 | 41 | SRS | 6 | 5 | 1.80952 | 472.95 | inversion |
| PAPOLA | AK7 | 37 | SRS | 7 | 6 | 0.20536 | 369.71 | ✗ |
| ESR1 | CCDC170 | 24 | SRS | 10 | 8 | 0.52564 | 308.07 | ✗ |
| TXLNG | SYAP1 | 27 | EXP | 12 | 9 | 0.11986 | 275.28 | ✗ |
| TBL1XR1 | RGS17 | 15 | SRS | 34 | 12 | 0.66667 | 153.69 | translocation |
| MYO6 | SENP6 | 14 | SRS | 35 | 13 | 0.10683 | 147.50 | ✗ |
| POP1 | MATN2 | 7 | SRS | RP | 0.15790 | 28.27 | ✗ | |
| MYH9 | EIF3D | 11 | SRS | 14 | RP | 0.04000 | 140.16 | ✗ |
| FOXA1 | TTC6 | 26 | SRS | 4.10526 | 303.47 | inversion | ||
| RSBN1 | AP4B1 | 7 | SRS | 19 | 0.01100 | 8.00 | ✗ | |
| ZNF217 | SULF2 | 16 | SRS | RP | 0.00758 | 27.14 | deletion | |
| ARFGEF2 | SULF2 | 23 | SRS | 15 | RP | 0.03104 | 113.28 | inversion |
Time and memory used by Genion and LongGF during mapping and fusion calling steps on PacBio sequencing of MCF-7 breast cancer cell line
| Mapping | Fusion Calling | |||||
|---|---|---|---|---|---|---|
| Threads # | Time (mm:ss) | Peak Memory (GB) | Threads # | Time (mm:ss) | Peak Memory (GB) | |
| LongGF/minimap2 | 48 | 16:47 | 44.94 | 1 | 05:28 | 1.34 |
| Genion/deSALT | 48 | 24:07 | 37.25 | 1 | 14:36 | 0.85 |