| Literature DB >> 34646994 |
Pora Kim1, Hua Tan1, Jiajia Liu1,2, Mengyuan Yang1,3, Xiaobo Zhou1,4,5.
Abstract
Identifying the molecular mechanisms related to genomic breakage is an important goal of cancer mechanism studies. Among diverse locations of structural variants, fusion genes, which have the breakpoints in the gene bodies and are typically identified from the split reads of RNA-seq data, can provide a highlighted structural variant resource for studying the genomic breakages with expression and potential pathogenic impacts. In this study, we developed FusionAI, which utilizes deep learning to predict gene fusion breakpoints based on DNA sequence and let us identify fusion breakage code and genomic context. FusionAI leverages the known fusion breakpoints to provide a prediction model of the fusion genes from the primary genomic sequences via deep learning, thereby helping researchers a more accurate selection of fusion genes and better understand genomic breakage.Entities:
Keywords: artificial intelligence applications; computational bioinformatics; genetics; genomics
Year: 2021 PMID: 34646994 PMCID: PMC8501764 DOI: 10.1016/j.isci.2021.103164
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Overview of FusionAI
(A) The investigation of fusion gene breakpoints of 48K FGs from FusionGDB identified the BP location across the human genome.
(B) Making training and test datasets of fusion-positive and -negative breakpoints.
(C) Diagram of fusion gene breakpoints classification by FusionAI.
(D) Effect of the size of the input sequence context on the accuracy.
Figure 2Performance of FusionAI
(A) Comparison of FusionAI to other methods for fusion gene prediction including 38,000 TCGA fusion genes from training and test datasets. The plots show the number of true positives and sensitivity from the left.
(B) Comparison of the number of the true positive fusions in ∼2200 validated fusion genes in ∼530 cancer cell-lines, and 862 Sanger sequence-based fusion genes that have fusion breakpoints at the exon junction position from ChiRTaRS3.1.
(C) Comparison of predicting fusion genes in three cancer cell-lines (H2228, K562, and MCF7).
(D) Identification of validated fusion genes in three cell-lines.
Figure 3Feature importance score for understanding genomic breakage
(A) Distribution of FI scores across 20 kb long of six representative fusion gene breakpoints.
(B) Logistic regression result of FusionAI prediction.
(C) Classification between fusion-positive and -negative from FusionAI and FI scores.
(D) (i) Distribution of overlaps between top 1% FI scored regions and 44 different types of human genomic features in both positive and negative data. (ii) Distribution of overlaps between all regions and 44 different types of human genomic features.
Figure 4High feature importance scored regions
(A) Distribution of the distance between the high FI scored regions and the exon junctional breakpoints.
(B) Enriched biological processes in the genes that have overlap with high FI scored regions per individual genomic feature categories.
Figure 5Consensus motif sequences in the high FI scored FG-positive regions and enriched biological processes.
(A) Identified DNA sequence motifs in fusion-positive breakpoint area of multiple groups such as all fusion-positives, intra-chromosomal events of fusion-positives, kinase fusion genes with dimerization and kinase domain, transcription factor fusion genes.
(B) Distribution of the GC-rich motif across 20 kb length sequence in the isoforms of TMPRSS2-ERG fusion gene.
(C) Transcription factor fusion genes that have GC-rich motifs.
(D) Enriched biological processes of those genes that have GC-rich motifs.
Comparison of FusionAI scores among the fusion genes in pan-cancer and healthy tissues
| Dataset | Desc. | Sequencing type | # jj fusion genes | # jj fusion genes (FusionAI score >0.5) | Percentage (%) |
|---|---|---|---|---|---|
| TCGA | Fusion genes in training data | RNA-seq | 18,210 | 18,207 | 99.98 |
| TCGA | Fusion genes in test data | RNA-seq | 7,759 | 7,383 | 95.15 |
| Klijin et al. | Fusion genes from cancer cell-lines | RNA-seq | 2,162 | 2,066 | 95.56 |
| ChiTaRS 3 | Fusion transcripts | Sanger sequencing | 862 | 807 | 93.62 |
| GTEx | Fusion genes common in cancer | RNA-seq | 646 | 537 | 83.13 |
| GTEx | Fusion genes not in cancer | RNA-seq | 925 | 634 | 68.54 |
| genomAD | WGS based predicted fusions | Whole genome sequencing | 923 | 46 | 4.98 |
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Open reading frame annotation of known fusion genes | ||
| Fusion gene breakpoint information of TCGA fusion genes | ||
| Fusion gene breakpoint information of cancer cell-lines | N/A | |
| Fusion gene breakpoint information of Sanger sequencing | ||
| Fusion gene breakpoint information of GTEx cohorts | N/A | |
| Fusion gene breakpoint information of genomAD cohorts | ||
| Simulation RNA-seq data of all validation sets | This study | |
| Fastq files for RNA-seq of K562 | Sequence Read Archive (SRA) in NCBI | Sequence Read Archive accession: SRR521460 |
| Fastq files for RNA-seq of MCF7 | Sequence Read Archive (SRA) in NCBI | Sequence Read Archive accession: SRR064286 |
| Fastq files for RNA-seq of H2228 | Sequence Read Archive (SRA) in NCBI | Sequence Read Archive accession: DRR016705.1s |
| Virus integration site information | ||
| Repeatmasker | ||
| MicroSatellite DataBase (MSDB) | ||
| Structural variant breakpoint information of genomAD | ||
| Chromatin state calls using a 15-state model | N/A | |
| Location of CpGisland, Methylation, Promoters | ||
| Replication timing-specific peak regions | ||
| Common TAD boundaries of five human cell-lines | N/A | |
| TRRUST2.0 | ||
| ENCODE Transcription Factor Targets | ENCODE Project | |
| FusionAI software | This study | |
| FusionAI model training | This study | |
| STAR-fusion | ||
| Arriba | ||