| Literature DB >> 35962324 |
Zhen Shen1, Yan Ling Shao1, Wei Liu1, Qinhu Zhang2,3, Lin Yuan4.
Abstract
BACKGROUND: Circular RNAs (CircRNAs) play critical roles in gene expression regulation and disease development. Understanding the regulation mechanism of CircRNAs formation can help reveal the role of CircRNAs in various biological processes mentioned above. Back-splicing is important for CircRNAs formation. Back-splicing sites prediction helps uncover the mysteries of CircRNAs formation. Several methods were proposed for back-splicing sites prediction or circRNA-realted prediction tasks. Model performance was constrained by poor feature learning and using ability.Entities:
Keywords: Back-splicing sites prediction; Batch normalization; CircRNA; Convolutional neural networks; Deep learning
Mesh:
Substances:
Year: 2022 PMID: 35962324 PMCID: PMC9373444 DOI: 10.1186/s12864-022-08820-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Model parameter
| Layer | Parameter |
|---|---|
| Conv1 | Kernel number: 256, Kernel size: 10, Padding mode: Valid, Stride window: 1 |
| Conv2 | Kernel number: 128, Kernel size: 20, Padding mode: Same, Stride window: 1, |
| Drop1 | 0.2,0.5, |
| MP | (5,5) |
| Drop2 | 0.2,0.5, |
Conv represents Convolution layer, Drop represents dropout layer, MP represents Max pooling layer
Comparison of CircCNN and other baseline models in cross-validation
| Model | Human | Mouse | Fruit Fly | |||
|---|---|---|---|---|---|---|
| AUC | ACC | AUC | ACC | AUC | ACC | |
| Model① | 0.8614 | 0.8019 | 0.8347 | 0.7669 | 0.8518 | 0.7755 |
| Model② | 0.7245 | 0.6744 | 0.7715 | 0.7054 | 0.7716 | 0.703 |
| Model③ | 0.8393 | 0.7793 | 0.82 | 0.7525 | 0.8415 | 0.7593 |
| Model④ | 0.8334 | 0.762 | 0.7915 | 0.7188 | 0.8231 | 0.7398 |
| Model⑤ | 0.7117 | 0.6647 | 0.7242 | 0.6637 | 0.743 | 0.676 |
| DeepCircCode | 0.8827 | 0.8232 | 0.8391 | 0.7653 | 0.8611 | 0.7796 |
| CircCNN (CVLD) | 0.9026 | 0.8348 | 0.8431 | 0.7572 | 0.8704 | 0.7807 |
| CircCNN | ||||||
CVLD represents the cross-validation strategy used in CircCNN training is same as DeepCirCode
Fig. 1Best performance of circCNN and other baseline models
CircCNN with BN outperforms other modified CircCNN
| AUC | ACC | MCC | Sens | Spec | ||
|---|---|---|---|---|---|---|
| Human | CircCNN (No BN) | 0.8963 | 0.8317 | 0.6613 | 0.8871 | |
CircCNN (BN → Dropout) | 0.8979 | 0.8325 | 0.6632 | 0.8891 | 0.7655 | |
| CircCNN | 0.7562 | |||||
| Mouse | CircCNN (No BN) | 0.8401 | 0.7629 | 0.5275 | 0.8003 | |
CircCNN (BN → Dropout) | 0.8407 | 0.7623 | 0.5263 | 0.8001 | 0.7246 | |
| CircCNN | 0.6797 | |||||
| Fruit Fly | CircCNN (No BN) | 0.858 | 0.773 | 0.5483 | 0.8058 | |
CircCNN (BN → Dropout) | 0.86 | 0.7753 | 0.5527 | 0.8165 | 0.7341 | |
| CircCNN | 0.7365 | |||||
Each number represents the average metric value of model in cross-validation
Three species motifs found by CircCNN match three known motif databases by TOMTOM
| FilterID | Motif found by CircCNN | Known motif in database | Known motif sequence | Gene Annotation | E-value | ||
|---|---|---|---|---|---|---|---|
| Human | Input1 | ||||||
| filter36 | UCUCUUUUUG | RNCMPT00012 | CUUUUUU | CPEB2 | 0.0205 | ||
| Input2 | |||||||
| filter18 | CCAUUUUCUU | RNCMPT00269 | ACUUUCU | PTBP1 | 0.0133 | ||
| Mouse | Input1 | filter0 | |||||
| filter41 | ACAAUUCCCG | RNCMPT00239 | CCUUUCCC | PCBP1 | 0.0498 | ||
| Input2 | |||||||
| filter160 | UGUAUGAGGA | RNCMPT00051 | GUGUGUG | RBM38 | 0.0673 | ||
| UGUAUGAGGA | RNCMPT00062 | UAAAAGG | KHDRBS1 | 0.0972 | |||
| Fly | Input1 | ||||||
| filter32 | GUUGGGUUUA | RNCMPT00120 | UUUAGUU | FNE | 0.0536 | ||
| Input2 | |||||||
| filter46 | UAAUAAACUU | RNCMPT00142 | AUAAUAA | QKR58E-1 | 0.0377 | ||
Fig. 2Sequence logos of three species matched motifs. From top to bottom, motif logos of three species are shown respectively, both sides of the red line are the motifs of input1 module and input2 module, respectively. The gene name is shown above each motif logos
Association between motif, gene and disease
| FilterID | Motif found by CircCNN | Known motif in database | Known motif sequence | Gene Annotation | Disease |
|---|---|---|---|---|---|
| filter188 | UAUCUUUUUA | RNCMPT00025 | AUUUUUU | HNRNPC | Breast Cancer |
| filter16 | AUUUAUUUUA | RNCMPT00032 | UUAUUUU | HUR | Gastric Cancer |
| filter169 | UAGACACACA | RNCMPT00027 | ACACACA | HNRNPL | Prostate Cancer |
| filter209 | AACAAACAGG | RNCMPT00047 | ACUAACA | QKI | Lung Cancer |
| filter28 | UUUUUUCCGA | RNCMPT00165 | UUUUUUC | TIA1 | Colorectal Cancer |
| filter162 | GACCCAUCCA | RNCMPT00026 | CCAACCC | HNRNPK | Gastric Cancer |
| filter34 | AGACUUUUUC | RNCMPT00268 | CUUUUCU | PTBP1 | Pancreatic Cancer |
Fig. 3Distributions of RNA motifs found by CircCNN in the positive and negative samples. Two red bordered squares represents exon-enriched motif and its distribution, the purple bordered squares represents intron-enriched motif and its distributions. For the motif distribution plot, the red line represents splice acceptor site or splice donor site, blue line and orange line represents positive samples and negative samples respectively. For the red line in motif distribution plot (A and B), its left and right are intron and exon respectively. For the red line in motif distribution plot (C), its left and right are exon and intron respectively
Several RNA motifs shared between human, mouse, and fruit fly
| FilterID | Human motif | Mouse motif | Fruit Fly motif | |
|---|---|---|---|---|
| Input1 | filter105 | UAAUUAAGAA | AAGAUAAGUC | UAAGAGAGAU |
| filter118 | ACUUUCUCAC | UGUUCCCUAC | UCUGUCUCAU | |
| filter167 | CCCUGGAUUA | CCAUUCAUCU | GUCAGUUUUA | |
| filter206 | AGUCUAUCUC | UGUUAAUGAC | UGUGACUGUC | |
| Input2 | filter120 | AAAAAUUCCA | GAUGUCUCCA | AUAAACGUCA |
Fig. 4Sequence logos of several RNA motifs shared in three species. Here, three filters in three species are intron-enriched, exon-enriched and exon-enriched respectively. For filter 206(input1), it is exon-enriched motif in human and is intron-enriched motif in mouse and fruit fly. For filter120(input2), it is intron-enriched motif in human and is exon-enriched motif in mouse and fruit fly
Fig. 5Workflow of CircCNN
The data output shape of each layer in CircCNN
| SA Input | SD Input | |||
|---|---|---|---|---|
| Type | Layer | Output Shape | Layer | Output Shape |
| Input Layer | input_1 | (None, 100,4) | input_2 | (None, 100,4) |
| Conv1D | conv1 | (None, 89,256) | conv3 | (None, 89,256) |
| Dropout | dropout_1 | (None, 89,256) | dropout_4 | (None, 89,256) |
| Conv1D | Conv2 | (None, 45,128) | conv4 | (None, 45,128) |
| Dropout | dropout_2 | (None, 45,128) | dropout_5 | (None, 45,128) |
MaxPooling 1D | max_pooling 1d_1 | (None, 9128) | max_pooling 1d_2 | (None, 9128) |
| Dropout | dropout_3 | (None, 9128) | dropout_6 | (None, 9128) |
| Flatten | flatten_1 | (None, 1152) | flatten_2 | (None, 1152) |
| Concatenate | cvout | (None, 2304) | ||
Batch Normalization | batch normalization_1 | (None, 2304) | ||
“None” represents batch size
Details about experimental data
| Class | Source | |
|---|---|---|
| Back splicing sites | circRNA datasets | |
| Human | GRCH37, GTF | circRNADb (Ref45), circBase (Ref46) |
| Mouse | GRCm38, GTF | Ref47 |
| Fruit Fly | BDGP5.4, GTF | Ref48 |