| Literature DB >> 34946967 |
Yunhe Liu1, Qiqing Fu1, Xueqing Peng1, Chaoyu Zhu1, Gang Liu1, Lei Liu1,2.
Abstract
Circular RNA (circRNA) is a distinguishable circular formed long non-coding RNA (lncRNA), which has specific roles in transcriptional regulation, multiple biological processes. The identification of circRNA from other lncRNA is necessary for relevant research. In this study, we designed attention-based multi-instance learning (MIL) network architecture fed with a raw sequence, to learn the sparse features of RNA sequences and to accomplish the circRNAs identification task. The model outperformed the state-of-art models. Moreover, following the validation of the attention mechanism effectiveness by the handwritten digit dataset, the key sequence loci underlying circRNA's recognition were obtained based on the corresponding attention score. Then, motif enrichment analysis identified some of the key motifs for circRNA formation. In conclusion, we designed deep learning network architecture suitable for learning gene sequences with sparse features and implemented it for the circRNA identification task, and the model has strong representation capability in the indication of some key loci.Entities:
Keywords: MIL architecture; circRNA; deep learning; non-coding RNA; sequence motif
Mesh:
Substances:
Year: 2021 PMID: 34946967 PMCID: PMC8701965 DOI: 10.3390/genes12122018
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Illustrations of instance extraction from full RNA sequence.
Figure 2Illustrations of attention-based deep encoder MIL model structure (Circ-ATTEN-MIL).
Figure 3Handling of handwritten numbers dataset for feeding into Circ-ATTEN-MIL.
Figure 4Illustrations of extraction of high-attention sequence splices.
Figure 5The comparison of simple features of sequences between the two-type sequence set: (a) Sequence length distribution comparison. (b) Sequence composition comparison.
Figure 6Training process (a) and ROC curve (b).
The evaluation for classification task.
| Accuracy | Sensitivity | Specificity | Precision | MCC | F1 | |
|---|---|---|---|---|---|---|
| Train | 0.9552 | 0.9662 | 0.9547 | 0.9713 | 0.9194 | 0.9687 |
| Validation | 0.9333 | 0.9485 | 0.9092 | 0.9433 | 0.8291 | 0.9459 |
| Test | 0.9284 | 0.9396 | 0.9039 | 0.9393 | 0.8435 | 0.9394 |
The comparison results.
| ACC | MCC | F1 Score | |
|---|---|---|---|
| PredcircRNA | 0.8056 | 0.6113 | 0.8108 |
| ACNN-BLSTM | 0.8942 | 0.7756 | 0.9149 |
| Circ-ATTEN-MIL | 0.9284 | 0.8435 | 0.9394 |
| CircDeep (fusion) | 0.9327 | 0.8536 | 0.9304 |
| Fusion model | 0.9434 | 0.8796 | 0.9546 |
Figure 7Attention score for identifying the determining numbers. (a) Single determinant (model 1); (b) multiple identical determinants (model 1); (c) multiple different determinants (model 2). (Left panel: attention score bar; right panel: the rightly and wrongly identify events and miss-identify events.)
Figure 8The high attention sequence distribution. (a) The length distribution (upper) and the attention sequence number for each transcript distribution (lower); (b) the extraction of attention sequence for motif enrichment; (c) density distribution of attention loci on all sequences.
Motif enriched from the sequence.
| Motif | Sequence | E-Value | Predicted (Uppercase in the Sequences: Target Loci) |
|---|---|---|---|
|
| tTGTTATACGAGGGATC | 2.3 × 10−9 | KR super family (autonomous structural domains): Kringle domains are believed to play a role in binding mediators. (Source: NCBI) |
|
| ggactcttcatgacAGGC | 2.1 × 10−3 | SPI1 target sequence: May bind RNA and modulate unmatured-RNA splicing by similarity. (Source: JASPAR) |
|
| yssmccTCCWGGYCC | 6.7 × 104 | PLN02915 super family: catalytic subunit. (Source: NCBI)Contains ETS1 target sequence. (Source: JASPAR) |
|
| tCCAAGAAACAAAAT | 2.0 × 105 | The actual alignment was detected with superfamily member pfam01267. (Source: NCBI) |
|
| gaagatcaggtcttaATTA | 1.1 × 106 | Contains multi estrogen receptor (ESR1; ESR2) and estrogen related receptor (ESRRA) target sequences. (Source: JASPAR and GeneCardsSuite) |
|
| CGGCCCCGGGG | 2.1 × 108 | TFAP2E target sequence: may bind to the consensus sequence 5’-GCCNNNGGC-3’. (Source: JASPAR) |
|
| agtgacaGCAGTTAT | 2.8 × 107 | Contains multi-Homebox-related factor (A6, A4, B6, C10, C8, D8, B9, B8, B6, B3, A5, A7, A9, B4, C4, A6) target sequences. (Source: JASPAR and GeneCardsSuite) |