| Literature DB >> 33334313 |
Lei Deng1, Youzhi Liu1, Yechuan Shi1, Wenhao Zhang2, Chun Yang3, Hui Liu4.
Abstract
BACKGROUND: RNA binding proteins (RBPs) play a vital role in post-transcriptional processes in all eukaryotes, such as splicing regulation, mRNA transport, and modulation of mRNA translation and decay. The identification of RBP binding sites is a crucial step in understanding the biological mechanism of post-transcriptional gene regulation. However, the determination of RBP binding sites on a large scale is a challenging task due to high cost of biochemical assays. Quite a number of studies have exploited machine learning methods to predict binding sites. Especially, deep learning is increasingly used in the bioinformatics field by virtue of its ability to learn generalized representations from DNA and protein sequences.Entities:
Keywords: Bidirectional long short term memory network; Binding sites; Convolutional neural network; Deep learning; Distributed representation; RNA-binding proteins; k-mer
Mesh:
Substances:
Year: 2020 PMID: 33334313 PMCID: PMC7745412 DOI: 10.1186/s12864-020-07239-w
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Computational methods for RBP binding preference prediction
| Method | Sequence | Structure | Reference |
|---|---|---|---|
| RNAcontext | Yes | Yes | [ |
| GraphProt | Yes | Yes | [ |
| iONMF | Yes | Yes | [ |
| Oli | Yes | Yes | [ |
| RNAcommender | Yes | Yes | [ |
| RCK | Yes | Yes | [ |
| DeepBind | Yes | No | [ |
| Deepnet-rbp | Yes | No | [ |
| DanQ | Yes | No | [ |
| iDeepS | Yes | Yes | [ |
| iDeepV | Yes | No | [ |
| Pysster | Yes | Yes | [ |
| DLPRB | Yes | Yes | [ |
“Yes” and “No” means whether the computational methods uses sequence and structure information to predict the binding site
Fig. 1The illustrative flowchart of DeepRKE learning framework. First, we use RNAShapes to predict the RNA secondary structure from primary sequences. Second, word embedding algorithm is used to learn the distributed representations of 3-mers from primary sequences and secondary structures. Third, the learned distributed representations are fed into two CNNs (one is for RNA sequence and the other for secondary structures) to transform sequence and structure features, which are in turn input into a CNN module and a bidirectional LSTM layer followed by two fully connected layer. Finally, we use a sigmoid classifier to predict the probability of being RBP binding sites
Performance comparison between DeepRKE, GraphProt, deepnet-rbp, DeepBind and iDeepV on RBP-24 dataset
| RBP | #positives | #negatives | GraphProt | deepnet-rbp | DeepBind | iDeepV | DeepRKE |
|---|---|---|---|---|---|---|---|
| ALKBH5 PAR-CLIP | 1213 | 1197 | 0.680 | 0.714 | 0.668 | 0.643 | |
| C17ORF85 PAR-CLIP | 1860 | 1849 | 0.800 | 0.820 | 0.755 | 0.740 | |
| C22ORF28 PAR-CLIP | 9369 | 9136 | 0.751 | 0.792 | 0.809 | 0.823 | |
| CAPRIN1 PAR-CLIP | 8140 | 7901 | 0.855 | 0.834 | 0.824 | 0.869 | |
| Ago2 HITS-CLIP | 48,095 | 44,251 | 0.765 | 0.809 | 0.879 | 0.886 | |
| ELAVL1 HITS-CLIP | 8595 | 8436 | 0.955 | 0.966 | 0.966 | 0.978 | |
| SFRS1 HITS-CLIP | 19,438 | 17,195 | 0.898 | 0.931 | 0.929 | 0.905 | |
| HNRNPC iCLIP | 21,472 | 19,794 | 0.952 | 0.962 | 0.978 | ||
| TDP43 iCLIP | 92,031 | 75,079 | 0.874 | 0.876 | 0.930 | 0.935 | |
| TIA1 iCLIP | 18,049 | 16,135 | 0.861 | 0.891 | 0.929 | 0.941 | |
| TIAL1 iCLIP | 42,332 | 36,652 | 0.833 | 0.870 | 0.922 | 0.929 | |
| Ago1-4 PAR-CLIP | 36,902 | 31,310 | 0.895 | 0.881 | 0.919 | 0.925 | |
| ELAVL1 PAR-CLIP(B) | 9464 | 9283 | 0.935 | 0.961 | 0.961 | 0.962 | |
| ELAVL1 PAR-CLIP (A) | 27,275 | 23,974 | 0.959 | 0.966 | 0.972 | 0.973 | |
| EWSR1 PAR-CLIP | 16,292 | 14,720 | 0.935 | 0.966 | 0.969 | 0.962 | |
| FUS PAR-CLIP | 34,581 | 31,480 | 0.968 | 0.980 | 0.983 | 0.976 | |
| ELAVL1 PAR-CLIP(C) | 125,202 | 113,686 | 0.991 | 0.994 | 0.989 | 0.990 | |
| IGF2BP1-3 PAR-CLIP | 8539 | 6838 | 0.889 | 0.879 | 0.939 | 0.923 | |
| MOV10 PAR-CLIP | 13,793 | 12,987 | 0.863 | 0.854 | 0.899 | 0.896 | |
| PUM2 PAR-CLIP | 9116 | 8227 | 0.954 | 0.964 | 0.965 | 0.965 | |
| QKI PAR-CLIP | 10,276 | 9142 | 0.957 | 0.973 | 0.965 | 0.975 | |
| TAF15 PAR-CLIP | 7298 | 6606 | 0.970 | 0.983 | 0.978 | 0.978 | |
| PTB HITS-CLIP | 44,574 | 43,700 | 0.937 | 0.944 | 0.936 | 0.953 | |
| ZC3H7B PAR-CLIP | 20,962 | 20,018 | 0.820 | 0.796 | 0.875 | 0.883 | 0.914 |
| Mean AUC | 0.887 | 0.902 | 0.917 | 0.913 | 0.934 |
Note: boldface is the best experimental results for this experiment
Fig. 2Performance comparison between DeepRKE, iDeepV, iDeepS, DeepBind and GraphProt on RBP-31 dataset. All methods are run on the same training and independent test set across 31 set of RBPs (x-axis)
Fig. 3Performance comparison of the models with or without distributed representation of sequences and secondary structural profiles. The performance was evaluated in terms of AUROC on RBP-24 and RBP-31 dataset. DeepRKE is our proposed model, DeepRKE- model is without RNA secondary structure, and DeepRKE- - is without RNA secondary structure and distributed representation of sequence, using one-hot encoding instead. a-b Performance comparison between DeepRKE and DeepRKE- on two datasets. c-d Performance comparison between DeepRKE- and DeepRKE- - on two datasets. e-f Performance comparison between models with only CNN laryer and CNN+BiLSTM layer on two datasets