| Literature DB >> 30832218 |
Cheng Peng1, Siyu Han2, Hui Zhang3, Ying Li4.
Abstract
Non-coding RNAs (ncRNAs) play crucial roles in multiple fundamental biological processes, such as post-transcriptional gene regulation, and are implicated in many complex human diseases. Mostly ncRNAs function by interacting with corresponding RNA-binding proteins. The research on ncRNA⁻protein interaction is the key to understanding the function of ncRNA. However, the biological experiment techniques for identifying RNA⁻protein interactions (RPIs) are currently still expensive and time-consuming. Due to the complex molecular mechanism of ncRNA⁻protein interaction and the lack of conservation for ncRNA, especially for long ncRNA (lncRNA), the prediction of ncRNA⁻protein interaction is still a challenge. Deep learning-based models have become the state-of-the-art in a range of biological sequence analysis problems due to their strong power of feature learning. In this study, we proposed a hierarchical deep learning framework RPITER to predict RNA⁻protein interaction. For sequence coding, we improved the conjoint triad feature (CTF) coding method by complementing more primary sequence information and adding sequence structure information. For model design, RPITER employed two basic neural network architectures of convolution neural network (CNN) and stacked auto-encoder (SAE). Comprehensive experiments were performed on five benchmark datasets from PDB and NPInter databases to analyze and compare the performances of different sequence coding methods and prediction models. We found that CNN and SAE deep learning architectures have powerful fitting abilities for the k-mer features of RNA and protein sequence. The improved CTF coding method showed performance gain compared with the original CTF method. Moreover, our designed RPITER performed well in predicting RNA⁻protein interaction (RPI) and could outperform most of the previous methods. On five widely used RPI datasets, RPI369, RPI488, RPI1807, RPI2241 and NPInter, RPITER obtained A U C of 0.821, 0.911, 0.990, 0.957 and 0.985, respectively. The proposed RPITER could be a complementary method for predicting RPI and constructing RPI network, which would help push forward the related biological research on ncRNAs and lncRNAs.Entities:
Keywords: CNN; deep learning; ncRNA; ncRNA–protein interaction prediction
Mesh:
Substances:
Year: 2019 PMID: 30832218 PMCID: PMC6429152 DOI: 10.3390/ijms20051070
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The flowchart of RPITER. The input of RPITER involves two sequence coding parts. The architecture of RPITER consists of four basic modules and an ensemble module. Dense-N means a fully-connected layer with N neurons, while Conv-M indicates a convolution layer with M filters.
Performance comparison between Conjoint Triad Feature (CTF) and our two improved coding methods on dataset RPI2241 by five-fold cross validation.
| Dataset | Coding Method |
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| RPI2241 | CTF | 0.848 | 0.826 | 0.869 | 0.864 | 0.697 | 0.929 |
| Improved CTF | 0.852 | 0.833 | 0.872 | 0.867 | 0.705 | 0.934 | |
| Improved Struct CTF | 0.852 | 0.834 | 0.870 | 0.865 | 0.704 | 0.931 |
Figure 2and comparison between different sequence coding methods on dataset RPI2241 by five-fold cross validation. (a) CTF, Improved CTF and Improved Struct CTF; (b) One hot, word2vec and doc2vec.
Figure 3Performance comparison among different basic prediction models on datasets RPI1807, RPI2241 and NPInter by five-fold cross validation.
Performance comparison between convolution neural network (CNN), stacked auto-encoder (SAE), random forest (RF), and support vector machine (SVM) on dataset NPInter by five-fold cross validation.
| Dataset | Models |
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| NPInter | CNN | 0.953 | 0.974 | 0.932 | 0.935 | 0.907 | 0.984 |
| SAE | 0.941 | 0.967 | 0.915 | 0.920 | 0.884 | 0.982 | |
| RF | 0.943 | 0.945 | 0.941 | 0.941 | 0.886 | 0.943 | |
| SVM | 0.933 | 0.940 | 0.925 | 0.926 | 0.866 | 0.933 |
Figure 4Performance comparison among different prediction modules of RPITER.
Figure 5Performance comparison among different RNA–protein interaction (RPI) prediction methods on datasets RPI369, RPI488, RPI1807, RPI2241 and NPInter.
Performance comparison of different RNA–protein interaction (RPI) prediction methods on five benchmark datasets by five-fold cross validation.
| Dataset | Method |
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| RPI369 | RPITER |
|
| 0.659 | 0.701 |
|
|
| IPMiner | 0.700 | 0.784 | 0.560 | 0.840 | 0.428 | 0.700 | |
| RPISeq-RF | 0.713 | 0.716 |
|
| 0.426 | 0.713 | |
| lncPro | 0.502 | 0.237 | 0.771 | 0.512 | 0.009 | 0.468 | |
| RPI488 | RPITER |
| 0.839 |
| 0.943 |
| 0.911 |
| IPMiner |
|
| 0.835 |
|
| 0.893 | |
| RPISeq-RF | 0.883 | 0.928 | 0.831 | 0.935 | 0.771 | 0.883 | |
| lncPro | 0.856 | 0.770 |
| 0.940 | 0.725 |
| |
| RPI1807 | RPITER | 0.968 |
| 0.946 | 0.959 | 0.936 |
|
| IPMiner | 0.968 | 0.965 |
| 0.955 | 0.935 | 0.966 | |
| RPISeq-RF |
| 0.970 | 0.976 |
|
| 0.969 | |
| lncPro | 0.472 | 0.445 | 0.506 | 0.532 | -0.049 | 0.506 | |
| RPI2241 | RPITER |
|
|
| 0.871 |
|
|
| IPMiner | 0.861 | 0.877 | 0.841 |
| 0.724 | 0.861 | |
| RPISeq-RF | 0.851 | 0.861 | 0.838 | 0.863 | 0.702 | 0.851 | |
| lncPro | 0.606 | 0.518 | 0.695 | 0.632 | 0.216 | 0.644 | |
| NPInter | RPITER | 0.955 |
| 0.937 | 0.939 | 0.910 |
|
| IPMiner |
| 0.956 |
|
|
| 0.957 | |
| RPISeq-RF | 0.943 | 0.937 | 0.949 | 0.936 | 0.885 | 0.943 | |
| lncPro | 0.508 | 0.739 | 0.276 | 0.505 | 0.017 | 0.517 |
The boldface indicates the highest metric performance among the compared methods on specific dataset.
The five benchmark RPI datasets used in this study.
| Dataset | Interaction Pairs | Non-Interaction Pairs | RNAs | Proteins | Reference |
|---|---|---|---|---|---|
| RPI369 | 369 | 0 | 332 | 338 | [ |
| RPI488 | 243 | 245 | 25 | 247 | [ |
| RPI1807 | 1807 | 1436 | 1078 | 3131 | [ |
| RPI2241 | 2241 | 0 | 841 | 2042 | [ |
| NPInter | 10,412 | 0 | 4636 | 449 | [ |
RPI369, RPI2241 and NPInter lack non-interaction pairs to serve as negative training samples, thus we randomly paired the RNAs and proteins in positive interaction samples and discarded existing pairs to generate the same number of negative samples and construct the balanced training datasets [21,26].
The sequence coding lengths for protein and RNA of different encoding methods.
| Sequence | CTF | Improved CTF | Improved Struct CTF |
|---|---|---|---|
| RNA |
|
|
|
| Protein |
|
|
|