| Literature DB >> 29495575 |
Wen-Jun Shen1, Wenjuan Cui2, Danze Chen3, Jieming Zhang4, Jianzhen Xu5.
Abstract
RNA-protein interactions (RPIs) have critical roles in numerous fundamental biological processes, such as post-transcriptional gene regulation, viral assembly, cellular defence and protein synthesis. As the number of available RNA-protein binding experimental data has increased rapidly due to high-throughput sequencing methods, it is now possible to measure and understand RNA-protein interactions by computational methods. In this study, we integrate a sequence-based derived kernel with regularized least squares to perform prediction. The derived kernel exploits the contextual information around an amino acid or a nucleic acid as well as the repetitive conserved motif information. We propose a novel machine learning method, called RPiRLS to predict the interaction between any RNA and protein of known sequences. For the RPiRLS classifier, each protein sequence comprises up to 20 diverse amino acids but for the RPiRLS-7G classifier, each protein sequence is represented by using 7-letter reduced alphabets based on their physiochemical properties. We evaluated both methods on a number of benchmark data sets and compared their performances with two newly developed and state-of-the-art methods, RPI-Pred and IPMiner. On the non-redundant benchmark test sets extracted from the PRIDB, the RPiRLS method outperformed RPI-Pred and IPMiner in terms of accuracy, specificity and sensitivity. Further, RPiRLS achieved an accuracy of 92% on the prediction of lncRNA-protein interactions. The proposed method can also be extended to construct RNA-protein interaction networks. The RPiRLS web server is freely available at http://bmc.med.stu.edu.cn/RPiRLS.Entities:
Keywords: Protein-RNA interactions; derived kernel; lncRNA-protein interaction networks; regularized least squares
Mesh:
Substances:
Year: 2018 PMID: 29495575 PMCID: PMC6017498 DOI: 10.3390/molecules23030540
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Predictive performance of RPiRLS in terms of the AUC on the RPI2662 training set over varying template sizes.
| Template Sizes | ||||||||
|---|---|---|---|---|---|---|---|---|
| 0.705 | 0.813 | 0.850 | 0.872 | 0.851 | 0.832 | 0.816 | 0.802 | |
| 0.375 | 0.767 | 0.853 | 0.911 | 0.920 | 0.915 | 0.910 | ||
| 0.219 | 0.644 | 0.802 | 0.881 | 0.910 | 0.921 | 0.924 | 0.922 | |
| 0.202 | 0.321 | 0.767 | 0.854 | 0.887 | 0.902 | 0.912 | 0.918 |
The performance of predicting RPIs was evaluated by using 10-fold stratified cross-validation on the RPI2662 data set. Different combinations of parameters k and l were evaluated. Remark on the symbols of template sizes: k stands for template size of amino acid sequences; l stands for template size of nucleic acid sequences. The best AUC in the table is marked in bold.
Predictive performance of RPiRLS in terms of the accuracy on the RPI2662 training set over varying template sizes.
| Template Sizes | ||||||||
|---|---|---|---|---|---|---|---|---|
| 0.673 | 0.763 | 0.812 | 0.817 | 0.779 | 0.769 | 0.756 | 0.743 | |
| 0.412 | 0.730 | 0.796 | 0.830 | 0.814 | 0.800 | 0.794 | ||
| 0.261 | 0.646 | 0.731 | 0.784 | 0.811 | 0.823 | 0.821 | 0.815 | |
| 0.243 | 0.317 | 0.702 | 0.747 | 0.785 | 0.804 | 0.816 | 0.824 |
Remark on the symbols of template sizes: k stands for template size of amino acid sequences; l stands for template size of nucleic acid sequences. The best accuracy in the table is marked in bold.
Predictive performance of, RPiRLS-7G in terms of the accuracy on the, RPI2662 training data set over varying template sizes.
| Template Sizes | ||||||||
|---|---|---|---|---|---|---|---|---|
| 0.663 | 0.737 | 0.776 | 0.766 | 0.733 | 0.656 | 0.605 | 0.578 | |
| 0.644 | 0.746 | 0.792 | 0.803 | 0.783 | 0.770 | 0.760 | 0.752 | |
| 0.433 | 0.755 | 0.796 | 0.816 | 0.795 | 0.791 | 0.782 | ||
| 0.347 | 0.673 | 0.764 | 0.805 | 0.822 | 0.816 | 0.803 | 0.794 | |
| 0.262 | 0.615 | 0.727 | 0.779 | 0.804 | 0.816 | 0.821 | 0.813 | |
| 0.242 | 0.320 | 0.703 | 0.754 | 0.791 | 0.808 | 0.815 | 0.818 |
The performance of predicting RPIs was evaluated by using 10-fold stratified cross-validation on the RPI2662 data set. Different combinations of parameters k and l were evaluated. Remark on the symbols of template sizes: k stands for template size of amino acid sequences; l stands for template size of nucleic acid sequences. The best accuracy in the table is marked in bold.
Comparision of RPiRLS with other methods on the RPI369 data set in predicting RNA-protein interactions with known structures.
| Measurements | RPiRLS | RPiRLS-7G | RPI-Pred | IPMiner |
|---|---|---|---|---|
| Accuracy | 0.85 | 0.79 | 0.49 | 0.50 |
| AUC | 0.92 | 0.90 | - | - |
| Specificity | 0.84 | 0.72 | 0.34 | 0.52 |
| Sensitivity | 0.86 | 0.87 | 0.63 | 0.48 |
Remark: ’-’ stands for the AUC score is not available.
Comparision of RPiRLS with other methods on the RPI2241 data set in predicting RNA-protein interactions with known structures.
| Measurements | RPiRLS | RPiRLS-7G | RPI-Pred | IPMiner |
|---|---|---|---|---|
| Accuracy | 0.80 | 0.67 | 0.49 | 0.50 |
| AUC | 0.80 | 0.74 | - | - |
| Specificity | 0.82 | 0.58 | 0.38 | 0.20 |
| Sensitivity | 0.79 | 0.76 | 0.61 | 0.79 |
Remark: ’-’ stands for the AUC score is not available.
Comparing the accuracy of RPiRLS with other methods in predicting non-coding RNA-protein interactions.
| Organism | # ncRPIs | RPiRLS | RPiRLS-7G | RPI-Pred |
|---|---|---|---|---|
| 36 | 0.92 | 0.61 | 0.92 | |
| 95 | 0.80 | 0.52 | 0.88 | |
| 202 | 0.54 | 0.52 | 0.90 | |
| 8246 | 0.92 | 0.74 | 0.86 | |
| 3669 | 0.91 | 0.80 | 0.94 | |
| 905 | 0.91 | 0.83 | 0.80 | |
| Weighted average | 13,153 | 0.91 | 0.76 | 0.88 |
The weighted average accuracy is given by the weighting on the number of RPIs of different organisms over the total.
Comparing the accuracy of RPiRLS with other methods in predicting long non-coding RNA-protein interactions.
| Organism | # lncRPIs | RPiRLS | RPiRLS-7G | RPI-Pred |
|---|---|---|---|---|
| 4 | 1.00 | 0.75 | 0.75 | |
| 61 | 0.87 | 0.69 | 1.00 | |
| 78 | 0.45 | 0.45 | 0.86 | |
| 8039 | 0.93 | 0.74 | 0.86 | |
| 3495 | 0.92 | 0.83 | 0.95 | |
| 437 | 0.94 | 0.90 | 0.87 | |
| Weighted average | 12,114 | 0.92 | 0.77 | 0.89 |
The weighted average accuracy is given by the weighting on the number of RPIs of different organisms over the total.
Figure 1Comparison of the long non-coding RNA-protein interaction networks predicted by the RPiRLS and RPI-Pred methods, for Caenorhabditis elegans. Networks are visualized with Cytoscape v3.4.0. The green (ellipse) and yellow (rectangle) nodes representing lncRNAs and proteins respectively, are connected by edges (solid lines) indicating an interaction between them. The edges colored in blue and red indicate true positive and false negative interactions, respectively.
Figure 2Comparison of the long non-coding RNA-protein interaction networks predicted by the RPiRLS and RPI-Pred methods, for Drosophila melanogaster.
Figure 3Histogram of the amino acid observing frequencies of lncRNA-binding proteins for six organisms. Escherichia coli has relative higher observing frequencies of amino acids A and V as well as much lower content of amino acid S compared with that of the other five organisms (highlighted by arrows).
Figure 4Comparison of the long non-coding RNA-protein interaction networks predicted by the RPiRLS and RPI-Pred methods, for Escherichia coli.
Figure 5Comparison of the long non-coding RNA-protein interaction networks predicted by the RPiRLS and RPI-Pred methods, for Saccharomyces cerevisiae.
Figure 6The work flow for the proposed RPiRLS method.