| Literature DB >> 32961749 |
Kui Wang1, Gang Hu2, Zhonghua Wu1, Hong Su1, Jianyi Yang1, Lukasz Kurgan3.
Abstract
With close to 30 sequence-based predictors of RNA-binding residues (RBRs), this comparative survey aims to help with understanding and selection of the appropriate tools. We discuss past reviews on this topic, survey a comprehensive collection of predictors, and comparatively assess six representative methods. We provide a novel and well-designed benchmark dataset and we are the first to report and compare protein-level and datasets-level results, and to contextualize performance to specific types of RNAs. The methods considered here are well-cited and rely on machine learning algorithms on occasion combined with homology-based prediction. Empirical tests reveal that they provide relatively accurate predictions. Virtually all methods perform well for the proteins that interact with rRNAs, some generate accurate predictions for mRNAs, snRNA, SRP and IRES, while proteins that bind tRNAs are predicted poorly. Moreover, except for DRNApred, they confuse DNA and RNA-binding residues. None of the six methods consistently outperforms the others when tested on individual proteins. This variable and complementary protein-level performance suggests that users should not rely on applying just the single best dataset-level predictor. We recommend that future work should focus on the development of approaches that facilitate protein-level selection of accurate predictors and the consensus-based prediction of RBRs.Entities:
Keywords: RNA-binding residues; benchmark; messenger RNA; predictive performance; protein-DNA interactions; protein-RNA interactions; ribosomal RNA; signal recognition particle; small nuclear RNA; transfer RNA
Mesh:
Substances:
Year: 2020 PMID: 32961749 PMCID: PMC7554811 DOI: 10.3390/ijms21186879
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Surveys of the sequence-based predictors of RBRs. While some of these surveys cover the structure-based methods and methods that consider protein-DNA interactions, we specifically focus on their coverage of the sequence-based predictors of RBRs.
| Ref. | Year Released | No. of Predictors Surveyed | No. of Predictors Assessed Empirically | Evaluates or Analyzes | |||
|---|---|---|---|---|---|---|---|
| Cross-Prediction between RNA and DNA | Specific Types of RNAs | Protein-Level Performance and Complementarity | Dependence on Sequence Similarity For Homology-Based Predictions | ||||
| This article | 28 | 6 | Yes | Yes | Yes | Yes | |
| [ | 2019 | 9 | 6 | No | No | No | No |
| [ | 2019 | 18 | 4 | Yes | No | No | No |
| [ | 2016 | 16 | 3 | Yes | No | No | No |
| [ | 2015 | 17 | 8 | Yes | No | No | No |
| [ | 2013 | 10 | 8 | Yes | No | No | No |
| [ | 2012 | 13 | 3 | No | No | No | No |
| [ | 2012 | 7 | 7 | No | No | No | No |
Partner-agnostic sequence-based predictors of RBRs.
| Ref. | Name | Year Published | Model Type | Citations | Impact Factor | Availa-Bility | Webpage | Webserver Available at the Time of Analysis | |
|---|---|---|---|---|---|---|---|---|---|
| Total | Annual | ||||||||
| [ | CNN model | 2019 | Convolutional NN | 0 | 0 | N/A | N | N/A | N/A |
|
|
|
|
|
|
|
|
|
|
|
| [ | iDeepE | 2018 | Convolutional NN | 47 | 23 | 4.5 | S |
| N/A |
|
|
|
|
|
|
|
|
|
|
|
| [ | PredRBR | 2017 | Gradient boosted DT | 28 | 13 | 2.5 | S |
| N/A |
| [ | DORAEMON | 2017 | Bayesian classifier | 5 | 2 | 1.9 | S |
| N/A |
|
|
|
|
|
|
|
|
|
|
|
| [ | RNAProSite | 2016 | RF | 14 | 3 | 2.5 | W |
| no |
| [ | SNBRFinder | 2015 | SVM+HT | 13 | 3 | 2.8 | W |
| no |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| [ | SRCpred | 2011 | Feedforward NN | 34 | 4 | 2.5 | W |
| no |
| [ | PredictRBP | 2011 | SVM | 33 | 4 | 2.5 | S |
| N/A |
| [ | SVM model | 2011 | SVM | 24 | 3 | 2.5 | N | N/A | N/A |
| [ | PRBR | 2011 | RF | 62 | 7 | 2.5 | W |
| no |
| [ | SPOT-Seq-RNA | 2011 | HT | 52 | 6 | 5.5 | W |
| no |
| [ | NAPS | 2010 | DT | 64 | 6 | 11.2 | W |
| no |
| [ | RBRpred | 2010 | SVM | 52 | 5 | 1.9 | N | N/A | N/A |
| [ | PRNA | 2010 | RF | 134 | 13 | 4.5 | S |
| N/A |
| [ | PiRaNhA | 2010 | SVM | 69 | 7 | 11.2 | W |
| no |
|
|
|
|
|
|
|
|
|
|
|
| [ | ProteRNA | 2010 | SVM | 22 | 2 | 3.5 | N | N/A | N/A |
| [ | Pprint | 2008 | SVM | 247 | 21 | 2.5 | W |
| yes |
| [ | PRINTR | 2008 | SVM | 71 | 6 | 2.5 | W |
| no |
| [ | RNAProB | 2008 | SVM | 119 | 10 | 2.5 | N | N/A | N/A |
| [ | RNABindR | 2007 | Naive Bayes | 198 | 15 | 11.2 | W |
| no |
| [ | BindN | 2006 | SVM | 416 | 30 | 11.2 | W |
| no |
| [ | NN model | 2004 | Feedforward NN | 79 | 5 | N/A | N | N/A | N/A |
We describe the type of the model, which includes neural network (NNs), random forest (RF), support vector machine (SVM), decision tree (DT), and homology transfer (HT). The citations were collected from Google Scholar on 9 June 2020. The most recent impact factor was obtained from Clarivariate Analytics in June 2020; the impact factor is not available (N/A) for the two methods that were published in the conference proceedings. The availability is encoded as W, S and N if webserver, only standalone software, and neither the webserver nor code are available, respectively. Methods shown in bold font are used in the empirical comparative analysis performed in this survey.
Predictive performance of the six partner-agnostic sequence-based predictors of RBRs on the benchmark dataset.
| RNA Type | Predictor | AUC | AULCratio | MCC | F1 | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| All RBRs | RNABindRPlus |
|
|
|
|
|
|
| aaRNA | 0.848= | 17.8+ | 0.344+ | 0.370 | 0.370 | 0.974 | |
| BindN+ | 0.803+ | 10.3+ | 0.233+ | 0.263 | 0.263 | 0.970 | |
| FastRNABindR | 0.792+ | 17.1+ | 0.312+ | 0.339 | 0.341 | 0.972 | |
| NucBind | 0.775+ | 16.0+ | 0.307+ | 0.335 | 0.333 | 0.973 | |
| DRNApred | 0.608+ | 4.1+ | 0.097+ | 0.132 | 0.132 | 0.964 | |
| rRNA | RNABindRPlus |
|
|
|
|
|
|
| aaRNA | 0.870= | 20.5+ | 0.356+ | 0.377 | 0.418 | 0.974 | |
| BindN+ | 0.829+ | 12.0+ | 0.246+ | 0.271 | 0.303 | 0.970 | |
| FastRNABindR | 0.820+ | 20.5+ | 0.334+ | 0.355 | 0.400 | 0.972 | |
| NucBind | 0.790+ | 18.7+ | 0.325+ | 0.347 | 0.385 | 0.973 | |
| DRNApred | 0.601+ | 4.5+ | 0.095+ | 0.126 | 0.141 | 0.964 | |
| mRNA | RNABindRPlus |
| 2.3 | 0.009 | 0.003 | 0.091 |
|
| aaRNA | 0.637+ |
|
|
|
| 0.974 | |
| BindN+ | 0.798+ | 7.2– | 0.020– | 0.005 | 0.205 | 0.970 | |
| FastRNABindR | 0.814+ | 7.8– | 0.025– | 0.007 | 0.227 | 0.972 | |
| NucBind | 0.844= | 10.6– | 0.030– | 0.008 | 0.273 | 0.973 | |
| DRNApred | 0.383+ | 4.0= | 0.006= | 0.002 | 0.091 | 0.964 | |
| snRNA | RNABindRPlus |
|
|
|
|
|
|
| aaRNA | 0.777= | 8.6= | 0.065= | 0.043 |
| 0.974 | |
| BindN+ | 0.716+ | 2.7+ | 0.018+ | 0.015 | 0.088 | 0.970 | |
| FastRNABindR | 0.769+ | 5.5+ | 0.040+ | 0.028 | 0.150 | 0.972 | |
| NucBind | 0.685+ | 5.9+ | 0.038= | 0.027 | 0.144 | 0.973 | |
| DRNApred | 0.535+ | 0.9+ | −0.002+ | 0.004 | 0.029 | 0.964 | |
| SRP | RNABindRPlus | 0.774 |
|
|
| 0.426 |
|
| aaRNA |
| 7.8+ | 0.025= | 0.008 | 0.204 | 0.974 | |
| BindN+ | 0.625+ | 4.3+ | 0.013+ | 0.004 | 0.130 | 0.970 | |
| FastRNABindR | 0.288+ | 0.3+ | −0.001+ | 0.001 | 0.019 | 0.972 | |
| NucBind | 0.608+ | 21.4= | 0.037= | 0.011 | 0.296 | 0.973 | |
| DRNApred | 0.543+ | 15.2= | 0.051= | 0.013 |
| 0.964 | |
| IRES | RNABindRPlus | 0.818 | 7.3 | 0.023 | 0.006 | 0.216 |
|
| aaRNA |
| 8.5= | 0.022= | 0.006 | 0.216 | 0.974 | |
| BindN+ | 0.729+ | 4.5= | 0.008+ | 0.002 | 0.108 | 0.970 | |
| FastRNABindR | 0.758+ | 0.7+ | 0.000+ | 0.001 | 0.027 | 0.972 | |
| NucBind | 0.780= | 0.7+ | 0.000+ | 0.001 | 0.027 | 0.973 | |
| DRNApred | 0.855= |
|
|
|
| 0.964 | |
| tRNA | RNABindRPlus | 0.745 |
| 0.029 | 0.027 | 0.095 |
|
| aaRNA | 0.735= | 5.1= |
|
|
| 0.974 | |
| BindN+ | 0.689+ | 3.7= | 0.030= | 0.026 | 0.111 | 0.970 | |
| FastRNABindR | 0.739= | 5.2= | 0.040= | 0.033 | 0.131 | 0.972 | |
| NucBind |
| 3.1= | 0.025= | 0.024 | 0.090 | 0.973 | |
| DRNApred | 0.742= | 1.8= | 0.016= | 0.017 | 0.084 | 0.964 |
Tests for a specific RNA type include RBRs that bind this RNA type and the non-RNA-binding residues; residues that interact with the other RNA types are excluded. The rate of the binary predictions was equalized between predictors such that the numbers of the predicted and the experimentally annotated RBRs are equal, allowing for side-by-side comparison of the binary metrics. The predictors are sorted in the order of their AUCs when tested using all RBRs. =/+/– summarize results of statistical tests and denote that the difference is not significant (p-value > 0.01)/that RNABindRPlus is significantly better (p-value ≤ 0.01)/that RNABindRPlus is significantly worse (p-value ≤ 0.01). The best results for each test are shown in bold font.
Figure 1Comparison of the default sensitivity (green bars) and the sensitivity when putative RBRs that are immediate neighbors (one residue away in the sequence) of the experimentally annotated RBRs are assumed correct (blue bars).
Assessment of the over-predictions and cross-predictions for the six partner-agnostic sequence-based predictors of RBRs on the benchmark dataset.
| Predictor | PPR on RNA-Binding Proteins | PPR on DNA-Binding Proteins | RatioRNA/DNA | PPR on Non-RNA Binding Proteins | RatioRNA/Non-RNA |
|---|---|---|---|---|---|
| DRNApred | 0.084 |
|
| 0.018 |
|
| RNABindRPlus |
| 0.017+ | 6.0+ |
| 9.6= |
| NucBind | 0.083 = | 0.019+ | 4.3+ | 0.018 = | 4.5= |
| aaRNA | 0.083 = | 0.026+ | 3.2+ | 0.019 = | 4.4= |
| FastRNABindR | 0.084 = | 0.028+ | 3.1+ | 0.019 = | 4.5= |
| BindN+ | 0.067 + | 0.032+ | 2.1+ | 0.026 + | 2.5+ |
The over-predictions (cross-predictions) are quantified with predictive positive rate (PPR) defined as the number of putative RBRs divided by the number of all residues in the subset of the benchmark set that covers 150 proteins that do not bind RNA (75 proteins that interact with DNA). Higher PPR values in these two datasets indicate worse predictions since these values correspond to false positive rates. We also give PPR on the set of 150 RNA-binding proteins, ratioRNA/DNA = PPR for the RNA-binding proteins divided by PPR for the DNA-binding proteins (higher value is better), and ratioRNA/non-RNA = PPR for the DNA-binding proteins divided by the PPR for the non-RNA-binding proteins (higher value is better). The binary predictions of RBRs were equalized between predictors such that the numbers of the predicted and the experimentally annotated RBRs on the benchmark dataset are equal, allowing for side-by-side comparison of the PPR and ratio metrics. The predictors are sorted by their ratioRNA/DNA values. =/+/– summarize results of statistical tests and denote that the difference is not significant (p-value > 0.01)/that DRNApred is significantly better (p-value ≤ 0.01)/that DRNApred is significantly worse (p-value ≤ 0.01). The best results for each test are shown in bold font.
Figure 2Protein-level predictive performance measured with AUC for the six partner-agnostic sequence-based predictors of RBRs on the benchmark dataset. This analysis focuses on the RNA-binding proteins as the calculation of the per-protein AUC is not possible for the other proteins. The violin plots in Panel (A) represent the distributions of the per-protein AUC values. The box plots inside the violin plots represent the first quartile (bottom of the box), the second quartile/median (white dot) and the third quartile (top of the box) for these distributions. The black points connected by the black solid lines denote the dataset-level AUC values. Panel (B) shows relation between per-protein AUC and the content of RBRs (fraction of RBR in the protein chain). The color-coded solid lines correspond to the linear fit between the content and the AUC values for a given predictor.
Figure 3Complementarity of the six partner-agnostic sequence-based predictors of RBRs. This analysis focuses on the RNA-binding proteins as the calculation of the per-protein AUC is not possible for the other proteins. Panel (A) shows the per-proteins AUC values for proteins sorted by the AUCs of the best performing RNABindRPlus that are represented by the red line. Panel (B) shows the fractions of the RNA-binding proteins for which a given predictor secures the highest value of AUC. Predictors are sorted in descending order by the value of the fraction.
Predictive performance of the six partner-agnostic sequence-based predictors of RBRs on the subsets of the benchmark set that share pre-defined levels of similarity to the templates of aaRNA (top of the table) and RNABindRPlus (bottom of the table).
| Benchmark Proteins Sharing a Given Range of Similarity to Templates of aaRNA | AUC | MCC | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aaRNA | RNABindRPlus | BindN + | FastRNABindR | NucBind | DRNApred | aaRNA | RNABindRPlus | BindN + | FastRNABindR | NucBind | DRNApred | |
| Below 30% |
| 0.85 – | 0.76 = | 0.83 – | 0.77 = | 0.65 = |
| 0.33 – | 0.18 + | 0.22 = | 0.25 = | 0.10 + |
| 30–50% |
| 0.92 – | 0.86 = | 0.90 – | 0.85 = | 0.72 + |
| 0.45 – | 0.22 + | 0.40 – | 0.31 + | 0.25 + |
| 50–80% |
| 0.86 + | 0.78 + | 0.83 + | 0.77 + | 0.66 + |
| 0.37 = | 0.19 + | 0.28 + | 0.21 + | 0.10 + |
| Above 80% |
| 0.86 = | 0.81 + | 0.75 + | 0.78 + | 0.56 + |
| 0.42 – | 0.26 + | 0.32 = | 0.35 = | 0.09 + |
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Below 30% |
| 0.82 = | 0.79 + | 0.74 + | 0.80 + | 0.57 + |
| 0.32 – | 0.18 + | 0.20 + | 0.22 + | 0.06 + |
| 30–50% |
| 0.86 + | 0.83 + | 0.86 + | 0.80 + | 0.62 + |
| 0.35 + | 0.30 + | 0.38 + | 0.42 + | 0.10 + |
| 50–80% |
| 0.82 + | 0.79 + | 0.78 + | 0.67 + | 0.42 + |
| 0.33 + | 0.24 + | 0.35 + | 0.31 + | −0.10 + |
| Above 80% |
| 0.84 + | 0.80 + | 0.85 + | 0.75 + | 0.67 + |
| 0.37 + | 0.31 + | 0.54 + | 0.43 + | 0.20 + |
The rate of the binary predictions was equalized between predictors such that the numbers of the predicted and the experimentally annotated RBRs are equal, allowing for side-by-side comparison of MCCs. We summarize significance of differences between results generated by aaRNA/RNABindRPlus and each of the other five predictors for the set of proteins that share the same level of similarity; =/+/– denote that the difference between aaRNA/RNABindRPlus and another predictor for the set of proteins that share the same level of similarity is not significant (p-value > 0.05)/that aaRNA/RNABindRPlus is significantly better (p-value ≤ 0.05)/that aaRNA/RNABindRPlus is significantly worse (p-value ≤ 0.05). Comparison of the predictions from aaRNA and RNABindRPlus for the benchmark proteins that share <30% similarity to their templates against the benchmark proteins that share higher levels of similarity to the templates of the same predictor and shown in bold font.