| Literature DB >> 24846307 |
Rasna R Walia1, Li C Xue2, Katherine Wilkins3, Yasser El-Manzalawy4, Drena Dobbs5, Vasant Honavar6.
Abstract
Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24846307 PMCID: PMC4028231 DOI: 10.1371/journal.pone.0097725
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Principal Components Analysis (PCA) of interface conservation scores and sequence alignment statistics.
Data points in the plot correspond to the projection of a 6-dimensional vector representing the pairwise alignment of a query and homolog sequence onto a 2-dimensional space defined by the first and second principal components. Blue lines with red circles at their tips represent the axes of the original 6-dimensional space for the 6 variables used in PCA analysis: -log(E) (where is the -value), Identity Score (), Positive Score (), log(L) (where is local alignment length), alignment length fractions ( and , where and are the lengths of the query and homolog proteins, respectively). Each data point is colored according to its computed score, with higher score (red/orange) indicating higher interface conservation and lower scores (blue/green) indicating lower interface conservation. The large gray arrow indicates the direction of increasing degree of interface conservation, from Dark to Twilight to Safe Zone.
The Linear Model for Interface Conservation.
| Variable | Parameter stimate | Standard Error | Type II SS |
|
| −0.532 | 0.042 | 8.70 |
|
| 0.001 | 0.000 | 1.11 |
|
| 0.005 | 0.000 | 12.54 |
|
| 0.600 | 0.014 | 97.55 |
|
| 0.089 | 0.007 | 8.60 |
Performance of HomPRIP on RB198.
| Homology Zone | Prediction Coverage | Specificity | Sensitivity | F-measure | MCC |
| Safe Zone | 89/198 = 45% | 0.87 | 0.85 | 0.86 | 0.83 |
| Twilight Zone | 54/198 = 27% | 0.64 | 0.49 | 0.55 | 0.50 |
| Dark Zone | 9/198 = 5% | 0.37 | 0.12 | 0.18 | 0.17 |
| All Zones | 152/198 = 77% | 0.79 | 0.69 | 0.73 | 0.69 |
The performance is shown for the Safe, Twilight, and Dark Zones, separately. Prediction coverage is the fraction of queries that can be predicted by HomPRIP in a given zone.
Figure 2RNABindRPlus flowchart.
Flowchart showing the different components of RNABindRPlus.
Evaluation of Methods on 28 proteins from the RB44 dataset.
| Method | Reference | Specificity | Sensitivity | F-measure | MCC |
| HomPRIP | This paper | 0.84 | 0.62 | 0.71 | 0.63 |
| RNABindRPlus | This paper | 0.76 | 0.67 | 0.71 | 0.60 |
| SVMOpt | This paper | 0.58 | 0.72 | 0.64 | 0.48 |
| Metapredictor |
| 0.74 | 0.54 | 0.62 | 0.51 |
| PiRaNhA |
| 0.66 | 0.65 | 0.65 | 0.51 |
| BindN+ |
| 0.56 | 0.75 | 0.64 | 0.47 |
| PPRInt |
| 0.49 | 0.77 | 0.60 | 0.39 |
| PRBR |
| 0.58 | 0.45 | 0.51 | 0.34 |
| RNABindR |
| 0.60 | 0.39 | 0.48 | 0.32 |
| BindN |
| 0.50 | 0.50 | 0.50 | 0.28 |
| NAPS |
| 0.43 | 0.58 | 0.49 | 0.23 |
| KYG** |
| 0.55 | 0.66 | 0.60 | 0.41 |
| OPRA** |
| 0.61 | 0.48 | 0.53 | 0.37 |
| PRIP** |
| 0.47 | 0.71 | 0.56 | 0.33 |
The first 11 methods are sequence-based methods. The last 3 methods are structure-based methods (indicated by **). Methods in each category are sorted in descending order of MCC. The highest value in each column is shown in bold font.
HomPRIP Performance by Zone on RB28.
| Homology Zone | Proteins | Specificity | Sensitivity | F-measure | MCC |
| Safe Zone | 2L5D_A, 2XD0_A, 2XZN_J, 3IZV_M, 3IZW_I, 3J00_G, 3J01_5, 3PIP_F, 3PIP_G, 3PIP_T, 3Q2T_A | 0.88 | 0.80 | 0.84 | 0.77 |
| Twilight Zone | 2XXA_D, 2XZM_B, 2XZM_C, 2XZM_G, 2XZM_I, 2XZM_M, 3IZV_X, 2RRA_A, 2XZM_E, 2XZM_Q, 2XZN_L, 2XZM_8, 2XZM_S, 2XZM_U, 3IZW_R | 0.83 | 0.55 | 0.66 | 0.58 |
| Dark Zone | 2XZM_D, 3PDM_P | 0.45 | 0.18 | 0.26 | 0.13 |
All measures are highest for proteins with Safe Zone homologs and lowest for those with Dark Zone homologs.
HomPRIP, RNABindRPlus, and SVMOpt Performance by Zone on RB28.
| Safe Zone | Specificity | Sensitivity | F-measure | MCC |
| HomPRIP | 0.88 | 0.80 | 0.84 | 0.77 |
| RNABindRPlus | 0.79 | 0.67 | 0.72 | 0.61 |
| SVMOpt | 0.63 | 0.68 | 0.65 | 0.48 |
|
|
|
|
|
|
| HomPRIP | 0.83 | 0.55 | 0.66 | 0.58 |
| RNABindRPlus | 0.73 | 0.69 | 0.71 | 0.60 |
| SVMOpt | 0.54 | 0.76 | 0.63 | 0.47 |
|
|
|
|
|
|
| HomPRIP | 0.45 | 0.18 | 0.26 | 0.13 |
| RNABindRPlus | 0.83 | 0.54 | 0.65 | 0.57 |
| SVMOpt | 0.68 | 0.64 | 0.66 | 0.52 |
Proteins with Safe Zone Homologs in RB111.
| Homology Zone | Proteins |
| Safe Zone | 2XGJ_A, 2XS2_A, 2YSY_A, 3AGV_A, 3AMT_A, 3B0U_X, 3KFU_A, 3KFU_F, 3LWR_A, 3NMR_A, 3R2C_A, 3RC8_A, 3S14_A, 3S14_B, 3T5N_A, 3V22_V, 3V2C_Y, 3ZD6_A, 4AFY_A, 4ARC_A, 4ATO_A, 4B3G_A, 4BTD_2, 4BTD_D, 4BTD_G, 4BTD_S, 4BTD_X, 4DH9_Y, 4DWA_A, 4E78_A, 4ERD_A, 4IFD_A, 4IFD_H, 4K4Z_A, 4KJ5_5, 4KJ5_G, 3NVI_A, 3OIN_A, 3R9X_B, 3RW6_A, 3ULD_A, 3VYX_A, 4AM3_A, 4B3O_A, 4BA2_A, 4F02_A, 4F1N_A, 4FXD_A, 4GV3_A |
There are 49 proteins in RB111 for which HomPRIP can find homologs and return predictions.
Evaluation of Methods on 49 proteins from the RB111 dataset.
| Method | Reference | Specificity | Sensitivity | F-measure | MCC |
| HomPRIP | This paper |
|
|
|
|
| RNABindRPlus | This paper | 0.64 | 0.54 | 0.59 | 0.55 |
| SVMOpt | This paper | 0.27 | 0.51 | 0.35 | 0.28 |
| BindN+ |
| 0.28 | 0.48 | 0.36 | 0.28 |
| RNABindR v2.0 |
| 0.19 | 0.67 | 0.30 | 0.24 |
| PPRInt |
| 0.21 | 0.56 | 0.31 | 0.23 |
| BindN |
| 0.18 | 0.39 | 0.24 | 0.14 |
| KYG** |
| 0.20 | 0.46 | 0.28 | 0.19 |
| PRIP** |
| 0.19 | 0.49 | 0.27 | 0.19 |
The first 7 methods are sequence-based methods. The last 2 methods are structure-based methods (indicated by **). Methods in each category are sorted in descending order of MCC. The highest value in each column is shown in bold font.
Figure 3PDB ID: 3NCU, Chain A: RIG-I.
(A) Actual interface residues, (B) Predictions made by HomPRIP, (C) Predictions made by SVMOpt, and (D) Predictions made by RNABindRPlus.
Sequence-based Methods for Predicting RNA-binding sites in Proteins.
| Method | Reference | Description |
| BindN |
| An SVM classifier that uses hydrophobicity, side chain pKa, molecular mass and PSSMs for predicting RNA-binding residues. It can also predict DNA-binding residues. Accessible at: |
| BindN+ |
| An updated version of BindN, that uses an SVM classifier based on PSSMs and several other descriptors of evolutionary information. It can also predict DNA-binding residues. Accessible at: |
| Metapredictor |
| A predictor that combines the output of PiRaNhA, PPRInt, and BindN+ to make predictions of RNA-binding residues using a weighted mean. Accessible at: |
| NAPS |
| A modified C4.5 decision tree algorithm that uses amino acid identity, residue charge, and PSSMs to predict residues involved in DNA- or RNA-binding. Accessible at: |
| PiRaNhA |
| An SVM classifier that makes use of PSSM profiles, interface propensity, predicted solvent accessibility, and hydrophobicity to predict protein-RNA interface residues. Accessible at: |
| PPRInt |
| An SVM classifier trained on PSSM profiles. Accessible at: |
| PRBR |
| An enriched random forest classifier trained on predicted secondary structure, a combination of PSSMs with physico-chemical properties, a polarity-charge correlation, and a hydrophobicity correlation. Accessible at: |
| RNABindR |
| A Naïve Bayes classifier that uses the amino acid sequence identity to predict RNA-binding residues in proteins. Previously accessible at: |
| RNABindR v2.0 |
| An SVM classifier that uses sequence PSSMs to predict RNA-binding residues in proteins. Accessible at: |
Structure-based Methods for Predicting RNA-binding sites in Proteins.
| Method | Reference | Description |
| KYG |
| Uses a set of scores based on the RNA-binding propensity of individual and pairs of surface residues of the protein, used alone or in combination with position-specific multiple sequence profiles. Accessible at: |
| OPRA |
| Uses patch energy scores calculated using interface propensity scores weighted by the accessible surface area of a residue to predict RNA-binding sites. The program is available upon request from the authors. |
| PRIP |
| Uses an SVM classifier and a combination of PSSM profiles, solvent accessible surface area (ASA), betweenness centrality, and retention coefficient as input features. Not accessible via the web server, but results can be obtained via correspondence with the author. |
Evaluation of Methods on the RB44 dataset.
| Method | Reference | Specificity | Sensitivity | F-measure | MCC |
| RNABindRPlus | This paper | 0.72 | 0.63 |
|
|
| SVMOpt | This paper | 0.58 | 0.72 | 0.64 | 0.47 |
| PiRaNhA |
| 0.64 | 0.63 | 0.64 | 0.48 |
| Metapredictor |
|
| 0.49 | 0.59 | 0.47 |
| BindN+ |
| 0.54 |
| 0.62 | 0.43 |
| PPRInt |
| 0.50 | 0.72 | 0.59 | 0.38 |
| RNABindR |
| 0.62 | 0.39 | 0.48 | 0.33 |
| PRBR |
| 0.58 | 0.41 | 0.48 | 0.31 |
| BindN |
| 0.50 | 0.51 | 0.50 | 0.28 |
| NAPS |
| 0.43 | 0.58 | 0.49 | 0.22 |
| KYG** |
| 0.56 | 0.67 | 0.61 | 0.42 |
| OPRA** |
| 0.57 | 0.51 | 0.54 | 0.36 |
| PRIP** |
| 0.46 | 0.68 | 0.55 | 0.31 |
The first 10 methods are sequence-based methods. The last 3 methods (indicated by **) are structure-based methods. Methods in each category are sorted in descending order of MCC. The highest value in each column is shown in bold font.
Figure 4Comparison of SVMOpt, RNABindRPlus, and the Metapredictor
on the RB44 dataset using (A) ROC curves and (B) PR curves with a 5 Å distance cut-off for interface residues.
Evaluation of Methods on the RB111 dataset.
| Method | Reference | Specificity | Sensitivity | F-measure | MCC |
| RNABindRPlus | This paper |
| 0.37 |
|
|
| SVMOpt | This paper | 0.25 | 0.44 | 0.32 | 0.24 |
| BindN+ |
| 0.25 | 0.43 | 0.31 | 0.24 |
| RNABindR v2.0 |
| 0.18 |
| 0.28 | 0.22 |
| PPRInt |
| 0.18 | 0.48 | 0.26 | 0.18 |
| BindN |
| 0.16 | 0.39 | 0.23 | 0.14 |
| KYG** |
| 0.19 | 0.47 | 0.27 | 0.19 |
| PRIP** |
| 0.17 | 0.45 | 0.24 | 0.15 |
The first 6 methods are sequence-based methods. The last 2 methods (indicated by **) are structure-based methods. Methods in each category are sorted in descending order of MCC. The highest value in each column is shown in bold font.
Figure 5Comparison of SVMOpt, RNABindRPlus, RNABindR v2.0, BindN, BindN+ and PPRInt
on the RB111 dataset using (A) ROC curves and (B) PR curves with a 5 Å distance cut-off for interface residues.
Boundaries of Safe, Twilight, and Dark Zones used by HomPRIP.
| Homology Zones |
|
| Safe Zone | 0.70 |
| Twilight Zone | 0.20 |
| Dark Zone | 0.15 |