| Literature DB >> 30536635 |
Yong Jung1,2,3, Yasser El-Manzalawy2,4,5, Drena Dobbs6,7, Vasant G Honavar1,2,8,4,3,5.
Abstract
RNA-protein interactions play essential roles in regulating gene expression. While some RNA-protein interactions are "specific", that is, the RNA-binding proteins preferentially bind to particular RNA sequence or structural motifs, others are "non-RNA specific." Deciphering the protein-RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein-RNA interfaces, there is a need for computational methods to identify RNA-binding residues in proteins. While most of the existing computational methods for predicting RNA-binding residues in RNA-binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner-specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner-specific protein-RNA interface prediction tools, PS-PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA-specificity metric (RSM), for quantifying the RNA-specificity of the RNA binding residues predicted by such tools. Our results show that the RNA-binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner-agnostic metrics, RNA partner-specific methods are outperformed by the state-of-the-art partner-agnostic methods. We conjecture that either (a) the protein-RNA complexes in PDB are not representative of the protein-RNA interactions in nature, or (b) the current methods for partner-specific prediction of RNA-binding residues in proteins fail to account for the differences in RNA partner-specific versus partner-agnostic protein-RNA interactions, or both.Entities:
Keywords: RNA-specificity metric; partner-specific protein-RNA binding; performance evaluation; protein-RNA Interface prediction; protein-RNA interactions
Mesh:
Substances:
Year: 2018 PMID: 30536635 PMCID: PMC6389706 DOI: 10.1002/prot.25639
Source DB: PubMed Journal: Proteins ISSN: 0887-3585
Figure 1Difference between RNA partner‐specific and partner‐agnostic interface residue predictors. U1 small nuclear ribonucleoprotein A is an example of an RNA binding protein with multiple binding sites. Two protein‐RNA complexes are used for this illustration (PDB ID: 4 W90, chains B and C, and 4YB1, chains P and R). A, An RNA partner‐specific interface residue predictor takes as input a query protein and one or more putative RNA partners and return predicted binding site for each RNA separately (highlighted using different colors). B, A partner‐agnostic interface residue predictor takes as input a query protein and returns all predicted RNA interface residues for that protein
Protein‐RNA datasets used in this study
| Dataset | No. interfacial pairs | No. non‐interfacial pairs | No. interfacial residues | No. non‐interfacial residues |
|---|---|---|---|---|
| PR122 | 6429 | 1 786 901 | 3328 | 25 474 |
| PR50 | 2662 | 608 602 | 1391 | 8664 |
| PR24 | 1048 | 406 061 | 512 | 2702 |
| PR30 | 1283 | 361 984 | 708 | 5580 |
PR122: A dataset for training, which consists of 122 protein‐RNA complexes.
PR50: A dataset for independent testing, which consists of 50 protein‐RNA complexes.
PR24: A dataset derived from PR122 and PR50 by excluding protein‐RNA pairs where RNA length is less than 100 ribonucleotides.
PR30: A dataset derived from PR50 by excluding protein‐RNA pairs where the protein sequence shares high sequence similarity (> 25%) with any protein sequence in PR50.
Performance of different classifiers for predicting interfacial amino acid residue‐ribonucleotide pairs using fivefold cross‐validation and PR122 dataset
| Features | Classifier | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|---|
| Sequence‐based | RF | 0.43 | 0.91 | 0.91 | 0.07 | 0.77 | 0.45 |
| SVM‐RBF | 0.16 | 0.98 | 0.98 | 0.06 | 0.74 | 0.39 | |
| NB | 0.50 | 0.75 | 0.75 | 0.04 | 0.68 | 0.34 | |
| Structure‐based | RF | 0.47 | 0.91 | 0.91 | 0.08 | 0.80 | 0.48 |
| SVM‐RBF | 0.16 | 0.98 | 0.98 | 0.06 | 0.75 | 0.40 | |
| NB | 0.51 | 0.76 | 0.75 | 0.04 | 0.69 | 0.35 |
Abbreviations: ACC, accuracy; AUC, Area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; NB, naïve bayes; RF, random forest; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity; SVM‐RBF, support vector machine with radial basis function kernel.
Performance of different classifiers for predicting interfacial amino acid residue‐ribonucleotide pairs using the PR50 independent test set
| Features | Classifier | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|---|
| Sequence‐based | RF | 0.48 | 0.91 | 0.91 | 0.09 | 0.80 | 0.49 |
| SVM‐RBF | 0.35 | 0.96 | 0.96 | 0.10 | 0.79 | 0.48 | |
| NB | 0.47 | 0.77 | 0.77 | 0.04 | 0.69 | 0.30 | |
| Structure‐based | RF | 0.52 | 0.90 | 0.90 | 0.09 | 0.83 | 0.52 |
| SVM‐RBF | 0.36 | 0.96 | 0.96 | 0.11 | 0.81 | 0.50 | |
| NB | 0.49 | 0.76 | 0.76 | 0.04 | 0.69 | 0.30 |
Abbreviations: ACC, accuracy; AUC, Area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; NB, naïve bayes; RF, random forest; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity; SVM‐RBF, support vector machine with radial basis function kernel.
Performance comparison of different methods for mapping predicted interfacial amino acid residue‐ribonucleotide pairs to RNA‐binding residues on protein side using PR50 independent test set
| Model | Method | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|---|
| PSPRInt‐Seq | Max | 0.47 | 0.93 | 0.87 | 0.42 | 0.81 | 0.42 |
| Average | 0.38 | 0.96 | 0.88 | 0.41 | 0.80 | 0.41 | |
| Average top‐5 | 0.46 | 0.93 | 0.87 | 0.42 | 0.81 | 0.42 | |
| Average top‐15 | 0.44 | 0.94 | 0.87 | 0.41 | 0.81 | 0.42 | |
| Average top‐25 | 0.42 | 0.95 | 0.87 | 0.42 | 0.81 | 0.42 | |
| PSPRInt‐Str | Max | 0.55 | 0.90 | 0.85 | 0.42 | 0.84 | 0.44 |
| Average | 0.41 | 0.95 | 0.88 | 0.42 | 0.83 | 0.43 | |
| Average top‐5 | 0.54 | 0.91 | 0.86 | 0.43 | 0.84 | 0.44 | |
| Average top‐15 | 0.50 | 0.93 | 0.87 | 0.43 | 0.84 | 0.44 | |
| Average top‐25 | 0.48 | 0.93 | 0.87 | 0.43 | 0.84 | 0.43 |
Abbreviations: ACC, accuracy; AUC, area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity.
Performance comparison of different methods for mapping predicted interfacial amino acid residue‐ribonucleotide pairs to RNA‐binding residues on protein side using PR30 independent test set
| Model | Method | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|---|
| PSPRInt‐Seq | Max | 0.25 | 0.93 | 0.85 | 0.21 | 0.74 | 0.33 |
| Average | 0.18 | 0.95 | 0.86 | 0.19 | 0.73 | 0.34 | |
| Average top‐5 | 0.23 | 0.93 | 0.85 | 0.20 | 0.74 | 0.33 | |
| Average top‐15 | 0.21 | 0.94 | 0.85 | 0.19 | 0.74 | 0.33 | |
| Average top‐25 | 0.21 | 0.95 | 0.85 | 0.19 | 0.74 | 0.33 | |
| PSPRInt‐Str | Max | 0.38 | 0.91 | 0.85 | 0.27 | 0.78 | 0.35 |
| Average | 0.24 | 0.96 | 0.88 | 0.26 | 0.78 | 0.37 | |
| Average top‐5 | 0.36 | 0.92 | 0.86 | 0.28 | 0.78 | 0.36 | |
| Average top‐15 | 0.32 | 0.93 | 0.87 | 0.28 | 0.78 | 0.36 | |
| Average top‐25 | 0.30 | 0.94 | 0.87 | 0.26 | 0.78 | 0.36 |
Abbreviations: ACC, accuracy; AUC, area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity.
Performance comparisons of PSPRInt‐Seq and PSPRInt‐Str with RNA partner‐agnostic RNA‐binding residue prediction methods using RBPs in PR50 test set
| Method | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| RNABindRPlus | 0.47 | 0.94 | 0.88 | 0.44 | 0.83 | 0.44 |
| FastRNABindR | 0.70 | 0.75 | 0.74 | 0.33 | 0.79 | 0.39 |
| RNABindR v2.0 | 0.68 | 0.71 | 0.71 | 0.28 | 0.77 | 0.35 |
| BindN+ | 0.47 | 0.85 | 0.80 | 0.27 | 0.75 | 0.33 |
| RBScore | 0.60 | 0.81 | 0.78 | 0.33 | 0.76 | 0.33 |
Results for two RNA partner‐specific methods are shown in bold, above the line. All other methods (below the line) are RNA partner‐agnostic.
Abbreviations: ACC, accuracy; AUC, area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity.
Performance comparisons of PSPRInt‐Seq and PSPRInt‐Str with RNA partner‐agnostic RNA‐binding residue prediction methods using RBPs in PR30 test set
| Methods | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| RNABindRPlus | 0.34 | 0.94 | 0.88 | 0.32 | 0.79 | 0.40 |
| FastRNABindR | 0.58 | 0.73 | 0.72 | 0.25 | 0.73 | 0.36 |
| RNABindR v2.0 | 0.57 | 0.70 | 0.70 | 0.22 | 0.72 | 0.35 |
| BindN+ | 0.40 | 0.83 | 0.78 | 0.21 | 0.71 | 0.30 |
| RBScore | 0.50 | 0.79 | 0.76 | 0.27 | 0.73 | 0.31 |
Results for two RNA partner‐specific methods are shown in bold, above the line. All other methods (below the line) are RNA partner‐agnostic.
Abbreviations: ACC, accuracy; AUC, area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity.
RSM scores for protein‐RNA pairs in the PR24 dataset determined for four RNA partner‐specific RNA‐binding residue prediction methods
| Interacting protein‐RNA chain | PS‐PRIP | PRIdictor | PSPRInt‐Seq | PSPRInt‐Str |
|---|---|---|---|---|
| 1MFQ_B‐1MFQ_A |
| 0.063 | 0.053 | 0.063 |
| 1MFQ_C‐1MFQ_A |
|
| 0.057 | 0.037 |
| 1NWY_M‐1NWY_9 | 0.288 | 0.400 | 0.029 |
|
| 1U6B_A‐1U6B_B |
| 0.007 | 0.058 | 0.055 |
| 1VQN_Q‐1VQN_9 | 0.193 | 0.004 | 0.114 | 0.135 |
| 1W2B_V‐1W2B_9 | 0.251 | 0.024 | 0.073 | 0.063 |
| 2OTJ_D‐2OTJ_9 | 0.070 | 0.034 | 0.038 | 0.029 |
| 2ZJR_D‐2ZJR_Y | 0.206 |
| 0.032 | 0.058 |
| 3DLL_J‐3DLL_Z | 0.076 |
| 0.038 | 0.070 |
| 3DLL_S‐3DLL_Z |
| 0.114 | 0.014 | 0.053 |
| 3G71_H‐3G71_9 | 0.129 | 0.054 | 0.054 | 0.053 |
| 3G8T_A‐3G8T_P |
| 0.005 | 0.091 | 0.116 |
| 3HHN_B‐3HHN_C |
| 0.032 | 0.072 | 0.056 |
| 3I56_N‐3I56_9 | 0.033 | 0.014 | 0.057 | 0.080 |
| 3IVK_H‐3IVK_M |
|
| 0.047 | 0.130 |
| 3NDB_B‐3NDB_M | 0.367 | 0.167 | 0.070 | 0.085 |
| 3V7E_A‐3V7E_C |
|
| 0.029 | 0.066 |
| 4IO9_W‐4IO9_Y | 0.697 | 0.132 | 0.077 | 0.062 |
| 4LCK_A‐4LCK_C |
|
| 0.003 | 0.032 |
| 4P3E_B‐4P3E_A |
| 0.108 | 0.038 | 0.033 |
| 4P3E_C‐4P3E_A |
|
| 0.047 | 0.089 |
| 4UYJ_D‐4UYJ_S | 1.000 | 0.074 | 0.024 | 0.125 |
| 4UYK_A‐4UYK_R |
| 0.083 | 0.055 | 0.093 |
| 4W90_B‐4W90_C |
| 0.017 | 0.033 | 0.082 |
| Average | 0.138 | 0.056 | 0.050 | 0.069 |
| STDEV | 0.248 | 0.088 | 0.025 | 0.034 |
Abbreviation: STDEV, standard deviation.
Performance comparisons of RNA partner‐specific RNA‐binding residue predictors with RNA partner‐agnostic RNA‐binding residue predictors using RBPs in PR24 test set
| Methods | Sn | Sp | ACC | MCC | AUC [ROC] | AUC [CROC] |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| RNABindRPlus | 0.79 | 0.81 | 0.80 | 0.54 | 0.88 | 0.36 |
| FastRNABindR | 0.85 | 0.57 | 0.62 | 0.35 | 0.80 | 0.33 |
| RNABindR v2.0 | 0.83 | 0.54 | 0.58 | 0.30 | 0.74 | 0.29 |
| BindN+ | NA | NA | NA | NA | NA | NA |
| RBScore | 0.7 | 0.75 | 0.74 | 0.40 | 0.79 | 0.37 |
Results for two RNA partner‐specific methods are shown in bold, above the line. All other methods (below the line) are RNA partner‐agnostic.
Abbreviations: ACC, accuracy; AUC, area under curve; CROC, concentrated receiver operating characteristic; MCC, Matthew's correlation coefficient; ROC, receiver operating characteristic; Sn, sensitivity; Sp, specificity.
NA AUC results because corresponding methods return only predicted binary labels. NA results for BindN+ because the server was no longer accessible at the time of running this experiment.
The performance metric for PSPRInt‐Seq and PSPRInt‐Str assessed using the PR24 dataset should be interpreted with caution because PR24 is derived in part from PR122 (dataset used to train PSPRInt‐Seq and PSPRInt‐Str) and PR50. Therefore, PR24 is not an independent test set for the PSPRInts and the superior performance of the PSPRInts here may be due to the 19 protein‐RNA pairs that they have in common.