Literature DB >> 26540053

Computational Prediction of RNA-Binding Proteins and Binding Sites.

Jingna Si¹, Jing Cui², Jin Cheng², Rongling Wu³.

Abstract

Proteins and RNA interaction have vital roles in many cellular processes such as protein synthesis, sequence encoding, RNA transfer, and gene regulation at the transcriptional and post-transcriptional levels. Approximately 6%-8% of all proteins are RNA-binding proteins (RBPs). Distinguishing these RBPs or their binding residues is a major aim of structural biology. Previously, a number of experimental methods were developed for the determination of protein-RNA interactions. However, these experimental methods are expensive, time-consuming, and labor-intensive. Alternatively, researchers have developed many computational approaches to predict RBPs and protein-RNA binding sites, by combining various machine learning methods and abundant sequence and/or structural features. There are three kinds of computational approaches, which are prediction from protein sequence, prediction from protein structure, and protein-RNA docking. In this paper, we review all existing studies of predictions of RNA-binding sites and RBPs and complexes, including data sets used in different approaches, sequence and structural features used in several predictors, prediction method classifications, performance comparisons, evaluation methods, and future directions.

Entities: Chemical Disease Species

Keywords: RNA-binding proteins (RBPs); RNA-binding site; bioinformatics; macromolecular docking; prediction

Mesh：

Substances：
RNA-Binding Proteins
RNA

Year: 2015 PMID： 26540053 PMCID： PMC4661811 DOI： 10.3390/ijms161125952

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

1. Introduction

Approximately 6%–8% of proteins are RNA-binding proteins (RBPs). These RBPs play an important part in gene expression and regulation. Due to study limitations, only a few types of RBPs have been identified such as HuR, AUF1, TTP, TIA1, and CUGBP2. These RBPs perform essential roles in various biological processes such as mRNA stability [1], stress responses [2], cell cycle, tumor differentiation [3], apoptosis, and gene regulation at the transcriptional and post-transcriptional levels [4]. Determining the three-dimensional (3D) structures of protein–RNA complexes facilitates the identification of physiochemical properties and biological interactions. Experimental methods (e.g., nuclear magnetic resonance spectroscopy (NMR) [5] and X-ray crystallography [6]) typically used for protein–RNA complex structure determination are expensive, time-consuming and labor-intensive. To date, 2274 protein–RNA complex structures determined by experimental methods have been deposited in the Protein Data Bank (PDB) database [7]. The number of protein–RNA complexes in the PDB database is significantly fewer than that which exists in nature. Given the large numbers of nucleic acid and protein sequences that exist, improved knowledge of how protein–RNA interactions occur could help us to recognize functional information. To achieve this goal, it is necessary to develop computational approaches which can reliably and rapidly identify RAN-binding proteins or sites. In contrast with experimental methods, computational tools could inexpensively and quickly identify RNA-binding sites and RBPs, which would be useful and helpful in studying protein–RNA interactions [8]; however, those methods based only on amino acid sequence information are difficult since organisms are highly complex. Several methods have been developed which focus on predicting RNA-binding sites and determining whether a protein–RNA complex exists. The majority of previous studies have focused on prediction approaches for RNA-binding sites and RBPs based on sequence similarity [9,10,11,12]. The query protein sequences were searched against databases; if the homologous sequences were known RNA-binding proteins, the query protein was regarded as an RNA-binding protein. Similarly, RNA-binding residues and sites in the query sequence could be detected. For another, methods based on predicted structural and sequence information are the most often used computational approaches to identify RNA-binding sites or RBPs. If the 3D structure of a target protein is known, the prediction based on structure information was carried out to distinguish RBPs [13,14,15]. It is believed that the structural similarity could provide more reliable and in-depth prediction consequence. Another technique is docking, a method started from the components coordinates, and aimed at modelling interaction conformation of macromolecular complexes [16]. Many protein–protein docking tools have been reported, but no specific RNA–protein docking method exists [17]. Several protein–protein docking programs accept RNA and protein coordinates as inputs to generate RBPs, such as HADDOCK [18], GRAMM [19], HEX [20], PatchDock [21], and FTDock [22]. The above strategies for RNA-binding site and RBP prediction are summarized in Figure 1.

Figure 1

Strategies for RNA-binding site and RBP prediction.

Strategies for RNA-binding site and RBP prediction. Although the methodology for predicting protein–protein interactions and protein–DNA interactions are well established [23,24], analyses of computational approaches used to identify protein–RNA interactions are lacking [8,17]. In this review, we discuss computational approaches for predicting RBPs and RNA-binding sites based on protein sequences or known protein 3D structures. Moreover, RNA–protein complex docking methods were discussed. We summarize detailed information of these computational tools, including various vectors based on sequence and/or structure, datasets used in the algorithm, performance comparison, machine learning methods, and so on. In particular, we summarize those available web servers for RNA-binding sites and RBP prediction, which are convenient for scientists. Finally, the future directions and several implications have been discussed, which can aid in method development.

2. Development of Computational Methods for Prediction of RNA-Binding Site

2.1. Data Set

The sequence and structure of protein–RNA complexes are available from PDB database and other specific protein–RNA interaction databases (Available online: http://pridb.gdcb.iastate.edu/) [25]. We analyzed several previous studies and summarize the datasets and methods used, which are listed in detail in Table 1. Of all existing datasets, RB344 is the largest and contains 344 non-redundant RBPs with at least 30% sequence identity [26]. In several studies, authors employed the same dataset to compare the advantages and disadvantages of various methods. In particular, Cheng et al. [27] constructed a novel PRIPU dataset which differed from previous datasets. The PRIPU dataset contained positive and unlabeled, but not negative samples. Such negative samples sometimes are not necessarily genuine negative samples and may even be unknown positive samples.

Table 1

Commonly used data sets for RNA-binding sites identification.

ID	Reference	Publication Year	Notes
PRIPU dataset	[27]	2015	The dataset contains positive and unlabeled examples, which is an innovation because previous ones usually have negative samples. Such negative samples are not real negative samples, some even may be unknown positive samples
^a RB344	[26]	2015	344 RNA binding proteins, almost entirely non-redundant at 30% sequence identity
RB172	[28]	2014	172 protein entries with sequence identity of less than 25%
RB75	[8]	2012	75 RNP complexes released between 1 January and 28 April 2011 from PDB database ^b, non-redundant at 40% sequence identity
RB199	[25,29]	2011	Extracted dataset (May 2010) from PDB database. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed
RB164	[30]	2010	The data were downloaded from RsiteDB. After removing protein and RNA chains with sequence identity above 25% and 60%, respectively, 205 non-redundant protein–RNA chains in 164 complexes were obtained
RB86	[31]	2008	86 RNA-binding protein chains were collected for training and fivefold cross validation
RB147	[32]	2007	Adding novel RNA-binding complexes since 2006, based on RB109
RB109	[33]	2006	109 RNA–protein complexes extracted from structures of known RNA–protein complexes solved by X-ray crystallography in the PDB. Proteins with >30% sequence identity or structures with resolution worse than 3.5 Å were removed

a RB: Abbreviation of RNA-binding dataset; b PDB: Protein Data Bank.

Commonly used data sets for RNA-binding sites identification. a RB: Abbreviation of RNA-binding dataset; b PDB: Protein Data Bank. RNA-binding residues are determined using two definitions: (i) a residue with any atom within 3–6 Å of any atom in a nucleotide; and (ii) residues involved in hydrophobic, electrostatic interactions with nucleotides, van der Waals, or hydrogen-bonding [25]. Residues satisfying these definitions are considered to be RNA-binding residues. As with protein–DNA complexes and protein–protein complexes, similar sequences in protein–RNA interactions are eliminated before dataset construction. Generally, sequences with similarities greater than 30%–40% are considered redundant. Clustering programs such as blastclust (available from NCBI), CD-HIT [34], and the PISCES web server are used to generate a non-redundant dataset.

2.2. Feature Selection for RNA-Binding Residues and Protein Predictors

Many features have been used to identify RBPs and binding sites. There are three kinds of features here, which are structure-based features, sequence-based features, chemical and physical features. The commonly used features summarized here include amino acid composition, sequence similarity, evolutionary information, accessible surface area (ASA), predicted secondary structures (SSs), hydrophobicity, electrostatic patches, cleft sizes, and other global protein features. Details of these features are shown as follows.

2.2.1. Sequence-Based Features

Amino Acid Composition

One of the most commonly used features of protein sequence is protein amino acid composition, not only in protein–protein interaction site prediction, but also in RNA-binding site prediction. The 20 amino acids exhibit various properties based on the presence of hydrophobic residues (G, F, L, M, A, I, P, V), polar residues (Q, T, S, N, C, Y, W), and charged residues (H, R, K, E, D) [35]. One of the encoding methods are based on the physicochemical properties of the various residue types. The hydrophobic, polar, charged and residues are encoded as (1 0), (0 1), and (0 0), respectively. Particularly, the positively-charged RNA backbone is usually more likely to combine with the negatively-charged residues, as shown in previous studies [36]. The other encoding method is standard binary encoding, which encodes each amino acid as a 20-dimensional binary vector, such as E (0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0), F (0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0), A (1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0),…, and Y (0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1).

Sequence Similarity

Sequence similarity (also referred to as sequence conservation) is frequently used for RNA-binding site prediction. The BLAST and PSI-BLAST programs are used to compare the similarities among various protein sequences. Generally, multiple sequence alignment (MSA) were obtained by comparing query sequences against the NCBI non-redundant database and were used to calculate each residue’s sequence similarity score. A number of conservation scoring tools are available including relative entropy, von Neumann entropy, Shannon entropy, and Scorecons.

Evolutionary Information

Evolutionary information has often been introduced in functional site predictors in recent studies, including RNA-binding site prediction. Previous studies showed that position-specific scoring matrix (PSSM) (an important form of evolutionary information) greatly improved the performance of RBPs prediction. PSSMs were used widely in pervious prediction studies because they provide the likelihood of a particular residue substitution based on evolutionary information.

2.2.2. Structure-Based Features

The Secondary Structure (SS)

The secondary structure (SS) provides local and geometric patterns, which can be obtained in two ways: One is that the protein structure is available and real SS could be calculated using SS assignment approach such as DSSPcont [37,38], the other is that the protein structure is unavailable and predicted SS could be obtained using SS predicted algorithm such as PSIPRED [38,39,40]. SS has been employed as an encoding feature in several studies to predict RNA-binding residues [41,42].

Accessible Surface Area (ASA)

RNA-binding residues tend to be exposed and interact with proteins, so calculation of solvent accessibility would be helpful in RNA-binding sites prediction. The relative ASA could be calculated using NACCESS [43,44], while the protein structure is available. It is worth pointing out that the relative ASA could not be calculated when the DNA molecule was absent. Residues with ASA value greater than 5% are defined as surface accessible residues.

2.2.3. Chemical and Physical Features

Hydrophobicity

Hydrophobicity, which represents the proportion of residues repelled by water, is frequently used by RNA-binding site predictors. Hydrophobicity scale was defined with numerical value for each amino acid [45].

Electrostatic Patches

A protein surface status can be described by electrostatic patches. Generally, nucleic acid-binding interfaces are more likely to be positively charged electrostatic patches [46]. Electrostatic patches can be computed using GRASP [47], GRASS [48], or the web server PFplus (PatchFinderPlus; Available online: http://pfp.technion.ac.il) [49].

Cleft Size

Cleft size is an important feature because the largest cleft on a protein surface tends to be where the protein active site is located [50]. The charge, dipole, and quadrupole moments can also be used to adequately recognize RBPs [51].

2.3. Prediction Methods

The computational methods used in previous studies to identify RBPs or RNA-binding sites can be divided into three aspects: (1) the use of sequence-based prediction methods when structure is unknown and sequence is known; (2) prediction methods based on structure when the query protein structure has been resolved; and (3) modeling using a docking method when the query structure is unknown. These three approaches are detailed below.

2.3.1. Sequences-Based Methods

Sequence-Based Methods for RNA-Binding Site Prediction

The number of known protein–RNA complex structures is few, so prediction methods which use only sequence information play an important role. Previously, Jeong et al. [52] introduced a predictor for RNA-binding sites using predicted secondary structure and amino acid type, and employingan artificial neural network. Subsequently, Terribilini et al. [33] contributed RNABindR, which is a classical method to train naive Bayes (NB) classifiers to predict RNA-binding sites. The RB109 dataset is listed in Table 1. Wang and Brown developed the BindN tool, which is a predictor of RNA- and DNA-binding sites [9]. The sequence features used in this method include molecular mass, hydrophobicity index, and the side chain pKa value. In addition, the evolutionary information was added to predictors, especially in the form of PSSMs. Pprint was developed by Kumar et al. [31], which combined evolutionary information (PSSM) and support vector machines (SVMs) and improved RNA-binding site and residue predictions significantly. Wang et al. [53] used SVM and PSSM profiles coupled with predicted SS and PSI-BLAST profiles in the PRINTR method to obtain improved performance. Tong et al. [54] introduced RISP, which is a hybrid RNA-binding site predictor which uses SVMs in conjunction with PSSMs and achieved a 61.0% increase in sensitivity and an 83.3% increase in specificity. A similar method, RNAProB, using SVM and a novel smoothed PSSM encoding method, was developed by Cheng et al. [55] and it gave better performance than the then current state-of-the-art systems. In 2010, Li et al. [56] constructed a novel method, employing evolutionary PSSM and structure-derived features to predict RNA-binding residues, which led to significant improvement. Liu et al. [30] proposed a novel classification system that combined sequence/structure-based features and interaction propensity, which is a novel interacting feature. In addition, a novel machine learning method (random forest) was used. Furthermore, Liu et al. compared their method with previous methods (e.g., RNAProB, PPRint, BindN and RNABindR) and achieved enhanced performance. Zhang et al. [57] presented an RNA-binding residue predictor using solvent accessibility, predicted SS, evolutionary conservation and sequence information. RNABindRPlus [58] is a recently developed predictor which obviously improved prediction reliability, which combines sequence homology and machine learning methods. Recently, Cheng et al. [27] developed a predictor (PRIPU) for protein–RNA interactions; the most important difference between this and original methods is that only positive and unlabeled samples are used in PRIPU, not negative samples.

Sequence-Based Methods for RNA-Binding Proteins (RBPs) Prediction

Han et al. [36] explored the SVM machine learning method to predict RBPs directly based on their primary sequence. The dataset in this work contained 447 RBPs and 4881 non-RBPs. The prediction accuracy was 40.0% and 99.9% for snRBPs and non-snRBPs, respectively, indicating the need for a sufficient number of proteins to train the SVMs. Shao et al. [59] developed a predictor to predict RNA-binding proteins with SVM methods using sequence characteristics. Similar to RNA-binding site prediction, evolutionary information was introduced to improve the performance of RBP predictions. Kumar et al. [60] exploited RNApred which combined binding residues and PSSM profiles and the SVM method to discriminate RBPs and non-RBPs. Another voting system was used to identify RBPs [42]. Zhao et al. developed SPOT for prediction of RBPs using a fold recognition method, which is freely available on the internet for academic users (Table 2).

Table 2

A general selection of Web servers of RNA-binding sites and protein prediction and protein–RNA complex docking.

Methods	URLs	References	Available	Seq/Struc/Docking	Sites/Protein
PRIPU	http://admis.fudan.edu.cn/projects/pripu.htm	Cheng et al. (2015) [27]	○	seq	site
RNABindRPlus	http://einstein.cs.iastate.edu/RNABindRPlus/	Walia et al. (2014) [58]	○		site
CatRAPID omics	http://s.tartaglialab.com/catrapid/omics	Agostini et al. (2013) [81]	○		site
SRCPred	http://tardis.nibio.go.jp/netasa/srcpred	Fernandez et al. (2011) [29]	○		site
SPOT	http://sparks.informatics.iupui.edu	Zhao et al. (2011) [15]	X		protein
PRBR	http://www.cbi.seu.edu.cn/PRBR/	Ma et al. (2011) [12]	○		site
RNAPred	http://www.imtech.res.in/raghava/rnapred/	Kumar et al. (2011) [60]	○		protein
RPISeq	http://pridb.gdcb.iastate.edu/RPISeq/	Muppirala et al. (2011) [82]	○		site
BindN+	http://bioinfo.ggc.org/bindn+/	Wang et al. (2010) [11]	○		site
NAPS	http://prediction.bioengr.uic.edu/	Carson et al. (2010) [81]	X		site
PiRaNhA	http://bioinformatics.sussex.ac.uk/PIRANHA/	Murakami et al. (2010) [10]	○		site
PRNA	http://www.sysbio.ac.cn/datatools.asp	Liu et al. (2010) [56]	X		site
RNA	http://mcgill.3322.org/RNA/	Li et al. (2010) [55]	X		site
RISP	http://grc.seu.edu.cn/RISP	Tong et al. (2008) [54]	X		site
PRINTR	http://210.42.106.80/printr/	Wang et al. (2008) [53]	X		site
PPRInt	http://www.imtech.res.in/raghava/pprint/	Kumar et al. (2008) [52]	○		site
RNABindR	http://bindr2.gdcb.iastate.edu/RNABindR/	Terribilini et al. (2007) [32]	○		site
BindN	http://bioinfo.ggc.org/bindn/	Wang and Brown (2006) [9]	○		site
SVMProt	http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi	Han et al. (2004) [36]	X		protein
RBPDetector	http://ibi.hzau.edu.cn/rbrdetector	Yang et al. (2014) [64]	○	struc	site
SPOT-Seq-RNA	http://sparks-lab.org/server/SPOT-Seq-RNA/	Yang et al. (2014) [65]	X		protein
DRNA	http://sparks.informatics.iupui.edu/yueyang/DFIRE/dRdR-DB-service	Zhao et al. (2011) [15]	X		protein
OPRA	Program available upon request from the authors	Perez-Cano and Fernandez-Recio (2010) [14]	○		site
PRIP	http://www.qfab.org/PRIP	Maetschke et al. (2009) [62]	X		site
KYG	http://cib.cf.ocha.ac.jp/KYG/	Kim et al. (2006) [13]	X		protein
DARS-RNP and QUASI-RNP	http://www.genesilico.pl/RNP/	Tuszynska and Bujnicki (2011) [66]	○	docking	complex
PatchDock	http://bioinfo3d.cs.tau.ac.il/PatchDock/index.html	Schneidman-Duhovny et al. (2005) [21]	○		complex
Haddock	http://www.nmr.chem.uu.nl/haddock/; http://haddock.science.uu.nl/services/HADDOCK	Dominguez et al. (2003) [18]	○		complex
Hex	http://hex.loria.fr/; http://hexserver.loria.fr/	Ritchie and Kemp (2000) [20]	○		complex
FTDock (3D-Dock)	http://www.sbg.bio.ic.ac.uk/docking/	Gabb et al. (1997) [22]	○		complex
GRAMM	http://vakser.bioinformatics.ku.edu/main/resources_gramm1.03.php	Katchalski-Katzir et al. (1992) [19]	○		complex

○: denotes the URL is available now; X: means the URL is not available nowadays; URLs: Abbreviations of UniformResourceLocators.

2.3.2. Structure-Based Methods

Structure-Based Methods for RNA-Binding Site Prediction

When the structure of the query protein is available and employed in the prediction system, the prediction became more reliable. There are a number of structure-based RNA-binding site prediction methods. Kim et al. [13] developed KYG method, which uses sequence profiles, doublets of spatially close residues, a number of binding scores, and combinations. Chen and Lim [61] developed a predictor based on structure information including electrostatics, evolution, and geometry. The disparate cleft and the surface patch were considered to be RNA-binding site. Subsequently, PRIP [62] was created, which exploited structural and topological information (retention coefficient, betweenness-centrality, accessible surface area and PSI-BLAST profile) and used two machine learning methods (SVM and naive Bayes classifiers). Towfic et al. [63] contributed Struct-NB, which used structural features to predict RNA-binding sites by combining a naive Bayes classifier. Recently, two predictors based on structure were proposed. RBRDetector [64], which uses evolutionary and microenvironmental features as inputs, combines feature- and template-based strategies to improve predictions of RNA-binding residues. The other predictor compares each template patch with surrounding patches and uses the accumulated distances as structural features [26].

Structure-Based Methods for RBP Prediction

Zhao et al. [15] introduced a predictor for RNA-binding domains based on structure information, which combined RNA binding affinity and relative structural similarity. SPOT-Seq-RNA [65] is a template-based structure prediction package which integrates RBP, RNA-binding residue, and protein–RNA complex structure prediction. RBPs and protein–RNA complexes are often modeled using the docking method.

2.3.3. Protein–RNA Complex Docking

Research on protein 3D structure modeling has become increasingly complex. Modeling structures of a protein–RNA complex is very important to help us understand the mechanisms of interaction. Several docking techniques used to predict protein–RNA complexes rely on known RNA and protein structures. There are no protein–RNA interaction docking algorithms, most reported docking techniques are modified from those protein-ligand interaction and protein–protein interaction docking softwares by employing certain energy/scoring function that fitted for protein–RNA interactions. For example, Katchalski-Katzir et al. [19] developed a low-resolution docking program, which requires specific scoring functions for different ligands. In the modeling progress, the program performs a six-dimensional search through the rotation of a ligand molecule and the rigid body translation and generates decoys. Gabb et al. [22] employed the FTDOCK program, which not only accepts protein–protein docking, but also accepts nucleic acid molecules. Ritchie and Kemp [20] introduced Hex, which enables protein–nucleic acid and protein–protein docking. The decoy scoring method contains electrostatics and shape-matching but does not have a special function for protein–RNA complexes. The method of Haddock [18] enables various molecules (e.g., nucleic acids, proteins and other small molecules) for docking, which using biochemical and biophysical characteristics as inputs. Recently, Tuszynska and Bujnicki [66] developed QUASI-RNP and DARS-RNP, which use statistical and quasi-chemical reference states to score protein–RNA decoys.

2.4. Prediction Algorithms

Almost all popular machine learning methods have been used for prediction of RNA-binding sites or RBPs. Generally, the machine learning methods obtain satisfactory performance with valid sequence- and/or structure-based features participation. The machine learning methods frequently used for RNA-binding research include SVMs [27,67], artificial neural networks (ANN) [68], Bayesian networks [29,67], and random forest [12,69]. Puton et al. [8] have attempted a meta-predictor of RNA-binding residues based on three of the highest ranked sequence-based primary predictors. This meta-predictor outperforms most other predictors. The template-based approach is another algorithm to predict structure of protein–RNA complex when a template structure is available. This method recognizes the putative RBPs by structurally aligning the query protein to RBPs with known structures. SPARKS X [15] is a program which predicts structure based on template-based structure. Similarly, TIM-align [70] is a structural alignment program.

2.5. Evaluation and Performance of Various Predictors

2.5.1. Performance Measures

The parameters commonly used to assess RNA-binding sites and RBP prediction performance include sensitivity, accuracy, strength, specificity, F-measure, precision, the Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC), these parameters are detailed listed in Table 3.

Table 3

Evaluation parameters.

Parameter	Meaning	Expression
Accuracy (ACC)	Percentage of correct prediction	Accuracy=TP+TNTP+TN+FP+FN ^a
Sensitivity	Percentage of correctly predicted positive	Sensitivity=TPTP+FN
Specificity	Percentage of correctly predicted negative	Specifcity=TNTN+FP
Strength	Mean value of the sum of sensitivity and specificity	Strength=Sensitivity+Specifcity2
MCC	Matthews correlation coefficient	MCC=(TP×TN)−(FN×FP)(TP+FN)×(TN+FP)×(TP+FP)×(TN+FN)
Precision	Positive predictive rate	Precision=TPTP+FP
F-measure	The harmonic mean of sensitivity and specificity	F−measure= 2 × Presion × SensitivityPresion+Sensitivity
AUC ^b	Probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one	AUC=∑i−1nTinT

a TP = True positive number; TN = True negative number; FP = False positive number; FN = False negative number; b In AUC formulation, i takes on values from 1 to n, T is the total number of positives in the test set, and T is the number of positives that score higher than the ith highest scoring negative.

For the formula presented in Table 3, TP represents true positives which are correctly predicted RNA-binding residues; FP indicates false positives which are mistakenly predicted RNA-binding residues; TN denotes true negatives which are correctly predicted non-RNA-binding residues; and FN means false negatives which are wrongly predicted non-RNA-binding residues. Due to the imbalance between positive sample and negative sample, the MCC is regarded as proper measurement to evaluate the overall performance. “MCC = 0” means completely random prediction, and “MCC = 1” indicates perfect prediction. Higher value of MCC (between 0 and 1) represents better prediction accuracy. Another widely used measurement is the receiver operating characteristic (ROC) curve, especially in the comparison of several predictors. The x-axis of ROC curve represents the true positive rate and the y-axis denotes the false positive rate. The larger the area under the curve (AUC), the better the method.

2.5.2. Comparison of Various Prediction Methods

The prediction results of existing methods for RNA-binding sites and RBP predictions are summarized in Table 4. The accuracy of most predictions is approximately 60%–80% and the specificity and sensitivity of these methods range widely. Each method has its own specialty because of the various datasets, input features, and algorithms. Three main datasets are listed in Table 4 including RB75, RB172, and RB344. Several original studies [8,28,71] compared several predictors independently based on the unified dataset and their results are summarized in this manuscript. The MCC is always considered an unbiased measurement and has been calculated in most methods, which helps significantly when comparing the performance among these methods. Subsequently, a meta-predictor that combines three predictors has been developed and has satisfactory performance [8].

Table 4

Performance of the state-of-the-art methods for RNA-binding site prediction.

Methods	Data Set	Performance								Reference	Feature
Methods	Data Set	ACC	SEN	SPE	AUC	MCC	Strength	F-Measure	Precision	Reference	Feature
PiRaNhA	RB75	-	-	-	0.822	0.435	-	-	-	[8]	Sequence-based
PPRInt	RB75	-	-	-	0.779	0.339	-	-	-	[8]
	RB172	0.71			-	0.25	0.66	-	-	[28]
	RB344	0.70	0.45	0.82	0.68	0.28	-	0.49	0.53	[26]
BindN	RB75	-	-	-	0.733	0.297	-	-	-	[8]
BindN	RB172	0.75	-	-	-	0.23	0.64	-	-	[28]
BindN+	RB75	-	-	-	0.821	0.397	-	-	-	[8]
	RB172	0.79	-	-	-	0.34	0.71	-	-	[28]
	RB344	0.72	0.32	0.89	0.68	0.26	-	0.41	0.56	[26]
RNABindR	RB75	-	-	-	0.708	0.317	-	-	-	[8]
RNABindR v2.0	RB172	0.66	-	-	-	0.27	0.69	-	-	[28]
PRBR	RB75	-	-	-	N/A ^a	0.294	-	-	-	[8]
NAPS	RB75	-	-	-	0.679	0.215	-	-	-	[8]
NAPS	RB172	0.66	-	-	-	0.17	0.61	-	-	[28]
RNAProB	RB172	0.82	-	-	-	0.22	0.60	-	-	[28]
KYG *	RB75	-	-	-	N/A	0.382	-	-	-	[8]	Structure-based
DRNA *	RB75	-	-	-	N/A	0.382	-	-	-	[8]
DRNA *	RB344	0.75	0.21	0.94	N/A	0.22	-	0.31	0.54	[26]
OPRA *	RB75	-	-	-	N/A	0.296	-	-	-	[8]
Ren’s method	RB344	0.68	0.48	0.76	0.68	0.26	-	0.48	0.48	[26,83]
Meta-predictor ^b	RB75	-	-	-	0.835	0.460	-	-	-	[8,34]

a N/A—not available; MCC—Matthews Correlation Coefficient; AUC—area under curve; SEN—sensitivity; SPE—specificity; b Meta-predictor developed based on top three sequence-based methods according to authors benchmark (PiRaNhA, PPRInt and BindN+); * The meta-predictor is composed of those methods labeled with asterisk.

2.5.3. Collection of Web Servers of RBPs and RNA-Binding Site Predictors

Many researchers provide web servers when they develop novel methods to predict RNA-binding sites and RBPs. Several protein–RNA complex docking programs are also available. We collected the URLs which are divided into sequence- and structure-based predictors and docking methods (Table 2). We have tested every web server and labeled them with “○” or “X” if the web server is available or not, respectively, and noted if the approach is aimed at predicting binding sites or RBPs. Actually, web servers could provide easy-to-use tools to the community. Users could understand the algorithm and conveniently obtain prediction results using web servers. Meanwhile, developers could continually modify their methods with users’ feedback.

3. Conclusions and Future Perspectives

Due to the significant biological roles of several RNA types, RNA-binding site prediction has become more and more important in the area of protein functional site prediction. Prediction accuracy has improved significantly during the past decades and a number of web servers are available to experimental scientists. Nevertheless, the current predictors require further research to improve their effectiveness due to shortcomings. Three outstanding issues face efforts to predict RNA-binding sites and RBPs. The first important issue is how to distinguish DNA-binding sites from RNA-binding sites. Generally, the prediction approaches that use templates are more effective than those using machine learning methods for distinguishing RBPs from DNA-binding proteins. Conversely, for those RBPs that could not detect successfully using template-based methods, several machine learning methods can detect RNA-binding residues. Therefore, combining the strengths of two approaches has the potential to obtain better performance of RNA-binding site and RBP prediction. The second important issue is that which vectors contribute more and which ones offer less to the mature predictor in machine learning methods remains unclear. It is certain that selection of novel and effective features could be one of the most important concepts in RBPs and RNA-binding site prediction. The third issue is that all existing protein–RNA docking approaches do not take into account conformational changes that may occur in the combination process of protein and RNA molecules. The ability to model the 3D RNA structure using several RNA folding simulations [72,73,74] and accommodating those methods to refold RNA fragments to simulate protein–RNA interaction and optimize minimum energy would be useful [75,76,77,78,79]. Rother et al. [80] successfully combined RNA and protein 3D structures into a unified modeling method. Moreover, further comparison studies are required to adequately evaluate the advantages and disadvantages of various methods. A general selection of Web servers of RNA-binding sites and protein prediction and protein–RNA complex docking. ○: denotes the URL is available now; X: means the URL is not available nowadays; URLs: Abbreviations of UniformResourceLocators. Evaluation parameters. a TP = True positive number; TN = True negative number; FP = False positive number; FN = False negative number; b In AUC formulation, i takes on values from 1 to n, T is the total number of positives in the test set, and T is the number of positives that score higher than the ith highest scoring negative. Performance of the state-of-the-art methods for RNA-binding site prediction. a N/A—not available; MCC—Matthews Correlation Coefficient; AUC—area under curve; SEN—sensitivity; SPE—specificity; b Meta-predictor developed based on top three sequence-based methods according to authors benchmark (PiRaNhA, PPRInt and BindN+); * The meta-predictor is composed of those methods labeled with asterisk.

81 in total

1. SVM based prediction of RNA-binding proteins using binding residues and evolutionary information.

Authors: Manish Kumar; M Michael Gromiha; Gajendra P S Raghava
Journal: J Mol Recognit Date: 2011 Mar-Apr Impact factor: 2.137

2. Prediction of protein-RNA binding sites by a random forest method with combined features.

Authors: Zhi-Ping Liu; Ling-Yun Wu; Yong Wang; Xiang-Sun Zhang; Luonan Chen
Journal: Bioinformatics Date: 2010-05-18 Impact factor: 6.937

3. Structural prediction of protein-RNA interaction by computational docking with propensity-based statistical potentials.

Authors: Laura Pérez-Cano; Albert Solernou; Carles Pons; Juan Fernández-Recio
Journal: Pac Symp Biocomput Date: 2010

4. Protein clefts in molecular recognition and function.

Authors: R A Laskowski; N M Luscombe; M B Swindells; J M Thornton
Journal: Protein Sci Date: 1996-12 Impact factor: 6.725

5. Computationally predicting protein-RNA interactions using only positive and unlabeled examples.

Authors: Zhanzhan Cheng; Shuigeng Zhou; Jihong Guan
Journal: J Bioinform Comput Biol Date: 2015-02-08 Impact factor: 1.122

6. Struct-NB: predicting protein-RNA binding sites using structural features.

Authors: Fadi Towfic; Cornelia Caragea; David C Gemperline; Drena Dobbs; Vasant Honavar
Journal: Int J Data Min Bioinform Date: 2010 Impact factor: 0.667

7. PiRaNhA: a server for the computational prediction of RNA-binding residues in protein sequences.

Authors: Yoichi Murakami; Ruth V Spriggs; Haruki Nakamura; Susan Jones
Journal: Nucleic Acids Res Date: 2010-05-27 Impact factor: 16.971

8. Posttranscriptional regulation of the breast cancer susceptibility gene BRCA1 by the RNA binding protein HuR.

Authors: Jodi M Saunus; Juliet D French; Stacey L Edwards; Dianne J Beveridge; Esme C Hatchell; Sarah A Wagner; Sandra R Stein; Andrew Davidson; Kaylene J Simpson; Glenn D Francis; Peter J Leedman; Melissa A Brown
Journal: Cancer Res Date: 2008-11-15 Impact factor: 12.701

9. Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction.

Authors: Oanh T P Kim; Kei Yura; Nobuhiro Go
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

10. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information.

Authors: Cheng-Wei Cheng; Emily Chia-Yu Su; Jenn-Kang Hwang; Ting-Yi Sung; Wen-Lian Hsu
Journal: BMC Bioinformatics Date: 2008-12-12 Impact factor: 3.169

28 in total

1. Individually double minimum-distance definition of protein-RNA binding residues and application to structure-based prediction.

Authors: Wen Hu; Liu Qin; Menglong Li; Xuemei Pu; Yanzhi Guo
Journal: J Comput Aided Mol Des Date: 2018-11-26 Impact factor: 3.686

Review 2. Computational approaches for the analysis of RNA-protein interactions: A primer for biologists.

Authors: Kat S Moore; Peter A C 't Hoen
Journal: J Biol Chem Date: 2018-11-19 Impact factor: 5.157

3. APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins.

Authors: Malvika Sharan; Konrad U Förstner; Ana Eulalio; Jörg Vogel
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

4. Sequence-Based Prediction of RNA-Binding Residues in Proteins.

Authors: Rasna R Walia; Yasser El-Manzalawy; Vasant G Honavar; Drena Dobbs
Journal: Methods Mol Biol Date: 2017

5. SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes.

Authors: Kristopher W Brannan; Wenhao Jin; Stephanie C Huelga; Charles A S Banks; Joshua M Gilmore; Laurence Florens; Michael P Washburn; Eric L Van Nostrand; Gabriel A Pratt; Marie K Schwinn; Danette L Daniels; Gene W Yeo
Journal: Mol Cell Date: 2016-10-06 Impact factor: 17.970

6. Protein-RNA interaction prediction with deep learning: structure matters.

Authors: Junkang Wei; Siyuan Chen; Licheng Zong; Xin Gao; Yu Li
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

7. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins.

Authors: Xinxin Peng; Xiaoyu Wang; Yuming Guo; Zongyuan Ge; Fuyi Li; Xin Gao; Jiangning Song
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

Review 8. Ribonomics Approaches to Identify RBPome in Plants and Other Eukaryotes: Current Progress and Future Prospects.

Authors: Muhammad Haroon; Rabail Afzal; Muhammad Mubashar Zafar; Hongwei Zhang; Lin Li
Journal: Int J Mol Sci Date: 2022-05-25 Impact factor: 6.208

9. Transcriptome-wide discovery of coding and noncoding RNA-binding proteins.

Authors: Rongbing Huang; Mengting Han; Liying Meng; Xing Chen
Journal: Proc Natl Acad Sci U S A Date: 2018-04-10 Impact factor: 11.205

10. BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins.

Authors: Inbal Paz; Efrat Kligun; Barak Bengad; Yael Mandel-Gutfreund
Journal: Nucleic Acids Res Date: 2016-05-19 Impact factor: 16.971