Literature DB >> 27084939

RBscore&NBench: a high-level web server for nucleic acid binding residues prediction with a large-scale benchmarking database.

Abstract

RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins
RNA
DNA

Year: 2016 PMID： 27084939 PMCID： PMC4987871 DOI： 10.1093/nar/gkw251

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

RNA– and DNA–protein interactions occur in a large amount of biological processes. The computational prediction of nucleic binding residues on protein is an important step in understanding protein functions. Although the problem of binding site prediction is old (1), the prediction algorithms are not so enlightening and effective as expected. Recently we developed RBscore (2), which linearly correlates feature values to nucleic acid binding probability in a residue neighboring network approach in order to predict nucleic acid binding residues. RBscore displays merits in several aspects. (i) Physicochemical and evolutionary features (electrostatics, solvation energy and conservation entropy) can be directly related to nucleic acid binding probabilities. (ii) RBscore, standing for RNA Binding score, was trained on RNA binding proteins (RBP) but demonstrates even higher accuracies for DNA binding residue prediction. This underscores, firstly, that RBP and DNA binding proteins (DBP) employ common driving forces in their binding to nucleic acid and, secondly, that RBscore has captured successfully the main factors controlling the binding propensities. There are more than 100 cases in the PDB (3) that report the formation of protein–RNA–DNA complexes. Recently, it has been estimated that about 2% of the human proteome may bind both RNA and DNA, and such proteins are named DRBP (4). Thus, it is now a major challenge to be able to predict nucleic acid specificities on the basis of the protein alone. (iii) One can plot the nucleic acid binding energy funnel on the protein surface using RBscore, with the residues closer to the nucleic acid binding region having higher prediction scores. This shows that a binary classification of binding sites (binding versus non-binding) is far from enough for the binding site prediction problem. Besides, RBscore demonstrates a high-level accuracy with a high stability in the accuracy on all DBP and RBP datasets. It guarantees also a weighted arithmetic mean of Area Under ROC Curve (wAUC) above 0.81, which cannot be achieved by other predictors. Since the running of the RBscore web server starting from June 2014, it has handled more than 5000 jobs submitted by more than 35 users from 11 different countries. With many strategies to predict RNA- or DNA-binding sites on protein reported every year, fair and rigorous benchmarking is a laborious necessity. However, many programs were only assessed by cross-validation in small-scale datasets and did not fully demonstrate their predictive abilities. Current assessments of binding site prediction programs differ in: (i) the definition of nucleic acid binding sites by distance cutoffs; (ii) the training and test datasets that can induce dataset bias and (iii) the criteria to measure prediction performance. The comparisons in many of the reported works were only based on single cutoff, binary criteria and small-scale datasets validations, which may include bias toward certain methods and lead to dangerous conclusions. Together with RBscore, we assessed 18 web servers and 3 stand-alone programs, 25 different approaches in total, on 41 different datasets, including more than 5000 protein chains derived from 3D structures of protein–nucleic acid complexes. The results demonstrate: (i) dataset bias and distance cutoff bias in binding site definition exist in some methods; (ii) DBP and RBP appear to follow similar driving forces which have been captured by some of the methods; (iii) the predictors have been greatly improved over the years, but there is still room for sorting out the essential mechanisms of binding for sequence based approaches. According to the results, RBscore also rank as a top-level predictor, which shows high but stable accuracies on all datasets regardless of distance cutoff used to define binding sites. This assessment work led us to build a database, NBench (5), to benchmark all the prediction programs and to provide all the prediction result data to the scientific community. Hopefully, later development of new nucleic acid binding site prediction programs can take the advantage of NBench database and make direct comparison with the existing 25 approaches to avoid unnecessary bias of dataset, binding site definition or assessment metrics.

RBscore WEB SERVER

RBscore

Several normalization steps were processed before calculating RBscore: (i) alternative locations are cleaned leaving only the first state; (ii) residues with incomplete backbone atoms (N, CA and C) are dropped; (iii) Selenomethionines are taken as methionine, while other HETATM lines in the file are deleted; (iv) incomplete side-chain atoms are predicted by RASP (6,7). The calculation of RBscore is based on electrostatics potential, solvation energy and sequence conservation entropy feature values. Electrostatics potential is measured by programs pdb2pqr (8,9) and APBS (10) in a similar way as PatchFinderPlus (11); implicit model of solvation energy is derived from accessible surface area calculated by NACCESS (12), while sequence conservation entropy is measured by Shannon entropy (13) using Weblogo (14) according to the multiple sequence alignment (MSA) generated by HHblits (15). The program DMS (16) is used to generated the surface grid around the protein surface and to define the residue level neighboring interaction network. A total of 104 weighing factors were trained in the model. Details of the RBscore scoring function were described previously (2). Besides the structure-based binding site predictor RBscore, we also provide a sequence-based predictor RBscore_SVM, which exploits the facility of machine learning approach, when no structure information is provided. RBscore_SVM first searches a small sequence database of Uniprot (17) for homologous sequences to derive Position Specific Scoring Matrix (PSSM) with PSI-BLAST (18), and then uses a slide-window approach to generate input feature vector for support vector machine (19) (SVM) to build the prediction model. Similar to other sequence-based predictors, RBscore_SVM is less stable in accuracy (5), more dataset dependent, and more distance cutoff dependent and without the ability to depict the binding energy funnel on protein surface. However, it still achieves top-level accuracy among sequence-based predictors, which guarantees a wAUC value >0.71 offering a still good choice when only sequence information is available.

Input

The web server integrates structure-based predictor RBscore and sequence-based predictor RBscore_SVM. When the protein structure is available, the prediction accuracy is generally better than only sequence information as well as more information can be visualized. The input of protein structure can either be a four-letter PDB code or by uploading a PDB formatted file. When only protein sequence is available, users need to input FASTA formatted sequences or upload a FASTA file of sequences. More than one sequence in the same FASTA file is possible and the prediction results will be in the same order. There are optional parameter settings for the RBscore_SVM model: (i) to set the specificity or sensitivity of the prediction to determine the cutoff value used in binary prediction of binding sites. By default, the specificity is set to 85%; (ii) SVM models used in prediction. The models are derived from RBP or DBP alone or both RBP and DBP. According to the results in NBench, the model trained on both RBP and DBP dataset achieves highest accuracy. Finally, users can input an email address to receive the results after the job finish. Otherwise, the web page is automatically refreshed to show the running log until the prediction results are generated.

Output

The output of the RBscore includes three sections: Summary of prediction (Figure 1A). It provides basic information of the protein and of the prediction results, including job ID, length of the protein, rough estimation of nucleic acid binding site number based on protein sequence and download link of all the results. For RBscore_SVM prediction, the summary lists the estimated sensitivity, specificity, the resulted threshold for SVM classifier and the predicted number of binding sites.

Figure 1.

Example of outputs from RBscore. (A) Summary of prediction. (B) Prediction score mappings on protein structure demonstrated by RBscore (2), RBRDetector (30) and aaRNA (26). (C) Detailed residue-wise results.

Plots of prediction scores and feature scores on protein. The graphical plots of the prediction scores on protein surface are illustrated by JSmol (20) in rainbow color, where blue shows the region with highest prediction scores that most likely to bind nucleic acid, while red shows the least likely parts. Together with RBscore, RBscore predictions based on single features (electrostatics, conservation or solvation energy) as well as feature values of conservation entropy and electrostatics potential are also provided for plotting on protein surface. Feature value plots are similar to CONSURF (21) and APBS tools in pymol (22). Comparing feature values with RBscore plotted on protein, users can intuitively verify the prediction of nucleic acid binding region. Conceptually, nucleic acid are more likely to bind positively charged parts on a protein, since nucleic acids are normally negatively changed resulted from the phosphate group, while functional sites are more conserved than other residues to maintain the function in evolution. Positively charged region and conserved region are also plotted as blue to correspond to the high RBscore region. Additionally, users can load other molecules or save the figure or molecule coordinates. Figure 1B compares three existing demonstrations of prediction score mapping schemes. RBRDetector uses a binary color scheme on a cartoon model of the protein while highlighting the predicted binding sites with stick model. Such plots can clearly show the binding sites when the prediction is good but can hardly explain some ‘orphan’ binding sites that without any other binding site neighbor around. It can neither show the hierarchical binding energy funnel on protein surface. aaRNA uses a hierarchical color scheme of ‘blue_white_red’, but a cartoon model of protein structure excludes all side-chain atoms which are most important to protein–nucleic acid binding. A rainbow color scheme in RBscore can clearly show the binding energy funnel on protein surface to help users find the binding region intuitively, while a surface model of the protein structure counts for the accessibility of the residues easily excluding unreasonable buried residues as binding sites. Following the hierarchical coloring of binding energy funnel, ‘orphan’ binding sites can be easily excluded demonstrating a clearer picture of the nucleic acid binding probability. Detailed residue-wise result (Figure 1C). It lists the prediction results for each residue, including residue name, conservation entropy, RBscore, RBscore_SVM value and other three prediction values based on single feature applied in RBscore model. Binary binding site prediction is based on the pre-defined specificity or sensitivity value and the resulted threshold. Generally, binding sites normally have an RBscore > 300 while RBscore_SVM > −0.44. It is more likely to be a nucleic acid binding site, when both RBscore, RBscore_SVM and conservation entropy show high scores. Example of outputs from RBscore. (A) Summary of prediction. (B) Prediction score mappings on protein structure demonstrated by RBscore (2), RBRDetector (30) and aaRNA (26). (C) Detailed residue-wise results.

NBench DATABASE

It is a non-trivial task to compare a new binding site predictor with existing ones to demonstrate its effectiveness. Such a comparison is prone to dataset bias, binding site definition bias and assessment metric bias, as well as the detailed treatments of the datasets. For example, PRBR (23) does not predict binding sites of the N-terminal and C-terminal residues, and may lead to an unfair comparison with other programs who predict on the whole sequence. To minimize the possibility of biased conclusions, NBench contains prediction results of 25 different approaches on 41 datasets to directly benchmark all the programs at the same level.

Data availability

NBench lists all the detailed information of the 41 datasets: number of protein, resolution, sequence identity, structural similarity and year of publication and PDB ID list. It provides all the PDB IDs, curated PDB files, sequence files in FASTA format and binding sites definition based on RBscore criterion which considers both distance cutoff and accessible surface area change. Besides, NBench stores all the assessment results of the programs and exhibit them in terms of 2D heat map, as shown in Figure 2. Users of the database can select their interested program, dataset, distance cutoff to define binding sites, assessment criterion and plot the 2D heat map accordingly. In this way, the comparison can be more specific and concrete. Finally, users are allowed to export these heat maps in different formats.

Figure 2.

Structure of NBench and examples of heat map. (A) Structure of NBench database. The database includes detailed information and raw data of 41 reported datasets of protein–nucleic acid interactions, and it lists all information about the currently available predictors. Besides, it benchmarks all the predictors with various criteria considering datasets and distance cutoffs in defining binding sites. (B) Examples of heat maps exported from NBench for comparison.

Potential benchmarking

Many predictors targeting the nucleic acid binding site prediction problem are being developed every year, NBench makes the validation of the new predictors easier and straightforward: on one hand, new predictor developers can download the datasets from NBench to run their predictor and compare with the results of other predictors obtained from NBench. Developers can perform their assessments on all the results for comparison, it avoids repeated calling of the other web servers. On the other hand, developers are also encouraged to submit their prediction results to NBench, so new predictions can be benchmarked in a more systematic way by NBench during the maintenance of the database. New data is added to NBench during regular maintenance of the database or upon request. Both approaches suggest better validations of the new upcoming predictors.

SOFTWARE IMPLEMENTATION

The web server was developed on Ubuntu 14.04 linux OS and is running on an Apache2 server and PHP5.3 as server-side scripting language. The server pipeline was written in python2.7 and interacts with RBscore program written in C++. Data of prediction results in NBench is organized and indexed by MySQL. The web pages were written in bootstrap HTML with Javascript, and were tested on the latest versions of Firefox and Chrome.

DISCUSSION

RBscore highlights the point that a nucleic acid binding site prediction is not a binary classifier but is to find the potential binding region to help understanding the underlying essence of protein–nucleic acid interaction, as well as to find the potential binding energy funnel. It is the first automated web server reported to predict DNA- and RNA-binding sites within the same prediction model. Besides, it directly combine feature values into a probability score of nucleic acid binding without complexion and achieves high level accuracy on all datasets regardless of binding site definition bias. Nucleic acid binding site prediction is an active field of work, no fewer than eight papers (24–31) targeting this problem were published in 2014. Validation of new predictors is a crucial necessity but prone to bias. NBench directly provides normalized datasets and related results from existing approaches, which can be a valuable resource for new predictor validation. We hope RBscore&NBench can help our understanding the essence of protein–nucleic acid binding and support the biological community as a useful tool.

AVAILABILITY

The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/.

26 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Electrostatics of nanosystems: application to microtubules and the ribosome.

Authors: N A Baker; D Sept; S Joseph; M J Holst; J A McCammon
Journal: Proc Natl Acad Sci U S A Date: 2001-08-21 Impact factor: 11.205

3. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Authors: Michael Remmert; Andreas Biegert; Andreas Hauser; Johannes Söding
Journal: Nat Methods Date: 2011-12-25 Impact factor: 28.547

4. RASP: rapid modeling of protein side chain conformations.

Authors: Zhichao Miao; Yang Cao; Taijiao Jiang
Journal: Bioinformatics Date: 2011-09-23 Impact factor: 6.937

5. RBRDetector: improved prediction of binding residues on RNA-binding protein structures using complementary feature- and template-based strategies.

Authors: Xiao-Xia Yang; Zhi-Luo Deng; Rong Liu
Journal: Proteins Date: 2014-06-09

6. Satisfying hydrogen bonding potential in proteins.

Authors: I K McDonald; J M Thornton
Journal: J Mol Biol Date: 1994-05-20 Impact factor: 5.469

7. The SWISS-PROT protein sequence data bank and its new supplement TREMBL.

Authors: A Bairoch; R Apweiler
Journal: Nucleic Acids Res Date: 1996-01-01 Impact factor: 16.971

8. Patch Finder Plus (PFplus): a web server for extracting and displaying positive electrostatic patches on protein surfaces.

Authors: Shula Shazman; Gershon Celniker; Omer Haber; Fabian Glaser; Yael Mandel-Gutfreund
Journal: Nucleic Acids Res Date: 2007-05-30 Impact factor: 16.971

9. RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins.

Authors: Rasna R Walia; Li C Xue; Katherine Wilkins; Yasser El-Manzalawy; Drena Dobbs; Vasant Honavar
Journal: PLoS One Date: 2014-05-20 Impact factor: 3.240

10. A graph kernel method for DNA-binding site prediction.

Authors: Changhui Yan; Yingfeng Wang
Journal: BMC Syst Biol Date: 2014-12-08

6 in total

1. TriPepSVM: de novo prediction of RNA-binding proteins based on short amino acid motifs.

Authors: Annkatrin Bressin; Roman Schulte-Sasse; Davide Figini; Erika C Urdaneta; Benedikt M Beckmann; Annalisa Marsico
Journal: Nucleic Acids Res Date: 2019-05-21 Impact factor: 16.971

2. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues.

Authors: Jing Yan; Lukasz Kurgan
Journal: Nucleic Acids Res Date: 2017-06-02 Impact factor: 16.971

3. APRICOT: an integrated computational pipeline for the sequence-based identification and characterization of RNA-binding proteins.

Authors: Malvika Sharan; Konrad U Förstner; Ana Eulalio; Jörg Vogel
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

4. Kaposi's sarcoma-associated herpesvirus polyadenylated nuclear RNA: a structural scaffold for nuclear, cytoplasmic and viral proteins.

Authors: Joanna Sztuba-Solinska; Jason W Rausch; Rodman Smith; Jennifer T Miller; Denise Whitby; Stuart F J Le Grice
Journal: Nucleic Acids Res Date: 2017-06-20 Impact factor: 16.971

5. Evolutionary Trends in RNA Base Selectivity Within the RNase A Superfamily.

Authors: Guillem Prats-Ejarque; Lu Lu; Vivian A Salazar; Mohammed Moussaoui; Ester Boix
Journal: Front Pharmacol Date: 2019-10-09 Impact factor: 5.810

6. Multiple protein-DNA interfaces unravelled by evolutionary information, physico-chemical and geometrical properties.

Authors: Flavia Corsi; Richard Lavery; Elodie Laine; Alessandra Carbone
Journal: PLoS Comput Biol Date: 2020-02-03 Impact factor: 4.475

6 in total