Literature DB >> 21071426

PRIDB: a Protein-RNA interface database.

Benjamin A Lewis¹, Rasna R Walia, Michael Terribilini, Jeff Ferguson, Charles Zheng, Vasant Honavar, Drena Dobbs.

Abstract

The Protein-RNA Interface Database (PRIDB) is a comprehensive database of protein-RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein-RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein-RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein-RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein-RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2010 PMID： 21071426 PMCID： PMC3013700 DOI： 10.1093/nar/gkq1108

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein–RNA interactions play critical roles in myriad and diverse biological processes, including many recently discovered regulatory functions, in addition to well-studied roles in protein synthesis, DNA replication, regulation of gene expression and defense against pathogens (1–9). Despite their importance, structures of protein–RNA complexes have proven difficult to obtain using experimental structure determination methods; such structures constitute only ∼1% of structures in the Protein Data Bank (PDB) (10). For this reason, several computational methods for predicting the interfaces in protein–RNA complexes have been developed (11–21). Virtually all such methods require data in the form of information about structurally characterized protein–RNA complexes and their interfaces. PRIDB is a repository of protein–RNA interface information derived from structures in the PDB. PRIDB is designed to facilitate detailed analyses of individual protein–RNA complexes of interest and rapid identification of interfacial atoms and residues in both the protein and RNA chains of a chosen complex or user-defined set of complexes. In addition, PRIDB can be used to generate data sets of protein–RNA interfaces for machine learning applications, such as the generation of classifiers for predicting interfaces in protein–RNA complexes for which high-resolution structures are not available.

Related databases/servers

To our knowledge, only one other up-to-date and comprehensive online repository of protein–RNA interfaces is currently available: Biological Interaction Database for Protein-Nucleic Acid (BIPA) (22). BIPA provides a list of protein–RNA (and protein–DNA) complexes from the PDB and displays RNA-binding residues within the linear primary sequence of a chosen protein, or within a multiple sequence alignment of related RNA-binding proteins. PRIDB complements BIPA by providing atomic- and residue-level interfacial information for both the RNA and protein chains of complexes, providing previously published reduced-redundancy data sets and allowing users to make advanced queries and compile custom data sets. Other collections of protein–RNA complexes and related resources include NDB (http://ndbserver.rutgers.edu/) (23), PRID (http://www-bioc.rice.edu/∼shamoo/prid.html) (24), RsiteDB (http://bioinfo3d.cs.tau.ac.il/RsiteDB/) (25), w3DNA (http://w3dna.rutgers.edu/) (26), NPIDB (http://monkey.belozersky.msu.ru/NPIDB) (27), ProNIT (http://gibk26.bse.kyutech.ac.jp/jouhou/pronit/pronit.html) (28) and the RNP Databases http://rnp.uthct.edu/index.html/). Several excellent databases of protein–DNA interfaces are also available, including PDIdb (http://melolab.org/pdidb/) (29) and hPDI (http://bioinfo.wilmer.jhu.edu/PDI/).

DATABASE CONTENTS

Data extraction, interface definition and motif identification

Atomic coordinate information for all 926 protein–RNA complexes in the Protein Data Bank (PDB) on 10 October 2010 was extracted using the REST API advanced search interface. To generate this comprehensive data set (rRB926), no filters based on sequence redundancy, structure resolution or other criteria were applied (see ‘Non-redundant Benchmark data sets’ below). The complex structures in rRB926 were then scanned to identify interacting amino acids and ribonucleotides using two different definitions: (i) a simple distance-based definition in which a given amino acid residue (AA) in a protein chain is defined as interacting with a ribonucleotide (rNT) in an RNA chain if any atom in AA is within a 5-Å radius of any atom in rNT; and (ii) a rule-based definition based on that of Allers and Shamoo (30), in which interactions are classified as van der Waals, hydrogen-bonding, hydrophobic or electrostatic interactions, involving specific AAs and rNTs. All such interacting AAs and rNTs are defined as ‘interface’ residues. ProSite patterns and profiles (31) appearing in any of the protein sequences in the database were retrieved using the ScanProsite REST service (32). RNA structural motifs were identified in RNA sequences using FR3D’s (33) pure symbolic search function; specific motif definitions used for these scans are available in the Tutorial and FAQs section of the PRIDB online server.

Non-redundant benchmark data sets

Because PRIDB is intended to be a comprehensive collection of protein–RNA complexes from the PDB, the rRB926 data set was not filtered on the basis of redundancy, structure determination method, resolution or protein/RNA chain length. While it is possible to filter with such criteria using PRIDB’s advanced search function, several pre-calculated benchmark data sets, which have been filtered to limit redundancy and to exclude low-resolution structures, are also provided for the user’s convenience. These include two previously published data sets, RB109 (17,34) and RB147 (35), as well as a larger, more recently extracted data set (RB199) (B. Lewis, submitted for publication). Complete lists of the PDB IDs for protein–RNA complexes in these data sets, in addition to the pre-calculated interface residue statistics, can be readily accessed from the ‘Datasets’ section of the PRIDB homepage.

Implementation and availability

PRIDB runs on the Apache 2.2 web server, using MySQL 14.14 as a database backend with AJAX and PHP 5 for user interface functions. Functions not requiring use of the database (e.g. calculating interface residues for a user-submitted complex) are implemented using standalone Perl 5 scripts and the BioPerl module (36). All PRIDB code is available on request under the Creative Commons Attribution Non-Commercial License. All data currently in PRIDB was obtained from databases or programs which impose no restrictions on academic use.

PRIDB summary statistics

As summarized in Table 1, the current version of PRIDB contains structural information for a total of 926 protein–RNA complexes available in the PDB as of 10 October 2010. These structures contain 9689 total protein chains, among which there are only 1174 unique sequences. While this would seem to indicate that most sequences in the database are repeated several times, this is not the case; 395 of the 1174 (34%) sequences appear only once, and 899 (77%) appear less than eight times (the ‘expected’ average redundancy). This disparity is due to the large proportion of ribosomal structures in the PDB (and, by extension, in PRIDB); 9 of the top 10 most abundant sequences, each present in more than 70 structures, are ribosomal proteins. The most abundant sequence, repeated more than 100 times, is that of the TRP-responsive attenuation protein, a protein for which numerous multimeric structures have been solved.

Table 1.

PRIDB contents: complexes and chains

	Total Number in PRIDB^a	Unique
Protein–RNA complexes	926	926
Protein chains	9689	1174
RNA chains	2074	746

aTotal number in PRIDB includes redundant complexes, RNA and protein chains (i.e. chains with identical sequences).

PRIDB contents: complexes and chains aTotal number in PRIDB includes redundant complexes, RNA and protein chains (i.e. chains with identical sequences). As shown in Table 2, PRIDB currently contains 1 475 774 amino acid residues. Based on a 5Å distance cutoff definition for interfacial residues, 397 216 of these residues interact with RNA; of 851 853 ribonucleotide residues in PRIDB, 322 858 interact with protein. On average, 38% of the amino acids in the RNA-binding proteins directly interact with RNA, and 28% of the ribonucleotides in the bound RNAs directly interact with protein. As before, these averages are skewed by the prevalence of ribosome structures; ribosomal proteins account for ∼90% of interacting amino acid residues and ∼60% of interacting nucleotides.

Table 2.

PRIDB summary statistics

Type	Total (Interface + Non-Interface)	Number in Interfaces (%)
Amino Acids	1 475 774	414 026 (38)
Ribonucleotides	851 853	326 441 (28)

PRIDB summary statistics

USER INTERFACE

PRIDB provides a ‘Tutorial and FAQs’ section with detailed instructions on using PRIDB’s web interface; a list and brief descriptions of key capabilities of PRIDB are provided here. Using the ‘Basic Search’ function, users can retrieve information about protein–RNA complexes using their PDB ID or a keyword. Using the ‘Advanced Search’ function, users can filter results by specifying: the experimental method used to determine the complex structure (e.g. X-ray diffraction, nuclear magnetic resonance); a resolution range or threshold (for structures determined using X-ray diffraction, electron microscopy or fiber diffraction); the minimum or maximum length of protein or RNA chains within the complex; an amino acid or nucleotide subsequence found within the sequence of at least one of the protein or RNA chains in the complex; and a motif (as defined by ProSite for protein chains or FR3D for RNA chains) found within at least one chain in the complex. The ‘Advanced Search’ function also allows users to either specify a different distance cutoff for the distance-based interaction definition or choose the alternative rule-based definition. As shown in Figure 1, when viewing search results, PRIDB provides:

Figure 1.

Sample PRIDB output. Amino acid residues and ribonucleotides highlighted in yellow are located in the protein–RNA interface; residues in red font are part of a ProSite or FR3D motif.

a summary of and basic information (name, resolution and structure determination method) about each complex, as well as a link to that complex’s PDB entry; a linear display of the amino acid and nucleotide residues in each chain of each complex, with residues in the protein–RNA interface highlighted; a display of residues (in red font) that are part of a protein or RNA motif, with information about that motif (and a link back to its source) provided on mouse-over; a JMol applet for 3D visualization of each complex, with interacting amino acid and nucleotide residues colored (Figure 2A); and

Figure 2.

(A) PRIDB provides a JMol applet for visualizing and manipulating interfaces within 3-D structures. (B) PRIDB output can be downloaded as a CSV file.

a link to a dynamically-generated file containing atomic-level interface information for each result in a machine readable format (Figure 2B). Sample PRIDB output. Amino acid residues and ribonucleotides highlighted in yellow are located in the protein–RNA interface; residues in red font are part of a ProSite or FR3D motif. (A) PRIDB provides a JMol applet for visualizing and manipulating interfaces within 3-D structures. (B) PRIDB output can be downloaded as a CSV file. In addition to providing machine-readable results files for all searches, pre-computed results files for the non-redundant RB109, RB147 and RB199 data sets described above have been made available. These files, along with the complete PRIDB database (rRB926), can be downloaded from the ‘Datasets’ section of the website. Users can also generate a machine-readable list of interface residues for any arbitrary collection of complexes by inputting a list of PDB IDs. Results files contain a single line for each pair of interacting atoms listing the specific interacting atoms (by chain name, residue number and atom name) and the distance between them. Users may also calculate interface residues for protein–RNA complexes that are not in PDB using PRIDB by submitting a structure file in PDB format. A results file containing interface residues (as calculated using PRIDB’s 5 Å cutoff) is returned via e-mail.

CONCLUSIONS AND FUTURE DIRECTIONS

PRIDB provides researchers with atomic and residue-level information about structures of protein–RNA complexes and their interfaces, facilitating analyses of protein–RNA interactions by pre-computing commonly used information and by providing structural information both interactively onscreen and in a machine-readable format. It allows users to rapidly identify and visualize interfaces in protein–RNA complexes on a residue-by-residue basis and displays identified ProSite or FR3D motifs along with the amino acid or ribonucleotide sequences. PRIDB can be used to generate custom data sets of protein–RNA interfaces for statistical analyses and machine learning applications. The PRIDB server also provides pre-calculated benchmark data sets of protein–RNA complexes for evaluating the performance of interface prediction methods. PRIDB will be updated regularly as new structures are released through PDB, and is intended to be a stable resource for researchers in the field of protein–RNA interactions. Future versions of PRIDB will include additional protein and RNA motifs from other sources, such as PRINTS (37), PIRSF (38) and other InterPro (39) member databases. In addition, the current JMol 3D visualization capabilities will be extended to user-submitted structures, allowing for more facile manipulation and examination of interfaces in complexes not currently in the PDB.

FUNDING

National Institutes of Health (GM066387 to V.H. and D.D.); the National Science Foundation [IGERT0504304 (to D.D.); GK120947929 (to B.A.L.); NIBIB-NSF0608769 (to V.H., J.F. and C.Z.)]; Iowa State University’s Center for Integrated Animal Genomics (to B.A.L. and D.D.); Center for Computational Intelligence, Learning and Discovery (to V.H.). Funding for open access charge: Center for Computational Intelligence, Learning and Discovery. Conflict of interest statement. None declared.

39 in total

1. BIPA: a database for protein-nucleic acid interaction in 3D structures.

Authors: Semin Lee; Tom L Blundell
Journal: Bioinformatics Date: 2009-04-08 Impact factor: 6.937

2. Optimal protein-RNA area, OPRA: a propensity-based method to identify RNA-binding sites on proteins.

Authors: Laura Pérez-Cano; Juan Fernández-Recio
Journal: Proteins Date: 2010-01

3. Dissecting the expression dynamics of RNA-binding proteins in posttranscriptional regulatory networks.

Authors: Nitish Mittal; Nilanjan Roy; M Madan Babu; Sarath Chandra Janga
Journal: Proc Natl Acad Sci U S A Date: 2009-11-16 Impact factor: 11.205

4. Struct-NB: predicting protein-RNA binding sites using structural features.

Authors: Fadi Towfic; Cornelia Caragea; David C Gemperline; Drena Dobbs; Vasant Honavar
Journal: Int J Data Min Bioinform Date: 2010 Impact factor: 0.667

Review 5. RNA processing and its regulation: global insights into biological networks.

Authors: Donny D Licatalosi; Robert B Darnell
Journal: Nat Rev Genet Date: 2010-01 Impact factor: 53.242

6. PROSITE, a protein domain database for functional characterization and annotation.

Authors: Christian J A Sigrist; Lorenzo Cerutti; Edouard de Castro; Petra S Langendijk-Genevaux; Virginie Bulliard; Amos Bairoch; Nicolas Hulo
Journal: Nucleic Acids Res Date: 2009-10-25 Impact factor: 16.971

Review 7. Role of plant RNA-binding proteins in development, stress response and genome organization.

Authors: Zdravko J Lorković
Journal: Trends Plant Sci Date: 2009-03-13 Impact factor: 18.313

Review 8. The ribonome: a dominant force in co-ordinating gene expression.

Authors: Kyle D Mansfield; Jack D Keene
Journal: Biol Cell Date: 2009-03 Impact factor: 4.458

9. Web 3DNA--a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures.

Authors: Guohui Zheng; Xiang-Jun Lu; Wilma K Olson
Journal: Nucleic Acids Res Date: 2009-05-27 Impact factor: 16.971

10. Exploiting structural and topological information to improve prediction of RNA-protein binding sites.

Authors: Stefan R Maetschke; Zheng Yuan
Journal: BMC Bioinformatics Date: 2009-10-18 Impact factor: 3.169

48 in total

Review 1. RNA Structural Differentiation: Opportunities with Pattern Recognition.

Authors: Christopher S Eubanks; Amanda E Hargrove
Journal: Biochemistry Date: 2018-12-18 Impact factor: 3.162

2. Quantifying sequence and structural features of protein-RNA interactions.

Authors: Songling Li; Kazuo Yamashita; Karlou Mar Amada; Daron M Standley
Journal: Nucleic Acids Res Date: 2014-07-25 Impact factor: 16.971

3. Re-analysis of cryoEM data on HCV IRES bound to 40S subunit of human ribosome integrated with recent structural information suggests new contact regions between ribosomal proteins and HCV RNA.

Authors: Agnel Praveen Joseph; Prasanna Bhat; Saumitra Das; Narayanaswamy Srinivasan
Journal: RNA Biol Date: 2014 Impact factor: 4.652

Review 4. DNA-protein interaction: identification, prediction and data analysis.

Authors: Abbasali Emamjomeh; Darush Choobineh; Behzad Hajieghrari; Nafiseh MahdiNezhad; Amir Khodavirdipour
Journal: Mol Biol Rep Date: 2019-03-26 Impact factor: 2.316

Review 5. Computational approaches for the analysis of RNA-protein interactions: A primer for biologists.

Authors: Kat S Moore; Peter A C 't Hoen
Journal: J Biol Chem Date: 2018-11-19 Impact factor: 5.157

6. Sequence-Based Prediction of RNA-Binding Residues in Proteins.

Authors: Rasna R Walia; Yasser El-Manzalawy; Vasant G Honavar; Drena Dobbs
Journal: Methods Mol Biol Date: 2017

7. RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information.

Authors: V Suresh; Liang Liu; Donald Adjeroh; Xiaobo Zhou
Journal: Nucleic Acids Res Date: 2015-01-21 Impact factor: 16.971

8. A MOTIF-BASED METHOD FOR PREDICTING INTERFACIAL RESIDUES IN BOTH THE RNA AND PROTEIN COMPONENTS OF PROTEIN-RNA COMPLEXES.

Authors: Usha Muppirala; Benjamin A Lewis; Carla M Mann; Drena Dobbs
Journal: Pac Symp Biocomput Date: 2016

9. PRD: A protein-RNA interaction database.

Authors: Shigeo Fujimori; Katsuya Hino; Ayumu Saito; Satoru Miyano; Etsuko Miyamoto-Sato
Journal: Bioinformation Date: 2012-08-03

10. PredictProtein - Predicting Protein Structure and Function for 29 Years.

Authors: Michael Bernhofer; Christian Dallago; Tim Karl; Venkata Satagopam; Michael Heinzinger; Maria Littmann; Tobias Olenyi; Jiajun Qiu; Konstantin Schütze; Guy Yachdav; Haim Ashkenazy; Nir Ben-Tal; Yana Bromberg; Tatyana Goldberg; Laszlo Kajan; Sean O'Donoghue; Chris Sander; Andrea Schafferhans; Avner Schlessinger; Gerrit Vriend; Milot Mirdita; Piotr Gawron; Wei Gu; Yohan Jarosz; Christophe Trefois; Martin Steinegger; Reinhard Schneider; Burkhard Rost
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971