Literature DB >> 28070546

Data of protein-RNA binding sites.

Wook Lee1, Byungkyu Park1, Daesik Choi1, Kyungsook Han1.   

Abstract

Despite the increasing number of protein-RNA complexes in structure databases, few data resources have been made available which can be readily used in developing or testing a method for predicting either protein-binding sites in RNA sequences or RNA-binding sites in protein sequences. The problem of predicting protein-binding sites in RNA has received much less attention than the problem of predicting RNA-binding sites in protein. The data presented in this paper are related to the article entitled "PRIdictor: Protein-RNA Interaction predictor" (Tuvshinjargal et al. 2016) [1]. PRIdictor can predict protein-binding sites in RNA as well as RNA-binding sites in protein at the nucleotide- and residue-levels. This paper presents four datasets that were used to test four prediction models of PRIdictor: (1) model RP for predicting protein-binding sites in RNA from protein and RNA sequences, (2) model RaP for predicting protein-binding sites in RNA from RNA sequence alone, (3) model PR for predicting RNA-binding sites in protein from protein and RNA sequences, and (4) model PaR for predicting RNA-binding sites in protein from protein sequence alone. The datasets supplied in this article can be used as a valuable resource to evaluate and compare different methods for predicting protein-RNA binding sites.

Entities:  

Keywords:  Binding sites; Prediction; Protein-RNA interactions

Year:  2016        PMID: 28070546      PMCID: PMC5219607          DOI: 10.1016/j.dib.2016.12.041

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data Few data resources have been available which can be readily used in developing or assessing a method for predicting protein-binding sites in RNA sequences or RNA-binding sites in protein sequences. Protein-RNA binding sites at the nucleotide and residue levels can facilitate to develop a new method for predicting protein-RNA binding sites. Protein-RNA binding sites provided here can be used as a useful resource to evaluate and compare different methods for predicting protein-binding nucleotides in RNAs and/or RNA-binding residues in proteins.

Data

The four datasets S1-S4 in XML format can be used to evaluate various methods for predicting: (1) protein-binding nucleotides from protein and RNA sequences, (2) protein-binding nucleotides from RNA sequence alone, (3) RNA-binding amino acids from protein and RNA sequences, and (4) RNA-binding amino acids from protein sequence alone.

Experimental design, materials and methods

From the Protein Data Bank (PDB) [2], we collected structures of protein-RNA complexes which do not include ribosomal RNAs and were determined by X-ray crystallography with a resolution ≤3.0 Å. As of September 2013, there were a total of 542 protein-RNA complexes, which contained 546 protein-RNA sequence pairs between 376 protein sequences and 439 RNA sequences. We defined a protein-RNA binding site using three types of protein-RNA interactions (hydrogen bonds, water bridges and hydrophobic interactions). A nucleotide (or amino acid) involved in at least one of the interactions was classified as a protein-binding (or RNA-binding) site. For each of the protein–RNA complexes from PDB, we obtained the three types of interactions from the Nucleic acid–Protein Interaction DataBase (NPIDB) [3] and incorporated them into the RNA and protein sequences. In order to reduce overlap between training and test datasets, we ran CD-HIT-EST on the RNA sequences and selected RNA sequences with a similarity of 80% or lower from other RNA sequences and constructed test datasets S1 and S2 for models RP and RaP [1], respectively. The datasets S1 and S2 have same RNA sequences, but have the following differences: Protein sequences were included in the dataset S1 only. In the dataset S2, protein-binding sites in a same RNA sequence with different protein partners were incorporated in the RNA sequence. The dataset S1 contains 130 protein sequences and 155 RNA sequences with 1848 protein-binding nucleotides and 4631 non-binding nucleotides. The dataset S2 contains 155 RNA sequences with 1795 protein-binding nucleotides and 4235 non-binding nucleotides. The test datasets S3 and S4 for models PR and PaR were constructed in a similar way. The dataset S3 contains 44 RNA sequences and 46 protein sequences with 923 RNA-binding residues and 7578 non-binding residues. The dataset S4 contains 49 protein sequences with 1349 RNA-binding residues and 11,217 non-binding residues.
Subject areaBioinformatics, computational biology
More specific subject areaMolecular structures
Type of dataText files in XML format
How data was acquiredProtein data bank (PDB) [2] and Nucleic acid-Protein Interaction DataBase (NPIDB) [3]
Data formatFiltered and processed
Experimental factors
Experimental features
Data source locationDepartment of Computer Science and Engineering, Inha University, Incheon, South Korea
Data accessibilityData is provided with this article.
  3 in total

1.  PRIdictor: Protein-RNA Interaction predictor.

Authors:  Narankhuu Tuvshinjargal; Wook Lee; Byungkyu Park; Kyungsook Han
Journal:  Biosystems       Date:  2015-12-01       Impact factor: 1.973

2.  The RCSB Protein Data Bank: redesigned web site and web services.

Authors:  Peter W Rose; Bojan Beran; Chunxiao Bi; Wolfgang F Bluhm; Dimitris Dimitropoulos; David S Goodsell; Andreas Prlic; Martha Quesada; Gregory B Quinn; John D Westbrook; Jasmine Young; Benjamin Yukich; Christine Zardecki; Helen M Berman; Philip E Bourne
Journal:  Nucleic Acids Res       Date:  2010-10-29       Impact factor: 16.971

3.  NPIDB: Nucleic acid-Protein Interaction DataBase.

Authors:  Dmitry D Kirsanov; Olga N Zanegina; Evgeniy A Aksianov; Sergei A Spirin; Anna S Karyagina; Andrei V Alexeevski
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.