| Literature DB >> 27383535 |
Yasser El-Manzalawy1,2, Mostafa Abbas3, Qutaibah Malluhi3, Vasant Honavar1.
Abstract
A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27383535 PMCID: PMC4934694 DOI: 10.1371/journal.pone.0158445
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of interface and non-interface residues in RB198, RB44, and RB111 datasets.
| Dataset | No. of Interface residues | No. of Non-interface residues |
|---|---|---|
| RB198_1 | 1666 | 7618 |
| RB198_2 | 1636 | 11456 |
| RB198_3 | 1496 | 8805 |
| RB198_4 | 1452 | 8365 |
| RB198_5 | 1700 | 9466 |
| RB44 | 1956 | 4521 |
| RB111 | 3305 | 34255 |
Data for RB198 is provided for each cross-validation fold.
Number of protein sequences in UniRef100 database and its variants.
| Database | No. of sequences |
|---|---|
| UR100 | 50,371,270 |
| UR50 | 11,992,242 |
| UR50R | 11,992,242 |
| UR40 | 9,893,262 |
| UR40R | 9,893,262 |
| UR30 | 8,888,952 |
| UR30R | 8,888,952 |
| UR10R | 5,037,127 |
| UR5R | 2,518,564 |
| UR1R | 503,713 |
List of existing Protein-RNA interface residue prediction servers that requires generation of PSSM profiles for query sequence(s).
| Method | BLAST database | BLAST database size | No. of sequences | URL |
|---|---|---|---|---|
| BindN+ | UniProtKB | 50371270 | 1 |
|
| PPRInt | NCBI nr | 78002046 | 1 |
|
| PRBR | NCBI nr | 78002046 | 1 |
|
| RBScore | Swiss-Prot | 462,819 | ≤5 |
|
| RNABindR v2.0 | NCBI nr | 78002046 | ≤20 |
|
| RNABindRPlus | NCBI nr | 78002046 | ≤20 |
|
| SNBRFinder | NCBI nr | 78002046 | 1 |
|
BLAST database size refers to the size of the database as of February 2016 and not the precise size of the database used by the servers. No. of sequences refers to the maximum number of protein sequences that can be processed by the corresponding server in a single submission.
Performance comparison using cross-validation tests.
| Features | NB | RF100 | SVML | SVMRBF |
|---|---|---|---|---|
| UR100 | 0.75 | 0.75 | 0.77 | 0.79 |
| UR50 | 0.73 | 0.77 | 0.79 | 0.80 |
| UR50R | 0.73 | 0.76 | 0.78 | 0.80 |
| UR40 | 0.70 | 0.77 | 0.78 | 0.80 |
| UR40R | 0.73 | 0.76 | 0.78 | 0.80 |
| UR30 | 0.70 | 0.76 | 0.78 | 0.80 |
| UR30R | 0.73 | 0.76 | 0.78 | 0.80 |
| UR10R | 0.76 | 0.77 | 0.78 | 0.80 |
| UR5R | 0.75 | 0.77 | 0.78 | 0.80 |
| UR1R | 0.74 | 0.77 | 0.78 | 0.79 |
AUC of different classifiers using 5-fold cross-validation and 10 different variants of PSSM based encodings generated using UR100 database and its variants.
Performance comparison using independent tests.
| Features | NB | RF100 | SVML | SVMRBF |
|---|---|---|---|---|
| UR100 | 0.69 | 0.72 | 0.77 | 0.78 |
| UR50 | 0.74 | 0.78 | 0.78 | 0.80 |
| UR50R | 0.70 | 0.76 | 0.79 | 0.80 |
| UR40 | 0.73 | 0.77 | 0.78 | 0.80 |
| UR40R | 0.71 | 0.76 | 0.78 | 0.80 |
| UR30 | 0.73 | 0.78 | 0.79 | 0.80 |
| UR30R | 0.72 | 0.77 | 0.79 | 0.80 |
| UR10R | 0.78 | 0.80 | 0.79 | 0.81 |
| UR5R | 0.76 | 0.78 | 0.79 | 0.81 |
| UR1R | 0.75 | 0.78 | 0.78 | 0.79 |
AUC of different classifiers trained using RB198 and tested using RB44 for 10 different variants of PSSM based encodings generated using UR100 database and its variants.
Fig 1PSI-BLAST run time.
The total PSI-BLAST run time (in hours) for generating PSSM profiles for RB198 sequences using UniRef100 versus its sequence similarity reduced variants (A) and its random sampled variants (B).
PSI-BLAST memory usage.
| Database | RB198 | RB44 |
|---|---|---|
| UR100 | 12.00 | 12.00 |
| UR50 | 3.50 | 3.50 |
| UR50R | 4.20 | 4.20 |
| UR40 | 2.80 | 2.70 |
| UR40R | 3.50 | 3.50 |
| UR30 | 2.40 | 2.40 |
| URF30R | 3.10 | 3.10 |
| UR10R | 1.80 | 1.80 |
| UR5R | 0.91 | 0.89 |
| UR1R | 0.21 | 0.20 |
Maximum computation memory (in gigabytes) allocated for PSI-BLAST during the generation of PSSMs profiles for RB198 and RB44 datasets using UniRef100 and its variants.
Average number of hits used for generating PSSM profiles.
| Features | RB198 | RB44 |
|---|---|---|
| UR100 | 453 | 492 |
| UR50 | 362 | 331 |
| UR50R | 422 | 433 |
| UR40 | 318 | 261 |
| UR40R | 415 | 416 |
| UR30 | 295 | 239 |
| URF30R | 413 | 416 |
| UR10R | 393 | 371 |
| UR5R | 336 | 291 |
| UR1R | 166 | 99 |
Average number of hits found by PSI-BLAST when generating PSSMs profiles for RB198 and RB44 datasets using UniRef100 and its variants.
Fig 2Average pairwise distances between different PSSM profiles of RB198 sequences.
Average pairwise NSSD (A) and NKL (B) distances over RB198 PSSM profiles. Random sampled UniRef variants are more representatives of UR100 than similarity reduced UniRef variants.
Evaluation of servers using RB111 test set.
| Method | ACC (%) | MCC | ||
|---|---|---|---|---|
| FastRNABindR | 75.1 | 0.61 | 0.76 | 0.24 |
| RNABindR v2 | 72.0 | 0.63 | 0.73 | 0.22 |
| BindN+ | 83.5 | 0.43 | 0.87 | 0.24 |
| PPRInt | 76.1 | 0.48 | 0.79 | 0.18 |
| KYG | 77.5 | 0.47 | 0.80 | 0.19 |
| PRIP | 75.2 | 0.45 | 0.78 | 0.15 |