| Literature DB >> 26109352 |
Zhenling Peng1, Lukasz Kurgan2.
Abstract
Intrinsically disordered proteins and regions (IDPs and IDRs) lack stable 3D structure under physiological conditions in-vitro, are common in eukaryotes, and facilitate interactions with RNA, DNA and proteins. Current methods for prediction of IDPs and IDRs do not provide insights into their functions, except for a handful of methods that address predictions of protein-binding regions. We report first-of-its-kind computational method DisoRDPbind for high-throughput prediction of RNA, DNA and protein binding residues located in IDRs from protein sequences. DisoRDPbind is implemented using a runtime-efficient multi-layered design that utilizes information extracted from physiochemical properties of amino acids, sequence complexity, putative secondary structure and disorder and sequence alignment. Empirical tests demonstrate that it provides accurate predictions that are competitive with other predictors of disorder-mediated protein binding regions and complementary to the methods that predict RNA- and DNA-binding residues annotated based on crystal structures. Application in Homo sapiens, Mus musculus, Caenorhabditis elegans and Drosophila melanogaster proteomes reveals that RNA- and DNA-binding proteins predicted by DisoRDPbind complement and overlap with the corresponding known binding proteins collected from several sources. Also, the number of the putative protein-binding regions predicted with DisoRDPbind correlates with the promiscuity of proteins in the corresponding protein-protein interaction networks. Webserver: http://biomine.ece.ualberta.ca/DisoRDPbind/.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26109352 PMCID: PMC4605291 DOI: 10.1093/nar/gkv585
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Architecture of DisoRDPbind.
Figure 2.Empirical assessment of the prediction of the disordered RNA-, DNA- and protein-binding residues. (A) predictive performance measured with AUC calculated per residue and significance of differences in AUC values when comparing DisoRDPbind with other methods on two benchmark data sets: TEST114 (solid bars) and TEST36 (hollow bars); * means that AUC of a given method is statistically significantly lower than AUC of DisoRDPbind at P-value < 0.05; statistical significance was assessed over 10 random subsets with half of proteins from a given test set; error bars show the corresponding standard errors (details in Online Methods). ‘DisoRDP w/o BLAST’ denotes DisoRDPbind without the use of the BLAST-based alignment. (B) PCC values between the propensity scores generated by the pairs of RNA- (red dots), DNA- (green dots) and protein- (black dots) binding predictors listed on the x and y-axes; results for the TEST114 and TEST36 data sets are shown above and below the dashed diagonal line, respectively; dot sizes are proportional to the corresponding PCC value that are shown next to the dots. (C) Relation between length of protein chains (x-axis) and the runtime (y-axis in the logarithmic scale) computed for proteins from the TEST114 and TEST36 data sets using a modern desktop; we include DisoRDPbind (solid circles), ANCHOR (hollow triangles), and one iteration of PSI-BLAST (hollow circles); the solid black, solid gray and dotted black lines represent the quadratic fit for DisoRDPbind, ANCHOR and PSI-BLAST, respectively.
Figure 3.Evaluation of predictions of the disordered RNA-, DNA- and protein-binding in the H. sapiens, M. musculus, C. elegans and D. melanogaster genomes. (A) Venn diagrams of the overlap between the disordered RNA-binding (DNA-binding) proteins predicted by DisoRDPbind and the known binding proteins collected from the GO_RNA (GO_DNA) and RBPDB (animalTFDB) data sets, respectively. (B) Venn diagrams of the overlap between the disordered RNA-binding (DNA-binding) proteins predicted by DisoRDPbind and the known binding proteins collected from the GO_RNA (GO_DNA) and from recently curated RNA-binding (DNA-binding) protein data set DB_RNA (DB_DNA), respectively. The area of the rectangles corresponds to 40% of size of a given proteome; The counts of proteins in a given data set and intersections of the data sets are given inside the corresponding rectangles; (C) Median ratio between the actual overlap between the RNA-binding (DNA-binding) proteins predicted by DisoRDPbind and proteins annotated in the GO_RNA, RBPDB and DB_RNA (GO_DNA, animalTFDB and DB_DNA), and the overlap of the proteins from these databases with a randomly chosen set of proteins. The median ratio is over 10 repetitions with half of the data; error bars are 30% and 70% centiles; the number of chains in a given database and percentage of overlap with the predictions of DisoRDPbind are given inside the bars; * means that the difference between the two values of overlap is statistically significant at P-value < 0. 0005. (D) Median ratio (over 10 repetitions with half of the data; error bars are 30% and 70% centiles) between the actual overlap between the cellular localizations of novel putative RNA (DNA) binders and the localizations that are significantly associated with the proteins known to bind RNA from GO_RNA, RBPDB and DB_RNA (known to bind DNA from GO_DNA, animalTFDB and DB_DNA), and the overlap in cellular localizations of the proteins from these databases with a randomly chosen set of proteins. This analysis was done in M. musculus since annotation of localizations in other genomes were not sufficiently complete. The percentage of the overlap with the predictions of DisoRDPbind is given inside the bars;* denotes that the difference between the two values of overlap is statistically significant at P-value < 0.05. (E) Relation between the promiscuity of proteins in PPI networks collected from mentha and the number of the disordered protein-binding regions predicted with DisoRDPbind. The relation was quantified with Pearson correlation coefficient (PCC) that is show inside the bars. Bars shows median ratio (over 10 repetitions with half of data; error bars are 30% and 70% centiles) in logarithmic scale between these PCC values and the ‘random PCC’ where the promiscuity values were shuffled; * denotes that the difference between the two values of PCC is statistically significant at P-value < 0.05.