Literature DB >> 27453073

Unfoldomics of prostate cancer: on the abundance and roles of intrinsically disordered proteins in prostate cancer.

Kevin S Landau¹, Insung Na¹, Ryan O Schenck¹, Vladimir N Uversky².

Abstract

Prostatic diseases such as prostate cancer and benign prostatic hyperplasia are highly prevalent among men. The number of studies focused on the abundance and roles of intrinsically disordered proteins in prostate cancer is rather limited. The goal of this study is to analyze the prevalence and degree of disorder in proteins that were previously associated with the prostate cancer pathogenesis and to compare these proteins to the entire human proteome. The analysis of these datasets provides means for drawing conclusions on the roles of disordered proteins in this common male disease. We also hope that the results of our analysis can potentially lead to future experimental studies of these proteins to find novel pathways associated with this disease.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2016 PMID： 27453073 PMCID： PMC5000786 DOI： 10.4103/1008-682X.184999

Source DB: PubMed Journal: Asian J Androl ISSN： 1008-682X Impact factor: 3.285

INTRODUCTION

Prostate cancer is the most prevalent form of cancer in males. It is estimated that in 2016, it will account for 180 890 (21%) of new cancer cases. It is also estimated that 26 120 (8%) of all cancer deaths which will occur in the United States male population in the year 2016 will be caused by prostate cancer.1 Although there has been progress in the detection and treatment of prostate cancer, it is clear that more research is needed to make the diagnosis more reliable. The need for the constant improvement of the diagnostics tools is given, for example, by a recently established effect of obesity (measured in terms of the body mass index) on the predictive performance of the well-established and widely used biomarker, prostate-specific antigen (PSA), and PSA-related markers for prostate cancer.2 These and similar facts alone should be sufficient to spark the scientific community's interest in studying more on possible causes of this very common male disease. Sometimes, a new approach needs to be taken to further the advancements in biomedical research. One of such approaches is considering roles of intrinsically disordered proteins, or IDPs, which are proteins that lack a stable secondary and tertiary structure and have been shown to constitute a noticeable part of any proteome of interest.3456789101112131415 At the primary sequence level, IDPs are characterized by noticeable compositional biases, being noticeably depleted in order-promoting amino acids (Cys, Trp, Tyr, Phe, Ile, Leu, Val, and Asn) and being found to be enriched with disorder-promoting residues (Pro, Arg, Gly, Gln, Ser, Glu, Lys, and Ala).3161718 This peculiar physicochemical nature of these proteins, therefore, does not favor spontaneous folding into well-defined globular structures, and they, in turn, remain “floppy.”1920 Among other important features, this “floppiness” of IDPs provides them with an ability to be promiscuous binders and be involved in regulation and control of various signaling processes, being controlled themselves at multiple levels.452122 Furthermore, intrinsically disordered protein regions (IDPRs) are often targeted for posttranslational modifications (PTMs)23242526 and are known to be involved in a myriad of biological processes.45161820272829303132 Proteins that are involved in multiple functions and processes are sometimes referred to as “moonlighting” proteins, and many of these moonlighting proteins are shown to be either completely disordered or possess long IDPRs.33 Often, IDPs are specifically compartmentalized, e.g., IDPs and proteins with IDPRs in various nuclear membrane-less organelles.34 Since cells in various cancers, not just of the prostate, grow and divide in an uncontrolled manner, it would be reasonable to assume that these moonlighting proteins with the multiple disordered regions have a significant role in many oncological processes. The validity of this hypothesis was demonstrated in earlier bioinformatics study where the majority of human cancer-associated proteins (HCAP) were shown to contain long IDPRs (i.e., regions possessing ≥30 consecutive disordered residues).35 More generally, a D2 (disorder in disorders) concept was introduced to emphasize that many proteins related to various human diseases such as cancer, neurodegeneration, diabetes, cardiovascular disease, and amyloidosis are intrinsically disordered.36 This two-part study was dedicated to the analysis of the prevalence and functionality of disordered proteins in prostate cancer cells and to compare them to the abundance and functionality of IDPs in the human proteome as a whole. The present paper, which represents the first part of this two-part study, reports the results of the global bioinformatics analysis of the disorderedness of proteins that were previously shown to be associated with the prostate cancer pathogenesis and compares these proteins with the entire human proteome. The second part continues this analysis, being focused on more detailed characterization of the prostate cancer-associated proteins from KEGG database.

MATERIALS AND METHODS

Acquisition of protein datasets

UniProt IDs of human proteins were obtained from the reviewed UniProt release 2015–09.37 From those, reviewed IDs without uncommon amino acid codes (J, X, O, U, B, and Z) and proteins shorter than 30 amino acids were removed. The remaining proteins totaled to 20 120 and were considered as the human proteome set. To obtain an experimentally validated dataset of proteins that were already shown to experimentally be involved in prostate cancer, we used results of a proteomic study that reported on the characterization of a total of 359 proteins, including 17 potential biomarkers of prostate cancer from prostatic cell lines, entered into the prostate cancer proteomics (PCP) database (http://ef.inbi.ras.ru).38 This database is described as a multilevel informational database that was created using the results of a proteomic study of human prostate carcinoma and benign hyperplasia tissues, and of some human-cultured cell lines. Prostate cancer-related proteins entered to this database were first separated by 2D electrophoresis and subsequently identified by mass-spectrometry.38 Proteins are extensively annotated using data from published articles and existing databases and contain direct Internet links to the information in the NCBI and UniProt databases.38 Although this PCP website was originally designed in Russian, it can be toggled into English. The PCP database consists of 7 interrelated modules that contain proteomic data from different cell lines.38 The seven modules that were used in our analysis are prostate proteins (hyperplasia, cancer), LNCAP (IPG-2DE), LNCAP (IEF-2DE), proteins of Rhabdomyosarcoma A-204, normal human myoblasts proteins, PC3, and BPH-1. The gene names were retrieved from the PCP database, and then were used as identifiers to find corresponding human proteins in UniProtKB. Some gene product names could not be mapped to UniProt IDs and were not included in our analysis. Once these UniProt IDs were found, obvious duplicates were removed, which decreased this dataset to 291 proteins. Since the PCP dataset had 7 modules that were interrelated, there were many repeated UniProt IDs including those corresponding to various isoforms. Therefore, a series of programs were written to obtain the unique representatives for subsequent analysis. After parsing the files, a dataset consisted of 251 proteins was assembled. This dataset included 196 reviewed UniProt proteins and 55 unreviewed TrEMBL proteins. The isoforms (designated with a dash and a number in their UniProt IDs) fell into the unreviewed category. Collectively, dataset containing 196 UniProt-reviewed proteins from the PCP database is referred to as the “Russian dataset” in our analysis and corresponding files. An independent search of KEGG database resource (http://www.genome.jp/kegg/)39 conducted in September 2015 provided us with a set of 48 proteins experimentally shown to be involved in prostate cancer. This dataset referred to as the “KEGG dataset” was used in a global analysis of abundance of intrinsic disorder in this set of prostate cancer-related proteins described in section of Results. A new search of KEGG database conducted on May 28, 2016, gave 40 additional proteins with experimentally validated connection to prostate cancer. The resulting set of 88 proteins constituted the “extended KEGG dataset”, and proteins from these datasets were subjected to a more focused and detailed analysis of their predisposition for intrinsic disorder in the companion paper (). Illustrative representation of the 2×2 contingency table to find significantly associated GO terms Click here for additional data file. Proteins from Russian and KEGG datasets that were not found in the whole human proteome dataset were filtered out. For comparison purpose, a modified human proteome dataset was generated from which proteins included into Russian and KEGG datasets were excluded. Since transmembrane domains of membrane proteins are typically enriched in hydrophobic, order-promoting residues, Russian, KEGG, and human proteome datasets were further divided into two categories: membrane and nonmembrane. Membrane and nonmembrane proteins were classified using the QuickGO gene association file (gene_association.goa_human.gz, updated on September 14, 2015), and 41 Cellular Component Gene Ontology (GO) terms with ‘integral’ and ‘membrane’ on their names.4041

Functional annotation using gene ontology (GO) terms

The GO terms for the human proteome were downloaded on September 14, 2015, from QuickGO (https://www.ebi.ac.uk/QuickGO/). The resulting file shows all of the biological processes (P), molecular functions (F), and cellular component (C) for all of the human proteins in UniProt format. All of the GO terms that are specific for annotating the cellular component were downloaded from the Gene Ontology Consortium (http://www.geneontology.org).

Evaluation of intrinsic disorder propensity and disorder-based functionality

Intrinsic disorder propensities of human proteins were evaluated using PONDR-FIT, PONDR® VLXT, and PONDR® VSL2 algorithms.4243444546 For each protein, after obtaining an average disorder score by each predictor, all three predictor-specific average scores were averaged again to generate a per-protein intrinsic disorder score. The use of consensuses for evaluation of intrinsic disorder is motivated by empirical observations that this leads to an increase in the predictive performance compared to the use of a single predictor.474849 The dataset average disorder scores (AVG scores) were then calculated for human membrane and nonmembrane proteins based on the corresponding per-protein scores, and those AVG scores were used to prepare contingency tables. The UniProt IDs in the set of human proteins were then used to match the disorder scores to the UniProt IDs of proteins in Russian and KEGG datasets to generate disorder scores for all of the proteins in our datasets. Since the human proteome disorder scores were only retrieved for the SwissProt reviewed human proteins, only the disorder propensities of the reviewed proteins in the Russian dataset were analyzed.

Contingency table constructions, statistical test, and common GO term search

Fisher's exact test was used to find significance of the frequency (number of proteins) that were observed in the prostate cancer protein datasets (Russian and KEGG) to have an average disorder score calculated based on scores all three disorder predictors (PONDR-FIT, PONDR® VSL2, and PONDR® VLXT) greater than or equal to the AVG score of the entire human proteome. These data were used to construct the 2 × 2 contingency tables ( represents an example of such a contingency table), so the subsequent statistical analyses can be performed.50 The null (H0) and research hypotheses (H1) shown below for both membrane and nonmembrane subsets of all datasets were tested using α = 0.01 as the statistical significance level. H0: a/(a + b) ≤ c/(c + d) H1: a/(a + b) > c/(c + d) IBM SPSS 22 software (IBM Corporation, USA) was used to perform the Fisher's exact test whereas the R statistical package was used to create figures to present the data. Similar approach was also used to find GO terms significantly associated with intrinsically disordered proteins in all datasets. Here, after categorization of the protein sets in contingency tables, Fisher's exact test was performed to find GO terms significantly associated with proteins possessing per-protein intrinsic disorder score greater than AVG score in Russian or KEGG datasets separately when those were compared with the whole human proteome. In these studies, membrane and nonmembrane proteins were treated separately. Fisher's exact test P values were retained through Python 3 programming, and its module, scipy. Next, significant GO terms were analyzed using AmiGo2.51 First, GO terms with P values smaller than 0.05 or 0.1 were obtained for KEGG or Russian dataset versus whole human proteome, separately. Those GO terms were uploaded to visualization server, and the resulting GO term maps were downloaded in the SVG format. These maps included significant GO terms specific for the protein in KEGG or Russian dataset and were used to find the intersection or common GO terms present in both datasets. Those intersection GO terms were analyzed using REVIGO.52 For prostate cancer-related proteins from the KEGG dataset, we analyzed the per-residue disorder propensities by PONDR-FIT, PONDR® VLXT, PONDR® VSL2 algorithms,4243444546 and by the PONDR® VL3 predictor that possesses high accuracy in finding long IDPRs.53 We also used a consensus approach MobiDB,5455 applied a binary disorder classifier charge-hydropathy plot (CH-plot) that evaluates the predisposition of a given protein to be ordered or disordered as a whole,1856 predicted potential disorder-based binding sites using the ANCHOR algorithm,5758 looked at the functional disorder using D2P2 database,59 and analyzed interactivity of these proteins by STRING.60 The MobiDB database (http://mobidb.bio.unipd.it/),5455 that generates consensus disorder scores by aggregating the output from ten predictors, such as two versions of IUPred (IUPred-long and IUPred-short),61 three versions of ESpritz (ESpritz-DisProt, ESpritz-NMR, and ESpritz-XRay),62 two versions of DisEMBL (DisEMBL-465 and DisEMBL-HL),63 JRONN,64 PONDR® VSL2B,4665 and GlobPlot.66 A CH-plot represents an input protein as a point within the 2D graph where the mean Kate-Doolittle hydrophobicity and the mean absolute net charge are used as the X- and Y-coordinates, respectively. In the corresponding CH-plot, fully structured proteins and fully disordered proteins can be separated by a boundary line. All proteins located above this boundary line are highly likely to be extended whereas proteins located below this line are likely to be compact.1856 In addition to CH-plot, another binary disorder predictor, cumulative distribution function (CDF) analysis was used.56 It summarizes the per-residue disorder predictions by plotting PONDR scores against their cumulative frequency, which allows ordered and disordered proteins to be distinguished on the basis of the distribution of prediction scores.56 At any given point on the CDF curve, the ordinate gives the proportion of residues with a PONDR score less than or equal to the abscissa. The optimal boundary that provided the most accurate order-disorder classification was shown to represent seven points located in the 12th through 18th bins.56 Thus, in the CDF analysis, order-disorder classification is based on whether a CDF curve of a given protein is above (ordered) or below (disordered) a majority of boundary points.56 Disorder evaluations together with important disorder-related functional information were retrieved from the D2P2 database (http://d2p2.pro/),59 which is a database of predicted disorder for a large library of proteins from completely sequenced genomes.59 D2P2 database uses outputs of IUPred,61 PONDR® VLXT,42 PrDOS,67 PONDR® VSL2B,4665 PV2,59 and ESpritz.62 The database is further supplemented by data concerning location of various curated posttranslational modifications and predicted disorder-based protein-binding sites. Additional functional information for these proteins was retrieved using Search Tool for the Retrieval of Interacting Genes; STRING, http://string-db.org/, which based on predicted and experimentally validated information on the interaction partners of a protein of interest generates a network of predicted associations.60 In the corresponding network, the nodes correspond to proteins whereas the edges show predicted or known functional associations. Seven types of evidence are used to build the corresponding network where they are indicated by the differently colored lines: a green line represents neighborhood evidence; a red line - the presence of fusion evidence; a purple line - experimental evidence; a blue line – co-occurrence evidence; a light blue line - database evidence; a yellow line – text mining evidence; and a black line – co-expression evidence.60 In our analysis, the most stringent criteria were used for selection of interacting proteins by choosing the highest cutoff of 0.9 as the minimal required confidence level. Finally, potential disorder-based protein-binding sites of prostate cancer-related proteins from the KEGG dataset were identified by the ANCHOR algorithm.5758 This algorithm utilizes the pair-wise energy estimation approach originally used by IUPred.6168 This approach acts on the hypothesis that long regions of disorder include localized potential-binding sites which are not capable of folding on their own due to not being able to form enough favorable intrachain interactions, but can obtain the energy to stabilize via interaction with a globular protein partner.5758

Evaluation of interactability of prostate cancer-related proteins in the KEGG dataset

Interactability of human proteins related to prostate cancer from the KEGG dataset was further evaluated by the APID (Agile Protein Interactomes DataServer) web server (http://apid.dep.usal.es).69 APID has information on 90 379 distinct proteins from more than 400 organisms (including Homo sapiens) and on the 678 441 singular protein-protein interactions. For each protein–protein interaction (PPI), the server provides currently reported information about its experimental validation. For each protein, APID unifies PPIs found in five major primary databases of molecular interactions, such as BioGRID,70 Database of Interacting Proteins (DIP),71 Human Protein Reference Database (HPRD),72 IntAct,73 and the Molecular Interaction (MINT) database,74 as well as from the BioPlex (biophysical interactions of ORFeome-based complexes)75 and from the protein databank (PDB) entries of protein complexes.76 This server provides a simple way to evaluate the interactability of individual proteins in a given dataset and also allows researchers to create a specific protein-protein interaction network in which proteins from the query dataset are engaged. Of 88 prostate cancer-related proteins in the KEGG dataset, APID was able to find protein-protein-related information on 87 proteins. No such information was available for the 3-oxo-5-alpha-steroid 4-dehydrogenase 2 (SRD5A2, UniProt ID: P31213).

RESULTS

Abundance of intrinsic disorder in Russian and KEGG datasets of prostate cancer-related proteins

There were 20 120 proteins in the whole human proteome dataset used. This dataset was separated into membrane proteins and nonmembrane proteins: Membrane proteins: 5193 Nonmembrane proteins: 14 927. The Russian dataset included 196 unique proteins that were mined from the PCP database. This dataset was then separated into membrane proteins and nonmembrane proteins: Membrane proteins: 6 Nonmembrane proteins: 190. The KEGG dataset contained 48 proteins, 4 of which were classified as membrane and remaining 44 as nonmembrane proteins. Peculiarities of distribution of disorder within the members of these six datasets are shown in in a form of DSFIT vs DSVSL2 plots where DSFIT and DSVSL2 correspond to the mean disorder scores calculated for query proteins using PONDR-FIT and PONDR® VSL2 algorithms, respectively.

Figure 1

Abundance of intrinsic disorder in six datasets analyzed in this study - (a) human membrane (red and blue circles) and nonmembrane proteins (pink and cyan circles); (b) human membrane (red and blue circles) and nonmembrane prostate cancer-related proteins (pink and cyan circles) from Russian dataset; (c) human membrane (red and blue circles) and nonmembrane prostate cancer-related proteins (pink and cyan circles) from KEGG-dataset. Data for all these sets are shown in a form of proteins with per-protein disorder score above (red and pink circles) or below AVG (blue and cyan circles). AVGs for nonmembrane and membrane proteins are shown as medium dashed dark green and short dashed dark gray lines. To avoid redundancy in the subsequent statistical analysis of GO terms, the membrane and nonmembrane subsets of the whole human proteome datasets were adjusted to exclude prostate cancer-associated proteins. Supplementary Tables show the resulting 2 × 2 contingency tables and the corresponding results are further shown in where they are depicted as a bar-graph.

Figure 2

Bar graphs presenting data from Supplementary Tables . These graphs represent analyzed datasets that were split up into the membrane and nonmembrane groups, in which each bar represents either data for the human proteome, or the Russian (PCP) dataset (a), or KEGG dataset (b) within their respective group. The red section of each bar represents the fraction of proteins that have per-protein average disorder score ≥ AVG and the blue sections represent the fraction of proteins that have per-protein average disorder score < AVG.

2×2 contingency table of the membrane KEGG dataset and membrane proteins in whole human proteome (membrane AVG: 0.301) Click here for additional data file. 2×2 contingency table of the non-membrane KEGG dataset and nonmembrane proteins in whole human proteome (non-membrane AVG: 0.437). Click here for additional data file. 2×2 contingency table of the membrane Russian dataset and membrane proteins in whole human proteome Click here for additional data file. 2×2 contingency table of the non-membrane Russian dataset and nonmembrane proteins in whole human proteome Click here for additional data file. Bar graphs presenting data from Supplementary Tables . These graphs represent analyzed datasets that were split up into the membrane and nonmembrane groups, in which each bar represents either data for the human proteome, or the Russian (PCP) dataset (a), or KEGG dataset (b) within their respective group. The red section of each bar represents the fraction of proteins that have per-protein average disorder score ≥ AVG and the blue sections represent the fraction of proteins that have per-protein average disorder score < AVG. We recognize that the values of some cells in the 2 × 2 contingency tables () are rather low, indicating that it is too risky to draw any solid statistical conclusions when there are not enough samples, especially when comparison is done between very small and very large samples. Furthermore, disorder propensities of proteins analyzed in this study were evaluated using a set of standard disorder predictors that are characterized by the accuracy of 80%–85%. This indicates that the confidence of conclusions outlined below (especially for the cases having very limited samples) is further influenced by the limited accuracy of predictors. Therefore, data presented below should be taken only as indication of potential tendencies in the disorder predisposition and not as the final statistically significant conclusions. Characterization of human proteins involved in the prostate cancer pathway Click here for additional data file. The average disorder scores between all three predictors (PONDR® VSL2, PONDR® VLXT, and PONDR-FIT) were calculated for nonadjusted subsets of membrane and nonmembrane proteins in whole human proteome. These average disorder scores of 0.301 and 0.437 for the membrane and nonmembrane dataset, respectively, were then used to perform the statistical analysis of the proportion of the disordered proteins in the KEGG and Russian datasets compared to the proportion of the disordered proteins in the entire human proteome. The one-sided Fisher's exact test statistic value for the membrane proteins in Russian dataset is 0.363, and therefore, H0 cannot be rejected at the significance level of α = 0.01. This means that in the membrane subset of Russian dataset, the proportion of proteins that have per-protein disorder scores ≥ to the AVG score is not significantly different from human membrane proteins that have per-protein disorder scores ≥ to the AVG score. On the other hand, the one-sided Fisher's exact test statistic value for the nonmembrane proteins from the Russian dataset is <0.001. Therefore, we reject H0 at the significance level of α = 0.01. From these data, we conclude that the fraction of the nonmembrane proteins that have per-protein disorder scores exceeding the AVG score is significantly higher in the Russian dataset of prostate cancer-associated proteins than in the human proteome. This indicates that proteins with high intrinsic disorder levels are found significantly more often in the nonmembrane Russian dataset than in the whole human proteome, suggesting that higher abundance of IDPs/IDPRs can be related to cancer development. This observation is in agreement with the results of earlier studies on the high abundance of intrinsic disorder in cancer-related proteins.35 For the KEGG membrane proteins versus dataset of human membrane proteins, we cannot reject the H0 hypothesis at α = 0.05. Therefore, there is no statistically significant difference between the proportions of KEGG membrane proteins with the per-protein disorder scores ≥ AVG and human membrane proteins with the per-protein disorder scores ≥ AVG. Similarly, for the KEGG nonmembrane proteins versus human nonmembrane proteins, we also cannot reject the Ho hypothesis at α = 0.05. Although this analysis revealed that the intrinsic disorder propensities of the prostate cancer-related proteins in the KEGG dataset are not significantly different from the disorder predispositions of proteins in human proteome, we conducted more detailed disorder-oriented analysis of proteins from the extended KEGG dataset to illustrate peculiarities of disorder distribution in these proteins and to see how IDPRs can be related to function and pathology (see ). Of the 17 potential biomarkers of prostate cancer outlined in Shishkin et al. paper,38 only 12 were included in the reviewed protein dataset analyzed in our study. All of these 12 potential biomarkers are in the nonmembrane subset members of which are characterized by the per-protein disorder scores below the AVG disorder score. Furthermore, the well-known prostate cancer biomarker PSA was also present in our set of nonmembrane prostate cancer-related proteins characterized by the disorder scores below the AVG disorder score of human nonmembrane proteins. These observations suggest that potential biomarkers of prostate cancer are characterized by lower disorder levels than an average human protein.

Finding GO terms significantly associated with intrinsically disordered prostate cancer-related proteins in the KEGG and Russian datasets

GO terms are based on three structured ontologies that are designed for consistent functional descriptions of proteins in a species-independent manner. These terms show the relations of the query proteins to the biological processes they are involved in, the molecular functions they conduct, and the cellular components where they can be found at. Obviously, because GO terms are specifically designed as general terms of functional classification of proteins, these terms are applicable to any annotated protein (not only to proteins found in various pathologies but also to the normally functioning proteins). However, GO terms can also be used to find a correlation between protein intrinsic disorder and functionality. In fact, it has been shown in several previous studies that some GO terms are preferentially found to be associated with IDPs whereas other GO terms would be more suitable for characterization of ordered proteins. Our analysis was conducted to show that many of the GO terms ascribed to the IDPs associated with prostate cancer describe disorder-related functions. This observation reemphasizes the importance of intrinsic disorder for these proteins. GO term analysis using AmiGo251 did not produce noticeable number of disorder-associated GO terms when the P value threshold of less than 0.05 was used. Since we wanted to see which functions can be assigned to the intrinsically disordered proteins related to prostate cancer, we decided to use a loosen P value threshold of less than 0.1 to find disorder associated GO terms (). One should keep in mind that although the use of the loosen P value (P < 0.1) does not provide statistically significant data, the corresponding analysis generates a notable statistical trend that can be used to look for weakly significant correlations between intrinsic disorder and function. Among those GO terms with tentatively significant correlation to intrinsic disorder, four process-oriented GO terms – GO: 0006915 (apoptotic process), GO: 0008219 (cell death), GO: 0012501 (programmed cell death), and GO: 0016265 (death) – were related with programmed cell death through REVIGO52 (). Moreover, those four chosen GO terms showed a simple map through AmiGo2 visualize tool (). Prostate cancer-related proteins from the KEGG and Russian datasets with disorder-associated common Gene Ontology terms Click here for additional data file. Functions of intrinsically disordered proteins related to prostate cancer. (a) REVIGO analysis of the cellular process GO terms with P < 0.1. (b) AmiGo2 result for all GO terms with P value less than 0.1. Highlighted are the programmed cell death-related cellular process GO terms. Click here for additional data file.

Bioinformatics analysis of the prostate cancer-related proteins in the KEGG dataset

To gain information on the disorder status of prostate cancer-related proteins in the KEGG dataset and on the potential functional roles of their predicted IDPRs, we looked at them using a set of disorder predictors of PONDR family (PONDR-FIT, PONDR® VSL2, PONDR® VL3, and PONDR® VLXT), a binary disorder predictor CH-plot, a consensus disorder evaluating internet tools MobiDB and D2P2, a platform STRING for finding potential interaction partners of a protein of interest, and a tool for predicting the disorder-based binding sites (ANCHOR). Some of these results are summarized in and .

Figure 3

3D representation of the results of evaluation of disorder levels in human prostate cancer-related proteins from the extended KEGG dataset. Here, the percentages of residues in these proteins predicted to be disordered by PONDR-FIT and PONDR® VSL2 are compared with the predicted percentages of disordered residues predicted by the MobiDB platform that aggregates the output from ten disorder predictors. The overall goal of this plot is to show the overall agreement between the outputs of different disorder predictors used in this study.

Characterization of human proteins involved in the prostate cancer pathway Click here for additional data file. 3D representation of the results of evaluation of disorder levels in human prostate cancer-related proteins from the extended KEGG dataset. Here, the percentages of residues in these proteins predicted to be disordered by PONDR-FIT and PONDR® VSL2 are compared with the predicted percentages of disordered residues predicted by the MobiDB platform that aggregates the output from ten disorder predictors. The overall goal of this plot is to show the overall agreement between the outputs of different disorder predictors used in this study. illustrates that disorder predictions generated for these KEGG proteins by PONDR-FIT, PONDR® VSL2, and MobiDB in a form of per-protein disorder scores (DSs) or the percentage of disordered residues (per-protein content of disordered residues, CDRs, for PONDR-FIT and MobiDB) are generally agreed. Furthermore, from data shown in , it is clearly seen that many proteins in this dataset are predicted to be moderately or highly disordered using the classification of proteins as highly ordered, moderately disordered, or highly disordered, if their CDR <10%, 10% ≤ CDR <30%, and CDR ≥30%, respectively.77 In fact, according to these criteria, the KEGG dataset includes 34 (38.6%), 38 (43.2%), and 16 (18.2%) highly disordered, moderately disordered, and highly ordered proteins, respectively. This suggests that almost 82% of prostate cancer-related proteins in this dataset are very noticeably disordered. shows that although the disorder propensities of these proteins are spread over a wide range (e.g., by PONDR-FIT, from 3% of disordered residues in PI3K-α to 100% of such residues in BAD), the vast majority of them possess sizable IDPRs containing at least 10 consecutive residues predicted to be disordered, with many of these proteins having several such regions. In fact, only eleven proteins from the KEGG dataset were shown not to have such regions, and the remaining 77 proteins possessed 293 IDPRs (i.e., on average, each of these 77 proteins is expected to contain 3.81 such regions). Often, IDPRs contain local regions with a strong tendency to become ordered at interaction with specific binding partners. Therefore, these regions might undergo coupled folding and binding, as shown for many of them by the NMR studies.787980818283 Furthermore, such local short segments of order located within long disordered regions were shown to often coincide with the potential-binding sites.84 Therefore, a number of computational tools for finding such molecular recognition features (MoRFs) were developed (e.g., a tool for predicting short binding regions with high α-helix-forming propensity, α-MoRFs8586 or the more general ANCHOR algorithm for finding potential disorder-based binding sites,5758 which are termed below AIBS for ANCHOR-identified binding sites). Curiously, in earlier studies, a systematic application of such computational tools indicated that α-MoRFs are likely to play important roles in protein-protein interactions involved in signaling events.85 Our analysis revealed that the majority of proteins in the KEGG dataset (71 of 88) are predicted to have at least one AIBS, and that many of these proteins are expected to have multiple such binding regions each (). Furthermore, AIBSs are found in almost each KEGG proteins that have at least one disordered region, and many of these proteins are shown to contain multiple disorder-based binding sites, with 34 and 30 AIBSs being found in CREB-binding protein (Q92793) and histone acetyltransferase p300 (Q09472), respectively. In fact, we found 472 AIBSs in 71 human proteins associated with prostate cancer, suggesting that on average, each of these protein contains >6.6 disorder-based binding sites. also shows that the length of AIBSs is ranging from 6 to 150 residues, and the overall content of residues involved in the disorder-based interactions ranges from 0% to 70.8%. The presence of more than one AIBS in a protein suggests that many prostate cancer-related proteins in the KEGG dataset commonly utilize disorder for their interactions with binding partners, and that these proteins are involved either in the polyvalent interactions using multiple binding sites to interact with one binding partner or in scaffolding-like interactions using multiple binding sites to interact with multiple binding partners. The wide spread of lengths of identified AIBSs also suggests the presence of multiple disorder-based binding mechanisms (ranging from local folding-on-binding of short regions to wrapping around binding mode to global binding-induced folding of large regions). We established that many of the proteins in the KEGG dataset are predicted to have noticeable amounts of intrinsic disorder. Furthermore, this analysis revealed that 14 of these proteins (P05019, P14625, P36402, P46527, Q00987, Q92934, Q99801, Q9UJU2, P38936, P07900, Q92569, P08238, Q9Y6K9, and Q02930) are expected to be disordered as a whole according to the CH-plot, i.e., are located above the boundary line separating compact and extended disordered proteins, and 16 more of these proteins, being located below this boundary, are found in its proximity (P04085, P04637, P25963, P60484, Q09472, Q92793, Q9NQB0, P62993, Q9Y243, P01127, Q9GZP0, P27986, O00459, Q9HCS4, Q96BA8, and P18848). Analogous analysis using another binary predictor, cumulative distribution function (CDF) plot (where proteins expected be disordered or ordered as a whole are found based on the position of their CDF curve relative to the boundary separating mostly ordered and disordered proteins), revealed that 39 are expected to be mostly disordered (O00716, P04085, P04637, P05019, P06400, P10275, P10415, P16220, P25963, P36402, P46527, P49841, Q00987, Q01094, Q07889, Q09472, Q12778, Q14209, Q92793, Q92934, Q99801, Q9NQB0, Q8WYR1, P38936, O43889, Q9UJU2, P01127, O00459, P15056, Q07890, Q04206, Q9Y6K9, Q9HCS4, Q02930, Q8TEY5, Q68CJ9, Q70SY1, Q96BA8, and P18848) since the majority of their CDF curves are located below the boundary, whereas the CDF curves of 11 additional proteins (O14920, P01308, P10398, P14625, P24864, P36507, Q02750, P07900, O15530, P27986, and Q92569) follows boundary almost exactly, suggesting that these proteins are definitely not ordered as whole, as their overall disorder status is “undecided”. It was pointed out that combined analysis of protein disorder status using CH-plot and CDF analysis simultaneously can provide additional important information on the classification of protein disorder. This combined approach is known as the CH-CDF analysis,878889 and it is based on the presence of a principle difference between the sensitivity of the CH-plot and the CDF analysis to different types of disorder. Here, the CH-plot can discriminate proteins with substantial amount of extended disorder (random coils and pre-molten globules) from proteins with compact conformations (molten globule-like and rigid well-structured proteins) whereas the CDF analysis may discriminate all disordered conformations, including molten globules and mixed proteins containing both disordered and ordered regions, from rigid well-folded proteins. Therefore, the CH-CDF analysis represents a computational tool to discriminate proteins with extended disorder from potential molten globules and mixed proteins containing comparable amounts of ordered and disordered regions.878889 Therefore, based on the combination of outputs of their CH-plot and CDF analyses, proteins can be classified as follows: proteins predicted to be disordered by CH-plots, but ordered by CDF; ordered proteins (i.e., proteins predicted as ordered by both tools); putative molten globules or mixed proteins (i.e., proteins predicted to be disordered by CDF, but compact by CH-plot); and proteins with extended disorder (i.e., proteins predicted to be disordered by both methods).878889 Based on this classification, there are 13 prostate cancer-associated proteins with extended disorder (P05019, P14625, P36402, P46527, Q00987, Q92934, Q99801, Q9UJU2, P38936, P07900, Q92569, Q9Y6K9, and Q02930) and there are at least 29 putative native molten globules or mixed proteins containing comparable amounts of ordered and disordered regions (O00716, P04085, P04637, P06400, P10275, P10415, P16220, P25963, P49841, Q01094, Q07889, Q09472, Q12778, Q14209, Q92793, Q9NQB0, Q8WYR1, O43889, P01127, O00459, P15056, Q07890, Q04206, Q9HCS4, Q8TEY5, Q68CJ9, Q70SY1, Q96BA8, and P18848). Furthermore, our analysis not only indicated the disorder status of these 88 proteins but also showed that disorder is crucial for functionality of many of the prostate cancer-related proteins. This conclusion follows from the analysis of the data generated for each protein in the KEGG dataset by the D2P2 that, in the visually attractive form, provides an access to the precomputed disorder predictions59 generated by PONDR® VLXT,42 IUPred,61 PONDR® VSL2B,4665 PrDOS,67 ESpritz,62 and PV259 and also provides information on the curated cites of various posttranslational modifications and on the location of predicted potential disorder-based binding sites. Our analysis clearly shows that many of human prostate cancer-related proteins from the KEGG dataset are predicted to have disordered regions of various lengths, often possess numerous potential disorder-based binding motifs () and contain multiple sites of various posttranslational modifications (PTMs). The finding that the IDPRs of these proteins have a multitude of PTMs is in agreement with the well-known fact that phosphorylation23 and many other enzymatically catalyzed PTMs are preferentially located within the IDPRs.23242526 The interactivity of prostate cancer-associated proteins from the KEGG dataset was further evaluated by the STRING computational platform that provides information on both experimentally validated and predicted interactions of query proteins.60 The corresponding STRING-generated protein-protein interaction (PPI) networks of these proteins (data not shown) indicate that all KEGG proteins are predicted to have very well-developed interactomes. This observation suggests that these proteins can serve as hubs in their functional PPI networks. These findings are in accord with earlier observations that intrinsic disorder plays an important role in functionality of hubs, defining their ability to be promiscuous binders engaged in interactions with a multitude of often unrelated partners.90919293949596 Furthermore, previous studies showed that many hubs are intrinsically disordered or contain functional IDPRs, and that the partners of ordered hubs are preferentially intrinsically disordered.90919293949596 It is likely that this binding promiscuity of the prostate cancer-related proteins can be at least partially attributed to the fact that many of them have numerous AIBSs. Also, this astonishing capability of mostly ordered, hybrid, and mostly disordered proteins from the KEGG dataset to be heavily connected hubs represents a major hurdle for the development of drugs targeting these proteins.

DISCUSSION

Interaction network of prostate cancer-related proteins

Results of the application of the APID server for evaluation of the interactivity of the 87 prostate cancer-related proteins from the KEGG dataset (as it was mentioned, no protein-protein interaction-related information was available for the 3-oxo-5-alpha-steroid 4-dehydrogenase 2 [SRD5A2, UniProt ID: P31213]) clearly showed that each of these proteins is known to be engaged in multiple protein-protein interactions (PPIs) (). In fact, the number of PPIs ranges from 4 (TCF7L1 and INSRR) to >1,000 (e.g., p53, a 393-residue-long protein for which the content of disordered residues predicted PONDR FIT (CDRFIT) is 54.96%, and EGFR with the length of 1021 residues and CDRFIT of 9.1%, interact with 1072 and 1031 partners, respectively). In fact, 81 proteins in this dataset are able to interact with more than 10 partners each (). This observation suggests that the vast majority of the prostate cancer-related proteins from the KEGG dataset can be considered as hub proteins. shows that the interactivity of these proteins is not correlated with their length () or their intrinsic disorder content (), suggesting that these two features do not directly determine the ability of a prostate cancer-related protein to be involved in many PPIs and to serve as a hub. The aforementioned lack of a correlation between the interactability of a given protein and its disorder content seems to be in contradiction with the claim that the ability of a protein to be engaged in many PPIs relies on intrinsic disorder. However, it is known that some hub proteins can be entirely disordered, other hubs may contain both ordered and disordered regions, and still other hubs can be highly structured.97 However, the binding regions of the partner proteins of ordered hubs were found to be intrinsically disordered.9899 These observations suggested two primary mechanisms by which disorder is utilized in protein-protein interaction networks, namely, one disordered region binding to many partners and many disordered region binding to one partner.9192939497100101

Figure 4

Characterization of the interactability of prostate cancer-related proteins from the extended KEGG dataset based on the results of their analysis by APID server. (a) Correlation between the number of PPIs found by APID for individual proteins and protein length. Note logarithmic scale of this plot. (b) Correlation between interactability (evaluated as the number of PPIs per 100 residues of a given protein) and intrinsic disorder (measured as PONDR FIT-based content of predicted disordered residues in a query protein). Note semi-logarithmic scale of this plot. As it follows from brief description of several highly disordered proteins related to prostate cancer, it is not uncommon to find them to be involved in interaction with each other. To understand how common this phenomenon is, we used the ability of the APID web server (http://apid.dep.usal.es) to build a specific PPI network between proteins included in a query list.69 represents the results of application of this tool to prostate cancer-related proteins from the KEGG dataset and shows that all proteins with known interactions are involved in the formation of a common interactive cluster, where each prostate cancer-related protein interacts with at least one other prostate cancer-related protein. shows results of this analysis in a form of a grid, where each node corresponds to a protein from the KEGG dataset, and where PPIs are shown as corresponding edges, thickness of which reflects the reliability of a given interaction. The resulting prostate cancer-related interactome clearly shows that almost all proteins currently known to be related to the pathogenesis of this disease are talking to each other. Therefore, both internal (interactions with other prostate cancer-related proteins) and external connectivities (interaction with other proteins) are high for many prostate cancer-related proteins.

Figure 5

Evaluation of the inter-set interactivity of prostate cancer-related proteins from the KEGG dataset using the APID web server (http://apid.dep.usal.es). This tool builds a PPI network between proteins included in a query list. Results are shown in a form of a grid, where each node corresponds to a protein from the KEGG dataset, and where PPIs are shown as corresponding edges, thickness of which reflects the reliability of a given interaction.

Functionality of intrinsic disorder in a major player, androgen receptor

Androgen receptor (AR, CDRFIT is 42.7%) is one of the steroid hormone receptors. AR is a 920 residue-long ligand-activated transcription factor controlling expression of various eukaryotic genes and affecting proliferation and differentiation of cells in target tissues. AR plays a key role in the development and progression of prostate cancer and contributes to the development of resistance to androgen deprivation leading to the formation of the castration-resistant form of prostate cancer.102 There are several functional regions/domains in the AR. For example, the most AR transcriptional activity is due to the N-terminal domain (NTD) transcriptional activation domain whereas interaction with DNA is attributed to the central DNA-binding domain that contains 2 zinc-finger motifs. Nuclear localization upon activation is driven by a short, flexible, hinge located after the DNA-binding domain. Finally, interaction of AR with ligands is conducted through its C-terminal ligand-binding domain (LBD). Structural information is available for the LBD (residues 659–920) and the N-terminal peptide, AR20–30.103 Analysis of the solution structure of AR NTD (residues 1–537) by circular dichroism revealed that this domain has a relatively limited amount of stable secondary structure.104 The lack of stable structure in NTD is further supported by the results of evaluation of disorder predisposition in this protein as shown in . As typical for IDPs and hybrid proteins containing ordered domains and IDPRs, AR has several regions with compositional biases, such as Gln-rich region (residues 58–120), poly-Gln regions (residues 58–80, 86–91, and 195–199), a poly-Pro region (residues 374–383), and poly-Ala and poly-Gly regions (residues 398–404 and 451–473, respectively). There are four alternatively spliced isoforms of this protein, where in comparison with the canonical form, isoform 2 misses residues 1–532 and has a GPYGDMR → MILWLHS substitution at position 533–539; isoform 3 has a substitution ARKLKKLGNLKLQEEG → EKFRVGNCKHLKMTRP (residues 629–644) and misses region 645–920; finally, in the isoform 4, residues 649–920 are missing and region 630–648 has a RKLKKLGNLKLQEEGEASS → AVVVSERILRVFGVSEWLP substitution. As follows from our analysis, the majority of regions affected by alternative splicing is predicted to be disordered.

CONCLUSIONS

This study shows that intrinsically disordered proteins and proteins with long disordered regions are commonly found in prostate cancer. Many of these proteins are predicted to be moderately or highly disordered using either the per-protein disorder score (DS) for classification of proteins as highly ordered (DS <0.1), moderately disordered (0.1≤ DS <0.3), and highly disordered (DS ≥0.3) or looking at the per-protein content of disordered residues (CDRs), with proteins being classified as highly ordered, moderately disordered, or highly disordered, if their CDR <10%, 10% ≤ CDR <30%, and CDR ≥30%, respectively. Functions of these proteins are regulated by various posttranslational modifications. Furthermore, many of these proteins are promiscuous binders and contain numerous disorder-based binding sites. We also show that irrespectively of their disorder status (i.e., irrespectively of being mostly ordered, mixed, or mostly disordered), these proteins are characterized by an astonishing capability to be heavily connected hubs involved in a broad range of interactions. This binding promiscuity might represent a major hurdle for the development of drugs targeting these proteins. Although our analysis revealed that potential biomarkers of prostate cancer are characterized by lower disorder levels than an average human protein, in our view, there is no contradiction between these observations and our discussion of the fact that IDPs and proteins with long disordered regions are commonly found in prostate cancer, where they might play a number of important roles. In fact, what we are showing is that intrinsically disordered regions are commonly present among prostate cancer-related proteins. Furthermore, the fact that disorder might be more commonly found in the whole human proteome than in a set of proteins related to prostate cancer does not mean that IDPs are not important for progression of this disease. In addition to the information on how many IDPs are associated with prostate cancer, a very important consideration is who these cancer-related IDPs are and what they do. For example, AR, PTEN, and p53 are known as major players in prostate cancer development (and, as a matter of fact, PTEN and p53 are involved in the development of many other pathological conditions). Loss of function of these three proteins (all of which have long disordered regions) contributes greatly to disease progression. Another example is given by NKX3.1, which is also a very disordered protein, whose loss is associated with the prostate cancer development. Therefore, we think that our data make an important contribution to the field by bringing attention to IDPs potentially related to prostate cancer.

AUTHOR CONTRIBUTIONS

KSL collected datasets, conducted computational analysis, participated in data analysis, and drafted the manuscript. IN participated in collecting datasets, conducted computational analysis, participated in data analysis, conducted statistical analysis, and drafted the manuscript. ROS participated in data analysis and contributed to drafting the manuscript. VNU conceived the idea of the study, designed and coordinated experiments, analyzed data, conducted computational analysis, and drafted the manuscript.

COMPETING FINANCIAL INTERESTS

The authors declared lack of any competing financial interest. Supplementary information is linked to the online version of the paper on the Asian Journal of Andrology website.

103 in total

1. Transient structure of the amyloid precursor protein cytoplasmic tail indicates preordering of structure for binding to cytosolic factors.

Authors: T A Ramelot; L N Gentile; L K Nicholson
Journal: Biochemistry Date: 2000-03-14 Impact factor: 3.162

Review 2. Protein folding revisited. A polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go?

Authors: V N Uversky
Journal: Cell Mol Life Sci Date: 2003-09 Impact factor: 9.261

Review 3. Flexible nets. The roles of intrinsic disorder in protein interaction networks.

Authors: A Keith Dunker; Marc S Cortese; Pedro Romero; Lilia M Iakoucheva; Vladimir N Uversky
Journal: FEBS J Date: 2005-10 Impact factor: 5.542

Review 4. Structural disorder throws new light on moonlighting.

Authors: Peter Tompa; Csilla Szász; László Buday
Journal: Trends Biochem Sci Date: 2005-09 Impact factor: 13.807

5. A majority of the cancer/testis antigens are intrinsically disordered proteins.

Authors: Krithika Rajagopalan; Steven M Mooney; Nehal Parekh; Robert H Getzenberg; Prakash Kulkarni
Journal: J Cell Biochem Date: 2011-11 Impact factor: 4.429

6. Proteomic and bioinformatic analysis of a nuclear intrinsically disordered proteome.

Authors: Bozena Skupien-Rabian; Urszula Jankowska; Bianka Swiderska; Sylwia Lukasiewicz; Damian Ryszawy; Marta Dziedzicka-Wasylewska; Sylwia Kedracka-Krok
Journal: J Proteomics Date: 2015-09-12 Impact factor: 4.044

7. Consequences of poly-glutamine repeat length for the conformation and folding of the androgen receptor amino-terminal domain.

Authors: Philippa Davies; Kate Watt; Sharon M Kelly; Caroline Clark; Nicholas C Price; Iain J McEwan
Journal: J Mol Endocrinol Date: 2008-09-01 Impact factor: 5.098