Literature DB >> 19412530

Prediction of protein binding regions in disordered proteins.

Bálint Mészáros¹, István Simon, Zsuzsanna Dosztányi.

Abstract

Many disordered proteins function via binding to a structured partner and undergo a disorder-to-order transition. The coupled folding and binding can confer several functional advantages such as the precise control of binding specificity without increased affinity. Additionally, the inherent flexibility allows the binding site to adopt various conformations and to bind to multiple partners. These features explain the prevalence of such binding elements in signaling and regulatory processes. In this work, we report ANCHOR, a method for the prediction of disordered binding regions. ANCHOR relies on the pairwise energy estimation approach that is the basis of IUPred, a previous general disorder prediction method. In order to predict disordered binding regions, we seek to identify segments that are in disordered regions, cannot form enough favorable intrachain interactions to fold on their own, and are likely to gain stabilizing energy by interacting with a globular protein partner. The performance of ANCHOR was found to be largely independent from the amino acid composition and adopted secondary structure. Longer binding sites generally were predicted to be segmented, in agreement with available experimentally characterized examples. Scanning several hundred proteomes showed that the occurrence of disordered binding sites increased with the complexity of the organisms even compared to disordered regions in general. Furthermore, the length distribution of binding sites was different from disordered protein regions in general and was dominated by shorter segments. These results underline the importance of disordered proteins and protein segments in establishing new binding regions. Due to their specific biophysical properties, disordered binding sites generally carry a robust sequence signal, and this signal is efficiently captured by our method. Through its generality, ANCHOR opens new ways to study the essential functional sites of disordered proteins.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2009 PMID： 19412530 PMCID： PMC2671142 DOI： 10.1371/journal.pcbi.1000376

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

The classical point of view on protein function claims that the functionality of a protein requires the presence of a well-defined three dimensional structure. However, as the amount of experimental evidence against the generality of this concept grew, this paradigm had to be reassessed [1]. It has become evident that there is a large number of proteins that do not require a stable structure even under physiological conditions in order to fulfill their biological role [2]–[4]. These intrinsically unstructured/disordered proteins (IUPs/IDPs) lack a well defined tertiary structure and exhibit a multitude of conformations that dynamically change over time and population. The importance of protein disorder is underlined by the abundance of partially or fully disordered proteins encoded in higher eukaryotic genomes [5],[6]. Disordered proteins are involved in many important biological functions [2],[7], which complement the functional repertoire of globular proteins [7]. Recent characterization of IUPs based on their functions shows that disorder can help these proteins to fulfill their functions in various ways [8],[9]. In the case of entropic chains, the biological function is directly mediated by disorder (e.g. MAP2 projection domain [10], titin's PEVK domain [11], NF-M and NF-H between neurofilaments [12],[13], nucleoporin complex [14]). Furthermore, disordered segments often act as flexible linkers between folded domains in multidomain proteins [2],[15]. Alternatively, many disordered proteins function by binding specifically to other proteins, DNA or RNA. This process, termed coupled folding and binding involves a transition from disordered state to a more ordered state with stable secondary and tertiary structural elements [16],[17]. The coupled folding and binding confers several functional advantages in certain types of molecular interactions. Since – at least partial – folding happens together with binding, the entropic penalty counterbalances the enthalpy gain coming from the binding [18],[19]. This way disorder uncouples specificity from binding strength allowing for weak transient, still specific interactions that are essential for signaling processes. These properties enable disordered proteins to play an important role in molecular recognition including gene regulation, cell cycle control and other key cellular processes [20]–[23]. The kinetic and thermodynamic details of the binding are influenced by conformational preferences present prior to binding [24]. Although disordered proteins in general lack secondary and tertiary structure, some exhibit partial secondary structure at closer inspection. For example, CD analysis indicated that p21 and p27 possess α-helical segments [19],[25],[26]. Detailed NMR characterization of p27 and other proteins showed that several segments can have a pronounced tendency to adopt α-helical, or even β strand conformations [9]. Upon binding, these inherent structural preferences can either be solidified or overwritten by the partner molecule [27]. Some regions can preserve flexibility even within the complex, mitigating the unfavorable entropy term [28]. This allows the fine-tuning of the affinity of interactions over a wide range. As a general rule, however, these interactions are driven largely enthalpically by the favorable interactions formed with the partner molecule [18],[19],[29]. The inherent flexibility of disordered proteins offers further advantages in binding. It results in a malleable interface that can allow binding to several partners or to adopt different conformations, manifested in increased binding capability [8],[20]. In accordance, several analyses of protein interaction networks revealed that disordered proteins are abundant among hub proteins, proteins with a large number of interacting partners [30],[31]. In a different scenario, the binding partners of an ordered protein are disordered, as shown for binding of 14-3-3 proteins, thus allowing a single protein to bind multiple partners [32]. Beside their involvement in protein-protein interactions, these proteins are also subjects of various post-translational modifications that control their functions, localization and turnover [33]. In this way, these proteins can integrate and mediate multiple signals of various sources, and act as the central elements in signaling or regulatory networks. The centrality of these proteins, however, is also their weakness. It has been suggested that the targeted attack of hubs can cause serious disruption in protein interaction networks [34]. Furthermore, disordered proteins are often associated with various diseases [35]. For example, the primary importance of p53 originates from its involvement of 50% of cancers [36]. In general, 79% of human cancer associated proteins have been classified as IUPs, compared to 47% of all eukaryotic proteins in SwissProt database [22]. Disordered proteins were also suggested to be common in diabetes and cardiovascular diseases [35],[37]. Several disordered proteins - such as Aβ, τ, α synuclein, and prion protein - are involved in neurodegenerative diseases and are also prone to amyloid formation [38]–[40]. On the other hand, due to their specific way of interactions, disordered proteins can also be attractive targets for drug discovery. A novel strategy for drug discovery exploiting binding sites within disordered regions has already been suggested [41]. This adds further support to the importance of finding specific functional sites in proteins that undergo disorder-to-order transition upon binding or disordered binding regions in short. Despite their importance, the number of well characterized examples of disordered proteins undergoing disorder-to-order transition is very small. The PDB also offers only a limited sample of proteins adopting a well defined conformation as part of a complex. However, recent comparisons of these structures with complexes formed between ordered proteins pointed out several differences [42]–[44]. In general, disordered proteins adopted a largely extended conformation in the complex exposing the majority of their residues for interacting with their partner. The interface of disordered proteins was enriched in hydrophobic residues compared to the interface of ordered proteins, but also to disordered regions in general. The higher number of interchain contacts was suggested to be a sign of better adaptation of disordered proteins to the surface of their partner. In general, the regions that become ordered were shorter as compared to globular domains, usually less than 30–40 residues. While the interface of globular proteins was most often formed by distant segments of the amino acid sequence brought together by folding, disordered binding sites were much more localized in the primary structure. These features demonstrate that the underlying principles of molecular recognition of disordered binding regions are different from the complex formation of globular proteins [43]. Disordered binding sites are also expected to be distinguishable from general disordered sites that are not directly involved in binding. A common notion is that protein disorder comes in many flavors, and these should be targeted by specific prediction methods [45],[46]. However, training specific methods would require significantly larger datasets than those that are available today. Nevertheless, existing general protein disorder prediction methods might already be equipped for this problem. It has been suggested that specific patterns of disorder prediction profiles can be associated with regions undergoing disorder-to-order transitions [47]. Since these regions can be ordered as well as disordered, there is no clear recipe whether these regions should be predicted ordered, disordered, or as borderline cases. A recent analysis compared several methods to recognize short protein-protein interaction motifs containing α-helical elements in their bound state, the so-called α-MoRFs [48]. As expected, the various methods showed large variations in predicted order/disorder tendency corresponding to binding regions. One of the earliest prediction method PONDR VL-XT [49]–[51] was quite consistent in predicting these regions as ordered within a broader disordered region, giving them the characteristic appearance of dips in the prediction output. Based on this specific prediction output, a method was developed to recognize α-MoRFs from the amino acid sequence [48],[52]. First, regions predicted with dips in the output of VL-XT were selected and were filtered further by a neural network using several additional properties. This prediction method is restricted to recognize short, α-helical binding regions within disordered proteins. Here we present a general method to identify specific binding regions undergoing disorder-to-order transition. Our method relies on the general disorder prediction method IUPred [53],[54]. IUPred is based on the assumption that disordered proteins have a specific amino acid composition that does not allow the formation of a stable well-defined structure. The method utilizes statistical potentials that can be used to calculate the pairwise interaction energy from known coordinates. Using a dataset of globular proteins only, a method was developed to estimate the pairwise interaction energy of proteins directly from the amino acid sequence. By virtue of this algorithm, disordered residues can be predicted by having unfavorable estimated pairwise energies. The estimation of the energy for each residue is based on its amino acid type, and the amino acid composition of its sequential neighborhood. Through the amino acid composition of the sequential environment, IUPred can take into account that the disorder tendency of residues can be modulated by their environment [53]. This property of IUPred is exploited in order to recognize regions that are most likely to undergo a disorder-to-order transition based on their estimated pairwise energies in different contexts. The prediction of binding sites is based on estimating the energy content in free and in the bound states, and identifying segments that are potentially sensitive to these changes. In a previous work, the ability to predict specific contacts was emphasized in order to recognize disordered regions that are involved in binding externally rather than internally [46]. In our model, however, there was no attempt made to model specific interactions. Instead, the environment is taken into account simply at the level of amino acid composition. Here we show that this simple model captures the essential property of disordered binding regions and allows their robust prediction. We termed our disordered binding site prediction method ANCHOR, to reflect the primary importance of short segments driving the complex formation between a disordered protein and its partner.

Results

The outline of the algorithm

The goal of the present work was to recognize a special class of disordered segments from the amino acid sequence, namely those that are capable of undergoing a disorder-to-order transition upon binding to a globular protein partner. The essential feature of such binding regions is that they behave in a characteristically different manner in isolation than bound to their partner protein. In their free state, they behave as disordered proteins, existing as a highly flexible structural ensemble. In their bound state they usually adopt a rigid conformation, similar to regions within globular structures. This capability to behave in drastically different ways in different environments is targeted by our approach. We seek to identify segments in a generally disordered region that cannot form enough favorable intrachain interactions, however they have the capability to energetically gain by interacting with a globular partner protein. Our prediction is based on three properties. The first criterion ensures that a given residue belongs to a long disordered region, and filters out globular domains. The second criterion corresponds to the isolated state and it ensures that a residue is not able to form enough favorable contacts with its own local sequential neighbors to fold, otherwise it would be prone to adopt a well defined structure on its own. The third criterion tests the feasibility that a given residue can form enough favorable interactions with globular proteins upon binding. This basically ensures that there is an energy gain by interacting with globular regions. These properties are estimated individually and are combined into a single predictor via optimized weights. In more detail, the prediction of these three properties relies on the energy estimation framework implemented in IUPred, a general disorder prediction method. The core element of IUPred is the energy predictor matrix P. The parameters in P were trained on globular proteins with known structures only, without relying on any kind of disordered dataset. These parameters were determined to minimize the difference between the estimated energies and the energies calculated from the known structures on the dataset of globular proteins. Using the energy predictor matrix IUPred predicts the E interaction energy for each residue based on the following formula in default:where i denotes the type of the k-th amino acid, P is the element of the energy predictor matrix that estimates the pairwise energy of residue of type i in the presence of residue type j, is the fraction of residue type j in the sequential environment within w residues from residue k. The size of neighborhood considered (w) equals 100 residues in both directions and the result is smoothed over a window size of 10 (also in both directions from the k-th residue so in fact 21 residues are considered in total). For the final prediction output, the energies are transformed into probability values, denoted as s. For more details see Dosztányi et al. [53]. The disordered binding site prediction is based on three different scores that are calculated with a slight modification of the original energy estimation scheme. The parameters of P were taken directly from IUPred. The following three scores are assigned to each residue in a protein according to the above described criteria (1–3): 1, To measure the tendency of the neighborhood of an amino acid for being disordered we use the IUPred algorithm and assign an S score to the k-th residue of the chain by averaging the IUPred scores in the w neighborhood of the residue in question:where s is the IUPred score of the j-th residue of the chain, N is the number of amino acids in the averaging and b and b are the lower and upper boundaries of the neighborhood of the i-th residue, that is b = max(k−w;1) and b = min(k+w;l), where l is the chain length. 2, We estimate the pairwise interaction energy the given residue may gain by forming intrachain contacts. This is done the exact same way as in IUPred using (1), only here the size of the considered neighborhood (w) is left as a parameter and is set during the training of the predictor:The smaller window size corresponds to more local behavior. 3, The pairwise energy that the residue may gain by interacting with a globular protein is approximated using the average amino acid composition of globular proteins:where is the fraction of residue type j in the averaged reference amino acid composition of globular proteins shown in Table 1. By subtracting this energy from one can estimate the energy that the residue may gain by interacting with a hypothetical globular protein compared to forming intrachain contacts ().

Table 1

Reference amino acid composition of globular proteins.

AA	F %
R	3.68
K	6.37
D	4.92
E	5.43
N	4.69
Q	3.86
S	8.05
G	8.46
H	2.00
T	6.35
A	7.67
P	4.89
Y	3.86
V	7.13
M	1.84
C	2.43
L	8.22
F	3.19
I	5.20
W	1.76

Amino acid composition of the reference globular protein dataset comprised of all the amino acids in the longer chains of the ordered complexes dataset. Amino acids are sorted by increasing hydrophobicity based on the Fauchere-Pliska hydrophobicity scale [94]. AA denotes amino acid codes and f denotes the fraction of the respective amino acid expressed as a percentage. The final prediction score of the residue is given by the linear combination of the above three terms:where the p, p and p coefficients are determined during the training of the predictor together with the optimal values of w and w window sizes. I is then converted into a p value that expresses the probability of that residue being in a disordered binding site. For a binary classification residues with scores above 0.5 are predicted to be in a disordered binding site. Since the second and third terms of (5) may vary heavily between neighboring residues, the final score is smoothed in a window of 4 residues. The optimal values for the three weights (p, p and p) and the two window sizes (w and w) are determined using a dataset of disordered protein complexes and ordered monomeric proteins by three-fold cross validation (See Methods and Figure S1 for a schematic representation and outline of this procedure). The small dataset of known disordered proteins bound to ordered proteins represent a serious bottleneck during optimization. Therefore, it is a clear advantage of our approach that it greatly reduces the dependence on the existing dataset of disordered complexes, and leaves us with only 5 parameters to be optimized on this small dataset. The behavior of various scores is shown for an example, the N terminal domain (residues 1–100) of human p53 tumor supressor protein that plays an important regulatory role [55]. Its N terminal region is completely disordered [56] and is known to be able to bind to (at least) three different globular proteins as shown in Figure 1. The segment between residues 17–27 binds to MDM2 [57], the other two binding sites overlap with residues 33–56 binding to RPA 70N [58] and residues 45–58 binding to the B subunit of RNA polymerase II [59]. The three calculated quantities for this domain are also shown in Figure 1. It is worth noting that the MDM2 binding site in the N-terminal region of p53 appears to be on the border of being disordered. Although the disordered prediction is part of ANCHOR, the output of this prediction (E, described in Theory) is linearly combined with two other quantities meaning that predicted disorder is not strictly a prerequisite of a successful disordered binding site prediction.

Figure 1

The construction of the ANCHOR prediction method demonstrated on the N-terminal domain of human p53.

Left: IUPred prediction score for the full length human p53 (top) and S, E and E calculated for the disordered N terminal domain of human p53 (middle). Grey boxes show the three binding sites with the overlap of the RPA70N and RNAPII binding sites shown in dark grey. The outputs of the three individually optimized predictors are shown in black and their average, the final prediction score is shown in purple (bottom). Right: PDB structures of the binding sites in the N-terminal region of p53 (yellow) complexed with the respective partners (blue): MDM2 (top, PDB ID: 1ycq [57]), RPA 70N (middle, PDB ID: 2b3g [58]) and RNA PII (bottom, PDB ID: 2gs0 [59]).

The construction of the ANCHOR prediction method demonstrated on the N-terminal domain of human p53.

Testing of the algorithm

Testing of the predictor was done by dividing both our negative and positive datasets (Globular proteins and Short disordered complexes) into three subsets, training the predictor on two of these and evaluating it on the remaining third one. This was done in all three possible combinations yielding three optimal parameter sets. The parameters calculated on the training sets are shown in Table 2 together with the respective True Positive Rates (TPR) and the fraction of the amino acids in disordered regions of the Disprot dataset predicted to be in disordered binding sites (F values). The optimal parameters were chosen to maximize the amount of correctly predicted disordered binding sites (TPR) while minimizing predicted binding sites in globular proteins (FPR) and also restricting predicted binding sites within disordered regions in general (F). The fact that the three parameter sets do not differ significantly implies that our method is robust.

Table 2

Parameter and prediction accuracy values obtained during the optimization of ANCHOR.

	w₁	w₂	p₁	p₂	p₃	F (%)	TPR (%)	FPR (%)
Training set 1	25	60	0.4630	0.3847	0.7985	46.0	69.8	5.0
Training set 2	27	60	0.6075	0.4149	0.6773	47.4	67.7	5.0
Training set 3	29	90	0.6990	0.4585	0.5488	43.4	64.8	5.0

Optimal parameters of the predictor determined during training. w, w, p, p and p are the optimized parameters, F is the fraction of the residues in the disordered regions in the Disprot database that are predicted to be in binding sites, TRP and FPR are the True- and False Positive Rates, respectively. The output of the predictor with all three parameter sets and the combined final predictor (the average of these three) are shown for the example of the N terminal region of p53 in Figure 1. A few additional well characterized examples are shown in the Supporting Information (Figure S2, Figure S3, Figure S4, Figure S5, and Figure S6). The results obtained on the three independent testing subsets as well as their average are given in Table 3. Since the cutoffs are given by the training process such that we achieve exactly 5% False Positive Rate (FPR) on the respective training sets (ie. the part of the original Globular proteins dataset that was used in the training of the respective subpredictor), the FPR's are also quoted (they can differ slightly from 5%). Besides the overall TPR calculated on a residue basis (marked TPR), we also calculated the percentage of binding sites identified, termed TPR. A binding site was considered to be found if at least five of its amino acids are correctly classified. The results show that ANCHOR performs at 62% TPR with a slightly higher TPR of 68% on average, while maintaining a 5% FPR. ANCHOR is also specific to disordered binding sites as opposed to disorder to general. If all disordered proteins had approximately equal capability of binding then the fraction of correctly identified disordered binding sites (TPR) could not be significantly different from the fraction of disordered regions predicted to be binding sites (F value). As this is not the case (TPR = 62% vs. F = 42%) we can conclude that common features of known disordered binding sites that distinguish them from general disordered protein regions are successfully recognized.

Table 3

Prediction efficiency of ANCHOR evaluated on the testing datasets.

	TPR_AA (%)	TPR_SEG (%)	FPR (%)
Testing set 1	61.1	62.5	5.7
Testing set 2	69.5	80.0	4.4
Testing set 3	54.7	62.5	5.1
Average	61.8	68.3	5.1

Results of the testing of ANCHOR on the three testing datasets. TPR denotes the ratio of correctly identified amino acids belonging to binding sites. TPR denotes the ratio of binding sites found by the algorithm. Another standard way of describing prediction algorithms is by Receiver Operating Characteristic (ROC) curves [60], that is the TPR versus the FPR of the algorithm. This relationship is mapped by scanning the interval between 0 and 1 with the score cutoff. The three ROC curves of the predictor with the three different parameter sets evaluated on the respective testing sets are shown in Figure 2. A single number measure to characterize the performance is the area under the curve (AUC) with random predictors scoring AUC = 0.5 and perfect predictors scoring AUC = 1. The AUC values of the predictors trained and tested on the respective subsets are 0.8675, 0.8781 and 0.8993.

Figure 2

ROC curves obtained during the testing of ANCHOR.

ROC curves obtained during the testing of ANCHOR.

ROC curves of the predictor with parameter sets optimized on each of the three training subsets and evaluated on the respective testing subsets are shown with red, green and blue lines. The line with unity slope corresponding to random prediction is also shown. The vertical line corresponds to FPR = 0.05, where the final predictor (the average of these three) is used. Since the interacting regions of a disordered and an ordered protein are inherently different we expect that the predictor will only recognize binding sites in disordered proteins that interact with globular proteins but are not part of globular proteins themselves. In order to verify this hypothesis we tested the combined final predictor on a dataset of complexes containing only ordered chains (that is three-state complexes – see Methods). The prediction was done on the short interacting chain of the complexes. This gave a false positive rate of only 3.7% that is even lower than the value obtained on our testing set, although this might be only a consequence of the relatively small size of our ordered complex set (72 complexes). Overall, we could ensure that our predictor makes very few mistakes on both globular proteins and complexes of globular proteins, while it can still recognize the majority of disordered binding regions. This implies that our algorithm is specific to disordered binding sites as opposed to globular proteins, the interface between globular proteins or disordered proteins in general. Our predictor was also tested on a completely independent dataset of α-MoRFs, short disordered complexes that was assembled by Cheng et al. [48] and composed of 40 proteins containing binding regions that adopt mostly α-helical structure upon binding. The results of the prediction on this dataset can be seen in Table 4. Although the residue based TPR is somewhat lower than that calculated on our testing set (57.0% instead of 61.8%), the segment based TPR is almost the same for the two sets (67.5% and 68.3%). Overall these results are comparable to the ones calculated on our training set.

Table 4

Prediction efficiency of ANCHOR evaluated on an independent dataset (α-MoRFs dataset).

	H	E	C	Total	SEG
In dataset	263	8	210	479	40
Found	147	5	121	273	27
Ratio (TPR)	55.9%	62.5%	57.6%	57.0%	67.5%

Prediction results for the α-MoRFs dataset. SEG denotes segment based results where each binding site is considered one segment and one such segment is considered found if at least five of its amino acids are correctly identified.

Amino acid based evaluation of the predictor

The specific construction of the algorithm for the prediction of interaction energy implies that the method will be sensitive to amino acid compositions. The differences between the composition of disordered binding sites and the amino acid composition of any of the negative sets (globular proteins, ordered interfaces and disordered proteins in general) are shown in Figure 3A, 3B, and 3C, respectively. The amino acid compositions of all three datasets are significantly different from that of disordered binding segments (data not shown).

Figure 3

The distinct amino acid composition of short disordered binding sites.

The distinct amino acid composition of short disordered binding sites.

The average amino acid composition of the interacting parts of the short disordered binding sites compared to the average amino acid composition of (A) the globular proteins dataset, (B) the disordered proteins dataset and (C) the interacting parts of the shorter chains of the ordered complexes. Amino acids are arranged according to increasing hydrophobicity. The final prediction is based on three different scores that combine local and global disorder tendency with sensitivity to the structural environment. Although the individual quantities that are combined for the final score can work selectively better or worse for various types of residues, the effect of these differences on the efficiency of the final prediction is not trivial. This effect was tested by comparing the amount of the different amino acids in the short disordered binding sites to the amount recovered from these by the predictor. These data are shown in Table 5 together with the calculated p values quantifying their differences. As all of the p values are fairly large, these differences are likely to occur by chance alone. For example, proline rich binding sites are found with similar accuracy as binding sites enriched in hydrophobic amino acids. Therefore, one may conclude that there is no statistical evidence based on the available dataset that the efficiency of the predictor depends significantly on the amino acid composition of the disordered binding site in question.

Table 5

The independence of the efficiency of ANCHOR from the amino acid composition of the binding sites.

AA	N_int	N_found	p
R	42	21	0.122
K	47	36	0.362
D	40	27	1.000
E	41	20	0.116
N	14	6	0.252
Q	22	11	0.358
S	46	34	0.497
G	23	14	0.758
H	9	7	1.000
T	31	20	1.000
A	39	33	0.068
P	40	19	0.113
Y	17	11	1.000
V	29	20	1.000
M	17	16	0.085
C	4	2	1.000
L	69	47	0.857
F	26	19	0.764
I	31	26	0.146
W	6	5	1.000

N shows the number of interacting residues in the short disordered binding sites, N shows the amount of these that are correctly found by the predictor. As there are types of amino acids that are rare, Fisher's exact test was used to calculate (two-tailed) p values to determine if the predictor works significantly better or worse for certain amino acid types with high p values corresponding to no significant difference.

Secondary structures and the efficiency of ANCHOR

The relationship between the efficiency of the prediction and the secondary structure types was also assessed, by considering the three types of secondary structural elements: helix (H, including α- and 310 helices), extended (E) and coil (C, including everything else) as defined by DSSP [61]. The number of amino acids in different conformations that can be found in the PDB structures of our positive training set (short disordered complexes), in the interacting residues of these structures and the interacting residues that are correctly identified by the predictor are shown in Table 6. These data are represented graphically as distributions in Figure 4. The secondary structure content in this type of interactions is heavily biased towards coil conformation. It can also be seen on Figure 4 that the predictor seems to work slightly better for H and E conformations. However assessing the difference of the distributions of secondary structures in interacting residues and in the subset identified correctly by ANCHOR shows that this difference is not statistically significant at a 5% level (χ2 = 5.32, p = 0.070).

Table 6

Secondary structure distributions in the short disordered binding site dataset.

	Total in PDB		Interacting residues		Correctly identified
	Number	Fraction (%)	Number	Fraction (%)	Number	Fraction (%)
H	297	35.7	200	33.6	144	36.7
E	25	3.0	25	4.2	23	5.9
C	510	61.3	371	62.2	225	57.4
Total	832		596		392

The number and fraction of amino acids in different secondary structures in the disordered chains of the complexes. The three groups show these data for all the amino acids in the PDB structures, the ones in interaction and the ones that are correctly identified as part of binding site by ANCHOR.

Figure 4

Secondary structure distributions in the short disordered binding site dataset.

Secondary structure distributions in the short disordered binding site dataset.

Fraction of amino acids in different secondary structures in the disordered chains of the complexes. The three groups denote the fractions calculated on all the residues in the PDB structures, only the interacting ones and the ones correctly identified by the predictor. The number and fraction of amino acids in different secondary structures in the disordered chains of the complexes. The three groups show these data for all the amino acids in the PDB structures, the ones in interaction and the ones that are correctly identified as part of binding site by ANCHOR. Furthermore, a similar result holds true if binding sites are categorized based on their dominant secondary structure type - that is there is no significant correlation between the secondary structure type the binding regions adopt upon binding and the efficiency of the predictor. (Dataset S1 shows the secondary structure types determined for the short disordered chains in the disordered complexes as described in Protocol S1.) Overall, this means that there is no significant difference in the efficiency of the prediction on different secondary structural elements.

Testing on long disordered regions

Since the predictor was trained on the short disordered dataset it is informative to see how it performs on long disordered binding sites. There is experimental evidence that at least some long disordered chains are not uniform concerning binding strength but contain short stretches of strongly interacting residues separated by segments that interact with the partner only weakly if at all [19]. In these cases, it is expected that the predictor will be unable to identify the weakly interacting parts since – though these parts may also form interchain contacts – they would not be able to bind to the partner in the absence of their sequential neighbors. The distribution of predicted binding regions for the short and long disordered chains in Figure 5A shows a strong preference for predicting multiple interacting regions for longer chains. This inevitably yields lower residue based TPR but the segment based TPR is not expected to drop. Testing the predictor on the long disordered data confirms this assumption with a decreased residue based TPR of 47.7% (as opposed to 65.8% obtained on running the final predictor on the whole set of short disordered complexes) but with a basically unchanged segment based TPR of 78.6% (compared to the 76.1% calculated on short disordered complexes). These data suggest that the method either finds short disordered binding sites as a whole or completely misses it. However, this may not be true for long binding regions. Figure 5B shows the distribution of the fraction of amino acids successfully identified during prediction in the two types of binding sites. The effect can clearly be seen as about 59% of short binding regions are either fully recovered or are completely missed (the sum of the rightmost and leftmost columns) whereas this ratio is only about 29% for long binding sites.

Figure 5

Prediction accuracies and segmentation for the short and long disordered binding sites.

Prediction accuracies and segmentation for the short and long disordered binding sites.

(A) The distribution of the number of binding segments predicted in short (white bars) and long (black bars) binding sites. It shows the segmented nature of longer binding sites. (B) The distribution of the fraction of correctly recovered interacting residues in both the short (white bars) and long (black bars) disordered binding sites. This type of behavior is illustrated on the disordered human p27. This protein is involved in controlling eukaryotic cell division through interactions with cyclin-dependent kinases. Its kinase inhibitory domain binds both subunits of the CDK2-cyclin A complex in an extended conformation (PDB ID: 1jsu [62]). It is known from kinetic measurements that the binding of p27 is hierarchical through its three domains: first, the D1 domain (residues 25–36) binds to cyclinA which anchors the neighboring LH domain (residues 38–60) that exhibits transient helical structure in monomer state as well [63]. After the binding of D1 this transient structure is stabilized and positions the rest of the chain (D2 domain, residues 62–90) in the correct position to bind to CDK2. Figure 6 shows the prediction output for p27. Four interacting regions are identified with the first one (27–37) clearly corresponding to D1. The gap between the first two regions (38–58) coincides with the weakly interacting LH domain. The last three regions (59–67, 74–77 and 79–90) cover the strongly interacting D2. Figure 6 also shows the number of atomic contacts/residue for p27 (averaged in a window of size 3). This contact number profile exhibits well pronounced peaks that line up with the regions that are predicted by our algorithm. The figure also shows the four predicted regions mapped to the crystal structure of the complex.

Figure 6

ANCHOR prediction for human p27.

Top: Number of atomic contacts (green) and prediction output (blue) and for the N-terminal binding region of human p27. “D1”and “D2” denote the two strongly interacting domains (red boxes) and “LH” denotes the weakly interacting linker domain between them (yellow box). Bottom: Crystal structure of human p27 (red and yellow) complexed with CDK2 (magenta) and Cyclin A (blue) (PDB ID: 1jsu [62]). Red parts denote regions that are predicted to bind by the predictor. These regions correspond to the experimentally verified strongly binding regions of p27. The figure was generated by PyMOL.

ANCHOR prediction for human p27.

Wiskott-Aldrich Syndrome protein (WASp)

The examples discussed so far represent various fragments of proteins. Here we present an additional case showing the prediction output for a complete protein sequence. The human Wiskott-Aldrich Syndrome protein (WASp) is a 502 residue long protein that is expressed in the cells of the hematopoietic system [64]. Its mutations can be linked to the Wiskott-Aldrich Syndrome (WAS), a disease characterized by actin cytoskeleton defects leading to deficiencies in blood clotting and immune response. The protein is composed of various functional domains. It contains the WH1 domain near the N terminus (residues 39–148), the GTPase-binding domain (GBD, 230–310), a polyproline-rich region and a C-terminal verpolin homology/central region/acidic region (VCA, 430–502) domain [65] that also contains the WH2 domain (430–447). Apart from the structured WH1 domain, it is predicted to be largely disordered and contains several low complexity regions (enriched in P, G and acidic amino acids). There is experimental evidence that the activated WASp hubs a number of interactions with partners including CDC42, RAC, NCK, FYN, SRC kinase FGR, BTK, ABL, PSTPIP1, WIP, and the p85 subunit of PLC-gamma as well as the Arp2/3 complex. However, the location of many of these binding regions is not known. The domain structure of WASp is shown in Figure 7 together with the known binding regions.

Figure 7

ANCHOR prediction for human WASp.

ANCHOR prediction for human WASp.

Red bars mark known interaction sites, green box marks the globular WH1 domain, blue boxes mark the GBD and VCA domains. Light red boxes indicate the regions with putative SH3 domain interaction sites. In its inactive state WASp exists in an autoinhibited form with the GBD domain bound to the VCA domain. When WASp is activated, the GBD domain is bound to CDC42 and this interaction disrupts the GBD-VCA interaction. This initiates a conformational change where WASp opens up and becomes able to bind to the Arp2/3 complex leading to its activation and actin nucleation. Both GBD and VCA regions were shown to be disordered in their free state [65],[66], with GBD adopting a loosely packed, compact conformation. However, the structure of both complexes could be determined using NMR, by covalently linking GBD to CDC42 or the VCA region, respectively [65],[67]. In these two structures WASp GBD adopts related but distinct folds. The plasticity that can be seen by comparing these two complexes is enabled by the absence of discrete tertiary structure in isolation. As it can be seen on Figure 7, ANCHOR captures these disordered binding sites correctly. It is known that WASp is able to bind to SRC Homology 3 (SH3) domains through one of its proline rich regions although the exact binding site is not known. The interaction with SH3 domains is usually mediated by a short, linear sequence motif that is present in the interaction partner. In the collection of Eukaryotic Linear Motifs (ELM) database (http://elm.eu.org/ [68]) there are five different motifs annotated as SH3 recognition sites. Multiple instances of the following three can be found in human WASp: LIG_SH3_1, LIG_SH3_2 and LIG_SH3_3 represented by the following consensus sequences: [RKY]..P..P, P..P.[KR] and …[PV]..P, for interaction with Class I/ClassII SH3 domains and those SH3 domains with a non-canonical Class I recognition specificity, respectively. The found motifs are clustered in two separate regions mainly falling into the proline-rich regions of WASp (Figure 7). Although there is no direct evidence for the location of interaction with SH3 domains on human WASp, the interaction sites have been identified for Las17 [69], the yeast homologue of this protein. In total, four distinct regions containing multiple binding sites were identified experimentally in Las17 that interact with various SH3 domains. These sites correspond to the proline rich regions in WASp (155–194 and 306–427) that also match with several SH3 binding motifs. As linear motifs were shown to have a preference to reside in disordered regions [70], it is plausible to expect ANCHOR to be able to recognize the SH3 binding region of WASp. In accordance with this, both regions containing putative SH3 binding sites contain binding sites predicted by ANCHOR. This prediction can restrict the candidate sequence regions for SH3 binding and can guide experimental studies to localize true binding sites.

Complete proteome scans

In order to gain some evolutionary insight concerning disordered binding sites, the predictor was run on the 736 complete proteomes (53 archaea, 639 bacteria and 44 eukaryota, see Dataset S5, Dataset S6, and Dataset S7, respectively) that are currently available from the SwissProt database (ftp://ftp.expasy.org/). In agreement of previous analyses [5],[6] there is a clear trend of increasing amount of protein disorder as the complexity of the organism increases (see Figure 8). However, Figure 8 also shows that the fraction of disordered amino acids predicted to be in disordered binding sites increases even compared to fraction of disordered residues, as the complexity of organisms grows. Generally, archaea have the least amount of both disorder and binding sites. On the other hand, eukaryota have generally the largest ratio of disordered and binding amino acids with bacteria being between these two groups on average. However there are a few exceptions to these general trends, marked separately on Figure 8.

Figure 8

Fraction of disordered and disordered binding site residues in complete proteomes.

Fraction of disordered and disordered binding site residues in complete proteomes.

The number of amino acids in disordered binding sites divided by the number of amino acids in disordered regions plotted as a function of the number of amino acids in disordered regions divided by the total number of residues in the proteome of the organism for the 736 complete proteomes deposited in the SwissProt database, colored according to the three kingdoms of life. The outlying points are marked with the name of the corresponding organism. Considering archaea, mesophiles generally contain a larger amount of disorder and a larger fraction of disordered binding sites than most extremophiles (thermophiles, cryophiles and acidiphiles). However the group of halophile archaea (archaea that favor high saline concentration) is a distinct exception with fraction of disordered amino acids ranging from 0.2 to 0.25 as opposed to other extremophiles' values not exceeding 0.07. This group includes all the halophile archaea in our study, namely Natronomonas pharaonis, Haloarcula marismortui, Haloquadratum walsbyi and two types of Halobacterium salinarum. Cenarchaeum symbiosum, the only example of obligate endosymbiont among archaea also has an unusually large amount of disordered protein segments in its proteome (0.12). While Cenarchaeum symbiosum is closely related to thermophile archaeas, it is adopted to the much lower living temperature of its host [71]. This adaptation could explain the relatively large amount protein disorder and disordered binding sites. In general, these clear differences in the predicted disorder between various archaea organisms points to different strategies to adapt to various extreme environmental conditions resulting in biased amino acid compositions. However, we cannot rule out the possibility that under such extreme conditions, as high salt concentration or high temperature, the amount of disorder can be over- or underpredicted depending how these conditions affect the presence of protein disorder. Among bacterial proteomes, there are a few examples of organisms that seem to utilize a surprisingly large fraction of their disordered amino acids in binding. The three most extreme cases (Carsonella ruddii, Sulcia muelleri and Buchnera aphidicola subsp. Cinara cedri) are marked separately on Figure 8. These are the three smallest complete bacterial proteomes, none of them reaching the size of the smallest archaea proteome. These organisms present extreme cases of streamlined genomes as a result of endosymbiosis [72]–[74]. As these proteomes are very small, the predicted amount of disorder and disordered binding sites are within the false positive range, and should be treated more cautiously. Eukaryotes tend to appear more consistent both in using larger amount of disordered residues and larger fraction of disordered residues for binding compared to the other two kingdoms (Figure 8). The only notable outlier both in terms of extremely low amount disordered proteins and disordered binding sites is Encephalitozoon cuniculi. This organism is the only microsporidian parasite in our dataset and has an extremely small proteome. This lack of complexity and dependence on a eukaryotic host to function might explain the lack of disordered proteins. The length distributions of the predicted disordered regions and binding sites in the three kingdoms of life was also analyzed and are shown in Figure 9A and 9B, respectively. As complexity increases, longer disordered segments are preferred, and the difference between eukaryota and lower complexity organisms becomes even more apparent for longer regions (over 30 residues). A similar trend can be observed in the length distribution of disordered binding sites. While in archaea and bacteria predicted binding regions are generally below 30 residues, longer binding sites in eukaryota organisms are much more common. There are at least three different effects that can contribute to this phenomenon. First, as the number of binding sites rise there is also an increasing possibility of these binding sites becoming very close to each other or even overlapping with each other. This scenario was demonstrated in the case of the N-terminal domain of p53, as shown in Figure 1. Second, extremely large disordered binding regions may be needed for special functions. Some members of the mucin protein family provide an example for this. Human MUC1 contains a large repeat region (20–120 repeats, one repeat being 20 amino acids long) that enables it to aggregate and to perform its function [75]. As each repeat is correctly identified as a disordered binding site, the whole repeat region is predicted as one large binding region. This mechanism can create binding sites up to the length of several hundreds of residues in extreme cases. Third, we cannot exclude the possibility that longer binding sites are not always segmented by weakly interacting regions like in the case of p27, thus forming long, continuous binding regions. Nevertheless, the majority of predicted binding sites is shorter than 30 residues, although such restriction on the length of disordered binding sites was not enforced.

Figure 9

Length distribution of disordered and disordered binding sites in complete proteomes.

The length distribution of (A) the disordered protein segments determined by IUPred and (B) predicted disordered binding sites determined by ANCHOR for the 736 complete proteomes available, grouped according to the three kingdoms of life.

Length distribution of disordered and disordered binding sites in complete proteomes.

Discussion

Regions undergoing disorder-to-order transitions upon binding are essential elements in the molecular recognition process involving disordered proteins. The main property of these binding regions is that they can exist in a disordered state as well as in bound state, adopting at least partially a well-defined conformation. The presence of these two separate states discriminates them from monomeric globular proteins as well as from complexes formed between globular proteins and from disordered proteins in general. They are also expected to differ from dual personality fragments [76], which occur within globular proteins, however, mostly as a result of perturbations of environmental conditions. In this work we aimed to recognize such disordered binding regions from the amino acid sequence. So far, the limited number of well characterized examples hindered the development of general prediction methods. Nevertheless, biophysical considerations suggest that in most cases there is a strong signal in the amino acid sequence highlighting regions involved in coupled folding and binding. These regions are linear in sequence, unlike in the case of globular proteins, where distinct sites in the amino acid sequence are brought together to form the interface for interaction [43]. An additional difference is that binding of disordered proteins is driven by a large enthalpic component to compensate for the entropy penalty due to the loss of conformational freedom [9]. These features result in a relatively short sequence segment containing residues with a pronounced tendency to make interactions, leading to a characteristic sequence signal. Our approach relies on a basic physical model of disordered binding sites and it is based on modeling the interaction capacity in the free disordered state and in the bound ordered state. Previously, it was shown that ordered proteins can be discriminated from disordered proteins based on estimated pairwise energy content and this approach was implemented in IUPred, a general disorder prediction method [53]. This method takes into account that disorder/order tendency can be modulated by the sequential neighborhood simply at the level of amino acid composition, without attempting to model the specific interactions. Taking it one step further, the same energy estimation calculations were used to identify disordered binding regions in proteins. Our model assumes that the specific properties of disordered binding sites are dictated by the combination of preferences to bind to an ordered protein on the one hand, and the ability to remain in a disordered state in isolation, on the other. Based on this simple model, ANCHOR achieved approximately 67% accuracy at predicting 5% false positive rate (Tables 2– 4). Furthermore, this approach was validated by the ability to reproduce the specific amino acid composition of disordered binding sites, that is distinct from that of ordered proteins as well as disordered proteins in general (Table 5). During binding, the formation of intermolecular contacts is accompanied by the formation or the stabilization of secondary structure elements. The secondary structure composition of the binding sites is highly unequal (Table 6 and Figure 4). The most dominant secondary structure element adopted in the bound conformation is coil, while β strand conformation is rare. Helical conformations are observed as frequently in disordered complexes as in globular proteins [27]. It was found that the adopted secondary structure can be predicted from the amino acid sequence with similar accuracy as in the case of globular proteins, suggesting that the adopted secondary structure can be imprinted into the sequence of the binding motif [27]. The secondary structure observed in the complex can also be dictated by the template structure. An extreme example of this is the C-terminal region of p53 (see Supporting Information), observed in all three secondary structure classes [32]. It is clear that not all of these conformations can be the result of inherent preferences. Interestingly, our prediction method does not seem to be sensitive to the adopted secondary structure conformation and it works with the same accuracy for all secondary structure conformations (Table 6 and Figure 4). This independence of secondary structure elements underlines the generality of ANCHOR. These results also suggest that disordered binding sites can be recognized without taking into account of the adopted secondary structure in the majority of cases. Nevertheless, the details of conformational preferences can be still crucial in selecting the specific binding partner, or determining the kinetic and thermodynamic properties of the associations. Beside our algorithm, a previously published method called α-MoRF predictor also exploited a general disorder prediction method to recognize short binding elements [48],[52]. Although the direct comparison between the two methods was not possible, because the α-MoRF predictor is not yet publicly available, some basic differences between the two methods should be noted. First, the α-MoRF predictor directly relies on the prediction output of PONDR VXLT, which essentially predicts binding regions as ordered structural elements, and a subsequent neural network is applied to filter out valid disordered binding sites. Although very high accuracies were reported for the performance of the neural network based filtering, the complete method is limited by finding dips based on PONDR VLXT [49]–[51]. Therefore it should be taken into account that this program is a first generation prediction method that was trained on only 15 proteins. In the case of IUPred, dips corresponding to certain binding sites were also observed, although to a smaller extent [48],[53]. This observation, however, is not directly exploited in our prediction method. Instead, the core parameters of the energy prediction of IUPred are used to create three separate scores characterizing three important attributes of disordered binding regions. The second main difference is that ANCHOR is not restricted to a single secondary structure class like the α-MoRF predictor that was trained to recognize only α-helical segments. The example of the C-terminal region of p53 (Figure S2), where four short overlapping regions were shown to bind in different conformations representing all three secondary structure classes, indicates that such restriction can be a serious disadvantage for recognizing some extremely adaptable disordered binding motifs. An alternative approach for binding site identification is based on the observation that protein-protein interactions are often mediated through short linear motifs (approximately three to eight residues) [77]. Such motifs are defined by a consensus pattern, which captures the key residues involved in function or binding. Prominent examples include the nuclear receptor box motif, MDM2 binding sites, SH2/SH3 domain recognition patterns or 14-3-3 domain binding sites [68]. Although there are known examples of motifs that reside within globular domains, many of them are required to be in a disordered region to function properly and it was suggested that such motifs share many similarities with disordered binding regions [70]. Our preliminary results support previous observations of the partial overlap between short linear motifs and disordered binding segments. Nevertheless, short disordered binding sites and sequence specific linear motifs capture different aspects of certain binding regions. Linear motifs are defined on the basis of a per residue binding strength, and they are specific to a certain partner or to a group of partner molecules. However, such short linear motifs can also occur purely by chance, with no biological significance. Also, sequence patterns alone cannot ensure the accessibility of the site and the potential flexibility of the binding region that could be necessary for the complex formation. Complementary to sequence motifs, ANCHOR aims to capture a broader structural context. Based on their specific structural properties, it can recognize disordered binding regions that are capable of undergoing disorder-to-order transition. The predictions are made without taking into account the partner molecules and are expected to be less sensitive to sequence details. For certain motifs, this molecular environment can be a prerequisite of functionality and could help to identify biologically significant binding motifs. In our work we assumed, that short binding regions undergoing disorder-to-order transition can be viewed as elementary binding units that are necessary for the molecular recognition. Therefore, such examples were used for the optimization of our method. In accordance with their elementary unit picture, ANCHOR recognized them generally as a single continuous binding site (Figure 5). Regions undergoing disorder-to-order transition, however, are not limited to such short segments as there are several examples of longer disordered segment becoming ordered upon complex formation. Such segments can be as long as 100 residues. However, these longer regions can contain segments which bind only weakly or might not become ordered at all [63],[78],[79]. This segmentation of longer binding regions can occur for structural reasons. The segmentation can prevent the accumulation of the critical amount of residues that would lead to the formation a collapsed structure or non-specific aggregates. The possible functional advantages of the segmented nature of a binding site were demonstrated for the well characterized example of p27. The kinase inhibitory domain of p27 can be divided into several subdomains which dock and fold in a stepwise manner on the surface of the Cdk2-cyclin A complex [19]. These segments can also evolve independently, increasing the repertoire for specificity for different cellular location or species. Intervening segments of higher flexibility are accessible for modifications such as phosphorylations and ubiquitinations. This way p27 can integrate and process various signals to regulate cell proliferation, in which the flexibility and modularity of p27 is essential [63]. The segmented nature of binding is reflected in the prediction output, with predicted binding sites corresponding to the strongly interacting regions (Figure 6 for p27, and Figure S4 for a similar example, calpastatin). In the dataset of longer disordered binding segments, we found this segmentation to be quite general. In these cases, the predicted sites generally give only partial coverage of the PDB structure, and multiple binding sites are predicted in the majority of cases (Figure 5). This suggests that our prediction method is likely to find those sites that interact more strongly, anchoring the disordered segments to their partner protein. While the segmented nature of binding is prominent in the case of long binding regions, to a smaller extent, it can also affect shorter binding regions. Indeed, around 20% of short disordered binding regions are predicted as 2 or 3 segments (Figure 5). This could also account for the significantly lower per residue efficiency compared to the segment based efficiency. By looking at further individual examples, one can already see remarkable variations in the details of disorder-to-order transitions even within the limited collection that is available today. The adopted conformation in these complexes can be quite different, both in terms of secondary or tertiary structure. Furthermore, the transition to an ordered structure might not be complete [28]. This could leave terminal residues or linker regions flexible and inaccessible to structure determination. It was also suggested that specific binding can be possible even without adopting a well-defined conformation as in the case of the ζ-chain of T-cell receptor [80] (see Figure S6). Differences are also present at the level of the sequence. Some binding regions rely largely on hydrophobic or aromatic residues (MDM2 binding regions, Figure 1), others use proline rich regions (WASp SH3 binding regions, Figure 7). Disordered binding regions can contain conserved linear motifs, while large divergence in sequence was noted in other cases (C terminal domain of histones [81]). These examples represent multiple ways disordered regions can be utilized for binding. A single protein sequence can contain several distinct binding regions, however, a single region can be involved in binding to multiple partners, or use these regions in combination to hub several interactions (p53 – see Figure 1 and Figure S2, WASp – see Figure 7). In an alternative scenario, disorder present in the partner molecules allows to bind a well-folded protein by a large number of proteins (β-catenin [82], Figure S3). Even further variations are expected as the number of examples will grow in the future. Nevertheless, the success of ANCHOR confirms our hypothesis, that despite these differences disordered binding regions have a common property that predispose them for coupled folding and binding. The occurrence of disordered binding sites is clearly tied to the presence of disordered protein regions. Their relationship was further analyzed at the level of complete proteomes. Previous studies have shown that the amount of predicted disordered regions increases with the complexity of organisms throughout evolution and reaches a high level in multicellular organisms [5],[6]. This increase can be mostly attributed to the appearance of long, domain-sized segments of protein disorder or fully disordered proteins (Figure 9A). Our analysis showed that the amount of disordered binding segments increases in eukaryotes in a similar way, however, their fraction is elevated even compared to disordered regions in general (Figure 8). The observed trend is valid through a wide range of organisms, and occasional exceptions occur either due to adaptation to extreme habitat conditions, or as a result of endosymbiosis. These findings imply that the newly introduced disordered proteins and protein segments mainly serve as a carrier for new binding regions in eukaryotic organisms. The importance of disordered regions in protein-protein interactions is also supported by the increased ratio of disordered proteins among hub proteins [30],[31]. Disordered segments are often involved for complex signaling and regulatory processes [20] such as cell cycle control, gene regulation or signal transduction in the intracellular region of transmembrane proteins [83]. These processes rely on interactions involving multiple partners and high specificity/low affinity interactions, that disordered binding segments can provide by their very nature. The disordered segments can harbor multiple binding sites which can act relatively independently. In other cases segmented binding sites can be involved in simultaneous binding to larger complexes. Overlapping binding sites (such in the case of p53 N and C terminal regions) suggest competition between binding partners. We are only beginning to comprehend how disordered binding regions are exploited to provide versatile interaction sites in proteins. In conclusion, disordered binding regions represent a specific subclass of disordered proteins that can undergo a disorder-to-order transition upon binding. These binding sites generally have distinct properties both structurally and functionally. Due to the inherent flexibility, these regions are difficult to study experimentally [84], making specific prediction methods even more valuable. While there are several methods available for prediction of disordered regions [85],[86], recognizing disordered binding sites was regarded as a more challenging problem [9] due to the limited number of well-characterized examples. In this work we report a general method to recognize disordered binding sites based on a basic biophysical model. Our method relies on a simple energy estimation procedure that was developed earlier for the IUPred disorder prediction method. This way, the problem of small datasets can be largely avoided. We showed that these regions can be characterized by highly disordered sequential neighborhood, unfavorable intrachain energies and more favorable interaction energies with a globular partner. The combination of these properties allowed the recognition of disordered binding sites independent of their secondary structure or amino acid composition, underlining the generality of the method. As such binding sites are essential functional elements of disordered proteins, their prediction directly provides information about functionally important residues in these proteins. In this way, ANCHOR broadens the repertoire of prediction methods for functional sites in proteins aiming to decrease the large number of unannotated sequences [87]. Generally, the complete understanding of protein-protein interactions involving disordered binding sites requires the knowledge of their partners as well as possible post-translational modifications that can influence their binding. While predictions can be made even without taking the partner molecule into account, certain cases might require incorporating the specific feature of the partner. Nevertheless, our method can provide the starting point for such scientific explorations, by finding potential regions involved in such binding.

Methods

Databases

The primary source of data for the present analysis is a carefully assembled dataset of binding regions undergoing disorder-to-order transition. The strict requirement of the experimental verification of both the disordered status in isolation and the formation of an ordered structure in complex distinguishes our dataset from a previously collected dataset for disordered binding regions [88]. The length of disordered regions involved in the binding can vary on a large scale. In the case of longer regions it is not guaranteed that each residue is equally important for binding, therefore complexes of short disordered regions were treated separately, and only these were used for tuning the method.

Short disordered complexes

Complexes from the PDB [89] were collected by scanning the chains in the PDB entries against the Disprot database [90]. A complex was accepted if it consisted of a chain with length between 10 and 30 residues that was found in the Disprot database as part of an annotated disordered segment and at least one interacting partner that was at least 40 residues long. Furthermore, complexes containing transmembrane proteins, RNA or DNA, chimeras, disulfide bonds between the disordered and ordered chains or a large number of unknown residues (marked with an X) were excluded. A few experimentally verified disordered complexes missing from Disprot were added to this set [42], [43], [62], [91]–[93]. A sequence similarity filter of 50% has also been applied to remove closely related proteins or protein segments. This procedure yielded a set of 46 complexes that are listed in Dataset S1.

Long disordered complexes

Complexes containing long disordered chains were collected in the same fashion as short ones but with different criteria for the length of the interacting partners. Here the length of the disordered chains was required to be at least 30 residues and they had to have an interacting partner of 70 residues or more. The resulting set of 28 complexes is listed in Dataset S2.

α-MoRFs dataset

This dataset originally consisted of 53 complexes [48]. Complexes that were contained in our Short disordered complexes dataset as well were excluded in order to get a truly independent set. Three complexes were further removed from the remainder since one of them is part of the ribosome subunit S23 and the other two can be found in the PBD with structures containing only the disordered chain – that is they are presumably capable of folding on their own. The rationale behind this exclusion is that our predictor is neither trained to recognize RNA/DNA-protein interactions nor to identify globular-globular interfaces. This left 40 complexes in total.

Globular proteins

Globular proteins were collected from PDB entries that had only one chain of at least 30 residues [53]. Also transmembrane proteins and complexes with RNA/DNA were filtered out. This dataset contains 553 proteins and is presented in Dataset S3.

Ordered complexes

This set contains protein complexes that consist of two partners both of which are ordered. These data were taken from the literature [43]. The dataset does not include cases of crystal packing dimers, chimeras and fragments and consists of 72 complexes (Dataset S4).

Disordered proteins

For the analysis of disordered proteins and protein segments the 3.7 version of Disprot database was used (http://www.disprot.org/) [90], considering only annotated disordered segments of 10 residues or longer.

Parameter optimization

The optimal parameters were determined by a three fold cross-validation, by dividing both our negative and positive datasets (Globular proteins and Short disordered complexes, respectively) into three parts. In each turn we used two parts for training and the remaining part for testing. To avoid any bias, the different subsets were chosen such that the distribution of chain lengths in both the positive and negative sets and the distribution of secondary structure types in the positive set were approximately the same. Our approach relies on IUPred, a general disorder prediction method, and its energy predictor matrix. These parameters (ie. the elements of the energy predictor matrix) have been determined earlier, independently of disordered binding regions. Only five additional parameters, w, w, p, p and p were optimized for this specific problem and were selected by a grid search procedure. Specifically, w was varied in the range of 20 to 100 in steps of 10 (giving 9 possible values), w was varied in the range of 5 to 35 in steps of 2 (giving 16 possible values), and p, p and p was selected from 1000 sets of randomly generated values. Taking into account that the prediction performance is insensitive to the norm and the sign of the vector corresponding to the p, p and p values, the search was restricted to 1000 random sets that were evenly distributed on the surface of the upper half of the unit sphere. This means that p and p were randomly selected from the interval [−1;1] and p was selected from the interval [0;1] in a way that the sum of their squares is always equal to 1. This yielded 1000 different (p, p, p) combinations. These, combined with all possible values of w and w gave 144,000 different parameter sets in total. These were considered in order to select the optimal one, containing the five optimal parameters for each round of the cross-validation. To quantify the performance of the predictor given a set of parameters we calculated the True Positive Rate (TPR) at False Positive Rates (FPR) fixed at 5% calculated on globular proteins as the negative set. However, a full characterization of the performance of the algorithm would also require a set of disordered proteins that are known not to bind to globular proteins. Unfortunately, such dataset cannot be constructed since there is hardly any way to give evidence for a protein that it does not contain binding sites. This problem was addressed by calculating the fraction of amino acids that are predicted as binding sites in general disordered regions of Disprot database that are correctly recognized as disordered by IUPred. This fraction was denoted as F. Optimal parameters should combine high TPR with low F at the expense of very low FPR. During optimization of the algorithm, the performance on three different datasets needed to be monitored at the same time (set of globular proteins, set of disordered binding sites and Disprot). The best parameter set was chosen manually, by reducing the parameter set in a step-wise manner based on the following steps: 1, Calculate TPR (at fixed FPR = 5%) and F for each of the 144,000 candidate sets of parameters 2, Discard all for which F>50% 3, Discard all for which TPR<60% 4, From the remainder choose the 20 for which the difference between TPR and F is the largest 5, Choose the one for which TPR is maximal (the TPR-F difference among these 20 sets vary only within a range of less then 0.02 so that is not a good measure to choose the best one) The negative and positive sets were divided into three parts, resulting in three different optimal parameter sets. The final predictor algorithm is constructed by averaging these three outputs. As the training sets only contained binding regions of at least 10 amino acids and we aim to identify at least 5 residues of each region, all predicted binding sites were removed that did not exceed 5 consecutive residues. A schematic figure of the training procedure is given in Figure S1.

Availability

ANCHOR is available upon request from the authors. 46 complexes of short disordered and long globular proteins. Column 4 contains the secondary structure type of the bound disordered chains based on the structure found in the PDB record as defined in Data and Methods. Thick lines separate the three groups used during parameter optimization. (0.07 MB DOC) Click here for additional data file. 28 complexes of long disordered and long globular proteins. Column 4 contains the secondary structure type of the bound disordered chains based on the structure found in the PDB record as defined in Data and Methods. (0.05 MB DOC) Click here for additional data file. 553 monomeric globular proteins that were used as a negative dataset [2]. Columns correspond to the grouping used during parameter optimization. (0.20 MB DOC) Click here for additional data file. 72 complexes of ordered proteins [3]. The interaction is considered between the shortest chains and its interaction partners. (0.08 MB DOC) Click here for additional data file. The 53 complete archaea proteomes available from SwissProt (ftp://ftp.expasy.org/) used for full proteome scans. The fraction of total amino acids in disordered regions and the fraction of disordered amino acids in disordered binding sites are indicated together for each organism. (0.09 MB DOC) Click here for additional data file. The 639 complete bacteria proteomes available from SwissProt (ftp://ftp.expasy.org/) used for full proteome scans. The fraction of total amino acids in disordered regions and the fraction of disordered amino acids in disordered binding sites are indicated together for each organism. (0.86 MB DOC) Click here for additional data file. The 44 complete eukaryota proteomes available from SwissProt (ftp://ftp.expasy.org/) used for full proteome scans. The fraction of total amino acids in disordered regions and the fraction of disordered amino acids in disordered binding sites are indicated together for each organism. (0.08 MB DOC) Click here for additional data file. Development of ANCHOR. In the first step, our Short Disordered Binding Sites dataset and Globular Proteins dataset (positive and negative datasets) are split up and only 2/3 is used in the subsequential steps. Then a parameter set (w1, w2, p1, p2, p3) is selected from the 144,000 random ones. This parameter set is used to calculate S, Eint and Egain for every position in every sequence in the three input datasets using the fixed energy predictor matrix P (see Theory). Based on this calculations the evaluating measures are calculated: TPR is calculated on Short Disordered Binding Sites, FPR is calculated on Globular Proteins and F is calculated on Disordered Proteins. Based on these measures, the best parameter set out of 144,000 is chosen (see Data and Methods). Then this parameter set is evaluated on the remaining one third of the datasets. These results are reported in Table 3. This procedure is repeated for all three subsets of Short Disordered Binding Sites and Globular Proteins. The output of the three optimized predictors are combined into one final predictor by averaging their output. (0.05 MB PPT) Click here for additional data file. ANCHOR prediction output for the C-terminal domain of human p53. Prediction for the C-terminal disordered domain of human p53. The regulatory binding site around residues 375–390 is able to adopt all three secondary structural elements upon binding to globular partners [4]. (0.04 MB TIF) Click here for additional data file. ANCHOR prediction output for Tcf4. Prediction output for transcription factor Tcf4 (blue) together with the number of atomic contacts (green) determined in the complexed form with Beta-catenin (PDB ID: 2gl7 [5]). Beta-catenin is known to bind several disordered binding regions. (0.03 MB TIF) Click here for additional data file. ANCHOR prediction output for human calpastatin. Prediction output for the I. domain of human calpastatin. Subdomains A. B and C (grey boxes) are known to bind to calpain and inhibit it. Subdomains A and C bind via a preformed alpha-helix. while subdomain B does not exhibit strong structural preference in solution [6]. (0.04 MB TIF) Click here for additional data file. ANCHOR prediction output for the KID domain of CREB. Prediction output for the KID domain of CREB. The region marked with a grey box interacts with the KIX domain of CBP via two preformed alpha-helices [7]. (0.03 MB TIF) Click here for additional data file. ANCHOR prediction output for the ζ-chain of T-cell receptor. Prediction output for the zeta-chain of the T-cell receptor. The transmembrane region is marked with red box and the three intracellular ITAM regions are marked with blue boxes. (0.12 MB TIF) Click here for additional data file. Protocol including references for the Supporting Information. (0.04 MB DOC) Click here for additional data file.

93 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. PEVK domain of titin: an entropic spring with actin-binding properties.

Authors: Wolfgang A Linke; Michael Kulke; Hongbin Li; Setsuko Fujita-Becker; Ciprian Neagoe; Dietmar J Manstein; Mathias Gautel; Julio M Fernandez
Journal: J Struct Biol Date: 2002 Jan-Feb Impact factor: 2.867

3. Preformed structural elements feature in partner recognition by intrinsically unstructured proteins.

Authors: Monika Fuxreiter; István Simon; Peter Friedrich; Peter Tompa
Journal: J Mol Biol Date: 2004-05-14 Impact factor: 5.469

4. Molecular mechanisms for organizing the neuronal cytoskeleton.

Authors: Rajendrani Mukhopadhyay; Sanjay Kumar; Jan H Hoh
Journal: Bioessays Date: 2004-09 Impact factor: 4.345

Review 5. Prediction of protein disorder at the domain level.

Authors: Zsuzsanna Dosztányi; Márk Sándor; Peter Tompa; István Simon
Journal: Curr Protein Pept Sci Date: 2007-04 Impact factor: 3.272

6. Crystal structure of a beta-catenin/BCL9/Tcf4 complex.

Authors: James Sampietro; Caroline L Dahlberg; Uhn Soo Cho; Thomas R Hinds; David Kimelman; Wenqing Xu
Journal: Mol Cell Date: 2006-10-20 Impact factor: 17.970

Review 7. Molecular mechanisms of alpha-synuclein neurodegeneration.

Authors: Elisa A Waxman; Benoit I Giasson
Journal: Biochim Biophys Acta Date: 2008-10-09

Review 8. Intrinsically disordered proteins in human diseases: introducing the D2 concept.

Authors: Vladimir N Uversky; Christopher J Oldfield; A Keith Dunker
Journal: Annu Rev Biophys Date: 2008 Impact factor: 12.981

9. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors: J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal: J Mol Biol Date: 2004-03-26 Impact factor: 5.469

10. Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters.

Authors: Dongying Wu; Sean C Daugherty; Susan E Van Aken; Grace H Pai; Kisha L Watkins; Hoda Khouri; Luke J Tallon; Jennifer M Zaborsky; Helen E Dunbar; Phat L Tran; Nancy A Moran; Jonathan A Eisen
Journal: PLoS Biol Date: 2006-06 Impact factor: 8.029

225 in total

1. MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins.

Authors: Fatemeh Miri Disfani; Wei-Lun Hsu; Marcin J Mizianty; Christopher J Oldfield; Bin Xue; A Keith Dunker; Vladimir N Uversky; Lukasz Kurgan
Journal: Bioinformatics Date: 2012-06-15 Impact factor: 6.937