Literature DB >> 22073276

Disordered patterns in clustered Protein Data Bank and in eukaryotic and bacterial proteomes.

Michail Yu Lobanov1, Oxana V Galzitskaya.   

Abstract

We have constructed the clustered Protein Data Bank and obtained clusters of chains of different identity inside each cluster, http://bioinfo.protres.ru/st_pdb/. We have compiled the largest database of disordered patterns (141) from the clustered PDB where identity between chains inside of a cluster is larger or equal to 75% (version of 28 June 2010) by using simple rules of selection. The results of these analyses would help to further our understanding of the physicochemical and structural determinants of intrinsically disordered regions that serve as molecular recognition elements. We have analyzed the occurrence of the selected patterns in 97 eukaryotic and in 26 bacterial proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. The matrix of correlation coefficients between numbers of proteins where a disordered pattern from the library of 141 disordered patterns appears at least once in 9 kingdoms of eukaryota and 5 phyla of bacteria have been calculated. As a rule, the correlation coefficients are higher inside of the considered kingdom than between them. The patterns with the frequent occurrence in proteomes have low complexity (PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, QQQQQP), and the type of patterns vary across different proteomes, http://bioinfo.protres.ru/fp/search_new_pattern.html.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22073276      PMCID: PMC3208572          DOI: 10.1371/journal.pone.0027142

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Intrinsically disordered regions serve as molecular recognition elements, and play an important role in the control of many cellular processes and signaling pathways [1]–[6]. It is useful to be able to predict positions of disordered regions in protein chains. Prediction methods are aimed at identifying disordered regions through the analysis of amino acid sequences using mainly the physicochemical properties of amino acids [7]–[16] or evolutionary conservation [17]–[20]. Many examples of proteins with intrinsically disordered regions which exhibit coupling between folding and binding have been described in the literature [4]–[6], [21]–[23]. Nevertheless, the universality of this phenomenon and functional importance of many disordered regions remain unclear. A database of continuous protein fragments (Molecular Recognition Features or MORFs) was compiled from the Protein Data Bank which includes short protein chains (with fewer than 70 residues) bound to larger proteins [24], [25]. It has been argued that MORFs participate in the coupling of binding and folding, a hypothesis that was supported by the analysis of the composition and predicted disorder of MORF segments. As a result of studying the subtle structural differences of the same proteins in bound (Complex) and unbound (Single) states in relation to their intrinsic disorder the database of protein structures (ComSin) has been constructed [26]. Recently several computational tools for identifying Linear motifs [27] and minimotifs in protein-protein interactions [28] have been published. Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure [27] but minimotifs are short functional peptide sequences obtained after analysis of known protein-protein interactions [28]. Low-complexity regions attract our attention since they are regions of a protein in which a particular amino acid, or a small number of different amino acids, are enriched. Single amino acid repeats (homorepeats) belong to these regions. It turned out that homorepeats play important roles in some biological process [29] and may play a more important role in human diseases than it was previously recognized. In the current study we search for sequence patterns consisting of a number of consecutive residues along the polypeptide chain that are nearly always associated with disordered segments. It has been found that two types of patterns appear to be recurrent: a proline-rich pattern and a positively or negatively charged pattern [30]. It should be noted that the old and new versions of our libraries include patterns enriched by proline and charged residues [31]. The statistical analysis of disordered residues was done considering 34 464 unique protein chains taken from the PDB database. In this database, 4.95% of residues are disordered (i.e. invisible in X-ray structures) [31]. The statistics was obtained separately for the N- and C-termini as well as for the central part of the protein chain. It has been shown that frequencies of occurrence of disordered residues of 20 types at the termini of protein chains differ from the ones in the middle part of the protein chain [31], [32]. It is necessary to construct a clustered PDB because this simplifies the filtering process of protein structures under their analysis and searches general structural characteristics among non-identical proteins. It is necessary to construct a clustered PDB which is important for the analysis of actualized data. In this work we constructed a clustered PDB and used clusters of protein chains where identity between chains inside of the cluster exceeds 75% (version of 28 June 2010). Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of disordered patterns [31]. At present the library includes 141 disordered patterns. Such an approach is new and promising for further studying and understanding the functional role of the obtained patterns in different proteomes. Taking into consideration the library of disordered patterns will help one improve accuracies of predictions for residues to be structured or unstructured inside the given region. The previous version of the library includes 109 disordered patterns and has restrictions on the minimal length of the patterns. Using more simple rules without restriction on the pattern length and clustered PDB of the same version we constructed the largest library of disordered patterns. The patterns occur more often as short fragments. Patterns of four-six residues long occur more frequently (105 out of 141) among the disordered patterns of the library. It should be noted that six residue patches affect the folding/aggregation features of proteins, and they are important “words” for the understanding of protein dynamics [33]. Moreover, nucleation sites are constrained by patches of approximately six residues [34], [35]. There is evidence that the minimum length necessary for a peptide to elicit an allergenic response and molecular mimicry (a patch of a protein eliciting an immune response equivalent to the entire protein) is about six [36]. All these facts suggest the existence of a fragment of biologically meaningful information located along approximately six residues [33]. With the library of disordered patterns taken into account, it would be easier to improve accuracy of prediction of ordered/disordered residues inside the given region. Proteome-wide calculations are a great way to place our work in a larger, evolutionary frame. In this paper of interest is the occurrence of 141 disordered patterns in 97 eukaryotic proteomes, since eukaryotic proteomes include more disordered regions than other proteomes [17], [37], [38], and for comparison, in 26 bacterial proteomes. A comparative analysis of the number of proteins containing the 141 disordered selected patterns in these proteomes has been performed. The disordered patterns with the most frequent occurrence in eukaryotic and bacterial proteomes have low complexity. It should be noted that each proteome has a specific set of disordered patterns, and this results in different correlation coefficients between numbers of proteins where a disordered pattern appears at least one time. We came to some important observations of a higher correlation coefficient within a kingdom or a phylum than across kingdoms or phyla after analysis of occurrence of disordered patterns in 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms.

Materials and Methods

Construction of clustered PDB

We have considered all protein structures determined by X-ray analysis with a resolution better than 3 Å, and the size of protein is larger than or equal to 40 amino acid residues, published in the PDB (version of June 28, 2010); the structures contain 116 997 protein chains (51 048 PDB entries). At the first step these 116 997 chains can be divided into 34 464 classes. We call these classes as clusters with 100% identity. This means that the chains from the same cluster have the same amino acid sequences, the sequences of chains from different classes are different i.e. they differ at least at one position. In total these 34 464 different sequences contain 9 085 893 residues. At the second step we created clusters of chains with identity inside each cluster ≥75%. Identity is calculated by using equation:where I is the number of identical residues, L and L are the numbers of amino acid residues in each considered protein. For calculation of Identity we used BLAST with default parameters [39]. At the beginning a pair of chains with maximal identity was combined, then another pair of chains or a chain with the cluster again with maximal identity, etc. If the combining of a chain with the cluster or combining of clusters occurred, then the average identity of gathering was considered. If identity of at least a pair of chains from different clusters was less than 75%, then the clusters were not combined. The procedure was repeated until there were clusters which could be combined. At the second step of grouping of chains, we obtained 18775 clusters of chains with identity inside each cluster ≥75%. Then the clusters C75 have been combined into clusters with identity Id≥50%, etc. Figure 1 demonstrates the dependence of the number of clusters on identity between chains inside the cluster. Further we consider the identity of 75% because the general grouping has occurred below 90% identity.
Figure 1

Dependence of the number of clusters on identity between protein chains.

Inside each cluster at the given identity between chains the identity is larger than the considered identity between clusters.

Dependence of the number of clusters on identity between protein chains.

Inside each cluster at the given identity between chains the identity is larger than the considered identity between clusters.

Construction of the library of disordered patterns

Among 116 997 chains, approximately 4.5% of their residues are disordered, i.e. are not resolved by X-ray analysis. To reveal such residues, we compared (for each protein chain) records SEQRES and records ATOM in the corresponding PDB-file. Residues which were present in record SEQRES, but their coordinates were absent in record ATOM (namely, the coordinates of the Cα-atom were absent in record ATOM), were considered as unstructured ones. We considered the residues as disordered if there were not coordinates of Cα atoms. Below we consider only clusters with ≥75% identities between any pair of chains inside each cluster because the general grouping has occurred below 90% identity. Considering this level of identity, we have created the Clustered Disordered Residues Data Base (CDRDB), its elements are 18 775 clusters of protein chains. Figure 2 illustrates two clusters with 100% identity combined in one cluster with 75% identity. One can see that the sequences from two clusters are different in one position 110, serin is changed for cystein, and the weight of the chain from the first cluster isand the weight of the chain from the second cluster is , respectively. Analogously the weight of each chain from any cluster is calculated by using equation:where N is the number of chains in the cluster with 100% identity and M is the number of clusters with 75% identity. It should be noted that the sum of weights inside one cluster with 75% identity will be equal to one. The weight of residue we consider to be the same as the weight of chain so at the level of 75% identity a cluster may include protein chains of different lengths.
Figure 2

Example of two clusters with 100% identity combined in one cluster with 75% identity.

The sequences from two clusters are different only in one position (110), serin is changed for cystein. U denotes disordered residues in the chain and dash denotes ordered residues, respectively.

Example of two clusters with 100% identity combined in one cluster with 75% identity.

The sequences from two clusters are different only in one position (110), serin is changed for cystein. U denotes disordered residues in the chain and dash denotes ordered residues, respectively. Our goal is to create a database of disordered patterns i.e. amino acid sequences that are likely to be found in disordered parts of protein chains using CDRDB by applying simpler rules for the creation of the library of disordered patterns than in our previous work [31]. Let P be a protein chain and A be a pattern of length L. The database was compiled using a two-stage procedure. At the first stage, we created a list of candidate patterns. To be a candidate in the patterns the considered pattern should be disordered in half cases among the chains from the cluster with 100% identity. Then the desired disordered patterns were selected into the candidate list. 855 775 candidates in the disordered patterns were gathered. We say that pattern A matches chain P at position s if the following conditions are valid: two residues from each end should coincide: there could be done substitutions at most L/5 positions r in the middle of pattern in which This means that for patterns with a length of L≤5 no change is possible, for 5 If the distance between the edges of the pattern and the chain is less than 40 residues the pattern is considered to match these residues. The pattern length is not limited in this paper. Further we consider the following terminology: Nu is the sum of weights (w) of disordered residues matched by the pattern; Nf is the sum of weights (w) of ordered residues matched by the pattern; Cu is the number of clusters with identity 75% (C75) in which Nu>Nf; Cf is the number of clusters with identity 75% (C75) in which Nu≤Nf. Protein P has an occurrence of pattern A if A matches P at position s. Fragment A = P of chain P is considered as a candidate disordered pattern if it meets the following conditions: There are 16 918 patterns meeting conditions C1, C2, and C3. The longest pattern has the length of 45 amino acid residues (HHHHHHSSGLVPRGSGMKETAAAKFERQHMDSPDLGTDDDDKAMA), and the shortest pattern has 2 residues (HH). In the next step we selected disordered patterns from the candidate list using the following iterative greedy procedure. From 16 918 patterns we chose the pattern with the maximal value D = Nu−Nf. Then for the rest patterns the values of Nu, Nf, Cu, Cf were recalculated not taking into account the residues matched by the first pattern. Again all the rest patterns were checked to meet conditions C1, C2, and C3. Among the rest patterns meeting conditions C1, C2, and C3 the pattern with a maximal D value was chosen. If there were no patterns meeting conditions C1, C2, and C3, then the procedure was stopped. The iterative procedure was stopped when 390 patterns were selected (D>0). Finally, we were interested in the patterns for which D≥10 and D≥25 (the value 25 corresponds to the summation of weights of 5 whole disordered patterns with 5 residues in length in 5 clusters without neighboring regions, or terminal occurrence). The numbers of such patterns are 249 and 141, respectively (see Dataset S1). The lengths of patterns are in the region: 4≤L≤24. Further we will consider only the set of patterns meeting the condition that D≥25.

Significance of disordered occurrences

We have studied the statistical significance of the selected patterns from two points of view. First, we were interested whether the disordered fragments are overrepresented among the occurrences of each pattern, and, second, whether the patterns are overrepresented in the database. The features are described with the proper Z-scores, called Z and Z, respectively. To estimate the significance of the number of disordered occurrences of pattern P we have implemented the following procedure. First, we determined the fraction of disordered fragments among all fragments with the given length taking into account the weight of the disordered residues in each case:where N is the number of chains in the CDRDB, L is the length of the considered chain, n is the fragment length, w is the chain weight, is equal to 1 if the fragment with adjoining regions is disordered more than in half positions, and 0 in the opposite case. For each pattern we know the number of clusters Cu where this pattern in more than half cases is disordered, and also the number of clusters Cf where this pattern is folded in more than half cases (see Dataset S1, columns J and K). We should calculate the probability P (Y) that the number of successes will be larger or equal to Cu at the given number of trials Y = Cu+Cf. In other words, this is the probability that at the given or larger number of trials:where p is the probability of success of one trial (see above). The significance of disordered occurrences is estimated with the Z-score:

Statistical significance of the observed number of occurrences of pattern X in proteomes

The probability of finding patterns with possible changes is equal to the summation of probabilities over all sequences compatible with the given pattern. is the sequence compatible with the given one (see the rules of coincidence, for example i = 39 for n = 6).where the probability p(X) that pattern X occurs in a sequence and p is the probabilities of occurrence of amino acids in the considered proteome. We calculated the probability p(X, N) that pattern X with n amino acid residues occurs in a sequence of length N: The probability distribution on protein sequences is assumed to be binomial. The statistical significance of pattern X is estimated with the Z-scorewhere S is the number of sequences containing at least one occurrence of homorepeat X. R is the number of proteins in the considered proteome. N is the average length of proteins in the considered proteome.

Statistical significance of the observed number of occurrences of pattern X in two different proteomes

Let n and n be the numbers of proteins with the given pattern X in proteomes i and j. N and N are the whole numbers of proteins in both proteomes, and is the frequency of proteins with the given pattern. is the standard deviation. L and L are the average length of proteins in the considered proteomes i and j. The scoring function is:We consider that the difference is significant if its Z-score exceeds the proper value with absolute meaning 3 and 5. These values correspond to the probabilities 3*10−3 and 6*10−7, respectively. The correlation coefficient (r) was calculated using the equation:where S and S are standard deviations for variables x and y.

Database of proteomes

We considered 3279 proteomes from the EBI site (ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/uniprot/proteomes/). Since the patterns with the frequent occurrence in proteomes have low complexity we did a preliminary analysis. The analysis showed that the number of proteins with at least one occurrence of homorepeats of 6 residues long is less than 500 for proteomes with an overall number of residues below 2500000. Even so, only 22 proteomes out of 3156 have more than 100 proteins with at least one occurrence of 6-residue homorepeats. The data gave grounds for our research involving only proteomes with an overall number of residues exceeding 2500000. We obtained 123 proteomes taking into account the length of proteomes representing 9 kingdoms of eukaryotes and 5 phyla of bacteria (see Table 1 and Dataset S2). Unfortunately, only three kingdoms of eukaryotes (Metazoa, Viridiplantae, and Fungi) are given at http://www.ncbi.nlm.nih.gov/Taxonomy/. In other cases, the rank of kingdom is missing. In such situations, we chose the highest taxonomic category proceeding from the subkingdom of eukaryotes instead of the kingdom. We chose 97 out of 120 eukaryotic proteomes, and a small number of bacterial proteomes. The smallest eukaryotic proteome belongs to Hemiselmis andersenii, class Cryptophyta. It is evident that 498 proteins with an overall number of 167452 of amino acid residues are not sufficient for reliable statistics. Historically, the superkingdom of bacteria is divided into phyla but not kingdoms. We preferred to consider such phyla separately.
Table 1

Names of 97 eukaryotic and 26 bacterial proteomes.

EukaryotaEukaryota (Fungi)Bacteria***
Metazoa25.H_sapiens34310.A_capsulata_ATCC_26029Acidobacteria25797.S_usitatus
22974.B_taurus34967.A_capsulata_H143Actinobacteria37022.A_mediterranei
59.M_musculus34495.A_dermatitidis_SLH1408133926.C_acidiphila
122.R_norvegicus34498.A_dermatitidis_ER-335278.Frankia_sp_EuI1c
21457.G_gallus35919.A_benhamiae35534.F_sp
20721.D_rerio29154.A_clavatus74443.K_setae
22388.T_nigroviridis33020.A_flavus33113.R_opacus
17.D_melanogaster22118.A_fumigatus_FGSC_A110025456.Rhodococcus_sp
25396.D_pseudoobscura31018.A_fumigatus_CEA10131.S_avermitilis
31436.A_aegypti29130.A_niger36666.S_bingchenggensis
78607.A_darlingi23077.A_oryzae84.S_coelicolor
22426.A_gambiae28239.A_terreus34910.S_scabies
21633.C_briggsae30100.B_fuckeliana35554.S_sp_ACT-1
9.C_elegans22024.C_albicans_SC531458962.S_violaceusniger
64800.L_loa32738.C_dubliniensis34011.S_roseum
79720.T_spiralis19665.C_glabrataProteobacteria112.B_japonicum
30565.N_vectensis34491.C_tropicalis22343.Burkholderia_sp_ATCC_17760
Viridiplantae23214.O_sativa25585.C_globosum_IFO_634725388.B_xenovorans
3.A_thaliana34493.C_lusitaniae33223.H_ochraceum
33157.Micromonas_sp34218.C_posadasii23351.M_xanthus
29351.O_lucimarinus79902.C_graminicola32044.P_pacifica
25972.O_tauri20018.D_hansenii30295.S_cellulosum
Stramenopiles* 35109.E_siliculosus34482.L_thermotolerans33616.S_aurantiaca
Choanoflagellida** 30562.M_brevicollis29447.L_elongisporusBacteroidetes33930.C_pinensis
Euglenozoa* 83400.L_braziliensis22028.M_oryzae32144.M_marina
83363.L_infantum34471.N_otaeChloroflexi36622.K_racemifer
71330.T_brucei34970.N_haematococca
33602.T_cruzi29157.N_fischeri
Alveolata* 32114.P_berghei22025.N_crassa
31998.P_chabaudi34307.P_brasiliensis_Pb03
493.P_falciparum34389.P_brasiliensis_Pb18
31342.P_knowlesi34392.P_brasiliensis_ATCC_MYA-826
31632.P_vivax31898.P_chrysogenum
21631.P_yoelii32999.P_marneffei
Amoebozoa* 21395.D_discoideum25591.P_nodorum
35301.P_pallidum29448.P_guilliermondii
Diplomonadida* 33600.G_intestinalis_ATCC_5080328727.P_stipitis
35295.G_intestinalis_ATCC_5058179908.P_graminis
65115.G_intestinalis79905.P_teres
30091.S_cerevisiae_YJM789
31651.S_cerevisiae_RM11-1a
34506.S_cerevisiae_JAY291
35062.S_cerevisiae_Lalvin_EC1118
71242.S_cerevisiae
30103.S_sclerotiorum
35280.S_macrospora
33056.T_stipitatus
35921.T_verrucosum
34386.U_reesii
30097.V_polyspora
35359.V_albo-atrum
20011.Y_lipolytica
31020.C_cinerea
20846.C_neoformans_JEC21
21380.C_neoformans_B-3501A
31023.L_bicolor
33031.P_placenta
22029.U_maydis

*Category without rank is given.

**The name of order is given because the highest ranks are missing in the taxonomic description.

***The superkingdom of bacteria is divided in phyla rather than kingdoms.

*Category without rank is given. **The name of order is given because the highest ranks are missing in the taxonomic description. ***The superkingdom of bacteria is divided in phyla rather than kingdoms. Among 97 eukaryotic proteomes, 17 belong to the kingdom of Metazoa or animals: Homo sapiens (51778 protein sequences), Bos Taurus (18405), Mus musculus (42120), Rattus norvegicus (28166), Gallus gallus (12954), Danio rerio (21576), and Tetraodon nigroviridis (27836) belong to Chordata phylum, Drosophila melanogaster (15101), Drosophila pseudoobscura (16000), Aedes aegypti (16042), Anopheles darlingi (11437), and Anopheles gambiae (12455) to arthropods, and Caenorhabditis briggsae (18531), Caenorhabditis elegans (23817), Loa loa (16271), and Trichinella spiralis (16040) belong to nematodes, Nematostella vectensis (24435) belongs to cnidaria phylum.

Results and Discussion

Library of disordered patterns

Following the procedure described in the Materials and Methods section, we constructed the clustered PDB (CDRDB) at the level identity of 75% (http://bioinfo.protres.ru/st_pdb/) and obtained a library of disordered patterns. The dataset includes 141 patterns (see Dataset S1). Figure 3 demonstrates the distribution of the patterns according to their lengths. The patterns occur more frequently as short fragments (105 out of 141 are patterns of 4–6 residues long). The largest pattern with condition D≥25 consists of 17 amino acid residues (HHHHHHSSGLEVLFQGP). It is interesting that the strong pattern is HHHH, but not HHHHHH as in the last version of the library [31]. We suggest that the residues matched by these patterns will be disordered in new protein chains because more than half of residues in these patterns are disordered (see conditions C2 and C3 in the Materials and Methods section).
Figure 3

Dependence of the number of patterns on the length (number of amino acid residues).

The statistical significance of disordered occurrences in the selected patterns was estimated with the Z-score (see Materials and Methods). We calculated the probability that the number of successes will be larger or equal to Cu at the given number of Cu+Cf (for each pattern we know the number of clusters Cu where this pattern in more than half cases is disordered, and also the number of clusters Cf where this pattern is ordered in more than half cases). This probability for all 141 disordered patterns is less than 7•10−5. All 141 patterns have Z that corresponds to the P-value of 7•10−5, which is in good agreement with the procedure of the disordered patterns determination. The worst variant is Cu = 5, Cf = 4, and the length of patterns is 6. We have four such cases: SVAESS, ASIGQA, PPSGSP, and DSDVSL (see Dataset S1, columns O and P).

Comparison of the new and the previous libraries of disordered patterns

After construction of the new library the question about similarity of two databases (previous and new) arises. For this purpose the previous patterns matched the clustered pdb (CDRDB) and the sum of weights was calculated analogously to the new patterns. Then we calculated the sum of weights for residues matching both the previous and the new patterns (intersections, I12). The number of clusters with identity of 75% in which there were new and previous patterns was calculated, as well as the number of intersections. The degree of coincidence was calculated using equations (13) and (14): where I12 is the sum of weights for intersections (coincidences), and N is the weight of a single pattern. We considered only pairs where F2>0.1, F2(C75)>0.1, I12≥3, and the number of clusters where two patterns appear together, C12≥3 (see Dataset S3). The measure F1 points to the coincidence between two considered patterns. At the same time the measure F2 demonstrates a level of inclusion of the pattern with smaller N into a larger one. Large difference between N1 and N2 results in a wide difference between F1 and F2. For example, the sequence GSSHHHHHHSSGLVPRGSHM occurs in 393 clusters on the N-termini, where it is disordered more than half in 387 clusters. This sequence is matched by pattern GSHM, and its beginning is matched by the HHHH pattern. If we have a test database with one protein where there is such a sequence at the N-end, then NGSHM = 20. N is the weight of a pure pattern with the neighboring part, in this case this is the length of the whole N-terminal fragment, NHHHH = 9, I12 = 9, F1 = 9/20 = 0.45, F2 = 9/9 = 1. In a real situation in the whole CDRDB NHHHH = 29 560.4, NGSHM = 8 452.0, I12 = 3 163.1, F1 = 0.09, F2 = 0.37. It should be noted that real F2 is less than test F2. This occurs because GSHM appears usually in sequence GSSHHHHHHSSGLVPRGSHM or in similar sequences. Yet sometimes GSHM appears alone. The result of intersections of the two libraries (the previous library includes 109 patterns and the new one includes 390 patterns if D>0) is presented in Fig. 4. One can see that there are 16 precise coinciding patterns: ENLYFQ, ASMTGGQQMGR, GSSHHH, WSHPQFEK, EGGSHHHHH, RRGKKK, PTTENLYFQGAM, PTTENLYFQGAM, SHHHHHHSQDP, HHHHHMA, SMTGGQQMGRGS, KKGEKK, SRSHHHH, ENLYFGGS, GGRHHH, HHHGSM, GSHMSQ, and 8 with not precise coincidence, for example HHHHHH and HHHH (Dataset S3).
Figure 4

Dependence of the number of coinciding patterns between previous and new libraries of disordered patterns at the given level of coincidence (F1).

The measure F1 points to the coincidence of protein regions covered by the considered patterns.

Dependence of the number of coinciding patterns between previous and new libraries of disordered patterns at the given level of coincidence (F1).

The measure F1 points to the coincidence of protein regions covered by the considered patterns. It is interesting that some patterns appear in a protein together with other patterns (57 out of 141). Such pairs can be seen in Dataset S4. Also we calculated the number of patterns which appear in proteins together with the considered pattern (see Fig. 5). Pattern HHHH occurs more often with other patterns in proteins. It should be noted that there are several patterns which appear alone in the CDRDB (see Fig. 5, Dataset S4). We used the same criteria as for the intersections of the two libraries.
Figure 5

Number of patterns with which the given pattern appears together in the same protein in the clustered PDB.

Pattern HHHH appears 45 times together with some other patterns from the library and 84 patterns appear alone in the clustered PDB.

Number of patterns with which the given pattern appears together in the same protein in the clustered PDB.

Pattern HHHH appears 45 times together with some other patterns from the library and 84 patterns appear alone in the clustered PDB.

Occurrence of disordered patterns in 97 eukaryotic and 26 bacterial proteomes

After creating the library of disordered patterns taken from the CDRDB, another interesting question arises: how often the obtained patterns could occur in some proteomes. Since eukaryotic proteomes include more disordered regions than other proteomes [17], [37], [38] we compared 97 eukaryotic proteomes and 26 bacterial ones (see Table 1, Dataset S1, and Materials and Methods). We considered two cases for coincidence. In the first case we calculated the number of proteins where the patterns match with precise coincidence a polypeptide chain fragment. In the second case we analyzed the coincidence according to the definition suggested here and in the paper [31]. According to the rule mentioned in the Materials and Methods section for patterns with a length of L≤5 no change may occur, for 5 Among 141 disordered patterns 17 occur (with precise coincidence) only in the PDB but are very sparse in 123 proteomes (see Dataset S5). Such patterns as RASQPELAPEDPED, SMTGGQQMGRGS, SHHHHHHSQDP, PTTENLYFQGAM, HHHHHHSSGLEVLFQGP, EQKLISEEDLN, and ASMTGGQQMGR do not appear in the analyzed proteomes even in two cases (precise coincidence and exact coincidence of two terminal residues and no coincidence in L/5 positions) (see Figure 6). This suggests that such patterns are an artificial addition to proteins from the CDRDB for their better purifications.
Figure 6

Occurrence of disordered patterns in four proteomes.

(A) H. sapiens, Chordata phylum; (B) D. Melanogaster, Arthropoda phylum; (C) C. elegans, Nematoda phylum; (D) N. vectensis, Cnidaria phylum. The blue color corresponds to precise coincidence of the considered patterns with the fragment of polypeptide chains, the aqua color corresponds to exact coincidence of two terminal residues from both termini and incomplete coincidence in the L/5 positions.

Occurrence of disordered patterns in four proteomes.

(A) H. sapiens, Chordata phylum; (B) D. Melanogaster, Arthropoda phylum; (C) C. elegans, Nematoda phylum; (D) N. vectensis, Cnidaria phylum. The blue color corresponds to precise coincidence of the considered patterns with the fragment of polypeptide chains, the aqua color corresponds to exact coincidence of two terminal residues from both termini and incomplete coincidence in the L/5 positions. From Figure 6 it is evident that the homorepeats occur very often in eukaryotic proteomes. The patterns with the most frequent occurrence in the eukaryotic proteomes have low complexity: PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, and QQQQQP. From Tables 2 and 3 it is evident that the disordered patterns with the most frequent occurrence in the eukaryotic and even in bacterial proteomes are patterns with low complexity GGGGG, PPPPP, TTTPTT, GGGGSGG, KKKKK, etc.
Table 2

Number of proteins with the most frequent occurrence of disordered patterns in 17 considered animal proteomes in the case of incomplete coincidence.

Number of proteins5177818405421202816612954215762783615101160001604211437124551853123817162711604024435
Proteome H. sapiens B. taurus M. musculus R. norvegicus G. gallus D. rerio T. nigroviridis D. melanogaster D. pseudoobscura A. aegypti A. darlingi A. gambiae C. briggsae C. elegans L. loa,16271 T. spiralis N. vectensis
PPPPP12604671002739282352581560526276385326318478127130153
GGGGG77933960643519827052271888539110257822013856411597
GGGGSGG478203364266128148276529579225698513103212465956
EEEED800320604524235340350159170140182186215182416771
EEEEVEE80427965058118929534611411268130125148144495455
HHHH22471201138671641923834163258174091161656210729
TTTPTT17856160110451027637341225657928327142023219958
KKKKK5033304233531795641961028813984911691817879118
SSTSS3531153102181251812372952772183011682383331168269
QQQQQP240601951455778107437528230501239109176568223
EDEREE46718536035412923524214313597140109138179373955
NSSSS182741791368211610227229318128714373114589931
PAPPP385146298230741121811481447410881100113235640
PPAPP3971613272601031041761111155681889190232237
APIPAP388150282242599513112221163827181127121326
PSRSPS31011325122579129204981116672545979311845
KKGEKK2101311971771001701215747765056151209354097
DDDDEDD10841120104301447310517210431110260851316177
PSPPP3061132691837410917095826280594861221865
KKEKK2257418914576147734852663946117143352447
Table 3

Average number of proteins with the most frequent occurrence of disordered patterns in 123 considered proteomes in the case of incomplete coincidence.

Metazoa (17)Viridiplantae (5)Stramenopiles (1)Choanoflagellida (1)Euglenozoa (4)Alveolata (6)Amoebozoa (2)Diplomonadida (3)Fungi (58)Acidobacteria (1)Actinobacteria (14)Proteobacteria (8)Bacteroidetes (2)Chloroflexi (1)
GGGGG46011762691752774140971217567732126
PPPPP46867961813329718249251715438701424
TTTPTT22414711519627518182617112102310819
GGGGSGG28756219414710530233472494053916
KKKKK2161521197357741196968004174
EEEED27023334871107103280798113610
SSTSS214118135961042845553106279518
EEEEVEE2441874705312594299883014310
QQQQQP19281142671015693118176438
EDEREE179140163787482167982167810
DDDDEDD1082062031545882308383021033
HHHH22916899445817292162445510
APIPAP127174179116433301963428488522
PSRSPS114135905280813421817261316
NSSSS1429073305154534185322438
PAPPP13615589595542765339334458
PSPPP10723613348855608568101506
PPAPP1321341025087221441173544410
KKGEKK113776426362381061039213237
RGRPRG891611325541111734442118210
According to work [31] we suggest that these patterns will be disordered in most cases. It should be noted that low-complexity regions can additionally include ordered structural proteins or proteins with strong structural propensity, like collagens, coiled-coils or fibrous proteins [40]. Recently, it has been demonstrated that an increased number of perfect tandem repeats correlates with their stronger tendency to be unstructured [41]. Moreover, strong association between homorepeats and unstructured regions was shown elsewhere [42]. Such patterns as GGGGSGG, EEEEVEE, EDEREE, APIPAP, and PSRSPS (see Table 2) often occur in the considered 17 animal proteomes. It should be noted that poly H fragments are artificial parts of proteins in the PDB which have been added for better purification of proteins, but in eukaryotic proteomes such a repeat is likely to have a biological function. The locations of poly-H fragments can be found in different proteomes from our site, http://bioinfo.protres.ru/fp/search_new_pattern.html. We calculated the statistical significance of the observed patterns in 123 proteomes by using equation (10) (see Materials and Methods). It should be noted that the average length of proteins in considered proteomes is larger (about 400 residues) than the average length of the protein in the PDB database (about 260 residues). On the one hand, Zoccur≤0, varies from 40 patterns for the human proteome to 91 ones for the bacterial proteome B. xenovorans. On the other hand, Zoccur>5 varies from 65 patterns in the rice proteome (O. sativa) to 8 patterns in the bacterial proteome B. xenovorans. Several examples deserve our attention. For instance, the appearance of pattern GGSGGGGSGGG varies from 7 cases in T. spiralis (the expected occurrence is 0.0004) to 149 cases in humans (the expected occurrence is 0.013), but the Zoccur value is 353 and 1291, respectively. Such patterns as MSLN and SNAM appear more sparsely in comparison with the expected value (Zoccur<0) for all considered 17 animal proteomes. Although the first pattern occurs 100 times (that is not rare) in the human proteome, and the second pattern appears 61 times, correspondingly. At the same time pattern HHHH appears more often than expected (from 10 for the human to 4 for the actinia (N.vectensis)), but Z is 68 and 12, respectively. We calculated the frequencies of occurrence of 141 disordered patterns in 123 proteomes. To make a statement that the given pattern X occurs more often in the i proteome than in the j one we introduced the scoring function for such difference between occurrences of the pattern in two proteomes (by using equation (11), see Materials and Methods). This scoring function should have a normal distribution according to the central limit theorem. We considered the difference occurrence of 141 patterns in some pairs of proteomes (see Dataset S5) and illustrated here the example for eukaryota and bacteria superkingdoms. It turns out that the appearances of 55 patterns in the two superkingdoms do not differ significantly at the level of 10−7. The negative value of the scoring function points out that the frequency of appearance of the given pattern is higher in bacteria than in eukaryota superkingdoms. For example pattern APIPAP occurs 1.5 times more frequently in 26 bacterial proteomes than in 97 eukaryotic proteomes ( = −20.4). It should be added, that HHHH and QQQQQP patterns occur in Arthropoda's proteomes more often than in the Chordata proteomes ( = −38.4 and −34.7, correspondingly) (see Table 2 and Dataset S5). For each proteome we calculated a set of 141 values reflecting the number of proteins containing at least one disordered pattern for each of the 141 patterns from the library. Then considering all possible pairs of proteomes, the correlation coefficients between the 141 values have been calculated resulting in the matrix of correlation coefficients. The correlation coefficient was calculated for each pair of proteomes separately (see Table 4), and then averaging has been done inside each kingdom and phylum (see Table 5). As a rule, the correlation coefficients are higher inside the studied kingdom and phylum than between them.
Table 4

Correlation coefficients (in percent) between 17 animal proteomes (kingdom Metazoa).

PhylumProteome H. sapiens B. taurus M. musculus R. norvegicus G. gallus D. rerio T. nigroviridis D. melanogaster D. pseudoobscura A. aegypti A. darlingi A. gambiae C. briggsae C. elegans L. loa T. spiralis N. vectensis
Chordata H. sapiens 98 99 99 97 86 97 7369715768 84 80 5666 83
B. taurus 98 98 98 97 91 95 7167705667 83 79 5465 86
M. musculus 99 98 99 97 88 97 7470735968 86 82 5969 85
R. norvegicus 99 98 99 97 89 96 6965705464 85 79 5766 84
G. gallus 97 97 97 97 92 95 7268755868 88 83 6070 88
D. rerio 86 91 88 89 92 83 6258705359 84 76 6370 88
T. nigroviridis 97 95 97 96 95 83 80 77 78 67 77 84 82 5769 83
Arthropoda D. melanogaster 737174697262 80 99 95 94 96 79 87 69 83 67
D. pseudoobscura 696770656858 77 99 94 96 96 74 83 66 80 62
A. aegypti 717073707570 78 95 94 94 92 85 90 78 90 73
A. darlingi 57565954585367 94 96 94 97 68 77 678054
A. gambiae 68676864685977 96 96 92 97 71 79 60 75 61
Nematoda C. briggsae 84 83 86 85 88 84 84 79 74 85 6871 97 84 88 86
C. elegans 80 79 82 79 83 76 82 87 83 90 77 79 97 85 90 84
L. loa 565459576063576966786760 84 85 91 74
T. spiralis 66656966707069 83 80 90 80 75 88 90 91 74
Cnidaria N. vectensis 83 86 85 84 88 88 83 6762735461 86 84 7474
Table 5

Averaged correlation coefficients (in percent) between numbers of proteins where at least once a disordered pattern for each of 141 patterns appears in 9 kingdoms of eukaryota and 5 phyla of bacteria.

Metazoa (17)Viridiplantae (5)Stramenopiles (1)Choanoflagellida (1)Euglenozoa (4)Alveolata (6)Amoebozoa (2)Diplomonadida (3)Fungi (58)Acidobacteria (1)Actinobacteria (14)Proteobacteria (8)Bacteroidetes (2)Chloroflexi (1)
Metazoa (17) 78 715970 75 325547 75 5543485269
Viridiplantae (5)71 77 736773204332666351554964
Stramenopiles (1)5973346072512427352544355
Choanoflagellida (1)70673474227157 78 4853534272
Euglenozoa (4) 75 736074 79 145553736554574773
Alveolata (6)322072214 96 151229−5−7−4456
Amoebozoa (2)554325715515 99 39561718122848
Diplomonadida (3)47321257531239 94 582831343560
Fungi (58) 75 6642 78 73295658 80 4439434666
Acidobacteria (1)5563734865−5172844 83 84 4470
Actinobacteria (14)4351525354−7183139 83 90 85 3268
Proteobacteria (8)4855545357−4123443 84 85 84 3970
Bacteroidetes (2)5249434247452835464432396454
Chloroflexi (1)6964557273648606670687054
From Table 4 four clusters can be selected with a high correlation coefficient between the numbers of proteins where all considered patterns appear for all pairs between 17 animal proteomes. The first cluster corresponds to phylum Chordata (7 proteomes), the second corresponds to Arthropoda (5 proteomes), the third to Nematoda (4 proteomes), and the fourth to Cnidaria (only 1 proteome). In Tables 4 and 5, bold formatting is used to show a correlation higher than 75%, normal size of numbers to show the correlation from 50% to 75%, and smaller size of numbers to show the correlation smaller than 50%. From Table 4 it is evident that the number of proteins from the human proteome correlates with that from chicken and fish lesser than with bovine, rat, and mouse proteomes. At the same time, the correlation between the number of proteins from proteomes from the Chordata phylum is high for such proteomes as C. briggsae and C. elegans. High correlation coefficients also are observed for such pairs as T. spiralis for the Arthropoda proteomes, and N. vectensis for the Chordata proteomes. Combining the motif discovery and disorder protein segment identification in the clustered PDB allows us to create the largest library of the disordered patterns. At present the library includes 141 disordered patterns. Such an approach is promising for further studying and understanding the functional role of the obtained patterns in different proteomes. We came to some general conclusions after analysis of 123 proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. We can conclude that the occurrence of disordered patterns is more monotonous within the same kingdom (phylum) than between kingdoms (phyla). One can suggest that such short similar motifs are responsible for common functions for nonhomologous, unrelated proteins from different organisms. List of 141 disordered patterns. (XLS) Click here for additional data file. Number of proteins and residues for each out of 123 proteomes. (XLS) Click here for additional data file. Comparison of the new and the previous libraries of disordered patterns. (XLS) Click here for additional data file. Pairs of patterns which appear in the same protein from the whole clustered PDB. (XLS) Click here for additional data file. Occurrence of disordered patterns in 97 eukaryotic and 26 bacterial proteomes in the cases of precise and imcomplete coincidence. (XLS) Click here for additional data file.
  42 in total

1.  Sequence patterns associated with disordered regions in proteins.

Authors:  S Lise; D T Jones
Journal:  Proteins       Date:  2005-01-01

2.  Prediction of unfolded segments in a protein sequence based on amino acid composition.

Authors:  Karen Coeytaux; Anne Poupon
Journal:  Bioinformatics       Date:  2005-01-18       Impact factor: 6.937

3.  Exploiting heterogeneous sequence properties improves prediction of protein disorder.

Authors:  Zoran Obradovic; Kang Peng; Slobodan Vucetic; Predrag Radivojac; A Keith Dunker
Journal:  Proteins       Date:  2005

4.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.

Authors:  Zsuzsanna Dosztányi; Veronika Csizmok; Peter Tompa; István Simon
Journal:  Bioinformatics       Date:  2005-06-14       Impact factor: 6.937

5.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins.

Authors:  Zheng Rong Yang; Rebecca Thomson; Philip McNeil; Robert M Esnouf
Journal:  Bioinformatics       Date:  2005-06-09       Impact factor: 6.937

6.  The Ising model for prediction of disordered residues from protein sequence alone.

Authors:  Michail Yu Lobanov; Oxana V Galzitskaya
Journal:  Phys Biol       Date:  2011-05-13       Impact factor: 2.583

Review 7.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

Review 8.  Intrinsically unstructured proteins and their functions.

Authors:  H Jane Dyson; Peter E Wright
Journal:  Nat Rev Mol Cell Biol       Date:  2005-03       Impact factor: 94.444

9.  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development.

Authors:  S Karlin; C Burge
Journal:  Proc Natl Acad Sci U S A       Date:  1996-02-20       Impact factor: 11.205

10.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life.

Authors:  J J Ward; J S Sodhi; L J McGuffin; B F Buxton; D T Jones
Journal:  J Mol Biol       Date:  2004-03-26       Impact factor: 5.469

View more
  9 in total

1.  Resolving the ambiguity: Making sense of intrinsic disorder when PDB structures disagree.

Authors:  Shelly DeForte; Vladimir N Uversky
Journal:  Protein Sci       Date:  2016-01-09       Impact factor: 6.725

2.  How Common Is Disorder? Occurrence of Disordered Residues in Four Domains of Life.

Authors:  Mikhail Yu Lobanov; Oxana V Galzitskaya
Journal:  Int J Mol Sci       Date:  2015-08-18       Impact factor: 5.923

3.  Phyloproteomic Analysis of 11780 Six-Residue-Long Motifs Occurrences.

Authors:  O V Galzitskaya; M Yu Lobanov
Journal:  Biomed Res Int       Date:  2015-05-31       Impact factor: 3.411

4.  HRaP: database of occurrence of HomoRepeats and patterns in proteomes.

Authors:  Mikhail Yu Lobanov; Igor V Sokolovskiy; Oxana V Galzitskaya
Journal:  Nucleic Acids Res       Date:  2013-10-22       Impact factor: 16.971

5.  Proteome-scale understanding of relationship between homo-repeat enrichments and protein aggregation properties.

Authors:  Oxana V Galzitskaya; Miсhail Yu Lobanov
Journal:  PLoS One       Date:  2018-11-06       Impact factor: 3.240

6.  Disordered Residues and Patterns in the Protein Data Bank.

Authors:  Mikhail Yu Lobanov; Ilya V Likhachev; Oxana V Galzitskaya
Journal:  Molecules       Date:  2020-03-27       Impact factor: 4.411

7.  Non-random distribution of homo-repeats: links with biological functions and human diseases.

Authors:  Michail Yu Lobanov; Petr Klus; Igor V Sokolovsky; Gian Gaetano Tartaglia; Oxana V Galzitskaya
Journal:  Sci Rep       Date:  2016-06-03       Impact factor: 4.379

8.  Hidden α-helical propensity segments within disordered regions of the transcriptional activator CHOP.

Authors:  Ángeles Canales; Marcel Rösinger; Javier Sastre; Isabella C Felli; Jesús Jiménez-Barbero; Guillermo Giménez-Gallego; Carlos Fernández-Tornero
Journal:  PLoS One       Date:  2017-12-06       Impact factor: 3.240

9.  Proteome-scale relationships between local amino acid composition and protein fates and functions.

Authors:  Sean M Cascarina; Eric D Ross
Journal:  PLoS Comput Biol       Date:  2018-09-24       Impact factor: 4.475

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.