Intrinsically disordered proteins are frequently involved in important regulatory processes in the cell thanks to their ability to bind several different targets performing sometimes even opposite functions. The PentUnFOLD algorithm is a physicochemical method that is based on new propensity scales for disordered, nonstable and stable elements of secondary structure and on the counting of stabilizing and destabilizing intraprotein contacts. Unlike other methods, it works with a PDB file, and it can determine not only those fragments of alpha helices, beta strands, and random coils that can turn into disordered state (the "dark" side of the disorder), but also nonstable regions of alpha helices and beta strands which are able to turn into random coils (the "light" side), and vice versa (H ↔ C, E ↔ C). The scales have been obtained from structural data on disordered regions from the middle parts of amino acid sequences only, and not on their expectedly disordered N- and C-termini. Among other tendencies we have found that regions of both alpha helices and beta strands that can turn into the disordered state are relatively enriched in residues of Ala, Met, Asp, and Lys, while regions of both alpha helices and beta strands that can turn into random coil are relatively enriched in hydrophilic residues, and Cys, Pro, and Gly. Moreover, PentUnFOLD has the option to determine the effect of secondary structure transitions on the stability of a given region of a protein. The PentUnFOLD algorithm is freely available at http://3.17.12.213/pent-un-fold and http://chemres.bsmu.by/PentUnFOLD.htm .
Intrinsically disordered proteins are frequently involved in important regulatory processes in the cell thanks to their ability to bind several different targets performing sometimes even opposite functions. The PentUnFOLD algorithm is a physicochemical method that is based on new propensity scales for disordered, nonstable and stable elements of secondary structure and on the counting of stabilizing and destabilizing intraprotein contacts. Unlike other methods, it works with a PDB file, and it can determine not only those fragments of alpha helices, beta strands, and random coils that can turn into disordered state (the "dark" side of the disorder), but also nonstable regions of alpha helices and beta strands which are able to turn into random coils (the "light" side), and vice versa (H ↔ C, E ↔ C). The scales have been obtained from structural data on disordered regions from the middle parts of amino acid sequences only, and not on their expectedly disordered N- and C-termini. Among other tendencies we have found that regions of both alpha helices and beta strands that can turn into the disordered state are relatively enriched in residues of Ala, Met, Asp, and Lys, while regions of both alpha helices and beta strands that can turn into random coil are relatively enriched in hydrophilic residues, and Cys, Pro, and Gly. Moreover, PentUnFOLD has the option to determine the effect of secondary structure transitions on the stability of a given region of a protein. The PentUnFOLD algorithm is freely available at http://3.17.12.213/pent-un-fold and http://chemres.bsmu.by/PentUnFOLD.htm .
The material for this study includes five initial sets of 3D structures of proteins that belong to: (1) alpha helical eukaryotic proteins; (2) beta structural eukaryotic proteins; (3) alpha + beta eukaryotic proteins; (4) alpha/beta eukaryotic proteins; (5) bacterial proteins; as well as a control set of proteins of different origins and structural classes (6). Each set contains no homologs, since the similarity between sequences was lower than 25% according to the Decrease Redundancy algorithm (https://web.expasy.org/decrease_redundancy/). Each protein has two to five different 3D structures in PDB. Those structures belong to proteins with 100% identity of amino acid sequences, but their secondary structures may be different. Thus, samples consisting of 100 alpha helical eukaryotic proteins and 378 structures, 100/355 beta structural eukaryotic proteins/structures, 100/387 alpha + beta eukaryotic proteins/structures, 100/386 alpha/beta eukaryotic proteins/structures, and 189/610 bacterial proteins/structures were formed. Average resolution of all X-ray structures of alpha helical proteins is 2.21 Å, of beta structural proteins—2.11 Å, of alpha + beta proteins—2.08 Å, of alpha/beta proteins—2.00 Å, of bacterial proteins—2.12 Å. The control set consists of 74/249 eukaryotic, bacterial and viral proteins/structures. Average resolution of all X-ray structures from control set of proteins is 2.29 Å. The IDs of all 3D structures in PDB, as well as a resolution are provided in Table 1S and Table 2S from the Supplementary Material. As we used a new control set of proteins we also provide information about amino acid sequences of all used 3D structures of proteins, information about their secondary structure and the results of all the algorithms described in the manuscript in Table 3S from Supplementary Material.Secondary structure has been estimated with a help of the DSSP algorithm (Kabsch and Sander 1983). Finally, for each protein we found: those random coils (C), alpha helices (H), and beta strands (E) that stay the same in all identical structures; those residues of alpha helices that exist in random coil in some of the structures (HC); those residues of beta strands that exist in random coil in some structures (EC); absolutely disordered fragments that cannot be seen in any of the examined structures (0); random coils that can turn into the disordered state (0C); alpha helices that can turn into the disordered state (0H); beta strands that can turn into the disordered state (0E). Also, we found a significant number of cases in which an alpha helix turns into random coil in some structures, but in other structures turns into the disordered state (0HC), and those cases, in which a beta strand turns into random coil in some structures and in other structures it turns to the disordered state (0EC). Interestingly, the number of residues that can exist in both alpha helical and beta structural state is quite low. Thus, a total of 46,249 cases of H, 27,798—E, 59,023—C, 1260—0, 3960—HC, 2835—EC, 1596—0C, 274—0H, 69—0E, 106—0HC, 33—0EC, 4—HE, 2—HEC were analyzed. Disordered N-terminal and C-terminal parts of proteins have been ignored in all calculations, except the testing of the PentUnFOLD 1D algorithm on the control set.The amino acid content of each of the abovementioned structural states has been calculated. Then, usages of amino acid residues in different structural states have been compared with each other by two-tailed t test for relative values, standard errors were provided in figures. In the same manner, we compared pentapeptide contents of those structural states. Pentapeptides were used with the aim to consider the influence of short-range interactions between amino acid residues and alternations of hydrophilic (P) and hydrophobic (H) residues on the probability of structural shifts. In those pentapeptides, amino acid residues are roughly divided into hydrophilic (Ser, Thr, Asp, Glu, Asn, Gln, His, Arg, Lys) and hydrophobic ones (Gly, Ala, Met, Leu, Ile, Val, Phe, Tyr, Trp, Cys, Pro) (Tina et al. 2007). The methodology of such comparisons has been described in details previously (Khrustalev et al. 2019).Additionally, we calculated the amino acid content and pentapeptide content of first and last amino acid residues of alpha helices and beta strands with stable and nonstable N- and C-termini, as well as for flanking random coil residues. For alpha helices, we also considered second positions of both N- and C-termini.Propensity scales (both amino acid and pentapeptide ones) have been created for each set of structural states. All those scales can be seen in Table 4S from Supplementary material. The workflow of the PentUnFOLD algorithm is described in the subsection of the Results and Discussion section, the manual is available as Supplementary material “PentUnFOLD-manual.pdf” file.Secondary structure of each protein from all the six sets has been determined with the DSSP program (Kabsch and Sander 1983), tertiary structure of each protein from the control set has been studied with the help of the PIC server (Tina et al. 2007). Among intraprotein interactions we consider hydrogen bonds, hydrophobic contacts, ionic contacts, cation-pi interactions, aromatic-aromatic, and aromatic-sulfur interactions, as well as disulfide bonds. We use the same criteria for their consideration as the PIC server (Tina et al. 2007). The number of amino acid residues that make contacts with a given residue is calculated. Then the algorithm counts for every amino acid residue the number of contacts with stable residues, with nonstable residues, with disordered ones, and with completely disordered ones.The information on 103 structures of human serum albumin can be found in Table 5S from the Supplementary Material file.Three PentUnFOLD algorithms are available on the web server (http://3.17.12.213/pent-un-fold) and on the page of our university (http://chemres.bsmu.by/PentUnFOLD.htm). PentUnFOLD 1D, PentUnFOLD 2D, and PentUnFOLD 3D require PDB file as an input, while PentUnFOLD 1D can also work with an amino acid sequence. The output of those algorithms is provided as a downloadable MS Excel file. Predictions can be easily copied from those files. Moreover, new calculations or formatting can be easily performed directly in those MS Excel worksheets. The server uses an output of DSSP algorithm with the aim to determine secondary structure for PentUnFOLD 2D and PentUnFOLD 3D. In case of DSSP server failure, secondary structure is determined by our own JAVA script based on DSSP criterions (Kabsch and Sander 1983). For the 3D version of the algorithm our server finds all the possible intraprotein interactions according to the criterions of the PIC server using a new JAVA script.
Results
Comparison between completely disordered state, disordered random coil, and stable random coil
One of the fundamental questions of this study is to find out are there any differences between the regions of proteins that can never be seen in PDB files and those that can be seen in some PDB files, but “disappear” in others. In Fig. 1, we show amino acid content of completely disordered regions (excluding those that exist in N-termini and C-termini of proteins), amino acid content of regions that exist as disordered ones in some 3D-strutures and form random coil in other structures with 100% similarity of amino acid sequence, as well as amino acid content of those regions of random coil that stay in random coil state in all examined 3D structures of identical proteins. The differences between stable random coils and unstable ones are as follows: the usage of several hydrophobic residues is significantly higher in stable random coils (Leu, Ile, Val, Phe, Tyr, Trp, Cys, Pro), as well as the usage of some hydrophilic ones (Asp, Asn, His). The differences between stable random coils and completely disordered regions are quite similar to the previously described ones: the usage of several hydrophobic residues is significantly higher in stable random coils (Met, Leu, Ile, Val, Phe, Tyr, Trp, Cys, Pro), as well as the usage of some hydrophilic ones (Thr, Asn, Gln, His). Completely disordered regions are enriched in several amino acid residues relative to the unstable random coils: by Gly, Ala, Leu, Val, Trp, Pro, Asp, and Lys. However, the magnitude of differences in amino acid usage is higher than 25% for three amino acid residues only: Met, Trp, and His. At the same time, the differences between completely disordered state and stable random coils are higher than 25% for Ser, Glu, Leu, Ile, Tyr, Trp, Cys, Phe, and His. Among the differences between unstable and stable random coils with a magnitude higher than 25%, we find the same Ser, Glu, Leu, Ile, Tyr, Trp, Cys, but without Phe and His.
Fig. 1
Amino acid content of completely disordered regions (“0”), amino acid content of regions that exist as disordered ones in some 3D-strcutures and form random coil in other structures with 100% similarity of amino acid sequence (“0C”), and amino acid content of those regions of random coil that stay in random coil state (“C”) in all examined 3D structures of identical proteins. Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue bar
Amino acid content of completely disordered regions (“0”), amino acid content of regions that exist as disordered ones in some 3D-strcutures and form random coil in other structures with 100% similarity of amino acid sequence (“0C”), and amino acid content of those regions of random coil that stay in random coil state (“C”) in all examined 3D structures of identical proteins. Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue barIn Fig. 2, we show the same comparison, but for 32 pentapeptides. The tendency of the enrichment of both completely disordered regions and unstable random coils by hydrophilic amino acid residues and pentapeptides composed of them is clear. However, there are some pentapeptides that are landmarks of the unstable random coil and not completely disordered regions: HPPPP; PPPPH; PHPPP; PPPHP; HHPPP; PPPHH; PPHHH; HHHPP; HPHPP (P is a hydrophilic amino acid residue, H is a hydrophobic amino acid residue). In the same way, there are a few quite frequent pentapeptides exactly in completely disordered regions: PHHPP; PPHHP; PPHPP; PHHPH; HPHHP. Interestingly, five last pentapeptides are known as alpha helical ones (Khrustalev and Barkovsky 2012).
Fig. 2
Pentapeptide content of completely disordered regions (“0”), pentapeptide content of regions that exist as disordered ones in some 3D-strcutures and form random coil in other structures with 100% similarity of amino acid sequence (“0C”), and pentapeptide content of those regions of random coil that stay in random coil state (“C”) in all examined 3D structures of identical proteins. Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to right
Pentapeptide content of completely disordered regions (“0”), pentapeptide content of regions that exist as disordered ones in some 3D-strcutures and form random coil in other structures with 100% similarity of amino acid sequence (“0C”), and pentapeptide content of those regions of random coil that stay in random coil state (“C”) in all examined 3D structures of identical proteins. Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to rightOn one hand, amino acid and pentapeptide content of completely disordered regions and unstable random coils are closer to each other, than to those of stable random coils. But on the other hand, there are some sharp differences between them, especially, if we consider combinations of hydrophobic and hydrophilic residues in pentapeptides. Because of these reasons, we included two methods to find disordered regions in random coils of proteins. In the first, we use a combined scale based on average characteristics of both completely disordered regions and unstable random coils. In the other method, we distinguish completely disordered regions from unstable random coils.
Comparison between alpha helices that can turn into the disordered state, those that can turn into random coil, and stable alpha helices
In this study we have found out that alpha helices that can turn into the disordered state are different from those that can turn into random coil. As one can see in Fig. 3, alpha helices able to form random coil are enriched in all nine hydrophilic amino acid residues, as well as in Cys, Pro, and Gly relative to stable alpha helices. In contrast to the last ones, alpha helices able to turn into the disordered state, and not to the random coil, are enriched in Ala and Met, and depleted in Gly, Pro, Thr, Asn, His, and Arg residues, relative to stable alpha helices. Even more surprisingly, alpha helices that are able to form disordered state are significantly enriched in Ala, Met, Ile, Val, Tyr, Asp, Glu, Gln, and Lys relative to those that can turn into the random coil. Especially prominent differences are evident (Fig. 3) for Ala, Asp, Glu, Gln, and Lys residues. These residues (except Asp) are well-known helix formers (Chou and Fasman 1978), but their usages are higher in those alpha helices that can turn into disordered state than in stable alpha helices and in those that can form random coil.
Fig. 3
Amino acid content of alpha helices that can turn into the disordered state (“0H”), amino acid content of alpha helices that can turn into random coil (“HC”), and amino acid content of those regions of alpha helices that stay in alpha helical state (“H”) in all examined 3D structures of identical proteins. Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue bar
Amino acid content of alpha helices that can turn into the disordered state (“0H”), amino acid content of alpha helices that can turn into random coil (“HC”), and amino acid content of those regions of alpha helices that stay in alpha helical state (“H”) in all examined 3D structures of identical proteins. Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue barIn Fig. 4, one can see that alpha helices prone to form disordered fragments of proteins are enriched in hydrophilic pentapeptides, as well as by a few less hydrophilic ones: PPHPH; HPPHP; PHPHH; HHHHP. Stable alpha helices are enriched in the most of hydrophobic pentapeptides. Those alpha helices that are prone to form random coil demonstrate higher usage of hydrophilic pentapeptides than stable alpha helices, but not as high as those alpha helices that can turn into disordered state. An opposite situation is there with some hydrophobic pentapeptides. Several pentapeptides with an average usage of hydrophilic residues are more frequently used in alpha helices prone to form random coil than in two other types, for example: HHPPP; HPHPP; HPPPH.
Fig. 4
Pentapeptide content of alpha helices that can turn into the disordered state (“0H”), pentapeptide content of alpha helices that can turn into random coil (“HC”), and pentapeptide content of those regions of alpha helices that stay in alpha helical state (“H”) in all examined 3D-structures of identical proteins. Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to right
Pentapeptide content of alpha helices that can turn into the disordered state (“0H”), pentapeptide content of alpha helices that can turn into random coil (“HC”), and pentapeptide content of those regions of alpha helices that stay in alpha helical state (“H”) in all examined 3D-structures of identical proteins. Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to rightIn the PentUnFOLD algorithm, we check whether a fragment of an alpha helix is prone to form random coil, and then we check can it turn into the disordered region using separate propensity scales (Table 4S).
Comparison between beta strands that can turn into the disordered state, those that can turn into random coil, and stable beta strands
Absolutely in the same way, as with alpha helices, beta strands able to form random coil are enriched in all nine hydrophilic amino acid residues, as well as in Cys, Pro, and Gly relative to stable beta strands (Fig. 5). At the same time, amino acid contents of alpha helices and beta strands (both able and unable to turn into random coil) are quite different. Beta strands prone to form disordered regions are enriched in several amino acids relatively to two other types of beta strands: Ala, Met, Tyr, Asp, His, Lys. Interestingly, alpha helices prone to form disordered regions are also enriched in Ala, Met, Asp, and Lys relative to two other types of alpha helices. It seems like being enriched in the same types of amino acid residues, both beta strands and alpha helices are becoming prone to form a disordered region, while being enriched in other types of amino acid residues they are both becoming able to turn into random coil. However, both transitions (to the disordered state and to the random coil) are becoming possible for beta strands and for alpha helices when they still have quite different amino acid contents.
Fig. 5
Amino acid content of beta strands that can turn into the disordered state (“0E”), amino acid content of beta strands that can turn into random coil (“EC”), and amino acid content of stable beta strands (“E”). Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue bar
Amino acid content of beta strands that can turn into the disordered state (“0E”), amino acid content of beta strands that can turn into random coil (“EC”), and amino acid content of stable beta strands (“E”). Amino acids from N-termini and C-termini of proteins were excluded. Standard errors are provided in the barcharts. Names of hydrophobic residues are in the yellow bar, names of negatively charged residues are in the red bar, while names of positively charged residues are in the blue barIt is not surprising that beta strands able to turn into random coil are enriched in more hydrophilic pentapeptides, while stable beta strands are more hydrophobic (Fig. 6). Surprisingly, beta strands that are prone to turn into disordered state are especially enriched with several concrete pentapeptides: PPPPH; PPPHH; PHPHP; PPHHH; PHHHP; HHHPH. This information is used by the PentUnFOLD algorithm to check if a beta strand fragment can turn into random coil, and if it is prone to turn into the disordered state.
Fig. 6
Pentapeptide content of beta strands that can turn into the disordered state (“0E”), pentapeptide content of beta strands that can turn into random coil (“EC”), and pentapeptide content of stable beta strands (“E”). Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to right
Pentapeptide content of beta strands that can turn into the disordered state (“0E”), pentapeptide content of beta strands that can turn into random coil (“EC”), and pentapeptide content of stable beta strands (“E”). Pentapeptides from N-termini and C-termini of proteins were excluded. “P” is a hydrophilic amino acid residue, “H” is a hydrophobic amino acid residue. Standard errors are provided in the barcharts. The pentapeptides are arranged in order of increasing hydrophobicity from left to right
The information on instability of N- and C-termini of alpha helices and beta strands
Most of helix to coil and beta sheet to coil transitions have been found by us in N- and C-termini of alpha helices and beta strands, respectively. These transitions have been studied separately from cases of complete helix to coil and sheet to coil transformations. Actually, in previous sections we described amino acid content of alpha helices and beta strands that are able to turn into random coil completely. Here, we compare amino acid content of N-termini of stable alpha helices and amino acid content of instable N-termini of alpha helices. Since in N-caps of alpha helices a residue situated before an alpha helix usually makes hydrogen bonds with residues from an alpha helix, those residues for stable and instable N-termini of helices have been compared as well. We also considered helix to coil transitions made by a single N-terminal residue separately from such transition made by two N-terminal residues. The same comparisons have been made for C-termini of alpha helices.In Fig. 7, one can observe that proline in the first position can significantly stabilize N-terminus of an alpha helix, while aspartic acid is usually making the first position of an alpha helix prone to turn into random coil. C-termini of alpha helices are stable if Val, Tyr or Asp is situated there. Nonstable C-terminal residues are Ser and Arg. The pentapeptide that stabilizes N-termini of alpha helices better than others is HPHPP (alpha helix starts from the third position).
Fig. 7
Amino acid content of stable and instable N- and C-termini of alpha helices. Standard errors are provided in the barcharts
Amino acid content of stable and instable N- and C-termini of alpha helices. Standard errors are provided in the barchartsN-terminal residues of beta strands that prevent those N-termini from turning into random coil are Met, Ile, and Val (Fig. 8). If a beta strand starts from Asp or Glu, it is prone to become shorter from its N-terminus. C-termini of beta strands are significantly stabilized by Leu, Ile, and Trp. If Gly is situated in the C-terminal position of a beta strand, that strand has a high chance to become shorter from its C-terminus (Fig. 8). The pentapeptide that stabilizes N-termini of beta strands better than others has the following sequence: HPHHH. The best stabilizer of the C-terminus of a beta strand is the HHHPP pentapeptide.
Fig. 8
Amino acid content of stable and instable N- and C-termini of beta strands. Standard errors are provided in the barcharts
Amino acid content of stable and instable N- and C-termini of beta strands. Standard errors are provided in the barchartsThe information on amino acid residues that make N- and C-termini of alpha helices and beta strands more or less stable is used by the PentUnFOLD to consider the stability of those caps. Users are able to change the positions of N-termini and C-termini of alpha helices and beta strands to search for their most stable (and so most expected) positions in the PentUnFOLD 2D version.
The principles of the PentUnFOLD algorithms
There are three versions of the PentUnFOLD algorithm: 1D version predicts based on just an amino acid sequence, 2D version uses both amino acid sequence and the data on secondary structure; while 3D version uses amino acid sequence, secondary structure, and the map of intraprotein contacts between amino acid residues.The PentUnFOLD algorithm predicts fragments of alpha helices and beta strands that can turn into random coil (referred to as “N” residues) separately from those fragments of alpha helices, beta strands, and random coils that can turn into completely disordered state (referred to as “D” residues). A fragment of a protein may be both “N” and “D”.The PentUnFOLD algorithm requires description of a polypeptide chain from a PDB file, the results of the evaluation of its secondary structure from the DSSP algorithm, and the information on intraprotein contacts between amino acid residues as an input. So, amino acid sequence is extracted from the lines “ATOM” of the PDB file. However, users of the PentUnFOLD 2D version can introduce unlimited number of amino acid substitutions in a sequence. The information on the secondary structure is extracted from the output of the DSSP algorithm, but users can change secondary structure manually in the 2D version of the algorithm.The algorithm solves the following problems: (i) it checks stability of alpha helices and beta strands in terms of their ability to turn into random coil; (ii) it checks ability of alpha helices, beta strands, and random coils to turn into the disordered state; (iii) it finds regions of random coil that are able to form alpha helix or beta strand. PentUnFOLD 1D version also predicts secondary structure for a protein considering that it can completely turn into a disordered state and fold back.The first problem requires two separate calculations: the estimation of the stability of N- and C-termini of alpha helices and beta strands, and the estimation of the stability of central regia. For termini of beta strands, the algorithm uses four scales: amino acid and pentapeptide propensity scales for the first (last) residue in a beta strand and for the flanking residue from the random coil. If the average value from these four scales is higher than 0.5, the terminal residue is considered to be stable. Terminal residue is also considered to be stable if a beta strand is predicted there by both amino acid and pentapeptide scales. Amino acid residues from the body of a beta strand are judged by the amino acid and pentapeptide scales first, and by the average results for the scales (amino acid and pentapeptide ones) for stable and nonstable bodies of beta strands next. Finally, the algorithm shows residues of beta strands that have a low probability to turn into random coil (ES), and those residues that have a high probability to turn into random coil (EN).The algorithm works with alpha helices in the same way as with beta strands, while the algorithm for their N- and C-termini is more complicated, since it includes also a calculation featuring the second residue from the N-terminus and the second residue from the C-terminus. The first (last) residue in an alpha helix is considered to be stable if it is predicted to be stable by both methods (the one featuring just the first residue, and the second considering two N-terminal or C-terminal residues). The second residue in an alpha helix is considered to be stable with a help of the method that includes the average result for six scales: amino acid scales and pentapeptide scales for the first and the second residues in an alpha helix, and for the residue in random coil before (after) the alpha helix.The ability of a random coil to turn into the disordered state is considered by the comparison between four scales: combined scale for completely disordered state and for random coil that can turn into the disordered state, stable random coil, combined scale for alpha helices that can turn into the disordered state and those that can turn into both disordered state and random coil, combined scale for beta strands that can turn into the disordered state and those that can turn into both disordered state and random coil. In case if random coil is considered to be disordered by this method, it is also judged by the method that features just two scales: for the completely disordered state and for random coils that can turn into the disordered state. With the first method the algorithm can also consider fragments of random coil that can turn into alpha helix or beta strand.Alpha helices are considered to be prone to turn into disordered state using a method that features amino acid and pentapeptide scales that include two options each: the scale for stable alpha helices, and the combined scale for alpha helices that can turn into the disordered state and those that can turn into both disordered state and random coil. If a residue is considered to be disordered by this method, it is also judged by the next one that compares completely disordered state with the combined scale for alpha helices that can turn into the disordered state and those that can turn into both disordered state and random coil. In the same manner the algorithm finds disordered residues in beta strands, and chooses absolutely disordered residues (V) among them.Finally, the algorithm predicts unstable alpha helices and beta strands that can appear in the place of random coil or disordered state. This prediction may be useful for consideration of the structure of those proteins that can completely turn into the disordered state and fold back.At the end of the 2D prediction step amino acid residues are classified into five categories: “V” means completely disordered residues; “D” means disordered residues; “N” means nonstable residues of alpha helices and beta strands that can turn into random coil or vice versa; “Z” means neither stable, nor nonstable residues of random coil; “S” means stable residues.The purpose of the 3D prediction step is to consider the influence of stabilizing and destabilizing contacts between amino acid residues. Only if a residue is predicted to be completely disordered (V), we ignore any contacts it can make to other parts of the protein. If a residue is predicted to be just disordered (D) by the 2D algorithm, it stays disordered only if the number of its contacts with stable (S) residues is less than 3. If a residue is predicted to be nonstable (N), it is classified as disordered one only if it makes less than 4 contacts with stable (S) residues. If a residue of a random coil is neither stable, nor nonstable (Z), it is classified as disordered one if it makes no contacts with other residues at all, or if the sum of its contacts with V, D, and N residues (actually, 0.5 · N) is higher than the number of its contacts with stable residues. Even stable (S) residue can become disordered if the sum of the numbers of its contacts with “V”, 0.5 · “D”, and 0.25 · “N” residues is higher than the number of its contacts with other “S” residues. Taking together, 3D prediction finds those residues that are situated in the disordered 2D environment, but surrounded by stable residues in the 3D space, as well as residues in the stable 2D environment, that are destabilized by contacts with other residues. Also, residues that are not making any contacts are considered to be disordered, and that is the case mostly for unfolded N- and C-termini of proteins.At the final step, residues surrounded by disordered ones in the primary sequence are considered to be disordered (DXD = DDD), and then residues surrounded by ordered ones (O) are considered to be ordered (OXO = OOO).
Performance of PentUnFOLD algorithms
In the test set, we classified residues as disordered ones in case if they were missing at least in one of the structures with 100% identity of sequences, compared to the randomly selected initial one. There were 1782 disordered residues in the whole test set. As one can see in Table 1, the PentUnFOLD 3D showed the highest sensitivity (71.44%) compared to nine other methods (from 3.59% for the Depicter to 43.60% for the Foldindex). This increased sensitivity has been reached thanks to the 3D step of prediction. Indeed, for the PentUnFOLD 1D sensitivity is equal to 12.21%, and for the PentUnFOLD 2D algorithm it is equal to 39.25%. Interestingly, if we consider just those residues that are classified as prone to turn into the disordered state (“D” and “V”) by the PentUnFOLD 2D algorithm, its sensitivity is equal to 11.34%, while if consider just structurally nonstable residues (N), the sensitivity is even higher (22.18%). These results prove that transitions of helix to coil and beta to coil (and vice versa) make a significant contribution into the “disappearance” of protein fragments from 3D structures. Provided data also shows that interactions between amino acid residues is the third key to open up the door to the understanding of the nature of intrinsically disordered regions of proteins, while the second key is the actual secondary structure, and the first key is their amino acid content and composition.
Table 1
The results of the testing of the PentUnFOLD and other algorithms on the test set of 74 proteins
Algorithms
Sensitivity, %
Specificity, %
Accuracy
MCC
F1
Depicter
3.59
9.22
0.890
0.006
0.052
Foldindex
43.60
9.69
0.612
0.037
0.159
GlobPlot
7.24
4.25
0.785
-0.061
0.054
PONDR VL-XT
22.05
9.38
0.756
0.018
0.132
PONDR VSL2
23.91
10.01
0.756
0.029
0.141
flDPnn
7.30
14.79
0.887
0.048
0.098
AUCpreD
8.59
27.77
0.905
0.114
0.131
DISOPRED3
6.90
29.29
0.908
0.107
0.112
DISOPRED3 (disordered)
3.09
26.57
0.912
0.065
0.055
PentUnFOLD 1D
12.22
14.14
0.853
0.052
0.131
PentUnFOLD 2D
39.25
7.99
0.570
− 0.012
0.133
PentUnFOLD 2D (D)
11.34
8.04
0.817
− 0.005
0.094
PentUnFOLD 2D (N)
22.18
7.66
0.711
− 0.015
0.114
PentUnFOLD 3D
71.44
9.05
0.374
0.034
0.161
The results of the testing of the PentUnFOLD and other algorithms on the test set of 74 proteinsInterestingly, nine tested algorithms, show low (Table 1) specificity (from 4.25 to 29.29%). It means that all of them largely overpredict disordered regions. PentUnFOLD is not an exception in terms of specificity: it is equal to 9.04% for its 3D version and 7.99% for its 2D version, but it is higher (14.14%) for its 1D version. Does that mean that we cannot trust in such predictions, or does it mean that the definition of disordered regions in the current test was too strict? To answer this question, we studied a set of 103 structures of human serum albumin (HSA). Indeed, the higher the number of 3D structures are available for a given protein, the higher the percent of residues that are missing from at least one of them. Actually, for human serum protein 59% of amino acid residues are disordered, according to our definition.As one can see in Table 2, the levels of specificity for all tested algorithms are much higher for human serum protein, than for proteins with a few known 3D structures. Actually, for Depicter, the level of specificity is even equal to 100%, since it has predicted just a single disordered residue, and that N-terminal residue may indeed disappear from 3D structures. For the other algorithms, specificity varies from 35.71% (DISOPRED3, disordered) to 75.00% (fIDPnn). The specificity for the PentUnFOLD 3D is equal to 60.87%, while its sensitivity again showed the highest level among other tested algorithms (65.50%). As in the test set of proteins, PentUnFOLD 1D showed worse sensitivity than PentUnFOLD 2D (7.02% vs. 48.83%), while their specificities were comparable with each other (53.33% vs. 58.19%).
Table 2
The results of the testing of the PentUnFOLD and other algorithms on the human serum albumin (HSA) with 103 available 3D structures
Algorithms
Sensitivity, %
Specificity, %
Accuracy
MCC
F1
Depicter
0.29
100.00
0.411
0.022
0.006
Foldindex
47.95
65.60
0.544
0.011
0.554
GlobPlot
2.05
46.67
0.408
− 0.010
0.039
PONDR VL-XT
11.40
55.71
0.423
− 0.003
0.189
PONDR VSL2
38.30
70.43
0.541
0.014
0.496
flDPnn
0.88
75.00
0.413
0.010
0.017
DISOPRED3
7.89
65.85
0.432
0.005
0.141
DISOPRED3 (disordered)
1.46
35.71
0.402
− 0.022
0.028
PentUnFOLD 1D
7.02
53.33
0.415
− 0.005
0.124
PentUnFOLD 2D
48.83
58.19
0.491
− 0.002
0.531
PentUnFOLD 2D (D)
18.42
62.38
0.453
0.003
0.284
PentUnFOLD 2D (N)
42.98
57.65
0.476
− 0.002
0.492
PentUnFOLD 3D
65.50
60.87
0.547
0.006
0.631
The results of the testing of the PentUnFOLD and other algorithms on the human serum albumin (HSA) with 103 available 3D structuresThe results provided above show that proteins usually have a lot of regions that can change their structure or turn into the disordered state. Even subtle changes in conditions or the binding of specific ligands can change the network of intraprotein amino acid contacts, and release disordered regions from stabilizing interactions.
Evaluation of the consequences of amino acid substitutions in the most disordered region of human major prion protein by the PentUnFOLD algorithm
According to the results of the PentUnFOLD algorithm, there are several disordered areas in the human major prion protein. We used NMR structure with PDB ID: 1HJM as an input. According to the 2D predictions, the first alpha helix (144–152) contains disordered N-terminal part (3 residues) and disordered C-terminal residue. There are just two stable residues in that alpha helix: 149–150. The long second alpha helix (174–194) is completely instable and just two residues are stable (Phe175 and Val180). C-terminal part of the second alpha helix is disordered (190–194), and one of those residues (Thr192) is absolutely disordered. The third alpha helix (200–225) contains disordered N-terminal part (5 residues), disordered residue 215 and one more disordered region (218–223). At the same time, just 10 out of 26 residues from the third alpha helix are predicted as nonstable. Residue 196 situated in the random coil between the second and the third alpha helices is predicted to be disordered.Results of the PentUnFOLD algorithm are in agreement with previous works that showed that the region containing the C-terminal part of the second alpha helix, the N-terminal part of the third alpha helix, and random coil between them is prone to form beta structure (Khrustalev et al. 2016).With the help of our new algorithm we estimated consequences of amino acid substitutions in this region that are known to be associated with prion diseases.H187R. This amino acid substitution is associated with Gerstmann-Straussler disease (GSD). As a result, two amino acid residues before R187 are becoming disordered. If we consider that there is a beta strand in place of the second alpha helix, there are no changes after the amino acid substitution (Table 3).
Table 3
Consequences of amino acid substitutions associated with the development of human prion diseases according to the original PentUnFOLD and other algorithms (2D predictions)
Comparison of performance of different predicting methods and computer algorithms is not something completely straightforward and objective. There are many different criterions to evaluate their ability to predict that usually show different results. From this point of view, it is important to discuss advantages and disadvantages of those algorithms to understand when and why they become suitable, and to identify conditions in which they are becoming misleading.Coming back to Table 1 one can see that the highest accuracy belongs to the DISOPRED3 (disordered) algorithm (91.16%). Intriguingly, the closest value of accuracy among our algorithms (85.31%) belongs to the PentUnFOLD 1D. The second best of our algorithms in terms of accuracy (81.70%) is the PentUnFOLD 2D in case if we consider only “D” and “V” residues. However, PentUnFOLD 3D has the value of accuracy equal to just 37.42%. The reason of this difference in accuracy is in the common style of disorder prediction for PentUnFOLD 1D, DISOPRED3 (disordered), DISOPRED3, AUCpreD, and flDPnn. All abovementioned algorithms have low sensitivity to the disordered residues, but high sensitivity to ordered residues. The fraction of ordered residues is higher than the fraction of disordered ones. That is why, taken together, the ratio between the sum of true positive and true negative residues and the sum of all residues is so high. One may choose those algorithms to find sequences with a high tendency to turn into the disordered state, as well as regions that are usually ordered. Such ability is well reflected by the MCC (Mathew’s correlation coefficient). The highest values of MCC, that are, actually, still far from 1, are there for AUCpred, DISOPRED3, and DISOPRED3 (disordered) (Table 1). Among PentUnFOLD algorithms, only the PentUnFOLD 1D has MCC value that is close to the one of DISOPRED3 (disordered). However, both in Tables 1 and 2 MCC values are somewhere near 0 reflecting that there is still a need of new ideas and approaches from the side of software for disordered regions prediction developers.In Table 2, accuracy values for all algorithms never rich as high values, as in Table 1. Indeed, a lot of residues predicted to be ordered are really disordered at least in some structures of HSA. So, three other algorithms show highest accuracy values in the set of HSA structures: PentUnFOLD 3D, Foldindex, and VSL2. Those algorithms largely overpredict disorder in the test set (Table 1), but perform much better in the set with increased percent of disordered residues (Table 2). So, abovementioned algorithms are recommended in case if one wants to find all the regions that have a chance (even a low one) to turn into the disordered state. Indeed, if we consider F1 index, that is largely focused on the ability of algorithms to find true positives, we will see that such ability has the highest values in PentUnFOLD 3D, Foldindex, and VSL2. Notice that the values of F1 for these algorithms are much higher in Table 2 than in Table 1.Taken together, the test of performance of current algorithms in a new set of proteins showed that they can be classified into two groups: those that are good in identification of regions that have high probability to turn into the disordered state, and those that are good in identification of regions that have high, average or even low probability to become disordered.
Conclusions
Due to the enormous functional and medical importance of IDPs/IDPRs, prediction of intrinsic protein disorder from amino acid sequence has become an area of active research. Such proteins are frequently involved in some of the most important regulatory functions in the cell, and the intrinsic lack of structure can confer functional advantages on a protein, including the ability to bind to several different targets performing sometimes even opposite functions. A lot of diseases are associated with different structural transitions. That is why approaches to creating new predictive algorithms are being developed.Our algorithm, PentUnFOLD, is based on the newly obtained propensity scales and it can determine not only fragments of alpha helices, beta strands, and random coils that can turn into the completely disordered state, but also regions of alpha helices and beta strands which are able to turn into random coils, and vice versa (H ↔ C, E ↔ C) not just at the N- and C-termini of proteins, but in the middle of their sequences. Moreover, PentUnFOLD has the option not only to determine the effect of amino acid substitutions, but also secondary structure transitions on the stability of a given region in unmodified or modified protein.Prediction of disordered regions from the 3D structure brings some benefits compared to the prediction from amino acid sequence. At first, amino acid content of alpha helices, beta strands and random coils prone to turn into the disordered state have some differences. So, it is better to know the secondary structure of a given fragment of polypeptide chain to consider its ability to turn into random coil or disordered state. At second, interactions between amino acid residues may decrease or increase the possibility of a given fragment transition to the disordered state.Our web server (http://3.17.12.213/pent-un-fold) processes one PDB file or amino acid sequence at a time. The algorithm itself is incorporated into the MS Excel spreadsheet. So, all the data are inserted into the spreadsheet automatically by the JAVA scripts from our server. Then a user has to download resulting file and open it with either MS Excel or LibreOffice Calc. Users are also welcome to perform those operations manually with original spreadsheets (http://chemres.bsmu.by/PentUnFOLD.htm).Below is the link to the electronic supplementary material.Supplementary file1 (DOCX 54 KB)Supplementary file2 (DOCX 23 KB)Supplementary file3 (XLSX 12169 KB)Supplementary file4 (XLSX 62 KB)Supplementary file5 (DOCX 22 KB)Supplementary file6 (DOCX 62 KB)Supplementary file7 (PDF 118 KB)
Authors: Patricia Aguilar-Calvo; Consolación García; Juan Carlos Espinosa; Olivier Andreoletti; Juan María Torres Journal: Virus Res Date: 2014-11-29 Impact factor: 3.303