Literature DB >> 26138206

Conformational Entropy of Intrinsically Disordered Proteins from Amino Acid Triads.

Anupaul Baruah¹, Pooja Rani¹, Parbati Biswas¹.

Abstract

This work quantitatively characterizes intrinsic disorder in proteins in terms of sequence composition and backbone conformational entropy. Analysis of the normalized relative composition of the amino acid triads highlights a distinct boundary between globular and disordered proteins. The conformational entropy is calculated from the dihedral angles of the middle amino acid in the amino acid triad for the conformational ensemble of the globular, partially and completely disordered proteins relative to the non-redundant database. Both Monte Carlo (MC) and Molecular Dynamics (MD) simulations are used to characterize the conformational ensemble of the representative proteins of each group. The results show that the globular proteins span approximately half of the allowed conformational states in the Ramachandran space, while the amino acid triads in disordered proteins sample the entire range of the allowed dihedral angle space following Flory's isolated-pair hypothesis. Therefore, only the sequence information in terms of the relative amino acid triad composition may be sufficient to predict protein disorder and the backbone conformational entropy, even in the absence of well-defined structure. The predicted entropies are found to agree with those calculated using mutual information expansion and the histogram method.

Entities: Chemical Disease Gene

Mesh：

Substances：

Year: 2015 PMID： 26138206 PMCID： PMC4490338 DOI： 10.1038/srep11740

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Conformational entropy of proteins is a proxy measure of its internal dynamics, which may be characterized by enumerating the different microscopic structural states involved in atomic motion123. Backbone conformational entropy constitutes a critical component of protein stability and plays a key role in the energetics of protein folding. Experimental measurement of conformational entropy has been difficult even though the atomic motions in the range of ps-ns may be characterized via NMR relaxation methods4, or from experiments like AFM-unfolding5 and neutron spectroscopy which demonstrates the role of conformational entropy in thermal protein unfolding6. Theoretical studies78 to quantify changes in conformational entropy are however computationally demanding. The first attempt, in this context9, calculated the relative entropy of protein unfolding by various residues with respect to glycine. A different approach10 correlated the backbone entropy to the distribution of main chain ϕ - ψ dihedral angles in the crystallographic structures by studying unfolded or flexible denatured states of proteins for a set of 61 nonhomologous proteins. The conformational entropy of a protein may be characterized in terms of dihedral angles. Neglecting the time-dependent correlation of the dihedral angles may often lead to an upper bound of the value of conformational entropy1112. The correlation between the dihedral angles may be extracted from molecular dynamics simulation trajectories coupled with the experimental NMR relaxation data for proteins through generalized order parameters, S2, derived from spin relaxation1314. This method is not useful for the disordered proteins since they lack a well-defined structure. The computational approaches link atomic motions with conformational entropy through the principal component analysis of the variance-covariance matrix of protein’s internal or position co-ordinates. The eigenvalues of the variance-covariance matrix are used to evaluate the quasi-harmonic entropy via the entropy equation of the quantum mechanical harmonic oscillator1516. This method overestimates the conformational entropy as the internal coordinates are assumed to be multidimensional Gaussian171819, which may not be valid for the dihedral angles of disordered proteins. This assumption is hardly true for globular proteins where some dihedral angles follow non-Gaussian distribution. The mutual information expansion method12 evaluates the entropy by approximating this correlation between internal co-ordinates up to a certain order. However, this method is complicated by the sampling statistics and convergence problems at or beyond third order correlation, even for medium sized proteins20. A recent study by Genheden, Akke and Ryde21 infer that even long MD simulations do not completely equilibrate protein conformations, while the configurational entropy depends on both sampling statistics and simulation time. Hence, these methods have limited scope for the intrinsically disordered proteins (IDPs), where the disordered regions/domains are characterized by a conspicuous absence of interpretable electron density due to the fast motions of the atoms, rendering them invisible. Intrinsically disordered proteins may be best represented as a dynamic ensemble of rapidly interconverting conformations22 either at the level of secondary or tertiary structures, resembling the unfolded protein under physiological conditions. In this context, Molecular Dynamics (MD) and Monte Carlo (MC) simulations are important in modeling conformational ensembles of IDPs/IDPRs. Despite the absence of a unique three-dimensional structure, disordered proteins exhibit functional diversity which complements the functions of ordered protein regions232425. Another important feature of the intrinsically disordered proteins is their disorder-order transition caused by binding to specific targets which explains the mechanism of regulating various cellular processes like transcription, translation and cell cycle control2627282930. Sequence analysis reveals that disordered domains/proteins are associated with less sequence complexity31. The sequences of these proteins are characterized by a low content of hydrophobic residues and a large number of charged residues which disfavor the folding process2632 resulting in a lesser number of two-body contacts33. This article quantifies the structural disorder of partially/completely disordered proteins in terms of both sequence composition and backbone conformational entropy, relative to that of the globular proteins. The non-redundant database along with the selected data sets of globular proteins, partially and completely disordered proteins (IDPRs and IDPs) are compiled separately for sequence analysis. MC and MD simulations characterize the conformational ensemble of the representative proteins of each group. A sequence analysis of the normalized relative composition for each individual amino acid reveals a considerable overlap between globular proteins and IDPs. However, the normalized relative composition calculated using amino acid triads is higher for IDPs as compared to the globular proteins and well demarcated from the region occupied by globular proteins. The conformational entropy of a disordered protein sequence may be expressed as the sum of the conformational entropy of the corresponding triads present in the non-redundant database obeying Flory’s isolated-pair hypothesis. Thus protein disorder may be predicted from the relative sequence composition only, while the backbone conformational entropy provides an appropriate measure of this structural disorder. The advantage of this method is that it avoids the requirement of extra long simulations which is not only computationally expensive but also time consuming.

Materials and Methods

Database Selection for Sequence Analysis

Non-redundant database

The non-redundant database chosen for this study comprised of non-homologous protein chains with a sequence identity of ≤25% (Feb 2010 release of PDB-select 1992–200934). Proteins with X-ray crystallographic structures of resolution ≤3 Å and R-factor ≤0.3 are selected from the Protein Data Bank (PDB)35. The compiled non-redundant database consists of 4316 chains from 4163 proteins with chain lengths ranging from 25 to 1015 residues.

Globular Proteins

A data set of the globular proteins is compiled from the RCSB PDB with X-ray crystallographic structures of resolution ≤3 Å. All proteins comprise of only single chains without any missing residues (listed in REMARK 465 of PDB). A sequence similarity of ≤25% and length ≥40 residues is applied using PISCES server36. The final data set of globular proteins consists of 1917 chains with chain lengths ranging between 40 and 1724 residues.

Group I

A data set of protein chains with intrinsically disordered protein regions (IDPRs) is selected from the PDB. The disordered regions are characterized by missing residues in the electron density map of the respective X-ray crystallographic structures. A non-redundant data set of 9508 protein chains is obtained from the compiled database using PISCES server with the selection criteria of ≤25% and length ≥40 residues. Out of these proteins, 138 chains with >50% structural disorder (percentage structural disorder is calculated with respect to the total length of the protein chain) comprise of the Group I data set. The chain lengths of these proteins span between 40 to 926 residues.

Group II

A set of 109 Intrinsically Disordered Proteins (IDPs) is selected from the DisProt, i.e., Database of Protein Disorder (Release 6.02)37. These proteins are completely disordered. A cutoff sequence similarity of ≤25% and length ≥40 residues is applied on these proteins using PISCES server. The final data set of IDPs comprises of 91 proteins with chain lengths ranging between 40 and 1861 residues. The detailed method for the selection of globular, Group I and Group II proteins may be found in Ref. 38. The protein ID’s for the selected data sets of globular, Group I and Group II proteins are provided in the Supplementary information.

Selected Proteins for Simulation

Conformational ensembles of the representative proteins from different groups are generated independently via Molecular Dynamics and Monte Carlo simulations. The representative globular protein is α-lactalbumin, whose crystal structure is extracted from the PDB (ID: 1A4V). Disordered proteins are selected from two classes: (i) 1CD3, 1F0R and 1MVF, with well-defined secondary structure coexisting with highly flexible disordered regions (ii) α-synuclein, which lacks well-defined tertiary structures and is completely disordered. The representative proteins are selected because of the following reasons: (i) The selected proteins collectively possess the complete set of 20 amino acids in their disordered regions, (ii) All proteins exhibit varying degree of structural disorder with 1CD3 having the minimum percentage of disorder content and α-synuclein with maximum disorder content, (iii) Selected proteins have varying location of disordered regions i.e. at C-terminus of the sequence for 1MVF, N-terminus of the sequence for 1F0R, in middle regions of the sequence for 1CD3 and throughout the entire sequence for α-synuclein. The three dimensional structures of the ordered regions of 1CD3, 1F0R and 1MVF are obtained from the PDB, while missing residues in the disordered regions are modeled with MODELLER39 using inputs from the protein sequence and structure of the ordered regions. The missing residues are incorporated by MODELLER such that the structure of the ordered part of the protein is exactly conserved. For α-synuclein, the sequence is the only input to model its structure. Supplementary Table S1 online summarizes the respective proteins used for MD and MC simulations with the method of modeling their disordered regions.

Molecular Dynamics Simulation

Molecular Dynamics simulations are performed for each of the representative proteins (1A4V, 1CD3, 1F0R, 1MVF and α-synuclein) in explicit-water using AMBER 12 simulation package40. LEAP subroutine is used to add the missing hydrogen atoms in each of the protein structures. ff99SB force field with periodic boundary conditions for proteins and the TIP3P model41 for water is employed in the present study. This force field presents a careful reparametrization of the backbone torsion terms compared to ff99 and provides an improved proportion of helical versus extended structures4243. Each protein structure is solvated in a cubic box (filled with TIP3P water) whose edge is maintained at a distance of 10 Å from the protein surface with closeness parameter 1 Å. The charge of each protein is neutralized by adding either Na+ or Cl– depending upon the charge of the solvated proteins. The PME algorithm44 with a real space cutoff of 8.0 Å is used for treating the long-range electrostatic interactions and a 8 Å distance cutoff is applied for the non-bonded interactions. The hydrogen atoms are constrained to the equilibrium bond length using the SHAKE algorithm45 which allows simulations with larger time step of 0.002 ps. Constant temperature and pressure are controlled through Berendsen’s temperature bath with coupling constant of 2 ps and barostat with a coupling constant of 1 ps, respectively46. Solvated proteins are energy minimized twice, first by energy minimization of the solvent, while keeping the protein constrained using conjugate gradient method followed by the energy minimization of the solvated protein. The minimized protein is initially equilibrated at an initial temperature of 100 K in NVT ensemble followed by gradually increasing the temperature upto 300 K at constant volume. A NPT equilibration of 5 ns is then performed for each solvated protein at a constant temperature of 300 K and a pressure of 1 bar followed by an extended NPT simulation of 100 ns. The root-mean-square deviation (RMSD) and radius of gyration (Rg) are plotted as a function of the simulation time for each protein in Supplementary Fig. S1 online.

Monte Carlo Simulation

Metropolis Monte Carlo simulation is also performed for 1CD3, 1F0R, 1MVF and α-synuclein to complement the results of the Molecular Dynamics simulations. The Cα backbone of the modeled conformations of each of these four proteins are considered as input structures for the Monte Carlo simulations. The pseudo Cα(i)-Cα(i + 1) and Cα(i)-Cα(i + 2) bond lengths are restricted to 3.8 ± 0.15 Å4748 and 6.0 ± 1.5 Å49 respectively. A 6–12 Lennard Jones potential accounts for the van der Waals interactions. The hydrophobic, electrostatic and the steric interactions are also modeled through coarse-grained two-body interaction potential. The energy function may be expressed as where, where, and are the normalized hydrophobicities of residues and respectively. is the distance between the sites j and k. The choice of the potential ensures that the hydrophobic-hydrophobic interactions are preferred while polar-polar interactions are not. where, and are the charges on the residues and respectively. () and () represents the normalized molecular weight of and the normalized coordination number of the j site in the i conformation respectively. The steric energy function is chosen such that the residues with high molecular weight is preferred at sites with low coordination number, while the residues with low molecular weight are preferred at highly coordinated sites. For each Monte Carlo simulation, 2000 conformations are randomly selected for each of the representative proteins. These conformations for each protein are solvated using TIP3P water model and energy minimized using AMBER 12. Root-mean-square deviation (RMSD) and radius of gyration (Rg) for the selected MC conformations are plotted for each protein in Supplementary Fig. S2 online.

Residuewise Conformational Entropy

Backbone conformational entropy may be evaluated from the probability distribution of the ϕ - ψ dihedral angles of the main chain of polypeptides1050. This entropy measure may be used as a criteria to distinguish the globular proteins from the disordered proteins, which lack well-defined secondary or tertiary structures. In this work, conformational entropy is calculated in terms of Shannon entropy which may be expressed as851, where is the fraction of amino acids present in the i bin for a specific range of ϕ - ψ angles of a given peptide segment. For each amino acid, the distribution of dihedral angles is obtained from the database of conformations. For the calculation of residuewise conformational entropy, individual amino acids are binned across the specified range of ϕ - ψ angles8. The Ramachandran’s plot is divided into 90 × 90 equally spaced grids; the height and width of each grid is 4°. Each amino acid in the protein is classified in a specific grid according to the values of its ϕ - ψ angles. The corresponding value is calculated from the fraction of the respective amino acid in that specific grid in the database. The conformational entropy of a protein is calculated from its values as defined in eq 5.

Two-body Contacts

Average number of two-body contacts is a measure of the residue-residue interactions present in any protein which discriminates the globular proteins from the disordered ones. A pair of non-hydrogen atoms is considered to be in contact when they are separated by a distance less than 8 Å52. For any i residue of a protein, neighbors along the sequence (i − 1, i − 2, i + 1 and i + 2) are neglected since contacts formed between these residues are present in the denatured state also with high probability53.

Results and Discussions

The residuewise conformational entropy is calculated using eq 5 for each amino acid in the non-redundant database and the conformational ensembles of each of the representative proteins generated by MD and MC simulations. Conformational entropy of IDPs and IDPRs calculated using the conformational ensembles of proteins generated by MD simulations is compared with that in the non-redundant database and globular protein (1A4V) in Supplementary Fig. S3(a) online, while the same is depicted in Supplementary Fig. S3(b) online for the conformational ensembles generated by MC simulations. Relative composition of the individual amino acids is calculated from the fraction of that amino acid present in the data set of Group II proteins relative to that of the globular proteins. The relative composition of amino acid is given by where j = 1 to 20, = , P denotes the fraction of j residue in i sequence of length n and the summation is over all sequences in the respective data sets of proteins. The normalized relative composition is used to differentiate between the globular and Group II proteins which may be expressed as where R is the total number of residues in a sequence, Δ represents the normalized relative composition of r residue in a sequence in the data set X for either globular proteins or Group II proteins. is calculated for individual sequences in the data set of globular and Group II proteins and the values are plotted in Fig. 1(a). The figure depicts a fuzzy boundary between the globular and Group II proteins, even though Group II proteins, on an average, show a higher value of the normalized relative composition. Thus, a new method is proposed which considers the combination of three amino acids, rather than the individual ones, to distinguish the Group II proteins from those of the globular ones. The combination of three amino acids, termed as amino acid triads (total combination = 8000) is considered, since the conformational flexibility of a specific amino acid residue is dependent on its flanking neighbors at the two ends. The normalized relative composition of these triads are calculated for each sequence of the globular and Group II proteins. The calculation of the normalized relative composition of the triad is similar to that of the individual amino acids using eqs 6 and 7. The normalized relative composition is calculated for all possible triads (i.e. 203 = 8000). For amino acid triads, P is the fraction of the j triad in the i sequence and R represents the total number of such triads in the sequences of the data sets for globular and Group II proteins. It is found that triads HHH, LEH, ALS and VLD are least preferred (except the triads which do not exist in Group II proteins) while triads QPF, PQQ, PHW and QQP are highly preferred in Group II proteins (see Supplementary Fig. S4 online). In Supplementary Fig. S4 (a) the relative composition of the top 50 triads, which are most preferred in Group II proteins, are plotted with 99% confidence interval for 2000 bootstrap resampling iterations. Supplementary Fig. S4 (b) depicts the relative composition of 50 triads which do not exhibit any clear preference; increased or decreased frequency of these triads in any sequence occurs by chance. The relative composition of the 50 least preferred triads are plotted in Supplementary Fig. S4 (c) online. In Fig. 1(b), is plotted for individual sequence in the data set of globular and Group II proteins. This figure depicts a higher value of the normalized relative composition for Group II proteins with a distinct boundary separating the Group II proteins from the globular proteins. Thus, the composition of amino acid triads provides a novel method to differentiate between the disordered and globular proteins and hence serve as an appropriate yardstick to predict disorder. The distribution of the normalized relative composition calculated using amino acid triads is plotted for data set of globular, Group I and Group II proteins (see Supplementary Fig. S5 online). The results show that the disordered proteins may be differentiated from well-structured globular proteins from the sequence information only.

Figure 1

Normalized relative composition calculated using (a) individual amino acid, (b) triads of amino acids, as a function of sequences in data set of globular and Group II proteins.

Blue line in (b) represents the boundary between globular and Group II proteins.

The above results confirm that the choice of amino acid triads provides a simple and effective method to predict disorder in proteins. This may suggest that these triads encode important information about the stability, flexibility and conformational entropy of proteins. Thus, the conformational entropy of these triads are calculated depending on the distribution of ϕ - ψ angles for different conformational ensembles of the non-redundant database, globular (1A4V ) protein, IDPRs (1CD3, 1MVF, 1F0R), and IDPs (α-synuclein) respectively. The conformational entropy of each conformational ensemble may be calculated by the histogram method as where P is the fraction of triads present in the i bin for a given range of the dihedral angles, ϕ - ψ. The Ramachandran plot may be divided into 12 × 12 equally spaced grids with height and width of 30° for each grid. A sufficiently large bin size i.e. 30° is chosen to minimize the statistical errors in the calculated probabilities of amino acid triads. Each amino acid triad in the protein is binned in the specified grid according to ϕ - ψ angles of middle amino acid in triad. The value of P is calculated from the fraction of triads in the specific grid of the database. The probability distributions of the triads in the non-redundant dataset are depicted in Fig. 2. In Fig 2(a–c) the probability distribution contour is plotted for triads EAL, DAT and TAS. Despite identical middle residue in each of these triads, the residue Alanine populates different regions of the dihedral angle space in the Ramachandran plot. Among these three triads EAL has the least entropy (S = 1.516) and is restricted to the α-helix region in the Ramachandran plot, DAT (S = 2.717) populates the α-helix and PP-II helix region, while TAS (S = 2.855) populates α-helix, PP-II helix and β-sheet regions. This suggests that the neighboring flanking residues of an amino acid may have significant influence on the structural degrees of freedom of that amino acid. Similarly, Proline, which is known to be structurally rigid, may exhibit varied structural flexibility and conformational entropy as is evident from Fig. 2(d–f). The highly flexible Glycine may contribute differently to the conformational entropy of a protein depending on its nearest neighbors in the triad as depicted in Fig. 2 (g,h). Interestingly, Aspartic acid is found to have more entropy as compared to Glycine, when it is flanked by residues D and K (Fig. 2 (i)). Figure 2 (j–l) depicts the change of the conformational entropy of the amino acid triads with the change in the middle amino acid residue only. Among these three triads, REI exhibits the lowest entropy (S = 1.819) while RNI exhibits highest entropy (S = 3.006). Supplementary Table S2 lists the relative entropies of amino acid triads (XXX), which show higher entropy relative to XGX, with 99% confidence interval for 2000 bootstrap resampling iterations. The relative entropies of the triads that exhibit minimum entropy with respect to XGX are given in Supplementary Table S3 online. Figure 2 and Supplementary Tables S2 and S3 imply that changing the middle amino acid for the same flanking residues as well as changing the flanking residues for the same middle amino acid may have significant impact on the conformational entropy. Therefore, study of triads in terms of conformational entropy is important. Conformational entropy of each amino acid triad is evaluated for globular (1A4V), IDPRs (1CD3, 1F0R and 1MVF) and IDPs (α-synuclein) using MD and MC generated conformational ensembles. The ratio of the conformational entropy of a specific triad in a data set of proteins to the corresponding conformational entropy of the triad in non-redundant database is calculated and the probability distribution of this ratio is plotted in Fig. 3(a,b) for MD and MC simulations respectively. Globular proteins exhibit the least entropy value with an average of ~0.65, which implies that a triad present in a protein with well-defined structure actually populates approximately half of the accessible ϕ - ψ range allowed for that triad. The conformational entropy of the ordered regions of IDPRs are flanked between their disordered counterparts and the globular proteins. α-synuclein depicts the highest entropy value for both MD (with average entropy value of ~1.0) and MC (with average entropy value of ~1.47) generated conformational ensemble. This implies that in a completely disordered protein, an amino acid triad spans the entire range of allowed ϕ - ψ angles over a period of time, imparting high flexibility and consequently high conformational entropy to the structural ensemble.

Figure 2

The probability distribution of ϕ - ψ angles in the non-redundant database for the following triads: (a) EAL, (b) DAT, (c) TAS, (d) VPV, (e) DPA, (f) LPF, (g) DGK, (h) LGT, (i) DDK, (j) REI, (k) RLI and (l) RNI.

The corresponding entropy values are mentioned in parentheses.

Figure 3

Conformational entropy calculated using triads relative to non-redundant database for

(a) Molecular dynamics, (b) Monte Carlo generated ensembles of conformations.

The ratio of the conformational entropy of any triad in the conformational ensembles of the partially and completely disordered proteins to that in the non-redundant database is calculated for conformational ensembles generated by MD simulations at 60 ns, 80 ns and 100 ns. The probability distribution of this ratio is plotted in Supplementary Figs S8 and S9 for partially disordered and completely disordered proteins. Despite the slight difference in the conformational entropy values for the 60 ns and 100 ns MD trajectory, 80 ns and 100 ns MD trajectories depict a similar distribution of the conformational entropy for both partially and completely disordered proteins. The position of the maximum is coincident for both 80 ns and 100 ns MD trajectories in both figures. This implies a 100 ns MD simulation may be sufficient to sample the ϕ - ψ range of the triads in disordered proteins. Recent studies545556 also support the fact that a 100 ns long MD simulation may adequately sample the dihedral angle space of the intrinsically disordered proteins. It is observed that dihedral angles are strongly correlated for IDPs, a few most populated dihedral combinations may dominate a major fraction of the entire conformational ensemble55. Therefore longer simulations may be helpful to sample more conformations which are not frequently populated but the overall ϕ - ψ distribution and the position of the maximum may not alter. The ϕ - ψ angles of the selected triads from each ensemble of conformations of globular (1A4V), IDPRs (1CD3, 1F0R and 1MVF) and IDPs (α-synuclein) are plotted in Fig. 4 (also see Supplementary Figs S6 and S7 online). These figures clearly demonstrate that the amino acid triads (LLK, IVE and NKL) in globular proteins span approximately half of the ϕ - ψ region as compared to the non-redundant data set of proteins. This may be due to the fact that triads in globular proteins exhibit a propensity for either helices or sheet structures and hence span limited regions of the Ramachandran space. While the amino acid triads in IDPRs (LAE, AEL and TLA) and IDPs (AAA, AVA and AEK) populate the entire range spanned by the non-redundant database of proteins. Since, disordered proteins lack specific secondary structures, triads in IDPRs and IDPs assume all possible ϕ - ψ angles in the allowed dihedral angle space. The triads in IDPRs and IDPs thus obey the Flory’s isolated-pair hypothesis57, which states that the ϕ - ψ angles for a given pair of residues is independent of those of the adjoining pair of residues (except for proline and residues preceding proline residues) in a protein. Within this approximation, we propose that the conformational entropy of a disordered protein sequence can be expressed as the sum of the conformational entropy of the corresponding triads present in the non-redundant database.

Figure 4

ϕ - ψ angles of the triad IVE in globular (1A4V), AEL in IDPRs (1CD3, 1MVF and 1F0R) and AVA in IDP (α-synuclein).

The ϕ - ψ angles are extracted from the MD simulation generated conformational ensemble and non-redundant database.

where S(i) represents the conformational entropy of the i disordered protein, t is the total number of triads in i protein sequence and denotes the entropy of j triad in the non-redundant database. Similarly for a sequence of a globular protein, the average conformational entropy may be estimated as Mutual information expansion method is used to verify the validity of the results obtained from our entropy prediction method. Systematic expansion of entropy in mutual-information terms upto second order may be written as12 where Here, I(x,x), is a measure of the correlation between two degrees of freedom (dihedral angles in this case), which denotes the mutual information of the system. S1(x) and S1(x) represents the uncorrelated Shannon entropies for x and x dihedral angles respectively. S(x, x) is the correlated entropy for the pair of dihedrals x and x . For 1A4V, the correlation of i dihedral with i + 1, i + 2 and i + 3 is found to be 0.07, 0.016 and 0.012 per dihedral pairs, respectively, for grids with height and width of 30°. The correlation between two dihedrals is found to decrease with an increase in the distance between dihedrals along the sequence. Thus, the correlation of each dihedral angle is calculated upto the nearest neighbor (i.e. for i dihedral, correlation is considered upto the (i + 1) dihedral). The conformational entropy predicted using our method (using eqns 9 and 10) is compared (see Fig. 5) with those calculated from the mutual information expansion method and simple histogram method that utilizes the MD generated conformational ensemble of each protein. Conformational entropy is calculated for 1A4V, α-synuclein and disordered regions of 1F0R, 1CD3 and 1MVF. Figure 5 depicts the match in the entropy value calculated using our method with those calculated using mutual information expansion (a maximum 5.7% difference) and histogram binning method (a maximum 3.3% difference). Hence, our sequence-based method provides fairly accurate estimation of the conformational entropy using the sequence information only, with the nearest neighbor pair correlation.

Figure 5

Conformational entropy calculated using mutual information expansion, histogram and our method (using eqns 9 and 10).

MC generated conformations for IDPRs and IDPs depict a wider range of ϕ - ψ angles compared to that of the non-redundant database. This is due to the choice of the coarse-grained potential for MC simulations with less restrictions imposed in the accessible conformations. Thus this study proposes a method to predict the disorder in proteins from the relative composition of the amino acid triads including a measure of an average conformational entropy for that sequence, since it may not be feasible to determine the conformational entropy of a large number protein sequences, especially for the completely disordered proteins, by simulations. The lower conformational entropy for globular proteins may be attributed to the highest number of favorable two-body contacts which are the primary stabilizing factor for well-defined structures. The two-body contacts are calculated using non-redundant database and conformational ensembles generated from MD and MC simulations respectively (1A4V, 1CD3, 1F0R, 1MVF and α-synuclein) with a distance cutoff of 8 Å. The distribution of two-body contacts for all data sets of proteins is plotted in Supplementary Fig. S10 online. The non-redundant database, comprising of well-structured globular proteins shows the highest number of contacts in their native state. Both IDPRs and IDPs depict less number of average two-body contacts in the conformational ensemble generated by MD simulation. For the MC generated conformational ensemble of 1CD3, 1F0R, 1MVF and α-synuclein, the two-body contacts depict a broad distribution, with the peaks matching with respective ensembles of MD simulation. The broad distribution of MC simulation generated ensemble is due to use of the coarse-grained model with less restrictions.

Conclusions

This work quantitatively characterizes the structural order/disorder in proteins in terms of the backbone conformational entropy, number of two-body contacts and the relative composition of amino acids compared to those of the globular proteins. Three groups, comprising of globular, Group I and Group II proteins, are compiled from the PDB and DisProt database for sequence analysis. MD and MC simulations are used to characterize the conformational ensemble of the representative proteins of each group. Sequence analysis of these groups of proteins reveal substantial overlap between the globular and completely disordered proteins from the normalized relative composition of individual amino acids. However, the normalized relative composition calculated using amino acid triads depicts a distinct boundary separating globular and disordered proteins. Thus the structural order/disorder in a protein may be accurately predicted from the sequence information only. The analysis of the conformational entropy of the triads is important. It is observed that a change in the middle amino acid of a triad or change in the neighboring flanking residues of a specific amino acid affects the conformational stability and flexibility of triads, which in turn may affect the conformational entropy of the protein. The conformational entropy evaluated from the heterogeneous conformational ensemble of the representative proteins from each group reveal that in globular proteins the amino acid triads samples about half of all possible conformational states while those in a disordered protein follow the Flory’s isolated-pair hypothesis and sample the entire range of allowed ϕ - ψ angles. Thus, conformational entropy of a disordered protein sequence may be expressed as the sum of the conformational entropy of the corresponding triads present in the non-redundant database. The conformational sampling and convergence of MD simulations for IDPs/IDPRs is inconclusive. Even microsecond long MD simulations sometimes do not exhibit complete convergence58. However, it is observed a 100 ns long MD simulation may capture the ϕ - ψ distribution effectively5455. The predicted values of the conformational entropy for globular and disordered proteins agree well with those calculated using mutual information expansion and histogram method. Thus, our sequence-based method may estimate the conformational entropy of proteins upto nearest neighbor pair correlation without the need of computationally expensive and time consuming simulations. Higher conformational entropy of the intrinsically disordered proteins is also reflected in the less number of two-body contacts, which is the primary stabilizing factor for well-structured globular proteins. Thus protein disorder may be predicted from the relative sequence composition of amino acid triads only, while the backbone conformational entropy provides an appropriate measure of this structural disorder.

Additional Information

How to cite this article: Baruah, A. et al. Conformational Entropy of Intrinsically Disordered Proteins from Amino Acid Triads. Sci. Rep. 5, 11740; doi: 10.1038/srep11740 (2015).

47 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. A measure of conformational entropy change during thermal protein unfolding using neutron spectroscopy.

Authors: Jörg Fitter
Journal: Biophys J Date: 2003-06 Impact factor: 4.033

3. Conformational Entropies and Order Parameters: Convergence, Reproducibility, and Transferability.

Authors: Samuel Genheden; Mikael Akke; Ulf Ryde
Journal: J Chem Theory Comput Date: 2014-01-14 Impact factor: 6.006

4. Validation of Molecular Dynamics Simulations of Biomolecules Using NMR Spin Relaxation as Benchmarks: Application to the AMBER99SB Force Field.

Authors: Scott A Showalter; Rafael Brüschweiler
Journal: J Chem Theory Comput Date: 2007-05 Impact factor: 6.006

Review 5. Probing the binding entropy of ligand-protein interactions by NMR.

Authors: Steve W Homans
Journal: Chembiochem Date: 2005-09 Impact factor: 3.164

6. Comparison of multiple Amber force fields and development of improved protein backbone parameters.

Authors: Viktor Hornak; Robert Abel; Asim Okur; Bentley Strockbine; Adrian Roitberg; Carlos Simmerling
Journal: Proteins Date: 2006-11-15

7. The binding mechanisms of intrinsically disordered proteins.

Authors: Jakob Dogan; Stefano Gianni; Per Jemth
Journal: Phys Chem Chem Phys Date: 2013-12-06 Impact factor: 3.676

8. New force field on modeling intrinsically disordered proteins.

Authors: Wei Wang; Wei Ye; Cheng Jiang; Ray Luo; Hai-Feng Chen
Journal: Chem Biol Drug Des Date: 2014-07-01 Impact factor: 2.817

9. Modeling of polypeptide chains as C alpha chains, C alpha chains with C beta, and C alpha chains with ellipsoidal lateral chains.

Authors: F Fogolari; G Esposito; P Viglino; S Cattarinussi
Journal: Biophys J Date: 1996-03 Impact factor: 4.033

10. Modeling and molecular dynamics of the intrinsically disordered e7 proteins from high- and low-risk types of human papillomavirus.

Authors: Nilson Nicolau; Silvana Giuliatti
Journal: J Mol Model Date: 2013-07-18 Impact factor: 1.810

8 in total

Review 1. To be disordered or not to be disordered: is that still a question for proteins in the cell?

Authors: Kris Pauwels; Pierre Lebrun; Peter Tompa
Journal: Cell Mol Life Sci Date: 2017-06-13 Impact factor: 9.261

2. Salt-bridge networks within globular and disordered proteins: characterizing trends for designable interactions.

Authors: Sankar Basu; Debasish Mukharjee
Journal: J Mol Model Date: 2017-06-19 Impact factor: 1.810

Review 3. Dynamic conformational flexibility and molecular interactions of intrinsically disordered proteins.

Authors: Anil Bhattarai; Isaac Arnold Emerson
Journal: J Biosci Date: 2020 Impact factor: 1.826

4. Effect of Ion and Binding Site on the Conformation of Chosen Glycosaminoglycans at the Albumin Surface.

Authors: Piotr Sionkowski; Piotr Bełdowski; Natalia Kruszewska; Piotr Weber; Beata Marciniak; Krzysztof Domino
Journal: Entropy (Basel) Date: 2022-06-10 Impact factor: 2.738

5. The Ramachandran Number: An Order Parameter for Protein Geometry.

Authors: Ranjan V Mannige; Joyjit Kundu; Stephen Whitelam
Journal: PLoS One Date: 2016-08-04 Impact factor: 3.240

6. Proteus: a random forest classifier to predict disorder-to-order transitioning binding regions in intrinsically disordered proteins.

Authors: Sankar Basu; Fredrik Söderquist; Björn Wallner
Journal: J Comput Aided Mol Des Date: 2017-04-01 Impact factor: 3.686

7. Changes of Conformation in Albumin with Temperature by Molecular Dynamics Simulations.

Authors: Piotr Weber; Piotr Bełdowski; Krzysztof Domino; Damian Ledziński; Adam Gadomski
Journal: Entropy (Basel) Date: 2020-04-01 Impact factor: 2.524

8. The BackMAP Python module: how a simpler Ramachandran number can simplify the life of a protein simulator.

Authors: Ranjan Mannige
Journal: PeerJ Date: 2018-10-16 Impact factor: 2.984

8 in total