Literature DB >> 32053592

Long-range correlation in protein dynamics: Confirmation by structural data and normal mode analysis.

Abstract

Proteins in cellular environments are highly susceptible. Local perturbations to any residue can be sensed by other spatially distal residues in the protein molecule, showing long-range correlations in the native dynamics of proteins. The long-range correlations of proteins contribute to many biological processes such as allostery, catalysis, and transportation. Revealing the structural origin of such long-range correlations is of great significance in understanding the design principle of biologically functional proteins. In this work, based on a large set of globular proteins determined by X-ray crystallography, by conducting normal mode analysis with the elastic network models, we demonstrate that such long-range correlations are encoded in the native topology of the proteins. To understand how native topology defines the structure and the dynamics of the proteins, we conduct scaling analysis on the size dependence of the slowest vibration mode, average path length, and modularity. Our results quantitatively describe how native proteins balance between order and disorder, showing both dense packing and fractal topology. It is suggested that the balance between stability and flexibility acts as an evolutionary constraint for proteins at different sizes. Overall, our result not only gives a new perspective bridging the protein structure and its dynamics but also reveals a universal principle in the evolution of proteins at all different sizes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 32053592 PMCID： PMC7043781 DOI： 10.1371/journal.pcbi.1007670

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Proteins, including the globular, fibrous, membrane and intrinsically disordered proteins, are responsible for diverse functions in almost every process of cellular life. Globular proteins, as the majority type of the proteins in nature, can fold from disordered peptide chains into specific three-dimensional (3D) structures on minimal-frustrated energy landscape [1-4]. Such kind of 3D structures, which are encoded by the amino acid sequences, are known as native states. It is worth noting that the native state of a protein is not static, but exhibits dynamical fluctuations around the energy minimum. Experiments and molecular simulations have shown that thermal fluctuations trigger the motions of proteins such as domain movements and allosteric transitions, which enable the biological functions of proteins such as catalysis [5], ligand binding [6, 7], biomolecular recognition [8], and transportation [9]. Uncovering the relations between the structure and the function of proteins is a fundamental question in molecular biophysics. To answer it, the fluctuations at the native states may provide a key. One of the most fascinating properties of proteins is the long-range correlated fluctuations around the native states [10-12]. Thanks to the long-range correlations, local perturbations to any residue can be sensed by every other residue of the entire protein, even when the two sites are spatially distant. Such a property plays an important role in the functionality of the proteins. For example, for allosteric proteins, long-range correlations warrant the binding at one site can be transmitted to other functional sites [13, 14], and enable the high susceptibility for proteins in cellular environments. Based on the correlation analysis of structural ensembles determined by solution nuclear magnetic resonance (NMR), it was already demonstrated that the native proteins exhibit long-range correlations and high susceptibility in the native dynamics [15]. Such a phenomenon is also in line with other theoretical and experimental results, for example, the long-range conformational forces related to the hydrophobicity scales of the proteins [16-20], the fractal dimension in the oscillation spectrum [21] and configuration space [22], the slow relaxation of protein molecules in the solution [23, 24], the volume fluctuation of allosteric proteins [25], and the overlap between the low-frequency collective oscillation modes and large-scale conformational changes in allosteric transitions [26-30]. Accumulating evidence indicates that native proteins are not only stable enough to warrant structural robustness, but also susceptible enough to sense the signals in the milieu, and ready to perform large-scale conformational changes. However, the origin of such kind of dynamics is still unclear. In the present paper, we concentrate on the structure and the equilibrium fluctuation dynamics of a large set of globular proteins determined by X-ray crystallography, ranging from a single hairpin structure to large protein assemblies. Firstly, to elucidate the connection between the long-range correlations and protein structures, we conduct correlation analysis based on the elastic network models (ENMs) [26-30]. We find that the long-range correlations and the scaling laws can be robustly reproduced by the ENMs with different model parameters. Such a result indicates that the long-range correlations are encoded in the native topology of the proteins. Secondly, we conduct normal mode analysis [31-33] for protein molecules, ideal polymer chains, and lattice systems. A similar scaling relation holds for polymers, lattices, and proteins, but the scaling coefficients are different. Such a result shows how native proteins balance between order and disorder, which resemble the physical systems near the critical point of a phase transition. Thirdly, we introduce the average path length and modularity to describe the topological characteristics of the proteins. Scaling relations are also observed between these topological descriptors and the size of the proteins. According to the result of the scaling analysis, we conclude that native proteins show both dense packing and fractal topology. Lastly, we focus on the size dependence of proteins’ shape. With a given chain length, the shape of a protein is not random, but a most-probable shape factor always exists. Such a constraint suggests that native proteins balance between stability and functionality. Overall, our result not only gives a new perspective bridging the protein structure and its dynamics but also reveals a universal principle in the evolution of proteins at all different sizes.

Results

The critical dynamics of proteins are robustly encoded in the native structures

In previous studies, based on the structural ensembles determined by solution nuclear magnetic resonance (NMR), it was observed that the native proteins in the solution exhibit long-range correlations and high susceptibility in the dynamics [15]. The native fluctuation of proteins behaves as though they are near the critical point of a phase transition [34-36]. The question arises whether the critical dynamics of native proteins are encoded in the native structure or driven by other factors in the milieu. To answer this question, we employ the minimal model of proteins, the elastic network model (ENM) to conduct our analysis. In an ENM, a protein molecule is described as a set of nodes (represented by their C atoms) connected with edges of elastic springs. As shown in Fig 1A, the 3D structure of a protein can be simplified as a network based on the topology of residue contacts. Note that the elastic networks are constructed only based on the spatial distances between residues. If an ENM can successfully reproduce long-range correlations in the fluctuations of the native proteins, then it can be concluded that the critical dynamics of proteins is encoded by the local contacts in the native structures.

Fig 1

The critical dynamics of proteins are robustly encoded in the native structure.

The critical dynamics of proteins are robustly encoded in the native structure.

(A) An illustration of the elastic network model (r = 9Å) of the protein CI2 (PDB code: 2CI2). The beads denote the residues, and the bonds denote the elastic springs in the model. (B) The correlation functions ϕ(r) for proteins at different sizes predicted by GNM with cutoff distance r = 9Å. (C) Correlation functions scaled by the radius of gyration of the proteins R. (D) For proteins of similar sizes (19.5Å ≤ R < 20.5Å), with different cutoff distances r, the correlation functions ϕ(r) predicted by GNM. (E) With different cutoff distances, for proteins of different sizes, the correlation length ξ is always proportional to the size of the protein R. (F) The susceptibility χ vs. chain length N shows the power-law relation: χ ∼ N, and the scaling coefficient αγ/ν ≈ 1 can be kept with different r (inset). The correlated motions of residues can be represented by a covariance matrix, in which matrix element . For simplification, we conduct our analysis based on the Gaussian network model (GNM) [37, 38]. In GNM, the covariance matrix C is proportional to pseudoinverse of the Kirchhoff matrix Γ, i.e., [26, 37]. Normalizing the covariance matrix, a pairwise cross correlation an be obtained. Similar to previous works [15, 39, 40], a distance-dependent correlation function ϕ(r) can be defined by averaging the correlations for residue pairs at mutual distance r, and , where r denote the spatial distance between residue i and j, and δ(x) is the Dirac-delta function selecting residue pairs at mutual distance r. Here, the correlation length ξ as the distance where ϕ(r) first decays to zero. To examine whether the correlation scales with the protein size, we sample over the protein data across different sizes. By averaging the distance-dependent correlation function ϕ(r) for a subset of proteins, we can define the averaged correlation function 〈ϕ(r)〉 to a group of proteins. Here, we divide the dataset into subsets according to the radius of gyration R of the proteins (e.g., subset {R ∼ 12Å} contains proteins at size 11.5Å ≤ R < 12.5Å), the distance-dependent correlation functions ϕ(r) for proteins at different sizes are calculated. As shown in Fig 1B, the correlation function first decreases from its maximum at short distances, crosses zero at r = ξ, continues to decline, reaches a negative minimum. As a notable sign of criticality, for proteins of different sizes, the correlation length ξ is proportional to their radius of gyration R. Therefore, the correlation functions can be scaled by the size (R) of the proteins, and all the correlation functions collapse (Fig 1C). This result indicates that correlations in the native fluctuation of proteins are scale-free: No matter how large the protein molecule is, correlation length can extend to the size of the entire system. Such long-range correlation contributes to the functionality of a large variety of proteins, for example, for allosteric proteins, the long-range correlation warrants the binding at one site can be transmitted to other functional sites [13, 14], even when the two sites are spatially distant. To validate the previous analysis, let us consider the parameter sensitivity in the prediction of the cross correlations in protein dynamics. The only free parameter in GNM is the cutoff distance r. With different r, the correlation would have different magnitude at short distances; however, as shown in Fig 1D, the correlation lengths ξ keep as a constant for different cutoff distances r. As shown in Fig 1E, for cutoff distances ranging from 6 Å to 15 Å, the correlation length ξ is always proportional to the radius of gyration R, showing that the critical dynamics of native proteins is generally a stable property and insensitive to the selection of cutoff distances. With only short-range interactions between residues taken into account, GNM can successfully capture the long-range correlations in the native dynamics of the proteins. To have a further investigation of the criticality, it is necessary to validate the scaling relations in the dynamics of proteins. Here, for illustration, we take the power-law relation between the susceptibility χ and chain length N as an example. For protein systems, a finite-size version of susceptibility χ is introduced to quantify the response of systems under perturbation [15]. It is defined as the total correlation in a unit volume within the correlation length: , where s denotes the shape factor of protein, and θ(x) denotes the Heaviside function. Previously, based on NMR-determined protein ensembles [15], it was observed that χ ∼ N, with the scaling coefficient αγ/ν ≈ 1 (Definitions of α, γ and ν are listed in S1 Appendix). Here, as shown in Fig 1F, by employing the GNM, similar scaling relations can also be observed. Such a result demonstrates that, no matter how large the molecule is, proteins can always have high sensitivity executing its function because the magnitude of the susceptibility grows with the chain length of the proteins. Besides, the scaling coefficients are insensitive to changes in cutoff distances (inset), demonstrating that the scale-free correlation of native proteins is a robust property. Our correlation analysis and scaling analysis methods can also be extended to other versions of elastic network models. For example, with harmonic C potential model (HCA) [41, 42], similar scaling coefficients can also be observed (see S1 Appendix). However, some models cannot correctly reproduce the scaling relations between χ and N, for instance, the parameter-free GNM (pfGNM) [43]. In fact, pfGNM fails to predict all the scaling relations in the proteins (see S1 Appendix). Previous researches already found that pfGNM can only be applied for proteins in crystalline conditions, and it will have a poor agreement to the collective motions given by molecular dynamics [42]. Such a result indicates that the scaling coefficient may help us to probe whether the protein is solvated or in a crystalline condition.

The size dependence of slowest modes reveals criticality of native proteins

Normal mode analysis is a practical tool to elucidate the global dynamics [31-33] and the evolutionary constraints [44, 45] of the proteins. Physically, the slow modes, or say, the low-frequency modes of a system are related to the motions with low excitation energy, long wavelengths (long-range correlation), long time scale (at the order from microseconds to seconds) and the large amplitude motions. Usually, the motions that correspond to the slow modes (especially the slowest nonzero mode) can have significant overlap with large displacement during the functional motions [46]. These functional motions usually engage relative movements of large subunits in the proteins or cooperative conformational changes of the whole proteins. Previously, the unique spectral properties of the residue contact networks have been noticed [47, 48], but the detailed differences have never been examined. To demonstrate the particularity in the spectrum of proteins, we compare the proteins with ideal polymer chains (detailed information listed in S1 Appendix) and lattice systems. Our analysis focuses on the size dependence of the slow modes. As shown in Fig 2A, for all these systems, the slowest few modes versus the system size N follow power-law distributions. Among these slow modes, we specifically focus on the eigenvalue λ1 which corresponds to the slowest nonzero mode. A similar power-law λ1 ∼ N− holds for ideal polymers, lattices, and proteins. However, the scaling coefficients ζ are different in these systems. As shown in Fig 2A, for ideal polymer chains, the scaling coefficient ζ ≈ 1.674. For face-centered cubic (fcc) lattice, by conducting normal mode analysis where atoms are connected by springs with their nearest neighbors and 2nd nearest neighbors), we have ζ ≈ 0.727. Theoretically, for lattice systems, the maximum wavelength l corresponds to the slowest elastic mode, and l is proportional to the characteristic length of the system. Since the maximum wavelength l ∼ N1/3, one can estimate that , which is close to 0.727. In contrast to ideal polymers and lattices, ζ ≈ 1 holds for protein molecules.

Fig 2

The slow modes of proteins are robustly defined by native structure.

The slow modes of proteins are robustly defined by native structure.

(A) The 1st, 2nd and the 3rd non-zero eigenvalues λ1, λ2, and λ3 vs. the chain length N of the proteins follows a power-law distribution. (Cutoff distance r = 9Å, and the scaling coefficients of λ1(N), λ2(N), and λ3(N) are 1.074, 0.900, and 0.868, respectively). As comparison, similar scaling relations in lattices and ideal polymer chains are also illustrated, and the scaling coefficients are 0.728 (lattices) and 1.674 (polymer). (B) The eigenvalue of the slowest nonzero mode λ1 versus chain length N shows the scaling relation: λ1 ∼ N−, and the inset shows scaling coefficient ζ vs. the cutoff distance r. (C) For proteins at similar sizes (chain length 180 ≤ N < 220), the histogram for the eigenvalue distribution g(λ). The scaling relations in the slowest modes of proteins are robust to the variation in model parameters. As shown in Fig 2B, the selection of cutoff distances r would not affect the scaling coefficient ζ. But the robustness of the scaling coefficient cannot be attributed to that of the eigenvalue distribution. As shown in Fig 2C, selecting different r would influence the mode distribution g(λ) of native proteins. The mode distribution g(λ), especially the low-frequency part, can be enhanced by selecting a short cutoff distance r. Such a result is also consistent with previous theoretical analysis on protein elastic network and ranges of cooperativity [43], which states that with a shorter interaction range, the predicted dynamics would be more cooperative and show better overlap with the displacement in large-scale conformational changes. It is worth noting that the scaling coefficients in the size dependence of the slowest mode demonstrate that the structure of proteins stands between lattices and ideal polymer chains. For proteins, the exponent ζ ≈ 1, above what is obtained from lattices (ζ ≈ 0.727), and below what is obtained from polymer chains (ζ ≈ 1.674). Thus, compared with ideal polymer chains, the proteins have higher structural stability, whereas compared with lattices, the proteins have higher flexibility and exhibit slower vibrations. Native proteins stand between lattices and polymers, acting as the “critical point” that separates the ordered and disordered phase. Not only are native proteins stable enough to ensure structural robustness and functional specificity, but also susceptible enough to sense the signals in the environment, and ready to perform large-scale conformational changes. Interestingly, staying at the critical point seems to be a common organizing principle of a large variety of biological systems [49-55]: If the system is too disordered, the system cannot stably exist; if it is too ordered, it cannot adapt or respond to perturbations from the environments. Our result of scaling analysis provides additional evidence to support the criticality hypothesis.

Protein structure: Dense packing with fractal topology

In previous sections, we demonstrated that the critical dynamics of the proteins are encoded in their native structures, and we showed that the equilibrium dynamics of protein molecules if different from lattices and polymers. How does the topology of the residue contact network encode such kind of dynamics? To answer the question, in this subsection, we will try to bridge the vibration spectrum with the architecture of the protein by mainly focusing on the issue of the network topology. In the network analysis, the average path length 〈l〉 is one of the most important topological descriptors quantifying the total connectivity among the nodes. Here, we first focus on the scaling relations between average path length 〈l〉 and the system size N. As shown in Fig 3A, for proteins at different sizes, there is a power-law relation between the average path length 〈l〉 and the chain length N: 〈l〉∼N, and α ≈ 0.338, which is close to 1/3. In the calculation, the cutoff distance r is set to be 8Å. Even different cutoff distance r will lead to different 〈l〉, but the scaling exponent is invariant (see S1 Appendix). The scaling relation in proteins is very similar to what in the lattice structures. Theoretically, for 3D lattices, the exponent would be α = 1/3. Such a scaling relation is confirmed in Fig 3A. While for ideal polymer chains, with an extended structure, there would be longer average path lengths, and fitting gives α ≈ 0.675. Such a result demonstrates that the residue contact networks show similar dense packing property as regular lattices. Both lattice and protein networks have much shorter path length 〈l〉 than ideal polymers.

Fig 3

The protein dynamics can be quantified by topological descriptors of the residue contact network.

The protein dynamics can be quantified by topological descriptors of the residue contact network.

(A) For the contact network of proteins (r = 8Å), fcc lattices and ideal polymers, the average path length 〈l〉 vs. system size N. (B) Similarly for proteins, fcc lattice and ideal polymers, modulaity Q vs. system size N. The inset shows the log-log plot of 1 − Q vs. N. (C) For proteins at similar sizes (180 ≤ N < 220), the scattering plot (yellow dots, each dot represents a protein molecule), the binned average (red dots) and the basic trend (red curve) of the average path length 〈l〉 vs. Q, and (D) Smallest non-zero eigenvalue λ1 vs. Q. Although protein and lattice share similar dense packing properties, the residue contact networks of proteins still exhibit unique properties. To demonstrate the difference between the residue contact network and the lattice networks, another measure—modularity Q is introduced into the study [56, 57]. Intuitively, a network that can be more easily divided into modules would have a higher Q value. Modularity Q also scales as the system size increases. For a d−dimensional cubic lattice network with N nodes, theoretically, it was proved that the modularity Q versus N follows the relation: Q = 1 − K ⋅ N−, where the scaling coefficient , and K is a constant that depend on average degree z and dimension d [58]. For ideal polymer chains, the fitting gives η ≈ 0.465, indicating an effective fractal dimension deff ≈ 1.15, which is much lower than 3. For a 3D cubic lattice, theoretically, η = 1/4. For fcc lattices, as shown in Fig 3B, fitting gives η ≈ 0.231 < 1/4, indicating deff ≈ 3.33 > 3, that is because, in the fcc lattices, every atom has more neighbors than cubic lattice. For proteins our dataset, when taking r = 8Å, similar power law can also be observed, but the scaling coefficient η = 0.279 > 1/4. Such an exponent indicate that the proteins has an effective dimension , which is lower than 3. Such a scaling coefficient displays that the residue contact networks have a fractal topology, and the fractal dimension is below 3. It is worth noting that, in this work, the fractal dimension of proteins is obtained by the scaling analysis for proteins at different sizes. The effective dimension obtained here is consistent with the fractal dimension (d ≈ 2.7) of proteins determined by structural analysis methods (see S1 Appendix). The scaling analysis of average path length reveals that the proteins have similar dense packing properties as ordered lattices, but the scaling analysis of modularity suggests that proteins exhibit fractal structures, which is similar to disordered polymer structures. In short, topological analysis demonstrates again that native of proteins balance between order and disorder. In the discussions above, by averaging the topological descriptors of proteins at similar sizes, we analyze the size dependence of topological properties. In fact, for proteins at similar sizes, topological descriptors can also play an important role in capturing the main features in the dynamics of the proteins. To illustrate that, here, we select the protein molecules with chain length 180 ≤ N < 220 from our dataset. Although these proteins have similar chain length, the structure may differ a lot. Our discussion centers around modularity Q. When the modularity Q of a protein increases, as shown in Fig 3C, the average path length 〈l〉 also increases. This is because, in a highly modularized network, there will be few connections between different communities, on the average, it will take more steps from one node to another. As shown in Fig 3D, as the modularity Q increases, the smallest non-zero eigenvalue λ1 decreases, in line with the common knowledge that that modularized structures in the proteins contribute to slow-mode motions. Such a result is consistent with the theory of spectral graph theory. Indeed, the spectrum of the graph Laplacian is closely related to the community structures of the network [59]. Our analysis quantitatively demonstrates that modularized structures contribute to the large-scale motions and slow relaxations of the proteins.

Stability-functionality constraint: The size dependence of proteins’ shape

The intrinsic dynamics of proteins is encoded in their structures. Since scaling relation between the dynamics and the size of the protein is already discussed in the previous sections. We focus on the relationship between the structure and the size of the protein in this section. The shape factor s can be introduced to describe the general architecture of a protein molecule [15]. According to the definition, the shape factor can be understood as the residue packing density within the inertia ellipsoid. When residues are tightly packed with a globular shape, the shape factor s would be large. When disordered loops or flexible linkers are connecting multiple domains, the shape of the molecule deviates from an ellipsoid, then s would be small. Here, for illustration, three proteins with a similar chain length 180 ≤ N < 220 but with different shape factor s are shown in Fig 4A. On the left, the receptor-binding domain of the short tail fiber (STF) is illustrated. Such a molecule has hardly any regular secondary structures like α−helices or β-strands [60]. The structure of such a molecule in its monomer state has a small shape factor and high modularity. To perform its functions, a knitted trimeric assembly has to be formed [60]. In the middle, there is the human molecular chaperone heat-shock protein 90 (Hsp90) [61] with medium shape factor and modularity. On the right, a de novo designed helical repeat protein DHR10 is illustrated. By repeating a simple helix–loop–helix–loop structural motif, DHR10 protein is highly ordered and becomes very stable, which can stay folded even at 95°C [62]. Generally, the proteins with larger shape factors show higher stability, and the proteins with smaller shape factors show higher flexibility.

Fig 4

The shape factor correlates with the chain lengths of the proteins.

The shape factor correlates with the chain lengths of the proteins.

(A) Three proteins with similar chain lengths: (Left) The receptor-binding domain of T4 STF (PDB: 1OCY, s = 0.84, Q = 0.74); (Middle) Human Hsp90 protein (PDB: 3T0H, s = 1.77, Q = 0.65); and (Right) The DHR10 protein (PDB: 5CWG, s = 2.37, Q = 0.63). (B) For proteins at similar sizes (chain length 180 ≤ N < 220), the scattering plot (yellow dots), binned average (red dots) and the trend line (red line) of shape factor s vs. modularity Q are plotted. Besides, there are histograms of the shape factor s (right vertical) and modularity Q (top horizontal). (C) For all the proteins in our dataset, the 2D histogram (in the background) of s vs. N and the plot (in navy blue) of the most-probable shape factor s* vs. chain length N. Although the definition of shape factor does not introduce any detailed information on secondary structures or residue contacts, the shape factor is closely related to the topological descriptors of the residue contact network. Here, statistics for the proteins with similar chain length (180 ≤ N < 220) is conducted. The scattering plot of shape factor s versus modularity Q is shown in Fig 4B. A trend line (in red) displays that as modularity Q increases, the shape factor s decreases. The result is easy to understand intuitively, a protein molecule in a shape that deviates from an ellipsoid is likely to have multiple domains or have flexible linkers connecting multiple ordered regions. Interestingly, although the proteins could have very different shapes, for protein molecules with a specific chain length, the value of shape factor does not vary a lot. Here, in Fig 4B, histograms of the shape factor s (right vertical) and modularity Q (top horizontal) are plotted. The histograms show that there exists a most-probable shape factor s* and corresponding modularity Q*. Most natural proteins have shape factors close to s*, exhibit a balancing behavior between stability and flexibility [21]. In fact, for proteins with different chain lengths, the most-probable shape factor s* always exists, which can be recognized as a constraint in the shape of the protein. As shown in Fig 4C, it was observed that larger proteins prefer smaller shape factors. A similar relation is also observed based on NMR-determined ensembles [15]. These observations provide additional pieces of evidence to support the criticality of native proteins. The native proteins have to balance between stability and flexibility. With short chain lengths, the proteins tend to have a larger shape factor to ensure a stable folded state. Accordingly, small proteins usually have higher residue packing density. However, as the chain length of the proteins increases, to execute functional motions, flexibility becomes the main demand of the proteins. One good example is the designed protein DHR10 as illustrated in Fig 4A. DHR10 has high structural stability, but it is hard for such a protein to execute any biological functions. In such a situation, smaller shape factors, which usually correspond with disordered loops or multi-domain structures, are demanded by the functionality. Our results suggest that the balance between stability and flexibility acts as an evolutionary constraint for proteins at different sizes.

Discussion

The long-range correlated fluctuations contribute to many biological processes of the proteins, such as allostery, catalysis, and transportation. To understand the origin of such long-range correlations, based on the elastic network model, we conduct normal mode analysis for a large dataset of globular proteins determined by X-ray crystallography. First, we predict the correlated motions for proteins at different sizes. It is observed that the correlation length of a protein can extend to the size of the whole protein, no matter how large the protein molecule is. Moreover, with different model parameters, the scale-free correlations and the scaling laws can be reproduced by the elastic networks model, which is the minimal structure-based model of native proteins. Such a result indicates that the critical dynamics characterized by the power-law relations are robustly encoded in the native topology of the proteins. Second, for proteins at different sizes, we conduct normal mode analysis and perform scaling analysis for the slow vibration modes of the proteins. To demonstrate the particularity in the spectrum of proteins, we compare the proteins with ideal polymer chains and lattice systems. Native proteins stand between ordered lattices and disordered polymers, acting as the “critical point” that separates the ordered and disordered phase. Our result of scaling analysis provides additional evidence to support the criticality hypothesis. Third, to understand how the native topology determines the architecture and the dynamics of the proteins, we conduct scaling analysis for the topological descriptors and the size of the proteins. Our results demonstrate that, although proteins have similar average path length with lattice structures, the residue contact networks are more modularized. Last, we focus on the size dependence of proteins’ shape. For proteins with different chain lengths, the most-probable shape factors always exist. Larger proteins prefer smaller shape factors. Such a constraint results from the balance between stability and functionality of proteins. In summary, our work quantitatively demonstrates how the native contact topology defines the long-range correlations and the slow dynamics of the native proteins. Our work not only provides quantitative scaling relations supporting the “structure-dynamics-function” paradigm but also reveals evolutionary constraints for proteins at different sizes. These results may shed light on a large variety of biophysical problems such as structure prediction, multi-scale molecular simulations, and the design of molecular machines.

Materials and methods

Dataset

Our dataset contains 13081 proteins selected from the Protein Data Bank (PDB) [63]. The structures of these proteins are all determined by X-ray diffraction with high resolution (≤ 2.0Å). For every protein structure in the dataset, it contains no DNA, RNA or hybrid structures; and the chain length 30 ≤ N ≤ 1200. In our protein dataset, every two proteins share less than 30% sequence similarity. The PDB codes of all the proteins in our dataset are listed in the Supplementary Information (S1 and S2 Files).

The elastic network models

The elastic network models are widely applied to predict the functional dynamics of a variety of proteins and bio-machineries [26, 27, 29, 30]. With the assumption that all residue fluctuations are Gaussian variables distributed around their equilibrium coordinates, the Gaussian Network Model (GNM) can successfully reproduce the residue fluctuations as determined by experiments [37, 38]. For a protein consisting of of N residues, based on the native structure, the potential energy of the network is given by: in which κ is a uniform force constant; and is the displacement of residue i and j, respectively; and Γ is the element of Kirchhoff matrix, or in a graph theory perspective, it is the graph Laplacian of the residue-residue contact network. The elements of matrix Γ is defined according to the contact topology of the native structure: for residue pair i − j, if r ≤ r, then Γ = −1; if r > r, then Γ = 0; and for the diagonal elements, Γ = −∑ Γ = −k, where k denote the degree of node i. In GNM with homogenous contact strength, the only control parameter is the cutoff distance r. With a large r, residue pairs at long distances can interact with each other; while for smaller r, only short-range interactions are contributed to the elastic energy of the system. One may also introduce distance-dependent force constants [41-43] to refine the predictions of elastic network models. In these models, the force constants κ becomes a function of the mutual distance between residue i and j. Further details and other variations of the elastic network models are listed in the S1 Appendix.

Normal mode analysis and the spectrum of the graph laplacian

Based on GNM, by diagonalizing the Kirchhoff matrix Γ, we can obtain all the eigenvalues and the corresponding eigenvectors describing the motions of every normal mode [32]. To compare the mode distribution for proteins of different chain lengths, the Kirchhoff (Laplacian) matrices correspond to the topology of native proteins are normalized. By normalizing all the diagonal elements as 1, we can obtain the symmetric normalized graph Laplacian [48]: in which D is a matrix of all the diagonal elements of matrix D = diag[Γ1,1, Γ2,2, ⋯Γ], describing the local packing status of each residue. Diagonalizing matrix L, then we have L = UΛU, in which the eigenvalues Λ = diag[λ0, λ1, λ2, ⋯λ] (λ0 ≤ λ1 ≤ λ2, ≤ ⋯ ≤ λ) and eigenvectors U = [u0, u1, u2, ⋯ u]. The eigenvalue λ describes the frequency ω of the i-th eigenmode (), and the eigenvector u describes the motion profile of the corresponding eigenmode. Note that the zero mode corresponds to the eigenvalue λ0 = 0, and eigenvector u0 describes the collective translational or rotational motions of the system. The code of normal mode analysis is listed in the Supplementary Code (S2 Appendix and S3 File).

Shape factor

To have a general description of the structure of a protein molecule, a dimensionless shape factor s is defined [15]. By calculating the the moments of inertia of a protein molecule, one can estimate the residue packing density within the inertia ellipsoid as , in which a = 3.8Å is the residue size, and L1, L2 and L3 are lengths of the principal axes of the protein (L1 > L2 > L3). The shape factors of the proteins in our dataset are listed in the Supplementary Data (S4 File).

Average path length

The average (or characteristic) path length 〈l〉 usually works as a measure of the information transfer efficiency on a network. It is defined as the average number of steps along the shortest paths for all possible pairs of network nodes. When l denotes the shortest distance between node i and j, then, the average path length

Modularity

Modularity is a topological descriptor which is designed to quantify if a network can be easily divided into modules. For a network with N node and M edges, when the topology is described by the adjacency matrix A where A = 1 if and only if node i and j are connected. Modularity is defined as the fraction of the edges that fall within the given module minus the expected fraction when edges were distributed at random [56, 57]. According to the definition, one can introduce the modularity matrix B with elements to describe the expected number of edges between node pairs, in which k and k denote the degrees of node i and j, respectively. Based on matrix B, the modularity can be calculated as: in which is the column vector describing the partition of a network. Vector x has elements x = ±1 indicating the modules to which the node belongs. The value of the Q lies in the range −1 ≤ Q ≤ 1. For any given partition s of a network, one can calculate the Q corresponding to such a partition. The appropriate partition of a network would maximize the modularity Q [64]. In this work, we introduced the Louvain method [65] to partition the network and maximize the value modularity Q. The code of topological analysis is listed in the Supplementary Code (S2 Appendix and S3 File).

Supplementary information.

Detailed descriptions of the structural datasets involved in this research. Additional information concerning the scaling relations, generation of polymer structures, and other variations of elastic network models are also included in the Supplementary Information. (PDF) Click here for additional data file.

Supplementary code.

The code (written in Python language) for PDB file processing, correlation analysis, normal mode analysis, and topological analysis are listed in Supplementary Code. (PDF) Click here for additional data file.

The PDB codes and the chain length of the proteins in Dataset A (13081 proteins determined by X-ray crystallography) are listed in the file.

(TXT) Click here for additional data file.

The PDB codes and the chain length of the proteins in Dataset B (5078 proteins determined by solution nuclear magnetic resonance) are listed in the file.

(TXT) Click here for additional data file.

A Jupyter Notebook version of the supplementary code.

(ZIP) Click here for additional data file.

The data (chain length N, radius of gyration R, average path length 〈l〉, smallest non-zero eigenvalue λ1, shape factor s and susceptibility χ) for all the proteins in our dataset are listed in the file.

(TXT) Click here for additional data file. 8 Nov 2019 Dear Dr Tang, Thank you very much for submitting your manuscript 'Long-range Correlation in Protein Dynamics: Confirmation by Structural Data and Normal Mode Analysis' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but both raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. In particular, the lack of software and data availability is not acceptable. Please follow the guidelines we have published on how to make data and software reproducable, see: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006649 Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Bert L. de Groot Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The paper presents a large-scale analysis of protein dynamics primarily using GNMs. The work appears to have been carried out carefully and the paper is well-written and understandable. The results reveal a scaling behaviour that appears to be unique to proteins and these results would be of interest to a certain group of scientists although perhaps not many biologists. I recommend that more is written about how these results can impact on our understanding of protein function and how they might help a biologist investigating a particular system. I appreciate the GNM part of the paper and have no issues with it. However, I do have issues with the part of the paper that using B-factors and indeed I think it is flawed. The authors suggest that those residues with a positive product of Z values are correlated and those with a negative are anticorrelated. They then look at the distance behaviour of the correlation and see that it scales according to the radius of gyration. This shows a "scale-free" behaviour. My interpretation would be different. The correlation distance is not a correlation distance but a distance that relates to a surface boundary region where there is a transition between residues with low B-factors (internal) and to a region where residues have high B-factors (on or near the surface). It is not surprising therefore that this scales with the size of the protein. Can the authors rule out this explanation? If so they should and if not they might consider removing this section and rely instead on the GNM results alone. I found the paragraph before the Conclusion section long-winded. The authors should consider shortening it. Reviewer #2: This article presents a network-centric analysis of collective motions in proteins, based on two approaches: (1) the analysis of crystallographic B-factors and (2) the analysis of Elastic Network Models constructed from crystallographic protein structures. They compare protein-derived networks with random networks and interaction networks of lattice structures. Their findings are interesting, shedding new light on phenomena that have been known empirically for a long time. They also seem novel to me, but I may be unaware of similar prior work because I have not followed network-centric approaches specifically over the last few years. For the same reason, I have not been able to verify many specific assertions on network properties made in the article. My main criticism of this work is that it is not reproducible. No statement is made about the algorithms applied and software used. No references to published software, no project-specific software available as supplementary material. Unfortunately, the PLOS policy on software sharing is weak and unrealistic (https://journals.plos.org/ploscompbiol/s/materials-and-software-sharing), so the lack of software looks compatible wit this policy, but as a reviewer I have to say that it makes it impossible for me to verify the results of this submission. General comments: - The authors refer to critical behavior in many places. This term has different (though related) meanings in different disciplines. The authors' use of the term best fits the concept of self-organized criticality in my opinion (which is also suggested by the titles of references 17 to 20). The authors should then say this in the article (or add some other clarifiation if they don't agree with mine). The authors should also explain in much more detail how their work relates to the earlier studies on self-organized criticality in proteins they refer to. - The choice of random networks and lattice structures as references for comparison looks a bit arbitrary. Random networks do not correspond to any interaction network in physical systems. Lattices do, but they are very far from proteins in terms of physical properties. The most interesting systems to compare to, in my opinion, are non-biological soft matter systems, such as polymers. - The authors make specific predictions, e.g. on the scaling of the slowest modes with protein size, that should be amenable to experimental validation. Have they searched for experimental studies in the literature? Specific comments by page number: Page 2: "Although there are only short-range physical interactions among the residues," Residues being charged, there are long-range interactions as well. It is true however that models that leave out the long-range interactions (e.g. ENMs) also exhibit long-range correlations, suggesting that the long-range interactions are perhaps not essential. Page 3: "B-factors" The authors repeat a popular false assumption in the study of protein dynamics: the idea that B-factors measure thermal fluctuations. In crystallographic structure refinement, a distribution of conformations is fitted to the observed Bragg peaks. This distribution is most commonly described by a Gaussian model, consisting of an average structure and a variance. The variance matrix is usually approximated by a diagonal matrix, whose non-zero elements are the B-factors. B-factors thus measure all deviations from an ideal crystal at temperature zero: finit-size effects, crystal disorder, and thermal fluctuations. Crystallography studies of the same protein at different temperatures (e.g. PDB codes 1IEE and 2LYM, both for tetragonal lysozyme) show that the influence of temperature on B-factors is very small, suggesting that the dominant contribution is crystal disorder. This doesn't invalidate the authors' analysis, as crystal disorder effects also spread through the protein via inter-residue contacts. It is only the presentation in terms of conformational fluctuations that needs to be revised. Page 4: "X-ray diffraction can only provide one static structure" See above. An X-ray structure is not static, it is the average structure in a Gaussian ensemble. Page 5: "From the B-factor profile, one can estimate protein flexibility, " This looks dubious. As said above, B-factors don't really measure fluctuations. And even if they did, it is not obvious how fluctuations at the atomic level, without inter-atomic conformational correlations, can be translated into some measuer of flexibility. Page 5: "A positive value of C^{(Z)}_ij" implies that the fluctuations of residues i and j are both above or below the average..." Why is this relevant? Does the average have any scientific interpretation that makes above/below average meaningful? Page 7: "it can be concluded that the collective motions of residues and the critical fluctuations of native proteins are encoded in the native structures" The encoding of collective motions in the structure is the fundamental hypothesis of Elastic Network Models, so it cannot be concluded from an analysis of their results. The added value contributed by the authors is a more specific characterization of these dynamics in comparison to other types of networks. Page 8: "Slow modes and critical fluctuations of native proteins" The authors use the simplest form of ENM, in which the force constant for each residue pair can only take the values 0 or 1. It has been shown in the past (see e.g. http://doi.org/10.1021/ct400399x) that these models describe the "real" protein dynamics (obtained from experiment or from more detailed models) rather badly and that ENMs with distance-dependent force constants yield better matches. This raises the question if the mode frequency scaling behavior observed by the authors also holds if more accurate protein models are used. Page 9: "Noting that the dynamics are encoded in the structures, that is to say, the structures of proteins, which are optimized through the process of molecular evolution, are significantly different from the regular 3D lattice structures." If the goal is to show the difference between biologically evolved systems and simpler non-adaptive physico-chemical systems, the comparison should not be 3D lattices but non-biological soft matter, e.g. polymers. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Konrad Hinsen 20 Dec 2019 Submitted filename: Response.pdf Click here for additional data file. 21 Jan 2020 Dear Dr. Tang, We are pleased to inform you that your manuscript 'Long-range Correlation in Protein Dynamics: Confirmation by Structural Data and Normal Mode Analysis' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch within two working days with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Bert L. de Groot Associate Editor PLOS Computational Biology Arne Elofsson Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed my original concerns. Reviewer #2: In their revision, the authors have taken all of my criticism and suggestions into account, even beyond my expectations. I find the comparison with polymer chains particularly interesting, and the supplied code, although only partial, to be particularly helpful. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: No: References to experimental input data (PDB entries) have been provided. Generating the figures from this input requires non-trivial software, which is only partially provided. It is not possible for readers to recompute the figures, but it is possible to understand the ideas and models implemented. ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Konrad Hinsen 4 Feb 2020 PCOMPBIOL-D-19-01780R1 Long-range Correlation in Protein Dynamics: Confirmation by Structural Data and Normal Mode Analysis Dear Dr Tang, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Laura Mallard PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

49 in total

Review 1. Collective protein dynamics in relation to function.

Authors: H J Berendsen; S Hayward
Journal: Curr Opin Struct Biol Date: 2000-04 Impact factor: 6.809

2. Zipf's law in gene expression.

Authors: Chikara Furusawa; Kunihiko Kaneko
Journal: Phys Rev Lett Date: 2003-02-26 Impact factor: 9.161

3. The structure of the receptor-binding domain of the bacteriophage T4 short tail fibre reveals a knitted trimeric metal-binding fold.

Authors: Ellen Thomassen; Gerrit Gielen; Michael Schütz; Guy Schoehn; Jan Pieter Abrahams; Stefan Miller; Mark J van Raaij
Journal: J Mol Biol Date: 2003-08-08 Impact factor: 5.469

4. The protein folding network.

Authors: Francesco Rao; Amedeo Caflisch
Journal: J Mol Biol Date: 2004-09-03 Impact factor: 5.469

5. Allosteric Dynamic Control of Binding.

Authors: Fidan Sumbul; Saliha Ece Acuner-Ozbabacan; Turkan Haliloglu
Journal: Biophys J Date: 2015-08-31 Impact factor: 4.033

Review 6. Physics of proteins.

Authors: Jayanth R Banavar; Amos Maritan
Journal: Annu Rev Biophys Biomol Struct Date: 2007

7. Collective motions in proteins: a covariance analysis of atomic fluctuations in molecular dynamics and normal mode simulations.

Authors: T Ichiye; M Karplus
Journal: Proteins Date: 1991

8. Analysis of correlations between energy and residue fluctuations in native proteins and determination of specific sites for binding.

Authors: Turkan Haliloglu; Burak Erman
Journal: Phys Rev Lett Date: 2009-02-27 Impact factor: 9.161