Literature DB >> 35085247

Statistical potentials from the Gaussian scaling behaviour of chain fragments buried within protein globules.

Stefano Zamuner¹, Flavio Seno^2,3, Antonio Trovato^2,3.

Abstract

Knowledge-based approaches use the statistics collected from protein data-bank structures to estimate effective interaction potentials between amino acid pairs. Empirical relations are typically employed that are based on the crucial choice of a reference state associated to the null interaction case. Despite their significant effectiveness, the physical interpretation of knowledge-based potentials has been repeatedly questioned, with no consensus on the choice of the reference state. Here we use the fact that the Flory theorem, originally derived for chains in a dense polymer melt, holds also for chain fragments within the core of globular proteins, if the average over buried fragments collected from different non-redundant native structures is considered. After verifying that the ensuing Gaussian statistics, a hallmark of effectively non-interacting polymer chains, holds for a wide range of fragment lengths, although with significant deviations at short spatial scales, we use it to define a 'bona fide' reference state. Notably, despite the latter does depend on fragment length, deviations from it do not. This allows to estimate an effective interaction potential which is not biased by the presence of correlations due to the connectivity of the protein chain. We show how different sequence-independent effective statistical potentials can be derived using this approach by coarse-graining the protein representation at varying levels. The possibility of defining sequence-dependent potentials is explored.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35085247 PMCID： PMC8794220 DOI： 10.1371/journal.pone.0254969

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Proteins are linear flexible hetero-polymers, made up of 20 different natural amino-acid species [1]. Most natural proteins in solution have roughly compact shapes, and thus are usually referred to as globular proteins. The fundamental fact about globular protein sequences is their ability to attain a compact native three-dimensional folded conformation in physiological conditions [2]. The biological functionality of proteins is intimately related to their native structures and to the dynamical properties encoded in them [3]. Quantitative theoretical modeling requires in principle a detailed description at atomic level, for example to take accurately into account the subtle yet dramatic effects that can be brought about by a single residue mutation. On the other hand, processes such as protein self-assembly and aggregation, involve time scales and system sizes which are currently unattainable by atomistic models [4]. Several schemes were thus developed to coarse-grain the representation of protein structures, and of the physical interactions between the representing entities, at a low resolution level [5]. The surprising success of coarse-graining approaches in computational protein science is related to the presence of robust qualitative emergent properties in protein systems, amenable to prediction by low resolution models [6]. For example, the native topology both shapes equilibrium fluctuations and determines folding and unfolding pathways, allowing for successful predictions by structure-based coarse-grained models [7]. An even more remarkable example of successful coarse-graining is the use of statistical potentials, as both effective interaction potentials to be used in folding simulations [8-10], and scoring functions employed in different contexts such as protein structure and function prediction [11], “de novo” protein design [12], model quality assessment [13, 14], aggregation propensity prediction [15-17], protein-protein interactions [18-22], prediction of binding affinities and of stability changes upon mutations [23-26], and many others. Statistical “knowledge-based” potentials can be introduced at different coarse-graining levels, including atomic resolution (in this case, the coarse-graining is due to solvent molecules being integrated out). They renounce a physics-based description of the effective interactions between representing entities; interactions are instead parametrized using the statistics empirically collected from the Protein Data Bank (PDB) [27]. In paradigmatic examples [28, 29], “contact statistical potentials” evaluate the effective interaction between a pair of amino acid residues based on the observed frequency of contacts between that pair in PDB structures. This approach can be generalized to many different observables, such as solvent accessibility, backbone dihedrals, orientation-dependent or many body interactions [30-34]. The conversion of empirical frequencies into an energy function is normally done employing Boltzmann inversion, as originally suggested by Sippl for pairwise “distance dependent potentials”, in analogy to the pairwise potentials of mean force [35]. Complementary potentials are typically estimated separately, to be then combined together, for interactions either short-range or long-range along the chain [36], with the aim of correctly capturing local structure elements. In state-of-the-art approaches several different statistical potential terms, each related to a different observable, can be combined together, optimizing the relative weights by means of supervised learning techniques [14, 37–40]. A crucial element in the definition of statistical potentials via Boltzmann inversion is the choice of a “reference state”. The probability distribution observed in the latter is used to normalize the statistics collected over the PDB structures for a given residue pair. The reference state should then be taken as an ensemble of protein-like structures where no specific direct interactions between amino acids are present. A simple choice is to consider the ensemble of all residue pairs from the PDB structures [41], but still many different recipes are possible to define the reference state [42-45]. Beside the uncertainty in the reference state definition, the very use of Boltzmann inversion for the statistics collected from different PDB structures was extensively debated [46, 47], in particular with reference to chain connectivity. The Boltzmann inversion has been justified by using information theory arguments within a Bayesian approach [48]. In this context, statistical potentials are considered as statistical preferences that can be obtained “a posteriori” from empirical data, whereas the reference state contains the “a priori” information about the system. In this work, we propose to use a reference state for deriving pairwise distance dependent potentials based on purely polymer physics considerations. Our strategy can be used at different levels of coarse-graining. In particular, we use the fact that a data set of protein “fragments”, when collected and properly filtered from PDB structures, exhibit Gaussian statistics, the one expected for ideal chains in the absence of any interaction. This property had been already uncovered by Banavar et al. [49], who found that fragments buried within globular proteins obeys on average the same Flory theorem [50, 51] derived for polymer melts, that is concentrated solutions of different chains. The same theorem has been shown to hold for fragments buried in the interior of single compact polymer chains, when selected with appropriate constraints [52]. The Flory theorem states that, for polymer chains from within a dense polymer melt, excluded volume repulsion is effectively canceled by solvent-mediated attractive forces between the monomers. Therefore the chains exhibit statistics which are characteristic of random walk behavior. The first purpose of this work is to confirm the existence of a Flory regime for buried protein fragments when a much larger data set of proteins is considered. We then take advantage of this fact by using the ensuing Gaussian reference state in order to obtain an unbiased estimation of a distance dependent effective interaction potential between aminoacids [41, 48, 53]. The statistical potentials estimated with this strategy could be either sequence independent or sequence dependent. In the first part of the paper, by analyzing a data set of 7793 non-redundant globular proteins, we confirm that the Flory theorem holds for compact native structures with a good degree of accuracy. This is achieved by showing that the properly rescaled distributions of the fragment end-to-end distances collapse to the same Maxwell distribution when fragment lengths larger than m = 70 and smaller than N2/3, where N is the length of the protein chain, are considered. The upper cut-off is introduced in order to select buried fragments [52]. The lower one is instead necessary to achieve a uniform Kuhn length. Our results extend the findings of Ref. [49], showing that the Gaussian scaling holds for fragments in a larger range of sizes, provided a non uniform Kuhn length is considered. As a consequence of the validity of the Flory theorem, we can assume that within protein globules the excluded volume repulsion is on average effectively canceled by solvent-mediated attractive forces between the monomers. We therefore interpret systematic deviations from the expected Gaussian behavior, which are visible in the short spatial range regime as an effective intra-molecular interaction, that can be then considered unbiased by the spurious correlations due in general to the chain constraint, to local conformational preferences, or to interior-exterior partitioning effects. Therefore, we devoted the second part of this manuscript to exploit the feasibility of this idea, by estimating a sequence-independent effective potential based on the statistics observed for buried protein fragments at different coarse-graining levels: CA-based, all heavy atoms, all atoms (including hydrogen atoms). The estimated potentials consistently change in the three cases. In particular, a power law repulsive term is present at short length whereas the potential vanishes beyond ≈20 Å in all cases. Well defined minima with negative energy are present for the atomistic resolutions for distances compatible with the sum of Van der Waals radii or with hydrogen bond geometry. If the analysis is repeated by classifying the protein segments according to the amino-acid types which occupy the first and the last position along the fragment, we can have a direct measure of the effective interactions between amino-acid types, as a function of the distance. This method could be of great interest for a wide range of applications in protein physics.

Materials and methods

Dataset

Our database of reference is Top8000 [54], which contains a set of 7957 high-resolution protein structures. The dataset has been filtered by excluding those structures of length N that do not exhibit a globular shape, i.e. whose gyration radius R(N) does not scale as . In order to achieve this we fit experimental data with the relation and discard all the structures that fall more than three standard deviations apart from the fitted curve. 164 structures have been discarded in this phase. The tangent vector to each residue has been computed as the difference between the coordinates of the subsequent and the previous residues along the chain. We reported the average cosine of the angle between the tangent vectors of pairs of residues as a function of their separation m along the chain. As the tangent-tangent correlation goes to zero when m ∼ 30, we decided to exclude from our analysis chain fragments shorter than 30 amino acids. We therefore split all the protein chains in fragments of length and grouped them by length. All other fragments have been discarded.

Reference and empirical distributions

We measured the end-to-end distance R of all fragments of given length m. We fitted the rescaled data at fixed m with a Maxwell distribution with a single free parameter b (the scale, a.k.a. Kuhn’s length) by using the scipy.stats python package and a maximum likelihood fit. The empirical distribution has been obtained by employing a Kernel Density Estimation (KDE) with a Gaussian kernel: The sum is extended over all M values r in the dataset of end-to-end distances of fragments of length m. We used cross validation in order to establish the optimal kernel bandwidth w for fragment lengths m ∈ {42, 48, 60, 64, 66, 72, 78, 84, 92}. We divided every set of end-to-end distances of fixed fragment length in five groups: an empirical distribution was computed using the data of four groups. The width of the Gaussian kernel was therefore adjusted in order to maximize the likelihood that the data from the fifth group was obtained from the same empirical distribution. In order to estimate the optimal bandwidth for all other datasets, we assumed the relation w = an between the bandwidth w and the number of points n in the dataset. We fitted the parameters a and s by minimizing the RMSD with the cross-validated bandwidths (see S1 Fig in S1 File).

Potential

For every sequence separation m, we estimated the potential as a potential of mean force depending on the distance R between two residues, using the ensemble of buried fragments selected as described in the Dataset subsection. Following a seminal approach [35], we assume R to be distributed according to the Boltzmann distribution where κ is the Boltzmann constant, T is the temperature at which thermodynamic equilibrium is assumed to hold, and Z is the canonical partition function. In what follows, we assume κ T = 1 for simplicity. One should keep in mind that F(R) is in fact an effective free energy, since it is obtained by coarse-graining other degrees of freedom, including for example the ones associated to solvent molecules. Boltzmann inversion than implies The potential of mean force is than defined as a free energy difference ΔF(R) with respect to the ideal reference state, characterized by the Maxwell distribution and the partition function , with the scale b determined as described in the previous subsection: The term with the (unknown) partition function ratio can be neglected since it does not depend on R, and as a proxy of the Boltzmann distribution P(R) we use the empirical distribution , evaluated using KDE, with bandwidth w optimized as described in the previous subsection. This leads finally to our estimation for the statistical potential: Note that Eq (6) defines an average pairwise residue-residue sequence-independent potential. Both the empirical and the Maxwell distributions entering Eq (6) are obtained based on the statistics of all fragments in our filtered dataset. The pairwise decomposition of the total score for a whole protein, is in general an approximation since it neglects correlations between different residue pairs. The major point in our analysis is related to the absence of effective interactions that is actually realized in the reference state. This implies that the pairwise decomposition is exact for the reference state (the denominator in the r.h.s of Eq (7)). As we show that the potential does not depend on m, we finally computed V*(R) as the average over all possible fragment lengths of V(R|b, w) in the Flory regime, 70 ≤ m ≤ 90. Note that both parameters b and w depend on fragment length m. We fitted the short range repulsive part of the potential by minimizing the root mean square deviation between the logarithm of V*(R) and a linear function.

Sequence-dependent analysis

We repeated the previous analysis while filtering the fragments depending on the amino acids types at their ends, so that both the empirical distribution and the Maxwell distribution are obtained from such restricted fragment sets. To increase available statistics, the average of V(R|b, w) is now taken over all fragment lengths 30 ≤ m ≤ 90 with a Gaussian reference state.

Results

In order to assess the hypothesis that the Flory theorem holds for fragments buried in the interior of globular proteins, we analyzed a large data-set of 7793 globular proteins. This protein ensemble was obtained by refining the TOP8000 data-bank [54] after removal of the non globular structures as explained in Methods. In data-set pruning, each protein is represented as a polymer whose monomers are placed in the C atomic position of the N amino-acids. The logarithmic plot of the radius of gyration of these polymers versus their length N is shown in S2 Fig in S1 File for the full TOP8000 data-bank. The proteins in the final pruned dataset (highlighted in S2 Fig in S1 File) have been selected in such a way that their radius of gyration scales as N1/3, as expected for globular proteins.

Long enough buried protein fragments follow Gaussian statistics: The thermal exponent

To investigate the validity of the Flory theorem, we analyzed an ensemble of protein fragments of different lengths, extracted from the pruned database. For any given chain of length N we considered only fragments with length m < N2/3 so that they belong to a part of the protein which is likely to be far from globule boundaries and thus buried within the globule interior [52]. Fig 1a shows in a logarithmic scale the behavior of the average end-to-end distance R of such fragments as a function of their length m, when the end-to-end distance is evaluated using C backbone atoms.

Fig 1

Gaussian statistics for buried fragments: The thermal exponent and the Kuhn length.

(a) Log-log plot of the average CA end-to-end distance R of protein fragments versus fragment length m. The plot was obtained by averaging over all fragments of length m from the data set selected as shown in S2 Fig in S1 File. For any given m, R was determined as the average over all fragments of that length in proteins whose overall lengths are larger than , in order to consider only fragments likely to be buried in the globule interior [52]. The error bars are of the order of the size of the symbols. The Flory regime, e.g. is reached when m ≥ 70. For m > 90 the statistical analysis deteriorates due to the fast decrease of available data with increasing m. (b) The Kuhn length b, obtained by maximizing the likelihood to Maxwellian distributions (1) of the empirical CA end-to-end distance data, plotted versus the length m of the protein fragments considered in the statistical analysis. The error bars were estimated based on the Fisher information evaluated at b(m) (see main text). The values of b decrease monotonically and reach a plateau in the region 70 ≤ m ≤ 90. The plateau uniform value is estimated to be b* = 3.67±0.01 Å. Only in this region all the rescaled empirical distributions collapse (see Fig 3), thereby showing the existence of the Flory regime. The number of fragments in the ensembles which are analyzed decreases with m as well. For m > 90 the ensemble population becomes too small to allow good estimations. Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/.

Gaussian statistics for buried fragments: The thermal exponent and the Kuhn length.

Fig 3

Data collapse of end-to-end distance distributions in the Flory regime for buried fragments.

The rescaled empirical probability distributions as a function of the rescaled length R/m1/2 for 70 ≤ m ≤ 90. All curves collapse rather well together. The reference Maxwell distribution (8) evaluated for the plateau scale parameter b* = 3.67 Å is shown for comparison. A significant deviation appears only for small distances and is due to the effect of excluded volume that at very short range can not disappear for real protein chains. It is worth to notice that, despite the presence of secondary structures, the value of b is close to the average distance, ≃3.8 Å, found between consecutive C atoms in protein native structures. Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/.

The Flory regime requires a scaling law R ∼ m with . Our data for CA end-to-end distances show that such a regime is valid only for the longer fragments, approximately when m > 70. This can be explained by the presence of secondary structures that introduce a strong bias in the scaling behavior for short fragments. This behavior can be understood by looking at S3 Fig in S1 File which shows the average tangent-tangent correlation as a function of sequence separation along the chain. For short sequence separations the direction of the chain is highly correlated reflecting the existence of short, effectively one-dimensional, rigid motifs such as α-helices and β-strands. On the other hand, the sharp anti-correlation minimum at m = 13 reveals a bending propensity in the opposite direction. This finds its counterpart in the almost flat behaviour of the average end-to-end distance in Fig 1a for m ≳ 30. This picture is confirmed by noticing that the value at which the correlation function reaches again zero (m ∼ 25) is almost twice as much as the value at the anti-correlation minimum, suggesting that for m ∼ 25 protein chains are expected to loop back on themselves significantly more than for other values of sequence separation. In fact, the above observation is consistent with the peak described in [55] for the probability of loop formation. This analysis suggests that the Flory regime can not be observed for short fragments because of the effects induced by secondary structures.

Long enough buried protein fragments follow Gaussian statistics: Maxwellian distributions for end-to-end distances

In order to investigate in more detail the existence of a Flory regime we studied whether the end-to-end distance R for fragments of length m follows the Maxwell distribution described by where the scale parameter b refers to the distance between consecutive monomers in an ideal Gaussian chain, namely the Kuhn length of the polymer. The distances between residues are computed in three different ways: as the distance between their α-carbons, as the minimum distance between any atom of the two residues and as the minimum distance between any heavy atom of the residues. We will refer to these three different levels of coarse-graining as CA (α-carbon level), HH (hydrogen atoms level) and HV (heavy atoms level) respectively. For all three coarse-graining levels of description, we fitted the Kuhn length b of a Maxwell distribution to maximize the likelihood that the empirical data have been drawn from it. This is done separately for any given m, so that the optimized Kuhn length b(m) depends on fragment length m. In Fig 1b we plot the optimized b as a function of m for CA end-to-end distances. The standard deviations σ associated to the maximum likelihood estimators b(m) are shown in Fig 1b as error bars. Based on the Fisher information evaluated at b(m), they were estimated as , where n(m) is the number of fragments in the dataset for a given m (see Table 1). The values of b decrease towards a plateau beginning approximately at m = 70. For m > 90 the values of b change dramatically as a consequence of the poorer statistics (see Table 1). At the plateau we estimate the m-independent Kuhn length for CA end-to-end distances. The standard deviation of b* is estimated accordingly. Similar results are obtained for HV and HH as well, as shown in S4 Fig in S1 File. The plateau values estimated for the Kuhn length are b* = 3.35±0.01 Å for HV and b* = 3.27±0.01 Å for HH.

Table 1

Length dependent statistics of buried fragments.

fragment length	number of fragments in the dataset
20	1640293
30	1352751
40	991104
50	563178
60	276421
70	120448
90	24967

Number of buried (m < N2/3) fragments in the pruned (see S2 Fig in S1 File) dataset as a function of fragment length.

Number of buried (m < N2/3) fragments in the pruned (see S2 Fig in S1 File) dataset as a function of fragment length. Fig 1b shows that the estimated Kuhn length is higher than in the plateau for lower values of m. This could explain the discrepancy with the higher value (b* = 3.75 Å) obtained in [49], as a different range of values of m was used in that work. Empirical probability distributions are inferred by raw data using Kernel Density Estimation (KDE), with kernel bandwidth estimated separately for each m by a maximum likelihood approach through a cross-validation procedure. The whole methodology is explained in detail in Methods. In Fig 2, the empirical CA end-to-end distance probability distributions for four different fragment lengths (m = 30, 50, 70, 90) are shown together with their best Maxwellian fits. Similar plots for HV and HH end-to-end distance distributions are shown in S5 and S6 Figs in S1 File. The competing effects of increasing m and of b(m) decreasing with m are both visible in Fig 2.

Fig 2

Gaussian statistics for buried fragments: End-to-end distance.

Gaussian statistics for buried fragments: End-to-end distance.

The CA end-to-end distance probability distributions for four different fragment lengths, (a) m = 30, (b) m = 50, (c) m = 70, (d) m = 90, are shown together with their best fits to the Maxwell distribution (8). The parameters b used in the plots are obtained maximizing the likelihood that the empirical data belong to the Maxwell distribution (8). The value of b decreases with m and reaches a plateau for 70 ≤ m ≤ 90, corresponding to the Flory regime (see Fig 1b). Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/. It is interesting to observe that, as shown in Fig 2, S5 and S6 Figs in S1 File, the Gaussian behavior of the end-to-end distances is mostly preserved even in a broader range of fragment lengths, 30 ≤ m ≤ 90, but since b(m) is not uniform for m < 70, we cannot talk about a Flory regime in that range. Nevertheless, the existence of a Gaussian reference state can be fruitfully exploited to derive an effective statistical potential in the full 30 ≤ m ≤ 90 range. In the region with uniform Kuhn length b (70 ≤ m ≤ 90) empirical distributions can be collapsed by rescaling the end-to-end distances by and multiplying the probability distributions by the same quantity, as shown in Fig 3 for CA. This graph vividly shows the existence for globular proteins of a range of fragment lengths, in which their statistics is well described by Gaussian ideal chains with a uniform estimated Kuhn length, b* = 3.67 Å, close to the average distance, ≃3.8 Å, found between consecutive C atoms in protein native structures. The value of the Kuhn length may appear surprisingly low. We can rationalize this fact by looking at the average tangent-tangent correlation function (S3 Fig in S1 File). Since the value of the Kuhn length is related to the integral of the latter over all sequence separations [56], the presence of a sharp minimum at a negative correlation, implying a significant negative contribution to the integral, is at least consistent with the low value that we find for the Kuhn length. A data collapse of similar quality can be obtained for both HV and HH, as shown in S7 Fig in S1 File. S7 Fig in S1 File also shows that the collapsed empirical distributions are more skewed with respect to the reference Maxwell distribution in the HV and HH cases.

Data collapse of end-to-end distance distributions in the Flory regime for buried fragments.

Statistical potentials with a Gaussian reference state: Sequence independent effective interaction

The Maxwell distribution (8) fits very well CA experimental data for large values of end-to-end distance, when the full cancellation of competing interactions, e.g. attractive and excluded volume, is effectively occurring. For short distances however, as we can see from Fig 3, there are important deviations from the ideal distribution. These are expected, since for an ideal chain excluded volume is absent even at short range, whereas for real protein chains it is anyway present. We then propose to use deviations from the ideal Gaussian behavior as a proxy of the effective short range interactions between protein residues. To test this hypothesis, we define through Boltzmann inversion a sequence independent statistical potential for any given fragment length m, as minus the logarithm of the ratio between the empirical probability density (already shown in Figs 2 and 3 for different fragment lengths) and the reference Maxwell distribution: where is the empirical end-to-end distance distribution for fragments of length m, obtained with KDE using a kernel bandwidth w (see Methods for details). Eq 9 highlights the dependence of such potential on the scale parameter b used for the reference state and on the kernel bandwidth w used to obtain the empirical distribution. It is worth noting that both parameters are obtained through a maximum likelihood approach. We plot in Fig 4 the effective potentials V(R) in the Flory regime 70 ≤ m ≤ 90 in which b(m)≃b* is effectively uniform, for all coarse-graining levels used in this work. Remarkably, the curves for different fragment lengths m collapse nicely together, allowing to recover well defined effective potentials V*(R) that we define as the average of the potentials obtained from Eq (9) over all fragment lengths 70 ≤ m ≤ 90 in the Flory regime (the resulting average potentials are shown in S8 Fig in S1 File for all coarse-graining levels, along with the corresponding standard deviation). Such result pinpoints the existence of a robust underlying mechanism which is revealed by using the ratio between the empirical and the ideal reference distributions and which can allow for a consistent estimate of amino-acids interactions. Even more remarkably, we observe that while the empirical end-to-end distance distributions collapse upon rescaling (see Fig 3), deviations from the Maxwellian reference state collapse when rescaling back to the physical distance values (see Fig 4). For example, S7 Fig in S1 File clearly shows, for the HH and HV cases, how the position of sharp small peak at short distances, that determines the main minimum of the statistical potential V*(R), drifts upon changing m when using rescaled distances.

Fig 4

Empirical knowledge-based potentials for different coarse-graining levels.

Empirical knowledge-based potentials for different coarse-graining levels.

Effective potential V(R) estimated for each 70 ≤ m ≤ 90 in the Flory regime using Eq (9). Remarkably, the curves obtained with this procedure do not depend on the fragment length and can therefore be interpreted as an effective potential between the terminal fragment residues. In this case, where all fragments are considered regardless of the type of amino acids at their ends, the potential can be interpreted as a generic sequence independent interaction between all residues. (a) CA representation. (b) HV representation. (c) HH representation. Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/. The effective statistical potentials V*(R) differ significantly depending on the coarse-graining level, as in fact expected for physics-based interactions. The potentials obtained when considering all atoms, either with (HH) or without (HV) hydrogens share in fact similar features: a steep short range repulsive part and a series of well defined attractive minima with decreasing depth for increasing distance (see Table 2). At large distances the potential vanishes towards zero, although shallower minima can be still identified (see Table 2). However, in the more coarse-grained HV case, the first minimum is partially smoothed out and the barrier separating the first two minima becomes repulsive. In the even more coarse-grained CA case, the minima of the statistical potential get much more smoothed out and the potential becomes repulsive for all distances.

Table 2

Local minima features of the average effective statistical potential.

position (Å)			value (κ_BT)
HH	HV	CA	HH	HV	CA
2.21	3.54	5.81	−1.91	−1.07	0.53
6.34	7.16	10.59	−0.61	−0.46	0.03
11.01	9.35	12.03	−0.26	−0.21	0.02
14.90	11.44	14.92	−0, 13	−0.21	−0.08
15.69	15.90	19.55	−0.13	−0.12	−0.13
18.95	17.63		−0.06	−0.11

Positions and values of the minima of the average effective statistical potential V*(R) for different coarse-graining levels.

Positions and values of the minima of the average effective statistical potential V*(R) for different coarse-graining levels. The positions and depths of the minima of the statistical potentials for different coarse-graining levels are reported in Table 2. Minima features are extracted using the V*(R) potential obtained in the Flory regime 70 ≤ m ≤ 90. We observe that the deepest minimum in the HH case (2.21 Å) corresponds to twice the Van der Waals radius 1.1 Å for the hydrogen atom [57], whereas the deepest minimum in the HV case (3.54 Å) is within the distance range observed between donor nitrogen and acceptor oxygen atoms for hydrogen bonds occurring in proteins [58]. In order to study the short distance repulsive behavior of the effective potential, more statistics is needed at very short distances. To this aim, we then consider the average of the statistical potentials defined by Eq (9), taken over the wider range of fragment lengths, 30 ≤ m ≤ 90, for which the rescaled end-to-end distance distributions are close to Maxwellians (see Fig 2). The reference state is thus now defined with a non uniform Kuhn length b(m). The quality of the collapse of the different V(R), 30 ≤ m ≤ 90, worsens, yet is still acceptable, as shown in S9 Fig in S1 File for all coarse-graining levels. The sequence independent effective potential is plotted in logarithmic scale in Fig 5 for the CA case, together with a linear regression fit for the short range region.

Fig 5

Short distance behavior of the empirical knowledge-based potential.

Short distance behavior of the empirical knowledge-based potential.

Potential for the CA case, obtained when averaging the statistical potentials (9) over all fragment lengths 30 ≤ m ≤ 90, shown together with the power-law fit at short distances. The exponent estimate is −5.7±0.3. (a) log-log scale; the standard deviation is also shown. (b) linear scale with data collapse of all V(R) potentials for different values of sequence separation 30 ≤ m ≤ 90. Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/. The existence of a power law behavior seems clear. The resulting estimate for the exponent is −5.7±0.3, which might be related to the presence of distinctive dipole-dipole interactions. Nevertheless, we caution that the above exponent estimate may depend on the limited range of short distances that can be probed with the available statistics. As a matter of fact, the use of a dipole-based description for peptide groups was already successfully proposed to perform coarse grained simulations of protein folding [59].

Statistical potentials with a Gaussian reference state: Sequence dependent effective interactions

The analysis carried out in the previous subsection can be repeated by splitting the full data set according to the specific amino acid types found at the end of the considered protein fragments. The resulting statistical potential should be interpreted as an effective interaction between the terminal residues. The decreased statistics, unfortunately, pushes our approach to its very limit, even when considering the average potential over all sequence separation values 30 ≤ m ≤ 90 and a reference state with variable b. For completeness, we report in Fig 6 some examples of average statistical potentials derived in our approach for the CA case, involving two cysteine residues (CYS-CYS), two small non-polar residues (ALA-ALA), two charged residues (GLU-GLU) and two hydrohobic (LEU-LEU) residues.

Fig 6

Sequence-dependent empirical knowledge-based potentials.

Sequence-dependent empirical knowledge-based potentials.

Examples of sequence-dependent potentials for the CA case, obtained when averaging the statistical potentials (9) over all fragment lengths 30 ≤ m ≤ 90, with its standard deviation (gray areas). The sequence independent potential (dashed line) is shown as a reference. (a) Cysteine-Cysteine (b) Small non polar residues ALA-ALA. (c) Two negatively charged residues GLU-GLU. (d) Two hydrohobic residues (LEU-LEU). Figure drawn with python package matplotlib, version 3.4.1. URL https://pypi.org/project/matplotlib/. It clearly appears that the sequence-dependent potential can differ significantly from the average sequence-independent one. It is also interesting to notice how the obtained potentials reflect the physical-chemical properties of the amino acids. We indeed see that we obtain a strongly negative (attractive) interaction between two cysteines and a strongly repulsive one between two equally charged amino acids (GLU-GLU). The interactions between two small and non polar residues matches very closely the average behavior of the sequence-independent potential, whereas two hydrophobic residues, despite strongly repulsive at small distances, show an attractive interaction at longer distances. Similar plots for the HV and HH cases are shown in S10 and S11 Figs in S1 File, respectively.

Discussion

As a first result of this paper, we have confirmed that the statistical properties of an ensemble of long enough fragments, collected from different globular proteins and selected to be buried in their interior, are similar to those of Gaussian ideal chains in a polymer melt [49]. The data set that we use [54] is based on experimentally derived protein native structures [27]. Fig 1a in fact shows that for sequence separations 70 ≤ m ≤ 90 the average fragment end-to-end distance, computed between C atoms, scales as m1/2, as expected for an ideal chain. At the same time, Fig 1b shows that the scale parameter b(m), that maximizes the likelihood to the Maxwell distribution expected for ideal end-to-end distances, consistently plateaus to a uniform value b* = 3.67±0.01 Å in the same sequence separation range. On the other hand, Fig 2 shows that, even outside the 70 ≤ m ≤ 90 range where the Flory theorem seems to hold, empirical distributions are reasonably approximated by Maxwell distributions in the whole sequence separation range 30 ≤ m ≤ 90, with a scale parameter b(m) non uniform for m < 70. In fact, both bounds for the larger range are only approximately determined in this work. The lower bound m ≳ 30 is necessary to avoid the local rigidity effects brought about mostly by secondary structure elements. The latter ones are otherwise seen to play a role for m ≲ 30, resulting in a non zero tangent-tangent correlation (see S3 Fig in S1 File), not consistent with a Gaussian regime. The presence of secondary structure elements is instead fully compatible with the observation of ideal Gaussian statistics for longer fragments. On the other hand, the upper bound m ≲ 90 is due to the lack of statistics caused by the constraint m < N2/3 for buried fragments combined with the available protein lengths in the dataset. Were longer proteins available, we would expect the Gaussian statistics to hold for even longer buried fragments. The intriguing observation that medium-sized buried protein fragments follow ideal chain statistics with varying Kuhn length is a novel result and we believe it deserves further investigation. Fig 3 shows the remarkable data collapse of empirical end-to-end distance distributions in the Flory regime 70 ≤ m ≤ 90, obtained upon rescaling C − C distances by m1/2. Notably, as shown in S4-S7 Figs in S1 File, we find similar results when computing fragment end-to-end distances with a more fine-grained representation of the protein chain, for either all atom (including hydrogen atoms, HH) or for all heavy atoms (excluding hydrogen atoms, HV). The observed ideal chain behaviour is due to the compensation between excluded volume effects and amino acids interactions, as predicted by the Flory theorem in a polymer melt. Our results extends previous findings based on a smaller data set of protein structures [49]. Moreover, we clearly show how the region in which the theorem applies should be determined. The crucial novel observation that we make in our study is that, although the Gaussian statistics is valid for a large range of end-to-end distances, at short spatial scales there are deviations due to the fact that the excluded volume effect, as well as other interactions, cannot obviously fully disappear for real protein fragments. We exploit these deviations to extract effective interaction potentials between amino-acids at fragment ends by comparing the empirical probability distribution with the ideal one taken as a reference. The effective canceling out of different interactions, achieved in our reference state ensemble due to Gaussian statistics, allows us to estimate an unbiased physics-based pairwise interaction potential, without the spurious correlations present in general because of the chain constraint, of local conformational preferences, or of interior-exterior partitioning effects. Following this approach, a different statistical potential can be estimated separately for any given value of sequence separation for which the ideal statistics is a good approximation of the empirical distribution. The main result of this work, shown in Fig 4, is the collapse of the different statistical potentials in the Flory regime 70 ≤ m ≤ 90. Most remarkably, while the reference states for different sequence separations collapse when rescaling physical distances according to the ideal Gaussian scaling, see Fig 3, deviations from the reference states collapse when rescaling back to physical distances, see Fig 4, a strong hint that the statistical potentials that we estimate in this work do indeed capture physics-based effective interactions. As a consequence, short range deviations consistently behave as finite size corrections, drifting towards zero rescaled distance for longer and longer fragments, when the universal Gaussian behavior is eventually obtained in the limit of infinite fragment size. Even more remarkably, different potentials are obtained for the different coarse-graining levels used in this work, again as expected for physics-based effective interaction potentials. For all coarse-graining levels, the statistical potential vanishes at large distances. Well defined local minima can be observed, listed in Table 2, with the deepest ones corresponding for the atomistic resolutions to steric (sum of Van der Waals radii) or hydrogen bonding interactions. The potential mimima get smoothed when considering a coarser representation, as expected for a proper coarse-graining when the finer degrees of freedom are averaged out. Within the C representation, the statistical potential is basically always repulsive; this can be rationalized by observing that the ideal Gaussian reference state already takes into account the average hydrophobic attraction needed for stabilizing a protein globule. The short-range behaviour is not easy to investigate, since small values of end-to-end distance are scarcely sampled, and the use of shorter, more numerous, fragments is then required. Within the C representation, Fig 5 shows that a power law repulsion is found, with an exponent estimate consistent with −6. This could be related to the presence of peculiar dipole-dipole interactions. Finally, we show in Fig 6 how the same approach can be used to derive sequence-dependent statistical potentials. Unfortunately the statistics available for buried protein fragments with a given pair of amino acid types at their end is barely enough to provide significant signals. Nonetheless, we observe trends consistent with what is expected from the physical-chemical features of the probed types of residue pairs. The Gaussian reference state from buried protein fragments could be used, in principle, to estimate an orientation-dependent statistical potential, when needed to properly represent specific interaction modes, such as disulfide bonds between pairs of Cysteine residues. Available statistics would currently be a major issue, but this bottleneck is likely to be overcome in the near future due to the rapid increase of native structures being deposited in the PDB. An ideal chain reference state was already used to define statistical potentials, in order to take into account chain connectivity in a minimal way [44]. However, our work shows that the use of an ideal chain reference state is well justified only for buried protein fragments, being in fact rooted into the non trivial polymer physics properties of protein globules. In fact, for fragment lengths m ≲ N, well above the threshold N2/3 used here to identify buried fragments, one expects to observe a non Gaussian behaviour characterized by the thermal exponent ν = 1/3 typical of compact globules (see S2 Fig in S1 File for evidence of the “compact globule” scaling of gyration radius with protein length). We believe this finding may be an important conceptual advance. For example, we can revisit one of the main criticisms raised against the standard derivation of statistical potentials [46, 47]. According to such critique, the Boltzmann inversion should be used in principle when averaging over different configurations from the thermal ensemble of the same protein system, not when averaging over a set of “fixed” configurations (native PDB structures) from different protein systems. However, it was observed that for some protein “substructures”, the frequencies of different “states” in the PDB database correlate with what expected from thermodynamic behavior, although with different apparent temperatures. In particular, a main role is played by the apparent temperature associated to interior-exterior partitioning, which was shown to depend on the length, composition and compactness of the proteins in the database [46]. The finding that buried protein fragments collected from different protein systems do follow Gaussian statistics (the thermodynamic expected behavior for a polymer melt system) may at least be rationalized noting that by using only buried fragments the possible variation of the apparent temperature associated to interior-exterior partitioning does not play a role anymore. We believe further work is needed to investigate in more detail the role of the constraint used to select buried protein fragments. In particular, relaxing that constraint, i.e. using m < aN3/2 with a ≳ 1, could be a way to gather more statistics and obtain more reliable sequence-dependent potentials. A trade-off is at play, since the larger a, the more fragment statistics can be collected, but the less effective would be the constraint in selecting actually buried fragments. It is important to observe that the statistical potential presented here, in order to be tested in practical applications such as model quality assessment, should be necessarily complemented with other scoring terms, assessing for example solvent accessibilities and local conformational preferences. These properties are crucial for the correct folding of proteins and can not be detected in our reference state of buried fragments. Moreover, we mention that it would be interesting to compare the results obtained here for buried fragments in protein globules, to the properties of fragments buried in the interior of polymer conformations sampled in the compact phase below the θ-point. In particular, it is interesting to speculate whether the Gaussian behaviour with non uniform Kuhn length found here for intermediate size fragments is peculiar to proteins or not. To conclude, we observe that the statistical properties uncovered in this work were derived analyzing ensembles built with different protein chains. Nonetheless, we may predict that the very same properties, reminiscent of ideal chain behaviour, could be observed for single protein chains in native conditions, for the specific case of Intrinsically Disordered Proteins (IDPs) that can form collapsed, globular ensembles while simultaneously exhibiting significant conformational heterogeneity [60]. This prediction could be in principle tested by means of single molecule FRET experiments in which fluorescent labels can be placed across different chain fragments, thereby providing a direct measurement of end-to-end fragment distances [61]. Similar experiments were in fact already carried on for IDPs that form extended heterogeneous ensembles [62]. (PDF) Click here for additional data file. 16 Aug 2021 PONE-D-21-21648 Statistical potentials from the Gaussian scaling behaviour of chain fragments buried within protein globules PLOS ONE Dear Dr. Trovato, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. As you will read, the reviewers are mostly supportive of your manuscript, but still have a few requests for clarifications, and - to a lesser extent - requests for a quantified statistical perspective. I agree with them and I hope you can address these in your revised manuscript. Please submit your revised manuscript by Sep 30 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Jerome Baudry, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The study finds that long enough buried protein fragments follow Gaussian statistics, and further uses this Gaussian distribution as a reference state in the derivation of statistical potentials. This is an interesting and statistical mechanically rigorous way to define the reference state for statistical potentials. Overall, the manuscript is clearly written and the conclusion is supported by their data. I have a couple of general comments: 1. The Gaussian reference state is m-dependent. Would it be possible to define m-independent Gaussian reference states, at least for certain rang of m values? 2. For the m region whereby proteins behave like a random Gaussian chain, the use of a Gaussian reference state seems reasonable. What about the regions with small m values (m < 30) or large m values (m>90)? Reviewer #2: The manuscript is well explained and interesting. I enjoyed reading it. A particular strength is that the limitations of the theoretical approach are described in detail. Some points for the authors to consider when revising the manuscript - The abstract states that "Gaussian statistics [...] holds for a wide range of fragment lengths [...]". However, the authors note many times in the manuscript that they found deviations from Gaussian statistics at short spatial scales. The sentence in the Abstract should be modified to reflect more fully the data presented in the manuscript. - Uncertainties ("error bars") are not provided for some of the fitted quantities. For example b*=3.67 A (line 242). - How does the Kuhn length b*=3.67 A compare to the average distance between alpha carbons in proteins? - It is stated i the Discussion that "[...] the use of an ideal chain reference state is well justified **only** for buried protein fragments," (I added * for emphasis). This statement is not backed by the data presented, the authors have examined only burried fragments and not non-burried/exposed fragments. - Variable R is first introduced in line 130, but not defined there. R is defined later in line 148. - Line 173: do the authors mean r.h.s. instead of l.h.s? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 30 Sep 2021 We thank both reviewers for their supportive and careful comments which helped us to improve our manuscript. Below a detailed point-by-point response to their comments. Reviewer #1: The study finds that long enough buried protein fragments follow Gaussian statistics, and further uses this Gaussian distribution as a reference state in the derivation of statistical potentials. This is an interesting and statistical mechanically rigorous way to define the reference state for statistical potentials. Overall, the manuscript is clearly written and the conclusion is supported by their data. We thank the reviewer for her/his positive evaluation. I have a couple of general comments: 1. The Gaussian reference state is m-dependent. Would it be possible to define m-independent Gaussian reference states, at least for certain rang of m values? We thank the reviewer for her/his observation. In fact, we realized that the description of how we define the “average” potential V(R) was possibly misleading and we corrected it in the revised version. As a matter of fact, we always compute V(R) as an average of potentials defined for different values of m, each having a different Gaussian reference state (i.e. a different Kuhn length b(m)). In the Flory regime (70 ≤ m ≤ 90) we could in fact use the same Gaussian reference state (with the “uniform” Kuhn length b* that we estimate in our manuscript) for all the potentials, as suggested by the referee. However, the resulting average potential would in practice be the same as the one computed with m-dependent reference states (see the following figures, where both average potentials are shown for all coarse-graining levels that we used in our work - figures visible in the pdf file). Finally, we realized that the Flory range was mistakenly labeled as 70 < m < 90 in the former version of the manuscript. We corrected this to 70 ≤ m ≤ 90 (and similarly to 30 ≤ m ≤ 90 for the overall range of fragment lengths) in the revised version. 2. For the m region whereby proteins behave like a random Gaussian chain, the use of a Gaussian reference state seems reasonable. What about the regions with small m values (m < 30) or large m values (m>90)? We thank the reviewer for her/his observation. It is definitely worth discussing why we could not find a Gaussian reference state outside the 30 ≤ m ≤ 90 range. First, we acknowledge that both bounds for the larger range are only approximately determined in our work. The lower bound is necessary to avoid the local rigidity effects brought about mostly by secondary structure elements. The latter ones are otherwise seen to play a role for m < 30, resulting in a non zero tangent-tangent correlation (see S3 Fig), not consistent with a Gaussian regime. The presence of secondary structure elements is instead fully compatible with the observation of ideal Gaussian statistics for longer fragments. On the other hand, the upper bound m > 90 is due to the lack of statistics caused by the constraint m < N2/3 for buried fragments combined with the available protein lengths in the dataset. Were longer proteins available, we would expect the Gaussian statistics to hold for even longer buried fragments. We inserted a paragraph in the discussion section along the above lines. Reviewer #2: The manuscript is well explained and interesting. I enjoyed reading it. A particular strength is that the limitations of the theoretical approach are described in detail. Some points for the authors to consider when revising the manuscript We thank the reviewer for her/his positive evaluation and for her/his observations. - The abstract states that "Gaussian statistics [...] holds for a wide range of fragment lengths [...]". However, the authors note many times in the manuscript that they found deviations from Gaussian statistics at short spatial scales. The sentence in the Abstract should be modified to reflect more fully the data presented in the manuscript. We agree with the referee and we modified the Abstract accordingly. - Uncertainties ("error bars") are not provided for some of the fitted quantities. For example b*=3.67 A (line 242). We agree with the referee (and the editor) that error bars need to be provided. Accordingly, we updated Fig 1b and S4 Fig by showing the error bars associated to the maximum likelihood estimators b(m). Based on the Fisher information evaluated at b(m), standard deviations were estimated as (equattion visible in the pdf file) , where n(m) is the number of fragments in the dataset for a given m (see Table 1). In details, the variance of b(m) is estimated as the inverse of the Fisher information, the latter being minus the second derivative of the log-likelihood function with respect to the parameter b. Moreover, we also provided the uncertainty for b* , estimated as the standard deviation of the 21 b(m) values (70 ≤ m ≤ 90) whose mean defines b* . For all coarse-graining levels the b* uncertainty turns out to be 0.01 Å. Such uncertainties were also represented in the updated Fig 1b and S4 Fig. A proper explanation of how standard deviations for b(m) and b* was added in the revised manuscript in the Result section. - How does the Kuhn length b*=3.67 A compare to the average distance between alpha carbons in proteins? We thank the reviewer for her/his observation. The Kuhn length is in fact close to the average distance, 3.8 Å, found between consecutive CA atoms in protein native structures. We added a sentence in the Result section. - It is stated i the Discussion that "[...] the use of an ideal chain reference state is well justified **only** for buried protein fragments," (I added * for emphasis). This statement is not backed by the data presented, the authors have examined only burried fragments and not non-burried/exposed fragments. We thank the reviewer for her/his observation. We agree that we did not examine the behavior of non-buried/exposed fragments. However, our statement can be backed by the following, admittedly indirect, argument. For fragment lengths m < N but of the same order as chain length N, that is well above the threshold N^(2/3) used here to identify buried fragments, one expects to observe a non Gaussian behaviour characterized by the thermal exponent ν = 1/3 typical of compact globules. In fact, we provide evidence of the “compact globule" scaling of gyration radius with protein length in S2 Fig. We inserted a sentence along the above lines in the Discussion section. - Variable R is first introduced in line 130, but not defined there. R is defined later in line 148. The referee is correct, we introduced the quantity R in a confusing way. First, we now correctly states (lines 116-118 in Dataset subsection) that the gyration radius scaling is used to filter the protein dataset. Second we now introduce R as the end-to-end distance in line 130 (Materials and Methods section) and line 206 (Discussion section). - Line 173: do the authors mean r.h.s. instead of l.h.s? Yes we mean r.h.s. Thanks for spotting this! Submitted filename: Response to Reviewers.pdf Click here for additional data file. 29 Oct 2021 Statistical potentials from the Gaussian scaling behaviour of chain fragments buried within protein globules PONE-D-21-21648R1 Dear Dr. Trovato, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Jerome Baudry, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 7 Jan 2022 PONE-D-21-21648R1 Statistical potentials from the Gaussian scaling behaviour of chain fragments buried within protein globules Dear Dr. Trovato: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Jerome Baudry Academic Editor PLOS ONE

51 in total

1. Distance-dependent, pair potential for protein folding: results from linear optimization.

Authors: D Tobi; R Elber
Journal: Proteins Date: 2000-10-01

Review 2. Development of novel statistical potentials for protein fold recognition.

Authors: N-V Buchete; J E Straub; D Thirumalai
Journal: Curr Opin Struct Biol Date: 2004-04 Impact factor: 6.809

3. Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force.

Authors: M Hendlich; P Lackner; S Weitckus; H Floeckner; R Froschauer; K Gottsbacher; G Casari; M J Sippl
Journal: J Mol Biol Date: 1990-11-05 Impact factor: 5.469

10. InterEvDock: a docking server to predict the structure of protein-protein interactions using evolutionary information.

Authors: Jinchao Yu; Marek Vavrusa; Jessica Andreani; Julien Rey; Pierre Tufféry; Raphaël Guerois
Journal: Nucleic Acids Res Date: 2016-04-29 Impact factor: 16.971