Literature DB >> 32436706

Pseudo-Improper-Dihedral Model for Intrinsically Disordered Proteins.

Łukasz Mioduszewski¹, Bartosz Różycki¹, Marek Cieplak¹.

Abstract

We present a new coarse-grained Cα-based protein model with a nonradial multibody pseudo-improper-dihedral potential that is transferable, time-independent, and suitable for molecular dynamics. It captures the nature of backbone and side-chain interactions between amino acid residues by adapting a simple improper dihedral term for a one-bead-per-residue model. It is parameterized for intrinsically disordered proteins and applicable to simulations of such proteins and their assemblies on millisecond time scales.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 32436706 PMCID： PMC7588027 DOI： 10.1021/acs.jctc.0c00338

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Molecular dynamics simulations provide insights into the properties of biomolecular systems. They make use of empirical potentials[1−3] that depend on the length scale of the description. The atomic level descriptions[4−9] are substantially distinct from the coarse-grained (CG) residue-level descriptions in which a residue is represented by a single bead. In between, there are CG models in which several atoms are grouped into beads with still different sets of potentials.[10−14] The one-bead-per-residue CG protein models are especially useful when analyzing large systems at long time scales such as those occurring in the dynamics of virus capsids (assembly and indentation)[15−17] or protein stretching at near-experimental speeds.[18−20] The simplest versions of such a model are structure-based or Go-like,[21−24] meaning that all parameters in the potentials are derived from the experimentally determined native structure[25−28] and the solvent is implicit. There is no unique way to construct a Go-like model because its most important descriptor is the contact map, and there are various criteria to define it. In addition, various contact potentials and the backbone stiffness terms can be employed. Our benchmarking to the stretching experiments[24] indicates that the Lennard-Jones (LJ) potential between the effective beads combined with the native contacts determined through an atomic overlap criterion[29−31] can correctly recreate protein dynamics of stretching and, in addition, leads to proper folding. The criterion involves checking for an overlap between spherical spaces associated with heavy atoms in a pair of residues in the native state. Its presence introduces an attractive potential well, and its absence results in a soft repulsion between the beads. The structure-based approach clearly cannot be applied to the intrinsically disordered proteins (IDP)[32] because such proteins dynamically adopt significantly differing conformations and there is no dominant “native state”. Sampling their rich energy landscape may be challenging for all-atom models, especially in the case of simulating processes involving multiple chains, like aggregation.[33] Short all-atom simulations may still be used to parameterize CG models built solely for the purpose of simulating one specific system (like in the multiscale approaches[34−36]); however, our goal is to construct a CG potential that is transferable to many systems. We have argued[37] that the contact-based CG description for IDPs is still possible provided that the contacts are determined dynamically from the instantaneous shape of the backbone, as described by the locations of the Cα atoms, and are thus allowed to form and disappear in an adiabatic fashion. The contacts can effectively arise either from the side-chain–side-chain, side-chain–backbone, or backbone–backbone interactions. This model also includes electrostatic interactions as described by a Debye–Hückel (D-H) potential,[38] and it leads to a reasonable agreement with experimental and all-atom theoretical results pertaining to the average geometry of the conformations for a set of systems. It is also appealing computationally because it effectively involves only two-body interactions. We have already used this model to determine the phase diagram for aggregation of polyglutamines,[39] which involved simulating 1800 residues for over 1 ms. However, switching the contacts dynamically on and off violates the detailed balance and can lead to nonequilibrium stationary states, which have been studied, for example, in the context of active biomembranes.[40−42] One may hope that sufficiently slow, adiabatic switches may mask such glitches, but it is nevertheless desirable to construct a model without any time-dependent potentials. In our approach, we distinguish side-chain and backbone interactions using only the positions of the Cα atoms. Determining the type of interaction between two residues in the previous model required knowledge of the positions of six residues, but after the contact was quasi-adiabatically turned on, the forces were acting only between a pair of residues. Without this switching, we could either return to Monte Carlo sampling (where the idea was originally implemented[43,44]) or introduce a multibody term in the potential (such terms are crucial in reproducing the fine structure of residues that is lost in coarse-graining[45,46]). An example of such a term is the one used for dipole–dipole interactions that can describe hydrogen bonding in the protein backbone.[47] Here, we propose a new and empirically motivated molecular dynamics model in which the short-range interactions are represented by time-independent four-body potentials. The point of departure is an observation that the backbone stiffness energy involves computation of a four-body dihedral potential[48] that restricts the angle between two planes, each set by three residues. This potential can be used in an “improper” way to take into account the rigidity of the side chain in a two-bead-per-residue model. When two residues interact, this picture can be further simplified by removing the side-chain bead (in our one-bead-per-residue model, the beads are centered on the Cα atoms). We can still use the improper dihedral term by replacing the side-chain bead by the bead of the second residue that participates in the interaction and vice versa: the side-chain bead of the second residue in the interacting pair can be replaced by the bead representing the first residue. The planes defined by this procedure are shown in Figure . Thus, despite using the four-body terms, we still retain the pairwise nature of interactions. In our model, each contact between residues is described by two pseudo-improper-dihedral (PID) angles associated with each residue.

Figure 1

Idea of the PID angles. The interaction between residues i and j involves angles Ψ (defined by i – 1, i, i + 1, and j beads) and Ψ (defined by j – 1, j, j + 1, and i beads).

Idea of the PID angles. The interaction between residues i and j involves angles Ψ (defined by i – 1, i, i + 1, and j beads) and Ψ (defined by j – 1, j, j + 1, and i beads). We made a survey of the structures from the Protein Data Bank, which showed specific patterns made by the PID angles. Those patterns are clearly different for the interactions made by the backbone and the side-chain groups, which proves that we can distinguish these two cases in our approach. We show that our new model can successfully recreate the experimentally determined radii of gyration (Rg) for a set of 23 IDPs. Because replacing the quasi-adiabatic contacts with four-body PID potentials significantly changed the dynamics, we had to reparameterize all aspects of the model. In order to quantify which variant of the model after reparameterization agrees best with the experiment, we computed Pearson coefficients that show how close the simulation and experimental results are. The best variants surpass our previous model. We also compared energy distributions of the models, which proved that adiabatic switching caused discrepancies from the Boltzmann distribution. Those discrepancies were not present in the new model with the PID potential. However, the new model turned out to require significantly more computational resources.

Methods

Results of the PDB Survey

Virtually all proteins are made from the same set of 20 amino acids, and the geometry of interactions between the residues should be similar for the case of IDPs and structured proteins even if the relative occurrence of those interactions may be different (since, e.g., the IDPs contain much fewer hydrophobic residues). Therefore, we made a survey of 21,090 structured proteins from the CATH database[49] (the set of proteins with the sequence similarity not exceeding 40%: cath-dataset-nonredundant-S40.pdb) to determine which PID angles are favorable in the inter-residue interactions. We computed the values of PID angles for each pair of contacting residues from the database, where a contact between residues is defined through the overlap criterion. These heavy atoms may be a part of a backbone or a side chain. If a backbone atom from one residue overlaps with a side-chain atom from another residue, we call it a backbone–side-chain contact (bs). We analogously define side-chain–side-chain (ss) and backbone–backbone (bb) contacts. There may be many overlaps, so one residue pair can form more than one type of contact simultaneously. In order to associate a characteristic set of PID angles and distances with a unique type of contact, we derive subdistributions corresponding to situations in which the overlaps arise only in one class of atoms (e.g., only ss). Table shows how many overlap contacts of each type are in the database.

Table 1

Number of Overlap Contacts for All Proteins in the Used Database (7,974,804 in Total)

type of contact	bb	bs	ss
number of residue pairs with only this type of contact	624,699	953,782	1,870,746
number of residue pairs that include this type of contact	3,742,271	3,872,292	4,297,919

Figure shows PID angle distributions for the ss, bs, and bb cases. The distinction between contact types is based on the overlap criterion described above (green histograms) or on a subset of these contacts that fulfill directional criteria introduced in the previous model[37] (red histograms). Figure shows distributions of contacts that are only one type (e.g., only bb overlaps). Figure S2 in the Supporting Information is the same as Figure , but contacts can be of more than one type (differences are minor and refer only to the bs case). The Boltzmann inversion potential VB = – kBT ln (p(ΨPID)) made from these distributions was fitted by an analytical function described in the next section. We observe that side-chain and backbone contacts are associated with different sets of the PID angles. The two minima for the bb case correspond to the parallel or antiparallel β sheet or to the right- or left-handed α helix (for the i, i +3 contacts, only one minimum is present, which reflects right-handedness, see Figure S3 in the Supporting Information).

Figure 2

Distributions of the PID angles in the contacts from the PDB survey that are only of one type (green histograms). Local i, i + 3 and i, i + 4 contacts are excluded. Each contact has two angles. Distribution of the first (ΨPID) is on the top panels, and that of the second (ΨPID) is on the bottom panels. Subdistributions made from contacts that obey the directional criteria defined in ref (37) are shown as red histograms. The potential resulting from the Boltzmann inversion procedure (blue dots; unit of energy, ε ≈ 1.5 kcal/mol) was fitted to an analytical function (purple line). Figure shows two-dimensional distributions, where the PID angle ΨPID is on one axis, and the Cα–Cα distance r is on the other axis. Different side chains result in different distance distributions, as shown for GLN or ALA residues, but the PID angle distributions stay mostly the same (with the exception of special cases, PRO and GLY). We find that the backbone contacts correspond to smaller and better defined distances than the side-chain contacts, which result in a diffuse cloud for ΨPID ≈ 0 rad and r > 6 Å (side chain) and sharp peaks for ΨPID ≈ ±1 rad and r < 6 Å (backbone). Two-dimensional distributions of the PID angle and distance for contacts made by all 20 types of residues are available in Figure S4 in the Supporting Information, and one-dimensional distance distributions used to determine the equilibrium distances for ss interactions (rminss) are available in our previous article;[37] however, the values of rminss are reprinted in Table S1 in the Supporting Information.

Figure 3

Two-dimensional distributions of contacts, where the PID angle ΨPID is on one axis, and the Cα–Cα distance r is on the other axis. The i, i + 3 and i, i + 4 contacts are excluded. The distributions include each contact obtained in the PDB survey where at least one residue in the pair was of the given amino acid type (PRO, GLY, ALA, and GLN). The PID angle ΨPID for each contact is the one associated with this residue (if both residues were the same amino acid, then it corresponds to two counts with two different PID angles).

Implementation of the PID Potential

In our model, the interaction between two residues depends on their distance and the two PID angles they make. Therefore, we chose our PID potential to be a product of three terms: V(ψA, ψB, r) = λA(ψA)λB(ψB)ϕ(r), where ψA is the first PID angle in a pair, ψB is the second angle in the pair, and r is the Cα–Cα distance. As the first approximation, we decided to use the cosine function for λ and the LJ potential for , where rmin is the minimum, and εLJ is the depth (discussed later). Due to the broad character of the bs distribution (green histograms in Figure ), we take only the bb and ss contacts into account (see section 2.3 in the Supporting Information). This feature is distinct from our previous model in which the bs interactions were included.[37] For the bb and ss interactions, we have clearly defined peaks that can be fitted to the potential function (separately for PID angles and distances). Each peak has a different width and center, so the detailed form of the λ function is Each pair of the ss contacts has rminss corresponding to the minimum identified in our previous work.[37] Because for r < rmin the ϕ(r) potential becomes strongly repulsive, α parameters for ss and bb contacts must be chosen so that if λbb ≠ 0, then λss = 0 and vice versa. Because the bb PID angle distribution has two peaks, the bb potential has two terms corresponding to both of them (ψ0bb+ and ψ0bb–). Therefore The repulsive part of the LJ potential should always be present for small distances to prevent the residues from passing through one another (excluded volume effect). This is why the bb terms have a more complicated form: This formula ensures the presence of the excluded volume because, for r < rminbb, the LJ potential ϕbb is no longer multiplied by the λbb factors. For the ss contacts, Vss(ψA, ψB, r) = λss(ψA)λss(ψB)ϕss(r) and rminss depends on the types of amino acids in the pair in contact (see Section 2.4 about the details of the LJ potential). The total PID potential is V = Vss + Vbb. Fitting the function to the Boltzmann inversion potential based on the contact distributions of only one type (bb, bs, or ss) that fulfill directional criteria defined in ref (37) (red histograms in Figure ) resulted in the following parameters: αbb+ = 6.4, αbb– = 6.0, αss = 1.2, ψ0ss = – 0.23 rad, ψ0bb+ = 1.05 rad, and ψ0bb– = – 1.44 rad. Fitting to the distributions that do not have to fulfill directional criteria (green histograms in Figure ) resulted in a model that agreed poorly with the experiment (see section 3.2 of the Supporting Information). In order to improve numerical efficiency, the cosine function was replaced by its algebraic approximation, defined in section 1 of the Supporting Information. Values of rmin are given in ref (37) and Table S1 in the Supporting Information. In Section 3, we denote the pseudo-improper-dihedral potential by letter P and the old, quasi-adiabatic model by letter A.

Backbone Stiffness and the Thermostat

Our model has an implicit solvent, represented by the Langevin thermostat with the damping term, so the equation of motion for the ith residue is , where m is the average amino acid mass, r⃗ is the position of the residue, F⃗ is the force resulting from the potential, γ = 2m/τ is the damping coefficient, and Γ⃗ is the thermal white noise with the variance σ2 = 2γkBT. The time unit τ ≈ 1 ns was verified for a different model with the same equations of motion,[50] and even if for this new model τ is expected to be slightly different, it should still correspond to an overdamped case with diffusional (not ballistic) dynamics. The residues in our model are connected harmonically with the spring constant k = 100 Å–2·ε and equilibrium distance 3.8 Å (ε ≈ 1.5 kcal/mol is the energy unit[51]). The backbone stiffness potential in our model consists of a bond angle and (proper) dihedral terms. Its depth and form were obtained from a Boltzmann inversion potential based on a random coil library.[52] Its exact analytical form is described in ref (37). It is defined in kcal/mol, so it can be used to verify what is the effective room temperature. The results for polyproline, which cannot form side-chain–side-chain contacts (see the next section) and has high backbone stiffness, indicate[37] that simulations for the room temperature 0.38 ε/kB give the best agreement with experimental values for the polyproline end-to-end distance, so we set the temperature to 0.38 in the reduced units. Because the potential for backbone stiffness, as in the previous model, is based on a random coil library, it does not favor any secondary structure. In order to make structures like α helices or β turns possible, we allow attractive contacts between the ith and i + 3rd residues in the chains as these contacts correspond to hydrogen bonds between backbone atoms in an all-atom representation.[53] However, the nature of i, i + 4 contacts is different (see section 2.1 in the Supporting Information), so in the results, we tried models with i, i + 4 contacts (+ in superscript) and without them (− in superscript).

Depth and Form of the Lennard-Jones Potential

The energy unit ε = 110 pN·Å ≈ 1.5 kcal/mol is taken from the earlier models,[24,37] and although it corresponds to the strength of one contact in those models, it is not necessarily the case for the PID model. We keep εbb = ε because it is roughly equal to the energy of one hydrogen bond made by the protein backbone,[54] but the varied nature of ss contacts required parameterization. We checked values between 0 and 1 ε for the uniform εss potential (we denote this case as ME) and tried two matrices where εss depends on the pair of residues: the classic Miyazawa–Jernigan matrix based on PDB statistics from 1996[55] (denoted MJ) and the MDCG matrix derived from all-atom simulations[56] (denoted MD). We also tried scaling them by a factor from 0 to 1 (in the results, the factor is denoted in subscript). In all three cases (ME, MJ, and MD), εss = 0 for PRO and GLY residues. The distance distributions for some pairs of residues are very broad.[37] This is caused by the ability of longer side chains to deform and adapt different conformations. This ability is lost in any CG model in which each residue is represented by a spherical bead. To correct this, one may imagine a modified form of the LJ potential for ss contacts: We tested both this modified form (denoted by letter F, for flat) and the traditional form (denoted by letter L) of the LJ potential.

Electrostatics

Our previous model used a Debye–Hückel-screened electrostatic potential with electric permittivity ϵ = 4 Å/r depending on the Cα– Cα distance r following the approach of Tozzini et al.[38] (thus we denote this version of electrostatic interactions by letter T). However, this approach was designed for structured proteins, where the permittivity inside the hydrophobic core is significantly lower than in water. IDPs lack this core and are more solvent-exposed, so we tried a simpler term with ϵ = 80 (we denote this term by letter C). In both cases, the form of the electrostatic potential is , where s is the screening length that depends on the ionic strength. In our model, histidine is considered to be uncharged.

Results and Discussion

We tested our model on a benchmark of 23 IDPs whose radius of gyration was measured in SAXS (small-angle X-ray scattering) experiments[57−67] (their sequences, screening lengths used, and a Pappu diagram[68] are in section 3.1 of the Supporting Information). We considered over 200 variants of the model: with a pseudo-improper-dihedral potential (denoted by letter P) or the previous version with a quasi-adiabatic potential (letter A), with or without i, i + 4 attractive contacts (+ or – in superscript), with the standard (letter L) or flat (letter F) form of the LJ potential, different depths of this potential (we considered three residue–residue matrices: uniform, ME; Miyazawa–Jernigan, MJ; and all-atom based, MD; all scaled by a factor denoted in subscript), and with standard (letter C) or distance-dependent (letter T) electric permittivity. Thus, the name of each version is made from four symbols (the full legend is available in Table S4 in the Supporting Information), e.g., A+ L ME1 T is the model with quasi-adiabatic switching, attractive i, i + 4 contacts, LJ potential with uniform ε, and electrostatics used by Tozzini et al.[38] P– F MD0.1 C means a model with the PID potential, no i, i + 4 contacts, LJ potential with a flat region and depth depending on the identity of residues according to the MDCG matrix rescaled by 0.1, and classic D-H electrostatics. For each model, we computed its Pearson coefficient[24] defined as , where Rgexp is the radius of gyration from the experiment, Rgsim is from the simulation, and the sum is over each of N = 23 proteins. The full list of models with their Pearson coefficients P and χ2 values is in section 2 of the Supporting Information. Here, we show just the five most significant ones, starting from the original model[37] shown in panel (a) of Figure .

Figure 4

Comparison between the simulated and measured radius of gyration for 23 IDPs. Panels (a)–(f) show simulation results obtained within six different models. The model names are specified in the panels and defined in the text. The value of the Pearson coefficient is also indicated in each panel. Colors indicate the number of residues constituting a given IDP. Applying a pseudo-improper-dihedral potential to this model results in worse agreement with the experiment (panel (b)), so we reparameterized the model by changing five features described above. The end result (panel (c)) has much better agreement with the experiment. We checked that this is indeed the result of using the new form of the potential as applying the reparameterized features to the previous model (panels (d) and (e)) improves it only slightly. Panel (f) shows the results of the Gaussian chain model[69] calculated with the formula , where n is the number of residues, and b is the effective Kuhn length that may be treated as a fit parameter. By fitting with the least-squares method to the set of 23 IDPs, we obtained b = 6.7 Å. The Rg values for the set of 23 IDPs have been measured in SAXS experiments. We note that the direct result of a biomolecular SAXS experiment is a scattering profile that contains information not only about the value of Rg but also about the average shape and size of the IDP under study. It is possible to compute such a scattering profile from an ensemble of simulation structures and compare it with the experiment,[70] but this task involves fitting parameters and requires raw experimental data, which are not available for many proteins used in the testing set. On the other hand, the value of Rg is a single number that can be easily compared to simulation results. For these reasons, we decided to compare our results only with the Rg values. An example of a comparison with a SAXS intensity profile for protein 6AAA is shown in Figure S12 in the Supporting Information. Another way to validate our model is to perform a histogram test,[71] where we can check if our simulation is consistent with the Boltzmann distribution of energy: the state with energy E should occur with the probability p(E) = Ω(E) exp ( – E/kBT)/Q, where Ω(E) is the state density, and Q is the normalization factor. If we perform simulations using two different temperatures T1 and T2, then we can compute the following quantity: log(p1(E)/p2(E)) = log (Q1/Q2) + E · (1/kBT2 – 1/kBT1). This quantity should depend on the energy linearly with the coefficient (1/kBT2 – 1/kBT1). We can plot this dependence (treating log(Q1/Q2) as a fit parameter) for the new version (P– F MD0.1 C) and previous version (A+ L ME1 T) of the model.[37] Such a plot for proteins his5 (24 residues), NRG1 (75 residues), and p53 (93 residues) is shown in Figure . The data points obtained with the new model lie closer to the line with the coefficient (1/kBT2 – 1/kBT1).

Figure 5

Histogram test[71] for the new model (P– F MD0.1 C) and previous model (A+ L ME1 T) based on simulations of proteins his5, NRG1, and p53 at temperatures 0.35 and 0.4 ε/kB. Insets show the energy histograms binned every 1 ε and used to construct the quantity log(p1(E)/p2(E)) shown on the main plots as a function of energy. The new version of the model has significant advantages over the previous one: it better agrees with the experimental data and the Boltzmann distribution. However, even though the computational speed of both models scales almost linearly with the system size, the new model is usually at least five times slower (see Figure ).

Figure 6

Physical time of 1 ns-simulation run on a single core for the new model (P– F MD0.1 C, squares) and previous model (A+ L ME1 T, stars) as a function of the system size. The number of protein chains in the simulation is indicated by color. The system density of multichain simulations was set to 1 residue/nm3. Fitted lines have coefficients of 0.003 and 0.0005 s/residue, respectively, with a fit error in the order of 0.0001 s/residue. The new model is also harder to parallelize (due to multibody terms) and does not perform well for folded proteins: we tried folding small proteins (PDB access codes 1L2Y, 1UBQ, and 1ERY) with it but arrived with an RMSD of 6 Å or higher even for the smallest protein, 1L2Y (see section 4 of the Supporting Information). This is not surprising because IDPs have very weak inter-residue interactions: the top seven variants of our model considered in IDP parameterization had their residue–residue matrices multiplied by a factor smaller than 0.5 (see Figure ). It is interesting to note that two out of our top seven variants multiply the matrix by 0, meaning that there are no interactions with the exception of excluded volume, backbone stiffness, and electrostatics. A simple Gaussian chain model also works quite well for IDPs (panel (f) in Figure ). This proves that water is a good solvent for IDPs, and their inter-residue interactions affect chain dimensions in a minor way.[32] This fact was also used in a recent hierarchical approach for studying IDPs, where only local fragments are modeled in detail, and the whole chain is constructed from those fragments.[72]

Figure 7

Top seven variants of the PID model ranked by their Pearson coefficient.

Top seven variants of the PID model ranked by their Pearson coefficient. The top-ranked variants of the model may be further refined by fine-tuning the ss interactions. In future studies, our new CG model can be used to explore conformational dynamics of IDPs and their assemblies. It can also be adapted to study physical properties of flexible linkers in multidomain proteins.[73,74] In this case, each of the structured domains can be kept stable by a Go-model potential,[24] and each of the linkers (as well as their interactions with the domains) can be described by the PID potential.

65 in total

Pseudo-Improper-Dihedral Model for Intrinsically Disordered Proteins.

Introduction

Methods

Results of the PDB Survey

Implementation of the PID Potential

Backbone Stiffness and the Thermostat

Depth and Form of the Lennard-Jones Potential

Electrostatics

Results and Discussion

1. Computer simulation of protein folding.

2. Multiscale ensemble modeling of intrinsically disordered proteins: p53 N-terminal domain.

3. Adhesion of membranes with active stickers.

Review 4. Coarse-grain modelling of protein-protein interactions.

5. Structural analysis of poly-SUMO chain recognition by the RNF4-SIMs domain.

6. AWSEM-IDP: A Coarse-Grained Force Field for Intrinsically Disordered Proteins.

7. Sequence dependent aggregation of peptides and fibril formation.

8. Lattice model for rapidly folding protein-like heteropolymers.

9. CIDER: Resources to Analyze Sequence-Ensemble Relationships of Intrinsically Disordered Proteins.

10. Mechanical and assembly units of viral capsids identified via quasi-rigid domain decomposition.

1. Computational Models for the Study of Protein Aggregation.

Review 2. Unifying coarse-grained force fields for folded and disordered proteins.

3. Simulation of FUS Protein Condensates with an Adapted Coarse-Grained Model.

4. Global Structure of the Intrinsically Disordered Protein Tau Emerges from Its Local Structure.