Literature DB >> 26808351

Using Local States To Drive the Sampling of Global Conformations in Proteins.

Abstract

Conformational changes associated with protein function often occur beyond the time scale currently accessible to unbiased molecular dynamics (MD) simulations, so that different approaches have been developed to accelerate their sampling. Here we investigate how the knowledge of backbone conformations preferentially adopted by protein fragments, as contained in precalculated libraries known as structural alphabets (SA), can be used to explore the landscape of protein conformations in MD simulations. We find that (a) enhancing the sampling of native local states in both metadynamics and steered MD simulations allows the recovery of global folded states in small proteins; (b) folded states can still be recovered when the amount of information on the native local states is reduced by using a low-resolution version of the SA, where states are clustered into macrostates; and (c) sequences of SA states derived from collections of structural motifs can be used to sample alternative conformations of preselected protein regions. The present findings have potential impact on several applications, ranging from protein model refinement to protein folding and design.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 26808351 PMCID： PMC5356493 DOI： 10.1021/acs.jctc.5b00992

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Conformational changes in proteins are often associated with protein–protein interactions,[1] ligand binding,[2] and posttranslational modifications.[3] They are at the basis of powerful mechanisms for functional regulation such as allostery,[4,5] and they can be fuelled by chemical reactions to produce large-scale mechanochemical motions in molecular motors.[6] The structural and energetic characterization of conformational transitions is therefore of central interest in understanding protein function. Computational approaches such as molecular dynamics (MD) simulations offer a powerful way to investigate such processes with atomic resolution. However, the conformational transitions usually found in biologically relevant systems are beyond the time scale currently accessible to equilibrium MD simulations. Different methods have been developed to overcome this problem by accelerating the sampling of “rare events”.[2,7−9] Most of the available methods are based on the use of collective variables (CVs),[7] which define the part of the space where the sampling is enhanced and usually represent the progress of the system along the process of interest. A range of CVs have been used in the literature to describe protein conformational changes, ranging from general descriptors of the protein global shape to local geometric parameters specific to the process under examination.[10−18] While the details of the global conformational landscape of a protein are uniquely determined by its amino acid sequence, it is now well-established that protein fragments tend to adopt recurring backbone conformations. In particular, coarse-grained and sequence-independent structural alphabets[19] (SAs) have been derived, which contain the minimal set of typical Cα conformations of protein fragments (local states or “letters”) needed to reconstruct experimental protein structures with a high level of accuracy. SAs have been exploited in the past for a number of applications, including local structure[20] and flexibility[21] prediction, sequence-based structural comparison,[22] structure mining,[23−25] fold classification[26] and recognition,[27] and de novo prediction.[28,29] Recently they have been shown to correctly represent also the dynamical properties of proteins.[30] Indeed, they have been used to analyze trajectories from MD simulations[31−33] and in particular to detect signal transmission in allosteric proteins.[32,34] The main goal of the present work is to investigate if the a priori knowledge provided by SAs on the local states preferentially adopted in protein structures can be exploited to accelerate the exploration of protein conformations in MD simulations. In the following we use a SA-based CV (CVSA) to guide the sampling of fragment conformations in small proteins toward predefined local states in metadynamics and steered MD (SMD) simulations. We first address the question of whether enhancing the sampling of native local structures can accelerate the sampling of global native structures. We show that folded structures of small proteins can be reproducibly and efficiently recovered in short simulations using the SA letters that best represent the local native structures. We then reduce the extent of the information on the local native structure provided a priori by using a simplified or reduced version of the SA (rSA), where local states are replaced by macrostates defining fragment shapes instead of detailed structures. We show that in all cases the missing information required to get the global folded state is recovered by the MD sampling, which allows the fragments to adapt to their specific environment by adjusting their conformation and relative orientation. Finally, we investigate possible ways to produce synthetic SA strings to guide the sampling when no information on the desired final state is available. In particular, we show that libraries of SA strings can be derived from databases of structural motifs and can be used to sample alternative conformations of parts of a protein. The resulting conformations can then be ranked a posteriori with a scoring function to identify the most native-like structures. The present findings indicate that MD simulations and knowledge-based SAs can be effectively combined to enhance the exploration of the conformational space of proteins. The proposed CVSA has a wide range of applications, ranging from protein model refinement to protein folding and design.

Results

The results presented in this work are derived using a new CV based on the 25-letter SA M32K25[30] (CVSA). To define a CVSA, the structure of the protein is first partitioned into four-residue fragments of Cα atoms. The CVSA is then calculated as a sum of single-fragment terms (eq in Methods), each consisting of a switching function related to the deviation of the fragment from a predefined reference letter extracted from the SA (Figure A). The overall CVSA measures the number of fragments matching the structure of their corresponding reference SA letters. The fragments used to define a CVSA can overlap, and they can cover the whole protein or only selected regions.

Figure 1

Schematic description of the CVSA and its applications. (A) A plot of the switching function used in the definition of the CVSA is reported, showing the dependence of the function on the Cα RMSD (ρ) of a fragment (gray to dark green licorice) from a reference SA state (green). Using the CVSA, the sampling of local states can be biased for all fragments in a protein (B) or for fragments in selected regions (C).

In the following we investigate if, when combined with an enhanced sampling technique, the CVSA can be used to guide the folding of the entire protein to its native state (Figure B) and to sample alternative conformations in selected regions (Figure C). Schematic description of the CVSA and its applications. (A) A plot of the switching function used in the definition of the CVSA is reported, showing the dependence of the function on the Cα RMSD (ρ) of a fragment (gray to dark green licorice) from a reference SA state (green). Using the CVSA, the sampling of local states can be biased for all fragments in a protein (B) or for fragments in selected regions (C).

Folding 3D Structures from SA Strings

In this section we apply the CVSA to fold two small fast-folders, the β-hairpin of the GB1 protein, and the Trp-cage mini-protein. These are small proteins that are known to adopt a stable fold in solution with relatively short folding times (6 μs for the GB1 β-hairpin[35] and 4 μs for the Trp-cage[36]), so that they are often used as model systems to study protein folding. As expected from their folding rates, equilibrium MD simulations in the nanosecond time scale are usually not suitable to observe folding events for these two proteins,[37] and either enhanced sampling[11,38,39] or long (∼10–100 μs) MD simulations[40,41] have been used in the past. In this section we use the CVSA combined with the enhanced sampling method metadynamics[42] (Methods). The reference SA letters are determined from the local-fit encoding (Methods) of the experimental native structure, where each SA letter is the best match of a fragment in its native conformation (Table ). CVSA contains all the overlapping fragments in the protein. In the local-fit mode used to assign the letters, no information on adjacent fragments is used to determine the letter of a given fragment. The sampling along the CVSA is then biased in metadynamics simulations (Methods), so that structures with CVSA values in the whole accessible range [0, CVSAmax] are explored, where CVSAmax is equal to the number of fragments included in the CVSA.

Table 1

M32K25 SA and rSA Strings Used for the Folding and Refinement Simulations

Fragments covering the central loop residues are indicated in bold.

The fragments include residues 18–31.

The fragments include residues 19–30.

The fragments include residues 38–45.

The fragments include residues 39–44.

Fragments covering the central loop residues are indicated in bold. The fragments include residues 18–31. The fragments include residues 19–30. The fragments include residues 38–45. The fragments include residues 39–44. The CVSA defined in this way contains information on the local structure of the native state but not on the global folded structure. Indeed, using the CVSA differs from providing the coordinates of the folded conformation as reference structure in that (a) the native conformations of single fragments are used, with no information on the relative arrangement of different fragments, and (b) discrete SA states are used to represent fragment conformations, so that the CVSA does not contain the actual native fragment structures, but the SA letters closest to them. Folded β-hairpin conformations, i.e., conformations with Cα RMSD from the experimental native structure (RMSDnat) ≤ 2 Å, were sampled multiple times within a 100 ns metadynamics simulation (Figure A) starting from an extended conformation, with a minimum RMSDnat of 0.4 Å. Considering that the CVSA was defined using reference SA letters from the native structure, folded conformations are expected to have high CVSA values. Indeed, almost all low-RMSDnat structures belong to the high-CVSA ensemble, defined as composed of structures with CVSA ≥ CVSAmax – 2 (blue points in Figure A). Moreover, after filtering the trajectory for high-CVSA structures, a significant enrichment was found in folded conformations, going from a percentage of folded structures of 6% for the overall trajectory to 39% for the CVSA-filtered one (Table ). Metadynamics simulations of the Trp-cage showed a similar behavior. Folded structures were sampled with RMSDnat as low as 0.5 Å (Figure C), and high-CVSA ensembles were composed for the 27% of native-like structures (Table ). As expected, for both proteins no folded conformations were found in control unbiased MD simulations of the same length (Table ).

Figure 2

Time evolution of the Cα RMSD from the experimental native structure of the GB1 β-hairpin (upper panels) and the Trp-cage mini-protein (lower panels) during metadynamics simulations. The CVSA was defined using the SA (left panels) and rSA (right panels) encodings of the experimental structures. High-CVSA points are colored in blue. For each panel, the experimental structures (white cartoon) are superimposed to the best matching high-CVSA structure (blue).

Table 2

Fraction of Native-like Structures in Metadynamics and SMD Simulations of the GB1 β-Hairpin and the Trp-Cage Mini-protein

	p^Nat_M (%)d	p^Nat_FiltCVsa (%)e	p^Nat_SMD (%)f
GB1 β-hairpin (SA)a	5.9	39.4
GB1 β-hairpin (rSA)b	2.0	37.5	36.0
GB1 β-hairpin (unbiased MD)c	0.0
Trp-cage (SA)a	3.3	26.5
Trp-cage (rSA)b	0.2	0.4	33.3
Trp-cage (unbiased MD)c	0.0

CVSA-biased simulation with SA encoding.

CVSA-biased simulation with rSA encoding.

Unbiased MD simulation.

Percentage of structures in the whole metadynamics trajectory with RMSDnat ≤ 2 Å.

Percentage of structures in the metadynamics high-CVSA ensemble with RMSDnat ≤ 2 Å.

Percentage of productive SMD trajectories (final structure with RMSDnat ≤ 2 Å).

CVSA-biased simulation with SA encoding. CVSA-biased simulation with rSA encoding. Unbiased MD simulation. Percentage of structures in the whole metadynamics trajectory with RMSDnat ≤ 2 Å. Percentage of structures in the metadynamics high-CVSA ensemble with RMSDnat ≤ 2 Å. Percentage of productive SMD trajectories (final structure with RMSDnat ≤ 2 Å). Time evolution of the Cα RMSD from the experimental native structure of the GB1 β-hairpin (upper panels) and the Trp-cage mini-protein (lower panels) during metadynamics simulations. The CVSA was defined using the SA (left panels) and rSA (right panels) encodings of the experimental structures. High-CVSA points are colored in blue. For each panel, the experimental structures (white cartoon) are superimposed to the best matching high-CVSA structure (blue). Non-native structures with high-CVSA values were also sampled (blue points above the dashed lines in Figure A,C). In these conformations the single fragments match their reference SA letter, but their global arrangement differs from the native one. Hairpin-like conformations were found where terminal segments are in a β-strand conformation, but they are too distant to form a β-sheet, resulting in a percentage of native contacts as low as 68% (Supporting Information Figure S1A). This is consistent with the fact that the CVSA does not contain a bias toward nonlocal native contacts, so that non-native global arrangements of native local states can be sampled during the simulation. In summary, the present results show that providing local structural information can enhance the folding toward the global native structure of small proteins. Using the CVSA promotes the sampling of structures with the desired sequence of discrete local states (or strings of letters). During the simulation, different relative arrangements of these fragments are explored, with a significant fraction of folded global structures. In the following section we further reduce the information needed a priori to build the CVSA by introducing a simplified version of the structural alphabet.

Generating SA Strings: Simplified Alphabet

The extent to which the target local structures are known before the simulation can vary significantly, and in some cases it might be limited to a prediction of the shape of the fragment. It is thus convenient to have a reduced set of SA letters representing more general fragment states (macrostates). To this end, a cluster analysis was performed on the 25 letters of the M32K25 SA (Methods). The resulting six clusters (Figure ), named after the letter of their representative, describe fragment states that differ mainly for the pseudotorsion around the two central Cα atoms (Supporting Information Table S1).

Figure 3

Licorice representation of the six clusters of M32K25 letters. The cluster representatives forming the reduced alphabet rSA are labeled and shown in darker colors.

Licorice representation of the six clusters of M32K25 letters. The cluster representatives forming the reduced alphabet rSA are labeled and shown in darker colors. SA encodings can be translated into rSA encodings by replacing each letter with the corresponding cluster representative (Table S1). For example, the GB1 β-hairpin SA string “DBAALVWOQABAB” becomes “AAAAAUUKRAAAA” (Table ), where the turn-encoding letters are easily recognizable as being sandwiched between two stretches of strand-encoding “A” letters. When going from the SA to the rSA encoding, the maximum RMSD between each fragment experimental structure and its corresponding letter for the two proteins increases from 0.4 to 1 Å, thus reducing the extent of information on the fragment target states contained in the rSA-based CVSA. Nevertheless, metadynamics simulations showed that folded conformations can still be recovered for both proteins (Figure , panels B and D), with overall RMSDnat values as low as 0.5 Å (β-hairpin) and 1.8 Å (Trp-cage). Filtering the trajectories for high-CVSA structures produced an enrichment in folded structures for the β-hairpin from 2 to 38% (Table ). A less extensive sampling of high-CVSA structures and hence of folded states was instead observed for the Trp-cage (Table ). The performance of the rSA-based CVSA was also tested by using a different enhanced sampling technique, the steered MD[43] (SMD; see Methods), where the CVSA was steered from the starting value to its maximum value CVSAmax in a predetermined amount of time and in multiple replicas. The folded state was recovered at the end of the simulation in about one-third of the runs (productive runs) for both molecules (Figure and Table S2), confirming that the rSA-based CVSA contains sufficient information to reach the folded state for both proteins.

Figure 4

Productive (blue) and nonproductive (orange) SMD runs of the GB1 β-hairpin (A) and the Trp-cage mini-protein (B). Final SMD structures (colored cartoon) are superimposed onto the experimental structures (white cartoon). Low-work relaxed structures (Supporting Information) are reported. A SMD run is defined productive if RMSDnat ≤ 2 Å at the end of the simulation. The average value of RMSDnat is also reported for productive runs.

Since the steering bias was applied on the overall CVSA, the single fragments were free to reach the reference structure at different times (Figure S2). In the productive runs, the folding mechanism mainly resembled a zipper mechanism,[37] where the contacts between the β-strands started to form from the turn toward the termini (Figure S2). Productive (blue) and nonproductive (orange) SMD runs of the GB1 β-hairpin (A) and the Trp-cage mini-protein (B). Final SMD structures (colored cartoon) are superimposed onto the experimental structures (white cartoon). Low-work relaxed structures (Supporting Information) are reported. A SMD run is defined productive if RMSDnat ≤ 2 Å at the end of the simulation. The average value of RMSDnat is also reported for productive runs. In the nonproductive trajectories, the final structures reached the CVSAmax maximum value, but they were trapped in non-native conformations where the N- and C-terminal segments were differently aligned (Figure , orange structural ensembles) and a smaller fraction of native contacts was recovered (Figure S3, orange lines) compared to the productive trajectories. The relaxation to the native state was hindered by the early formation of non-native contacts (Figure S4). Interestingly, we observed that part of the nonproductive SMD trajectories can be filtered out without using information on the target native state by calculating the work needed to drive the overall transition for each trajectory. Indeed, the trajectories associated with low-work transformations (Supporting Information) were found to have a higher success rate in producing a correctly folded structure. The present results show that the reduced representation of the SA can be effectively used to sample folded conformations of small proteins within both the metadynamics and the SMD frameworks. The generation of the rSA fragment macrostates is a first step toward the definition of artificial sequences of fragment states, which do not require a priori knowledge of the final structure. In the next section we show that recurring motives can be identified in the rSA encoding of experimental loop structures, indicating that libraries of predefined strings can be used to define the CVSA reference states.

Generating SA Strings: Recurrence of SA Motifs in Loops

Loops, defined as nonrepetitive structural units connecting regular secondary structures, have been shown to adopt recurring conformations by different classification schemes.[19] The existence of supersecondary structural motifs has been exploited in the past for both function and structure prediction.[19,44] In this section, we analyze the loops contained in the ArchDB database[44] (Supporting Information) to show that a large number of loop structures can be encoded by a small number of recurring rSA strings or rSA motifs. In the following we focus on β-hairpins of length 4 (ArchDB.BN4.100, Supporting Information), which are one of the most populated loop types in ArchDB,[44] but the analysis can be easily extended to any loop type. For each β-hairpin, the structure of the loop plus the first residue from each flanking β-strand was encoded into three-letter strings with the rSA. Of the 56 possible combinations of six letters in strings of length 3, only 26 rSA motifs were observed in the 3314 loop structures of the ArchDB.BN4.100 data set (Table S3). Interestingly, 97% of all the structures fall in the first five rSA motifs (Figure ). All together, these motifs cover 20 of the original 21 ArchDB subclasses (Supporting Information). The structures of the five rSA motifs are significantly different from each other, with an average intermotif RMSD of 2.2 Å (Table S4).

Figure 5

Structures of the most populated rSA motifs for β-hairpin loops of length 4 in the ArchDB.BN4.100 database. The cluster representatives of the loop structures belonging to each motif are shown, with the most populated cluster highlighted as a thicker tube. The existence of a small number of recurring rSA motifs can be exploited to build CVSA variables that are not tailored to a specific experimental structure as was done in the previous sections. As we will see in the next section, such CVSA can be used to dictate the general shape of a loop, while the specific conformation is determined by the amino acid sequence and/or the environment of the loop.

Folding Different Amino Acid Sequences Using the Same rSA String

In this section we show that simulations based on the same rSA string can correctly reproduce differences in loops that, while having a similar shape, present structural differences due to their different amino acid composition. We identified two groups of β-hairpin loops with similar shape (encoded by the rSA string “KUN”) but with differences in the specific structure that correlate with differences in their amino acid sequence (Supporting Information). A structural superimposition (Figure A) shows that KUN loops are clustered in two groups (KUNg1 and KUNg2), with KUNg2 structures (green) featuring a more bent conformation around the residue in position p4 compared to KUNg1 (blue), a different orientation of the p4 CO bond (licorice), and significantly different Ramachandran plots at the p4 and p5 positions (arrows in Figure D). A sequence alignment shows that all KUNg2 loops have a Gly in position p5 (Figure S5A), which allows for the larger bending of the loop.

Figure 6

Comparison of KUNg1 and KUNg2 β-hairpins. (A) Superimposition of experimental structures of KUNg1 (blue) and KUNg2 (green) β-hairpins. (B) Superimposition of experimental (dark colors) and simulated (light colors) structures of KUNg1 (blue) and KUNg2 (green) β-hairpins. The simulated structures were extracted from the high-CVSA metadynamics ensembles with a cluster analysis. The backbone CO bonds of the residues in positions p4 and p5 are represented as licorice. (C) His3-Gly5 hydrogen-bonded interaction in the KUNg2 metadynamics structure. (D) Ramachandran plots of experimental (large numbered circles), metadynamics (diamonds), and SMD (small circles) folded structures for KUNg1 (left panel) and KUNg2 (right panel). Two sequences representative of the two groups were selected. CVSA-biased MD simulations were run to fold each amino acid sequence into its corresponding β-hairpin starting from an extended conformation. The same rSA encoding was used for the two β-hairpins (Table ). In both cases, the native state was recovered in metadynamics and SMD simulations (Table S5), with a percentage of native-like structures ranging from 4–10% (metadynamics) to 36–72% (SMD). Remarkably, even if the same rSA encoding was used for both amino acid sequences, the simulated folded structures reproduce the differences between the experimental ones. Indeed, KUNg2 simulated structures (Figure B, light green) are more bent at p4 than KUNg1 (light blue) and the simulated conformations of each amino acid sequence are more similar to the experimental structure with the same sequence than to the other one (Table S6). Correspondingly, the Ramachandran plots of simulated structures show differences in the distribution of φ and ψ angles that parallel the differences between the experimental structures, in particular for residues in positions p4 and p5 (arrows in Figure D). A possible explanation of the conformation adopted by KUNg2 at position p4 can be found by looking at the nearby side chains. Indeed, the bent loop arrangement in KUNg2 allows the Gly5 backbone NH group to be on the same side of the His3 side chain, favoring the formation of a His3-Gly5 hydrogen bond both in metadynamics (Figure C) and SMD (Figure S5B) simulations. The present results indicate that the CVSA based on rSA motifs contains sufficient information to guide the folding toward the correct general shape of the loop backbone while at the same time allowing for adjustments to obtain sequence-specific structures. Side chains are not included in the definition of CVSA, but they are explicitly taken into account during the simulation by the all-atom force field and they can modulate the loop conformation via direct side chain–backbone interactions.

Real Life Application: Protein Model Refinement

In this section we use the CVSA and the library of loop motifs to refine a protein model. Differently from the previous sections, only part of the structure is considered in the definition of the CVSA, while the rest is left unbiased during the simulation. The protein to be refined was selected from the targets of the Refinement category of the CASP8 exercise (Supporting Information). The best model generated in the normal prediction exercise failed to correctly describe two regions L1 (residues 19–30) and L2 (residues 39–44) (Figure A), indicated by CASP organizers as problematic and showing deviations from the experimental structure > 4.0 Å (Table ).

Figure 7

Table 3

Refinement of the CASP8 Target TR464

	p^Nat (%)b	p^Nat_FiltCVsa (%)c	RMSD_nat (Å)d	RMSD_nat(L1) (Å)e	RMSD_nat(L2) (Å)e
Best Model			2.94	4.00 (3.22)	4.35 (2.02)
Best Refined Model			2.23	1.57 (1.38)	2.38 (0.49)
Metadyn (Target)a	2.1	21.4	2.00	0.53 (0.27)	1.50 (0.30)
unbiased MD	0.0		2.32	2.86 (2.45)	2.48 (1.64)

CVSA metadynamics simulation run using the Target rSA string.

Percentage of structures in the whole trajectory with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.

Percentage of structures in the high-CVSA ensemble with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.

Cα RMSD from the native structure calculated over the whole structure. Minimum values observed during the simulation are reported for metadynamics and unbiased MD.

Cα RMSD from the native structure calculated over only L1 and L2. Minimum values observed during the simulation are reported for metadynamics and unbiased MD. Values in parentheses are calculated in local fit (only the L1 or L2 structures are used for the best fit superposition instead of the whole structure).

Refinement of the TR464 target from CASP8. (A) Superimposition of the L1 (residues 19–30) and L2 (residues 39–44) regions for the starting unrefined model (pink) and the target experimental structure (green). The rest of the structures is represented as white cartoon. Residues E5 and R42 are highlighted as licorice. A E5-R42 salt bridge interaction is formed only in the experimental structure. (B) Superimposition of the L1 and L2 regions in the target structure (green) and in a representative structure from metadynamics (blue). The E5-R42 interaction is recovered in the metadynamics simulation (blue licorice). (C) Superimposition of the L1 and L2 regions in final structures from productive SMD runs (blue cartoon). The E5-R42 salt bridge is recovered in all of the productive SMD runs. (D) Comparison of the structures with the best Rosetta score (blue) from the high-CVSA metadynamics ensembles obtained using the BlindL2a (left), BlindL2b (center), and BlindL2c (right) encodings for L2. The Rosetta score is reported together with the L2 RMSDnat calculated in local fit. CVSA metadynamics simulation run using the Target rSA string. Percentage of structures in the whole trajectory with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å. Percentage of structures in the high-CVSA ensemble with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å. Cα RMSD from the native structure calculated over the whole structure. Minimum values observed during the simulation are reported for metadynamics and unbiased MD. Cα RMSD from the native structure calculated over only L1 and L2. Minimum values observed during the simulation are reported for metadynamics and unbiased MD. Values in parentheses are calculated in local fit (only the L1 or L2 structures are used for the best fit superposition instead of the whole structure). The rSA encoding sequences of L1 and L2 in the best model and in the target experimental structure (Table ) show that L1 changes from an α–β-hairpin with the central turn in a “UUK” conformation to a “KUK” β-hairpin, while the central turn in L2 changes from “UUK” to “UKR”. Interestingly, these three turn encodings are among the top five motifs for length-4 loops in β-hairpins described before (Table S3). Enhanced sampling simulations were first run with a CVSA based on the rSA target encodings of L1 and L2 (Target rSA in Table ) to test if local information can be used to guide the refinement. A single CVSA coordinate was used containing the fragments of both regions. The distance between the structures sampled during the simulations and the target conformation was measured by calculating the Cα RMSD for the overall structure (RMSDnat) and separately for the two regions L1 (RMSDnat(L1)) and L2 (RMSDnat(L2)). Structures significantly closer to the target structure than the starting model were sampled during 100 ns long metadynamics simulations. Both RMSDnat(L1) and RMSDnat(L2) were reduced to ≤2 Å in ∼21% of the CVSA-filtered trajectory (pNatFiltCVSA in Table ), with deviations for the single regions as low as 0.5 (L1) and 1.5 Å (L2) for the best metadynamics structure (Figure B). Similarly, SMD simulations with rSA encoding improved the starting model in 18% of the runs, with RMSDnat values as low as 1.7 Å (Figure C and Table S7). Control unbiased simulations failed to significantly improve the starting model. Indeed, no structures were found where both L1 and L2 were in a native-like conformation (unbiased MD pNat in Table ). The analysis of the time evolution of RMSDnat for metadynamics (Figure S6B, upper panel) and SMD runs (Figure S7A,B) indicates that L2 is the last loop to adopt the native conformation, suggesting that its rearrangement is the difficult step in the model refinement. To test the performance of CVSA in a real life situation where the target state is not known, multiple metadynamics simulations were run using “blind” rSA encodings for the L2 loop (BlindL2a–c in Table ). L2 encodings were chosen among the top five motifs for length 4 loops in β-hairpins identified in the previous section (Figure ). The BlindL2a encoding “UKR” coincides with the native Target rSA encoding used in the previous calculations. The blind trajectories were filtered for high CVSA values (CVSA ≥ CVSAmax – 2), and the resulting structures were rescored using the Rosetta scoring function[45,46] (Methods). The best scoring structure was found in the BlindL2a simulation (Table S8), with a RMSDnat(L2) of 0.7 Å when using only L2 for the best fit superposition (local fit or LF) and of 3.0 Å when calculated using the whole protein structure (global fit or GF). Indeed, the L2 conformation in the best CVSA-refined structure (blue in Figure D, left panel), while featuring a residual translational displacement from the experimental structure (green), has a very good match to the experimental shape. Remarkably, 13 of the top 20 scoring structures are from the BlindL2a simulation (gray rows in Table S8). The best structures from the other two blind simulations differed significantly from the experimental L2 conformation (BlindL2b and BlindL2c in Figure D), with RMSDnat(L2) values of 2.2 (BlindL2b) and 1.9 Å (BlindL2c) in the local fit. Interestingly, they have also poorer scores according to Rosetta (Table S8). Thus, rescoring the structures with Rosetta allows BlindL2a structures to be identified as the closest to the native state without using information on the experimental structure. To summarize this section, we showed how the CVSA coupled with libraries of rSA loop motifs can be effectively used to sample multiple alternative loop conformations and, when combined with a knowledge-based rescoring potential, to refine protein models.

Discussion

The occurrence of recurring local structures in proteins has inspired several approaches to structure prediction and design. Large libraries of sequence-dependent fragments have been successfully employed in fragment-assembly strategies such as Rosetta.[45−47] Alternatively, small sets of coarse-grained sequence-independent fragments were used as structural alphabets,[30,48−50] often coupled with machine learning predictors.[20] Recently, it was also shown that local structural changes observed in MD conformational ensembles can be analyzed in terms of changes of SA letters without loss of information. This was particularly true for the M32K25 SA, which has been derived from conformational attractors, i.e., regions in the fragment conformational space that are highly populated by experimental structures.[30] The use of this SA turned out to be particularly effective in investigating allosteric proteins.[32] All these data indicate that SAs can provide a compact and reliable representation of the most populated conformational states of protein fragments, recapitulating the structural features of a large number of experimental structures.[30] The central idea of the present work is to exploit the experimental information distilled in a SA to accelerate the exploration of protein conformations in MD simulations. We thus introduced a SA-based collective variable (CVSA) to control the match between simulated and SA fragment conformations. Combining the CVSA with either metadynamics or steered MD techniques, it was possible to bias the sampling of fragment conformations toward experimentally preferred local states. While SAs have been used in the past for postprocessing MD trajectories,[31−33] this is the first time to our knowledge that they are used to enhance the sampling during an MD simulation. The use of the CVSA allows the introduction of knowledge-based elements in the simulation without loss of generality. Since the CVSA is based on local states, no assumption is required on nonlocal contacts and thus on the relative arrangement of the fragments. Moreover, the SA fragments contain only Cα atoms, so that they can be used with any amino acid sequence and no a priori information is needed on side chain structures. CVs based on secondary structures have been used in the past either to accelerate the folding of small proteins to their native states[13] or to explore the space of their accessible folds.[51] Performances comparable to the CVSA were obtained when folding the GB1 β-hairpin with metadynamics simulations.[13] However, these CVs are based on canonical structures of blocks of secondary structure elements, and for β structures they contain pairs of β-strands.[13] The CVSA differ from these in that (a) it does not require nonlocal information on specific arrangements of secondary structure elements and (b) it can be used to describe regions with irregular structure. The performance of the CVSA was first tested on the folding of peptides and mini-proteins. In all cases, conformations with a CVSA value equal or close to CVSAmax were sampled during metadynamics simulations starting from unfolded conformations. In these high-CVSA ensembles, all the fragments were in their target SA state or close to it, indicating that the structures had a local native-like conformation. Remarkably, a significant portion of each ensemble (27–40%) had also a global native-like conformation. In the SMD runs, the CVSA was explicitly steered to its maximum value, so that high-CVSA states were ensured to be sampled in a fixed amount of time. Similarly to metadynamics, the ensemble of high-CVSA conformations composed by the final SMD structures had a significant proportion of native-like conformations, with percentages up to 72%. Folded structures were thus observed when the fragments, in addition to being in the correct local state, had also a native-like arrangement, with native-like interfragment contacts. No information was directly provided on nonlocal native contacts, but folded structures were successfully formed during the relatively short simulations performed here. The bias on the CVSA ensured an increased probability of observing native-like local structures, which in turn increased the probability of finding them in a global native-like arrangement compared to an unbiased simulation. While the local states used to perform these calculations were derived from the target global structures, experimental and predicted information on the local structure of a protein may be available even in the absence of its global structure. For example, the experimental secondary structure composition can be derived from CD spectra,[52] while the sequence of secondary structure elements can be usually predicted with a high level of accuracy.[53−55] For irregular regions, libraries of structural motifs are available,[44] while protein regions involved in conformational changes can be identified by hydrogen/deuterium exchange mass spectrometry.[56] By construction, high CVSA values can be used to discriminate conformations with a local native structure among those generated during the simulation, but these are not necessarily globally folded. To fold larger and more complex systems than the ones considered in this work in a comparable amount of time, additional information on interfragment contacts would need to be provided. This information could be derived from inter-residue contact prediction, for example using recently proposed methods based on coevolution.[57] On the other end, when no contact information is provided a priori, the possibility to sample multiple global arrangements compatible with the same sequence of local states could be exploited for protein design methods. Indeed, simulations using the same reference SA string with different amino acid sequences would show how the spectrum of different interfragment arrangements is modulated by the primary structure of the protein. The form chosen for the CVSA (eq ) allows for some degree of structural variability also in the local structure of high-CVSA ensembles. Indeed, the conformation of a fragment is not required to exactly match its reference SA letter to contribute significantly to the CVSA, but small adjustments in the local structure are possible if energetically favored. This behavior is regulated by the ρ0 parameter (eq ), which defines the tolerance on the fragment deviation from the reference letter in the switching function. In most of the calculations, a value of 0.6 Å was used, which is comparable to the cluster radius in the macrostates of the reduced alphabet (rSA). When using a CVSA with the rSA encoding, the requirement for the fragment is thus to match a macrostate, with freedom to adapt to any of the SA states that compose it. Coupling the MD sampling with the reduced version of the alphabet rSA instead of the full SA has the advantage that less information on the system needs to be provided before the simulation. Indeed, even when using the same reference rSA string for different amino acid sequences, the structural differences of the experimental structures are still recovered during the MD simulation. While the rSA-based CVSA provides information on the sequence of local macrostates, during the simulation the fragments can adopt the states that best match their specific chemical environment. The reduced complexity of rSA strings can be exploited for the generation of guess reference strings for the CVSA without detailed knowledge of the desired final structure. In particular, secondary structure predictors[53−55] can be used for regular structures, while loop encodings can be extracted from loop databases. Indeed, we showed that a large part of the loops in experimental β-hairpin structures can be described by a reduced number of rSA strings. This analysis can be easily extended to other types of loops,[44] generating a comprehensive library of rSA motifs. The use of rSA sequence libraries was exemplified with the refinement of a protein model. Alternative rSA sequences, generated from the most populated motifs for β-hairpins, were used to refine the L2 loop in the CASP8 target TR464. No information on the final state was used to define the CVSA. L2 structures closer to the native state than the starting model were identified by rescoring with Rosetta. In particular, the ensemble generated with the native rSA motif could be identified as the best one because of the higher proportion of its conformations in the top ranking structures after rescoring. The results discussed above suggest that the combination of CVSA-based MD sampling with rSA libraries is a promising approach for the development of protein refinement methods. In particular, the increasing availability of parallel computing resources could be exploited to test a large number of rSA strings, which might be needed to be considered when multiple regions are involved in the refinement. The development of refinement methods that can systematically improve the quality of protein models has proven to be particularly challenging so far, and it is still an area of ongoing work. Structure prediction relying on knowledge-based approaches seems to have reached a plateau in their accuracy,[58] and different refinement strategies are now required to achieve effective improvements. The combination of knowledge-based prediction with physics-based methods looks particularly promising. Indeed, an MD-based method was found for the first time to be the best performing in recent rounds of CASP.[58,59] Key factors in this success were the coupling of MD with knowledge-based elements and the use of an averaging procedure over multiple parallel trajectories to enhance the structure sampling and to generate an enrichment of native-like versus non-native features.[60] Another example of these types of approaches is the use of distance maps taken from high-resolution experimental structures as restraints in MD simulations.[59] Sampling remains critical especially if the refinement requires large energy barriers to be overcome.[58] In these cases, enhanced sampling, as performed in our study, is necessary. Moreover, CASP11 results demonstrated that loops tend to be more challenging to refine. To this end, a CVSA-based strategy is particularly suitable since it uses tunable local biases for loop regions. Enhanced sampling methods were used in this work mainly to accelerate the sampling of high-CVSA conformations and not to derive energetic or kinetic data. Snapshots from metadynamics simulations were rescored with an external scoring function for model refinement, while work values in SMD simulations were only used to prefilter candidate structures. However, provided that appropriate simulation lengths and postprocessing protocols are used, it is in principle possible to extract direct energetic information from CVSA-biased trajectories and derive free energy changes for the processes involved, ranging from free energy changes associated with loop rearrangements or changes in secondary structure to folding energies. Moreover, when combined with suitable global CVs, the CVSA could be used to investigate the relative kinetics of formation of secondary and tertiary elements[61,62] in the folding of proteins, for example when comparing diffusion–collision and nucleation–condensation pathways.[63] Following recent successful examples where secondary structure-based CVs were combined with NMR chemical shifts to study denatured states[64] and IDPs,[65] the CVSA could in principle be coupled with CVs containing specific experimental information on unfolded or denatured states. In this context, using the CVSA is particularly suitable since it is able to describe both regular and irregular local structures. As a final remark, the CVSA can be used in any CV-based enhanced sampling approach, including hybrid approaches mixing CVs with REMD-like methods, such as parallel tempering[11] or bias exchange metadynamics,[66] where exchanges are allowed between replicas that use different CVs. The second approach would be particularly suitable to explore alternative local structure arrangements by using multiple CVSA with different SA strings. In conclusion, we showed that, by enhancing the sampling of local states from a structural alphabet, it is possible to recover the global native state in MD simulations of small proteins. This finding is robust against approximate definitions of local states. Moreover, we showed how artificial sequences of SA states from libraries of recurring SA motifs can be used to generate alternative conformations of protein regions. Biasing the sampling of local states has a wide range of potential applications, going from protein design to the study of conformational changes based on large local rearrangements such as hinged motions, secondary structure transitions, or loop remodelling.

Methods

Full details on methods, simulation setup and data analyses can be found in the Supporting Information.

Structural Alphabet and Structure Encoding

A SA is a collection of prototypical backbone conformations adopted by short fragments in protein structures, where each letter represents a fragment conformational state. In this work, we used the M32K25 SA, composed of 25 representative fragments of four consecutive Cα atoms.[30] A protein structure can be encoded into a SA string by progressively labeling each overlapping four-residue fragment (i.e., residues 1–4 for fragment 1, 2–5 for fragment 2, and so on) from the N-term to the C-term of the protein with a SA letter (A–Y), so that conformation of a protein of N residues is encoded into a structural string of length N – 3. In this work, the labeling of a fragment is performed by identifying the SA letter that has the minimum root-mean-square deviation (RMSD) from the fragment (local-fit encoding). No information from adjacent fragments is used, so that letters encoding consecutive fragments are assigned independently from each other, allowing for a nonexact match in the overlapping region. Local macrostates were defined by clustering the 25 letters of M32K25 (Supporting Information). The representatives of the six resulting clusters define a reduced version of the M32K25 SA (rSA).

CVSA Definition

CVSA is defined as the number of four-residue fragments f in the protein with Cα RMSD ρ from a preassigned SA letter X (reference state) within a given cutoff ρ0:Each term in the sum is a differentiable function[13] switching from 1 (ρ ≪ ρ0, f very close to the reference state X) to 0 (ρ ≫ ρ0, f very far from X), where n and m are user-defined parameters that modulate the switching rate. It follows that the maximum value that a CVSA can adopt (CVSAmax) corresponds to the number of fragments used for its definition. The RMSD ρ is calculated on the positions of Cα atoms. The CVSA was implemented in a modified version of PLUMED[10] 1.3. The resulting user interface for the CVSA definition is flexible, and any number of fragments can be used, spanning the entire protein structure or parts of it. The sequence of X letters defines the CVSA reference string. The CVSA section of sample PLUMED input files is reported in the Supporting Information for the GB1 β-hairpin (Appendix S1) and TR464 (Appendix S2). Different criteria can be used for the string assignment, as it is shown in Results. The CVSA can be used in combination with any CV-based enhanced sampling method implemented in PLUMED. The patch files used to implement the CVSA can be downloaded from https://afornililab.wordpress.com/software or http://people.brunel.ac.uk/~csstaap2/software.html.

MD Simulations

The GROMACS 4.5.5 program[67] was used to prepare the initial system coordinates and to run the simulations. The Amber99SB*-ILDN[68] force field was used for all the simulations. Enhanced sampling simulations were performed by coupling GROMACS with PLUMED-1.3.[10] In the metadynamics folding simulations, a CV describing a global property was used in addition to the CVSA. In SMD simulations, the steering was performed with harmonic restraints moving at constant velocity.

Tools for Trajectory Postprocessing and Analysis

MD trajectories were analyzed with the GROMACS 4.5.5 tools[67] and with in-house scripts in R (available from https://afornililab.wordpress.com/software or http://people.brunel.ac.uk/~csstaap2/software.html). The bio3d[69] R package was used for coordinate manipulation and for the analysis of the ArchDB database.[44] The encoding of experimental structures and MD trajectories with the M32K25 SA was performed with GSATools.[34] Structures from the blind loop refinement of the TR464 CASP target were independently rescored with Rosetta[45−47] (ver. 3.3).

66 in total

1. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks.

Authors: A G de Brevern; C Etchebest; S Hazout
Journal: Proteins Date: 2000-11-15

2. Escaping free-energy minima.

Authors: Alessandro Laio; Michele Parrinello
Journal: Proc Natl Acad Sci U S A Date: 2002-09-23 Impact factor: 11.205

3. Using circular dichroism spectra to estimate protein secondary structure.

Authors: Norma J Greenfield
Journal: Nat Protoc Date: 2006 Impact factor: 13.491

Review 4. Moving into the cell: single-molecule studies of molecular motors in complex environments.

Authors: Claudia Veigel; Christoph F Schmidt
Journal: Nat Rev Mol Cell Biol Date: 2011-02-16 Impact factor: 94.444

5. GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit.

Authors: Sander Pronk; Szilárd Páll; Roland Schulz; Per Larsson; Pär Bjelkmar; Rossen Apostolov; Michael R Shirts; Jeremy C Smith; Peter M Kasson; David van der Spoel; Berk Hess; Erik Lindahl
Journal: Bioinformatics Date: 2013-02-13 Impact factor: 6.937

6. Evolutionary conserved Tyr169 stabilizes the β2-α2 loop of the prion protein.

Authors: Danzhi Huang; Amedeo Caflisch
Journal: J Am Chem Soc Date: 2015-02-20 Impact factor: 15.419

7. Structural alphabets derived from attractors in conformational space.

Authors: Alessandro Pandini; Arianna Fornili; Jens Kleinjung
Journal: BMC Bioinformatics Date: 2010-02-20 Impact factor: 3.169

8. Knotted vs. unknotted proteins: evidence of knot-promoting loops.

Authors: Raffaello Potestio; Cristian Micheletti; Henri Orland
Journal: PLoS Comput Biol Date: 2010-07-29 Impact factor: 4.475

9. In silico phosphorylation of the autoinhibited form of p47(phox): insights into the mechanism of activation.

Authors: Flavia Autore; Bruno Pagano; Arianna Fornili; Katrin Rittinger; Franca Fraternali
Journal: Biophys J Date: 2010-12-01 Impact factor: 4.033

10. Improved side-chain torsion potentials for the Amber ff99SB protein force field.

Authors: Kresten Lindorff-Larsen; Stefano Piana; Kim Palmo; Paul Maragakis; John L Klepeis; Ron O Dror; David E Shaw
Journal: Proteins Date: 2010-06

6 in total

1. Temperature evolution of Trp-cage folding pathways: An analysis by dividing the probability flux field into stream tubes.

Authors: Vladimir A Andryushchenko; Sergei F Chekmarev
Journal: J Biol Phys Date: 2017-10-05 Impact factor: 1.365

2. Exploring the potential of a structural alphabet-based tool for mining multiple target conformations and target flexibility insight.

Authors: Leslie Regad; Jean-Baptiste Chéron; Dhoha Triki; Caroline Senac; Delphine Flatters; Anne-Claude Camproux
Journal: PLoS One Date: 2017-08-17 Impact factor: 3.240