Alessandro Pandini1, Arianna Fornili2. 1. Department of Computer Science, College of Engineering, Design and Physical Sciences and Synthetic Biology Theme, Institute of Environment, Health and Societies, Brunel University London , Uxbridge UB8 3PH, United Kingdom. 2. School of Biological and Chemical Sciences, Queen Mary University of London , Mile End Road, London E1 4NS, United Kingdom.
Abstract
Conformational changes associated with protein function often occur beyond the time scale currently accessible to unbiased molecular dynamics (MD) simulations, so that different approaches have been developed to accelerate their sampling. Here we investigate how the knowledge of backbone conformations preferentially adopted by protein fragments, as contained in precalculated libraries known as structural alphabets (SA), can be used to explore the landscape of protein conformations in MD simulations. We find that (a) enhancing the sampling of native local states in both metadynamics and steered MD simulations allows the recovery of global folded states in small proteins; (b) folded states can still be recovered when the amount of information on the native local states is reduced by using a low-resolution version of the SA, where states are clustered into macrostates; and (c) sequences of SA states derived from collections of structural motifs can be used to sample alternative conformations of preselected protein regions. The present findings have potential impact on several applications, ranging from protein model refinement to protein folding and design.
Conformational changes associated with protein function often occur beyond the time scale currently accessible to unbiased molecular dynamics (MD) simulations, so that different approaches have been developed to accelerate their sampling. Here we investigate how the knowledge of backbone conformations preferentially adopted by protein fragments, as contained in precalculated libraries known as structural alphabets (SA), can be used to explore the landscape of protein conformations in MD simulations. We find that (a) enhancing the sampling of native local states in both metadynamics and steered MD simulations allows the recovery of global folded states in small proteins; (b) folded states can still be recovered when the amount of information on the native local states is reduced by using a low-resolution version of the SA, where states are clustered into macrostates; and (c) sequences of SA states derived from collections of structural motifs can be used to sample alternative conformations of preselected protein regions. The present findings have potential impact on several applications, ranging from protein model refinement to protein folding and design.
Conformational changes
in proteins are often associated with protein–protein
interactions,[1] ligand binding,[2] and posttranslational modifications.[3] They are at the basis of powerful mechanisms
for functional regulation such as allostery,[4,5] and
they can be fuelled by chemical reactions to produce large-scale mechanochemical
motions in molecular motors.[6]The
structural and energetic characterization of conformational
transitions is therefore of central interest in understanding protein
function. Computational approaches such as molecular dynamics (MD)
simulations offer a powerful way to investigate such processes with
atomic resolution. However, the conformational transitions usually
found in biologically relevant systems are beyond the time scale currently
accessible to equilibrium MD simulations. Different methods have been
developed to overcome this problem by accelerating the sampling of
“rare events”.[2,7−9]Most of the available methods are based on the use of collective
variables (CVs),[7] which define the part
of the space where the sampling is enhanced and usually represent
the progress of the system along the process of interest. A range
of CVs have been used in the literature to describe protein conformational
changes, ranging from general descriptors of the protein global shape
to local geometric parameters specific to the process under examination.[10−18]While the details of the global conformational landscape of
a protein
are uniquely determined by its amino acid sequence, it is now well-established
that protein fragments tend to adopt recurring backbone conformations.
In particular, coarse-grained and sequence-independent structural
alphabets[19] (SAs) have been derived, which
contain the minimal set of typical Cα conformations
of protein fragments (local states or “letters”) needed
to reconstruct experimental protein structures with a high level of
accuracy. SAs have been exploited in the past for a number of applications,
including local structure[20] and flexibility[21] prediction, sequence-based structural comparison,[22] structure mining,[23−25] fold classification[26] and recognition,[27] and de novo prediction.[28,29] Recently they have
been shown to correctly represent also the dynamical properties of
proteins.[30] Indeed, they have been used
to analyze trajectories from MD simulations[31−33] and in particular
to detect signal transmission in allosteric proteins.[32,34]The main goal of the present work is to investigate if the
a priori
knowledge provided by SAs on the local states preferentially adopted
in protein structures can be exploited to accelerate the exploration
of protein conformations in MD simulations. In the following we use
a SA-based CV (CVSA) to guide the sampling of fragment
conformations in small proteins toward predefined local states in
metadynamics and steered MD (SMD) simulations.We first address
the question of whether enhancing the sampling
of native local structures can accelerate the sampling of global native
structures. We show that folded structures of small proteins can be
reproducibly and efficiently recovered in short simulations using
the SA letters that best represent the local native structures. We
then reduce the extent of the information on the local native structure
provided a priori by using a simplified or reduced version of the
SA (rSA), where local states are replaced by macrostates defining
fragment shapes instead of detailed structures. We show that in all
cases the missing information required to get the global folded state
is recovered by the MD sampling, which allows the fragments to adapt
to their specific environment by adjusting their conformation and
relative orientation.Finally, we investigate possible ways
to produce synthetic SA strings
to guide the sampling when no information on the desired final state
is available. In particular, we show that libraries of SA strings
can be derived from databases of structural motifs and can be used
to sample alternative conformations of parts of a protein. The resulting
conformations can then be ranked a posteriori with a scoring function
to identify the most native-like structures.The present findings
indicate that MD simulations and knowledge-based
SAs can be effectively combined to enhance the exploration of the
conformational space of proteins. The proposed CVSA has
a wide range of applications, ranging from protein model refinement
to protein folding and design.
Results
The results presented in
this work are derived using a new CV based
on the 25-letter SA M32K25[30] (CVSA). To define a CVSA, the structure of the protein is first
partitioned into four-residue fragments of Cα atoms.
The CVSA is then calculated as a sum of single-fragment
terms (eq in Methods), each consisting of a switching function
related to the deviation of the fragment from a predefined reference
letter extracted from the SA (Figure A). The overall CVSA measures the number
of fragments matching the structure of their corresponding reference
SA letters. The fragments used to define a CVSA can overlap,
and they can cover the whole protein or only selected regions.
Figure 1
Schematic description of the CVSA and its applications.
(A) A plot of the switching function used in the definition of the
CVSA is reported, showing the dependence of the function
on the Cα RMSD (ρ) of a fragment (gray to dark
green licorice) from a reference SA state (green). Using the CVSA, the sampling of local states can be biased for all fragments
in a protein (B) or for fragments in selected regions (C).
In the following we investigate if, when combined with an enhanced
sampling technique, the CVSA can be used to guide the folding
of the entire protein to its native state (Figure B) and to sample alternative conformations
in selected regions (Figure C).Schematic description of the CVSA and its applications.
(A) A plot of the switching function used in the definition of the
CVSA is reported, showing the dependence of the function
on the Cα RMSD (ρ) of a fragment (gray to dark
green licorice) from a reference SA state (green). Using the CVSA, the sampling of local states can be biased for all fragments
in a protein (B) or for fragments in selected regions (C).
Folding 3D Structures from SA Strings
In this section
we apply the CVSA to fold two small fast-folders, the β-hairpin
of the GB1 protein, and the Trp-cage mini-protein. These are small
proteins that are known to adopt a stable fold in solution with relatively
short folding times (6 μs for the GB1 β-hairpin[35] and 4 μs for the Trp-cage[36]), so that they are often used as model systems to study
protein folding. As expected from their folding rates, equilibrium
MD simulations in the nanosecond time scale are usually not suitable
to observe folding events for these two proteins,[37] and either enhanced sampling[11,38,39] or long (∼10–100 μs) MD simulations[40,41] have been used in the past.In this section we use the CVSA combined with the enhanced sampling method metadynamics[42] (Methods). The reference
SA letters are determined from the local-fit encoding (Methods) of the experimental native structure, where each
SA letter is the best match of a fragment in its native conformation
(Table ). CVSA contains all the overlapping fragments in the protein. In the local-fit
mode used to assign the letters, no information on adjacent fragments
is used to determine the letter of a given fragment. The sampling
along the CVSA is then biased in metadynamics simulations
(Methods), so that structures with CVSA values in the whole accessible range [0, CVSAmax] are explored, where CVSAmax is equal to the number of fragments included in the CVSA.
Table 1
M32K25 SA and rSA Strings Used for
the Folding and Refinement Simulations
Fragments covering the central loop
residues are indicated in bold.
The fragments include residues 18–31.
The fragments include residues 19–30.
The fragments include residues 38–45.
The fragments include residues
39–44.
Fragments covering the central loop
residues are indicated in bold.The fragments include residues 18–31.The fragments include residues 19–30.The fragments include residues 38–45.The fragments include residues
39–44.The CVSA defined in this way contains information on
the local structure of the native state but not on the global folded
structure. Indeed, using the CVSA differs from providing
the coordinates of the folded conformation as reference structure
in that (a) the native conformations of single fragments are used,
with no information on the relative arrangement of different fragments,
and (b) discrete SA states are used to represent fragment conformations,
so that the CVSA does not contain the actual native fragment
structures, but the SA letters closest to them.Folded β-hairpin
conformations, i.e., conformations with
Cα RMSD from the experimental native structure (RMSDnat) ≤ 2 Å, were sampled multiple times within
a 100 ns metadynamics simulation (Figure A) starting from an extended conformation,
with a minimum RMSDnat of 0.4 Å. Considering that
the CVSA was defined using reference SA letters from the
native structure, folded conformations are expected to have high CVSA values. Indeed, almost all low-RMSDnat structures
belong to the high-CVSA ensemble, defined as composed of
structures with CVSA ≥ CVSAmax – 2 (blue points in Figure A). Moreover, after filtering the trajectory for high-CVSA structures, a significant enrichment was found in folded
conformations, going from a percentage of folded structures of 6%
for the overall trajectory to 39% for the CVSA-filtered
one (Table ). Metadynamics
simulations of the Trp-cage showed a similar behavior. Folded structures
were sampled with RMSDnat as low as 0.5 Å (Figure C), and high-CVSA ensembles were composed for the 27% of native-like structures
(Table ). As expected,
for both proteins no folded conformations were found in control unbiased
MD simulations of the same length (Table ).
Figure 2
Time evolution of the Cα RMSD
from the experimental
native structure of the GB1 β-hairpin (upper panels) and the
Trp-cage mini-protein (lower panels) during metadynamics simulations.
The CVSA was defined using the SA (left panels) and rSA
(right panels) encodings of the experimental structures. High-CVSA points are colored in blue. For each panel, the experimental
structures (white cartoon) are superimposed to the best matching high-CVSA structure (blue).
Table 2
Fraction of Native-like
Structures
in Metadynamics and SMD Simulations of the GB1 β-Hairpin and
the Trp-Cage Mini-protein
pNatM (%)d
pNatFiltCVsa (%)e
pNatSMD (%)f
GB1 β-hairpin (SA)a
5.9
39.4
GB1 β-hairpin (rSA)b
2.0
37.5
36.0
GB1 β-hairpin (unbiased MD)c
0.0
Trp-cage (SA)a
3.3
26.5
Trp-cage (rSA)b
0.2
0.4
33.3
Trp-cage (unbiased MD)c
0.0
CVSA-biased simulation
with SA encoding.
CVSA-biased simulation
with rSA encoding.
Unbiased
MD simulation.
Percentage
of structures in the
whole metadynamics trajectory with RMSDnat ≤ 2 Å.
Percentage of structures in
the
metadynamics high-CVSA ensemble with RMSDnat ≤ 2 Å.
Percentage
of productive SMD trajectories
(final structure with RMSDnat ≤ 2 Å).
CVSA-biased simulation
with SA encoding.CVSA-biased simulation
with rSA encoding.Unbiased
MD simulation.Percentage
of structures in the
whole metadynamics trajectory with RMSDnat ≤ 2 Å.Percentage of structures in
the
metadynamics high-CVSA ensemble with RMSDnat ≤ 2 Å.Percentage
of productive SMD trajectories
(final structure with RMSDnat ≤ 2 Å).Time evolution of the Cα RMSD
from the experimental
native structure of the GB1 β-hairpin (upper panels) and the
Trp-cage mini-protein (lower panels) during metadynamics simulations.
The CVSA was defined using the SA (left panels) and rSA
(right panels) encodings of the experimental structures. High-CVSA points are colored in blue. For each panel, the experimental
structures (white cartoon) are superimposed to the best matching high-CVSA structure (blue).Non-native structures with high-CVSA values were
also
sampled (blue points above the dashed lines in Figure A,C). In these conformations the single fragments
match their reference SA letter, but their global arrangement differs
from the native one. Hairpin-like conformations were found where terminal
segments are in a β-strand conformation, but they are too distant
to form a β-sheet, resulting in a percentage of native contacts
as low as 68% (Supporting Information Figure
S1A). This is consistent with the fact that the CVSA does
not contain a bias toward nonlocal native contacts, so that non-native
global arrangements of native local states can be sampled during the
simulation.In summary, the present results show that providing
local structural
information can enhance the folding toward the global native structure
of small proteins. Using the CVSA promotes the sampling
of structures with the desired sequence of discrete local states (or
strings of letters). During the simulation, different relative arrangements
of these fragments are explored, with a significant fraction of folded
global structures. In the following section we further reduce the
information needed a priori to build the CVSA by introducing
a simplified version of the structural alphabet.
Generating
SA Strings: Simplified Alphabet
The extent
to which the target local structures are known before the simulation
can vary significantly, and in some cases it might be limited to a
prediction of the shape of the fragment. It is thus convenient to
have a reduced set of SA letters representing more general fragment
states (macrostates). To this end, a cluster analysis was performed
on the 25 letters of the M32K25 SA (Methods). The resulting six clusters (Figure ), named after the letter of their representative,
describe fragment states that differ mainly for the pseudotorsion
around the two central Cα atoms (Supporting Information Table S1).
Figure 3
Licorice representation
of the six clusters of M32K25 letters.
The cluster representatives forming the reduced alphabet rSA are labeled
and shown in darker colors.
Licorice representation
of the six clusters of M32K25 letters.
The cluster representatives forming the reduced alphabet rSA are labeled
and shown in darker colors.SA encodings can be translated into rSA encodings by replacing
each letter with the corresponding cluster representative (Table S1). For example, the GB1 β-hairpin
SA string “DBAALVWOQABAB” becomes “AAAAAUUKRAAAA”
(Table ), where the
turn-encoding letters are easily recognizable as being sandwiched
between two stretches of strand-encoding “A” letters.When going from the SA to the rSA encoding, the maximum RMSD between
each fragment experimental structure and its corresponding letter
for the two proteins increases from 0.4 to 1 Å, thus reducing
the extent of information on the fragment target states contained
in the rSA-based CVSA. Nevertheless, metadynamics simulations
showed that folded conformations can still be recovered for both proteins
(Figure , panels B
and D), with overall RMSDnat values as low as 0.5 Å
(β-hairpin) and 1.8 Å (Trp-cage). Filtering the trajectories
for high-CVSA structures produced an enrichment in folded
structures for the β-hairpin from 2 to 38% (Table ). A less extensive sampling
of high-CVSA structures and hence of folded states was
instead observed for the Trp-cage (Table ).The performance of the rSA-based
CVSA was also tested
by using a different enhanced sampling technique, the steered MD[43] (SMD; see Methods),
where the CVSA was steered from the starting value to its
maximum value CVSAmax in a predetermined amount
of time and in multiple replicas. The folded state was recovered at
the end of the simulation in about one-third of the runs (productive
runs) for both molecules (Figure and Table S2), confirming
that the rSA-based CVSA contains sufficient information
to reach the folded state for both proteins.
Figure 4
Productive (blue) and nonproductive (orange) SMD runs
of the GB1
β-hairpin (A) and the Trp-cage mini-protein (B). Final SMD structures
(colored cartoon) are superimposed onto the experimental structures
(white cartoon). Low-work relaxed structures (Supporting Information) are reported. A SMD run is defined
productive if RMSDnat ≤ 2 Å at the end of the
simulation. The average value of RMSDnat is also reported
for productive runs.
Since the steering
bias was applied on the overall CVSA, the single fragments
were free to reach the reference structure
at different times (Figure S2). In the
productive runs, the folding mechanism mainly resembled a zipper mechanism,[37] where the contacts between the β-strands
started to form from the turn toward the termini (Figure S2).Productive (blue) and nonproductive (orange) SMD runs
of the GB1
β-hairpin (A) and the Trp-cage mini-protein (B). Final SMD structures
(colored cartoon) are superimposed onto the experimental structures
(white cartoon). Low-work relaxed structures (Supporting Information) are reported. A SMD run is defined
productive if RMSDnat ≤ 2 Å at the end of the
simulation. The average value of RMSDnat is also reported
for productive runs.In the nonproductive trajectories, the final structures reached
the CVSAmax maximum value, but they were trapped
in non-native conformations where the N- and C-terminal segments were
differently aligned (Figure , orange structural ensembles) and a smaller fraction of native
contacts was recovered (Figure S3, orange
lines) compared to the productive trajectories. The relaxation to
the native state was hindered by the early formation of non-native
contacts (Figure S4).Interestingly,
we observed that part of the nonproductive SMD trajectories
can be filtered out without using information on the target native
state by calculating the work needed to drive the overall transition
for each trajectory. Indeed, the trajectories associated with low-work
transformations (Supporting Information) were found to have a higher success rate in producing a correctly
folded structure.The present results show that the reduced
representation of the
SA can be effectively used to sample folded conformations of small
proteins within both the metadynamics and the SMD frameworks. The
generation of the rSA fragment macrostates is a first step toward
the definition of artificial sequences of fragment states, which do
not require a priori knowledge of the final structure. In the next
section we show that recurring motives can be identified in the rSA
encoding of experimental loop structures, indicating that libraries
of predefined strings can be used to define the CVSA reference
states.
Generating SA Strings: Recurrence of SA Motifs in Loops
Loops, defined as nonrepetitive structural units connecting regular
secondary structures, have been shown to adopt recurring conformations
by different classification schemes.[19] The
existence of supersecondary structural motifs has been exploited in
the past for both function and structure prediction.[19,44] In this section, we analyze the loops contained in the ArchDB database[44] (Supporting Information) to show that a large number of loop structures can be encoded by
a small number of recurring rSA strings or rSA motifs. In the following
we focus on β-hairpins of length 4 (ArchDB.BN4.100, Supporting Information), which are one of the
most populated loop types in ArchDB,[44] but
the analysis can be easily extended to any loop type. For each β-hairpin,
the structure of the loop plus the first residue from each flanking
β-strand was encoded into three-letter strings with the rSA.Of the 56 possible combinations of six letters in strings of length
3, only 26 rSA motifs were observed in the 3314 loop structures of
the ArchDB.BN4.100 data set (Table S3).
Interestingly, 97% of all the structures fall in the first five rSA
motifs (Figure ).
All together, these motifs cover 20 of the original 21 ArchDB subclasses
(Supporting Information). The structures
of the five rSA motifs are significantly different from each other,
with an average intermotif RMSD of 2.2 Å (Table S4).
Figure 5
Structures of the most populated rSA motifs for β-hairpin
loops of length 4 in the ArchDB.BN4.100 database. The cluster representatives
of the loop structures belonging to each motif are shown, with the
most populated cluster highlighted as a thicker tube.
Structures of the most populated rSA motifs for β-hairpin
loops of length 4 in the ArchDB.BN4.100 database. The cluster representatives
of the loop structures belonging to each motif are shown, with the
most populated cluster highlighted as a thicker tube.The existence of a small number of recurring rSA
motifs can be
exploited to build CVSA variables that are not tailored
to a specific experimental structure as was done in the previous sections.
As we will see in the next section, such CVSA can be used
to dictate the general shape of a loop, while the specific conformation
is determined by the amino acid sequence and/or the environment of
the loop.
Folding Different Amino Acid Sequences Using the Same rSA String
In this section we show that simulations based on the same rSA
string can correctly reproduce differences in loops that, while having
a similar shape, present structural differences due to their different
amino acid composition.We identified two groups of β-hairpin
loops with similar shape (encoded by the rSA string “KUN”)
but with differences in the specific structure that correlate with
differences in their amino acid sequence (Supporting Information).A structural superimposition (Figure A) shows that KUN
loops are clustered in
two groups (KUNg1 and KUNg2), with KUNg2 structures (green) featuring
a more bent conformation around the residue in position p4 compared
to KUNg1 (blue), a different orientation of the p4 CO bond (licorice),
and significantly different Ramachandran plots at the p4 and p5 positions
(arrows in Figure D). A sequence alignment shows that all KUNg2 loops have a Gly in
position p5 (Figure S5A), which allows
for the larger bending of the loop.
Figure 6
Comparison of KUNg1 and KUNg2 β-hairpins.
(A) Superimposition
of experimental structures of KUNg1 (blue) and KUNg2 (green) β-hairpins.
(B) Superimposition of experimental (dark colors) and simulated (light
colors) structures of KUNg1 (blue) and KUNg2 (green) β-hairpins.
The simulated structures were extracted from the high-CVSA metadynamics ensembles with a cluster analysis. The backbone CO
bonds of the residues in positions p4 and p5 are represented as licorice.
(C) His3-Gly5 hydrogen-bonded interaction in the KUNg2 metadynamics
structure. (D) Ramachandran plots of experimental (large numbered
circles), metadynamics (diamonds), and SMD (small circles) folded
structures for KUNg1 (left panel) and KUNg2 (right panel).
Comparison of KUNg1 and KUNg2 β-hairpins.
(A) Superimposition
of experimental structures of KUNg1 (blue) and KUNg2 (green) β-hairpins.
(B) Superimposition of experimental (dark colors) and simulated (light
colors) structures of KUNg1 (blue) and KUNg2 (green) β-hairpins.
The simulated structures were extracted from the high-CVSA metadynamics ensembles with a cluster analysis. The backbone CO
bonds of the residues in positions p4 and p5 are represented as licorice.
(C) His3-Gly5hydrogen-bonded interaction in the KUNg2 metadynamics
structure. (D) Ramachandran plots of experimental (large numbered
circles), metadynamics (diamonds), and SMD (small circles) folded
structures for KUNg1 (left panel) and KUNg2 (right panel).Two sequences representative of the two groups
were selected. CVSA-biased MD simulations were run to fold
each amino acid sequence
into its corresponding β-hairpin starting from an extended conformation.
The same rSA encoding was used for the two β-hairpins (Table ).In both cases,
the native state was recovered in metadynamics and
SMD simulations (Table S5), with a percentage
of native-like structures ranging from 4–10% (metadynamics)
to 36–72% (SMD). Remarkably, even if the same rSA encoding
was used for both amino acid sequences, the simulated folded structures
reproduce the differences between the experimental ones. Indeed, KUNg2
simulated structures (Figure B, light green) are more bent at p4 than KUNg1 (light blue)
and the simulated conformations of each amino acid sequence are more
similar to the experimental structure with the same sequence than
to the other one (Table S6). Correspondingly,
the Ramachandran plots of simulated structures show differences in
the distribution of φ and ψ angles that parallel the differences
between the experimental structures, in particular for residues in
positions p4 and p5 (arrows in Figure D).A possible explanation of the conformation
adopted by KUNg2 at
position p4 can be found by looking at the nearby side chains. Indeed,
the bent loop arrangement in KUNg2 allows the Gly5 backbone NH group
to be on the same side of the His3 side chain, favoring the formation
of a His3-Gly5hydrogen bond both in metadynamics (Figure C) and SMD (Figure S5B) simulations.The present results indicate
that the CVSA based on
rSA motifs contains sufficient information to guide the folding toward
the correct general shape of the loop backbone while at the same time
allowing for adjustments to obtain sequence-specific structures. Side
chains are not included in the definition of CVSA, but
they are explicitly taken into account during the simulation by the
all-atom force field and they can modulate the loop conformation via
direct side chain–backbone interactions.
Real Life Application:
Protein Model Refinement
In
this section we use the CVSA and the library of loop motifs
to refine a protein model. Differently from the previous sections,
only part of the structure is considered in the definition of the
CVSA, while the rest is left unbiased during the simulation.The protein to be refined was selected from the targets of the
Refinement category of the CASP8 exercise (Supporting Information). The best model generated in the normal prediction
exercise failed to correctly describe two regions L1 (residues 19–30)
and L2 (residues 39–44) (Figure A), indicated by CASP organizers as problematic and
showing deviations from the experimental structure > 4.0 Å
(Table ).
Figure 7
Refinement
of the TR464 target from CASP8. (A) Superimposition
of the L1 (residues 19–30) and L2 (residues 39–44) regions
for the starting unrefined model (pink) and the target experimental
structure (green). The rest of the structures is represented as white
cartoon. Residues E5 and R42 are highlighted as licorice. A E5-R42
salt bridge interaction is formed only in the experimental structure.
(B) Superimposition of the L1 and L2 regions in the target structure
(green) and in a representative structure from metadynamics (blue).
The E5-R42 interaction is recovered in the metadynamics simulation
(blue licorice). (C) Superimposition of the L1 and L2 regions in final
structures from productive SMD runs (blue cartoon). The E5-R42 salt
bridge is recovered in all of the productive SMD runs. (D) Comparison
of the structures with the best Rosetta score (blue) from the high-CVSA metadynamics ensembles obtained using the BlindL2a (left),
BlindL2b (center), and BlindL2c (right) encodings for L2. The Rosetta
score is reported together with the L2 RMSDnat calculated
in local fit.
Table 3
Refinement
of the CASP8 Target TR464
pNat (%)b
pNatFiltCVsa (%)c
RMSDnat (Å)d
RMSDnat(L1) (Å)e
RMSDnat(L2) (Å)e
Best Model
2.94
4.00 (3.22)
4.35 (2.02)
Best Refined Model
2.23
1.57 (1.38)
2.38 (0.49)
Metadyn (Target)a
2.1
21.4
2.00
0.53 (0.27)
1.50 (0.30)
unbiased MD
0.0
2.32
2.86 (2.45)
2.48 (1.64)
CVSA metadynamics simulation run
using the Target rSA string.
Percentage of structures in the
whole trajectory with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.
Percentage of structures in the
high-CVSA ensemble with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.
Cα RMSD from the
native structure calculated over the whole structure. Minimum values
observed during the simulation are reported for metadynamics and unbiased
MD.
Cα RMSD
from the
native structure calculated over only L1 and L2. Minimum values observed
during the simulation are reported for metadynamics and unbiased MD.
Values in parentheses are calculated in local fit (only the L1 or
L2 structures are used for the best fit superposition instead of the
whole structure).
Refinement
of the TR464 target from CASP8. (A) Superimposition
of the L1 (residues 19–30) and L2 (residues 39–44) regions
for the starting unrefined model (pink) and the target experimental
structure (green). The rest of the structures is represented as white
cartoon. Residues E5 and R42 are highlighted as licorice. A E5-R42
salt bridge interaction is formed only in the experimental structure.
(B) Superimposition of the L1 and L2 regions in the target structure
(green) and in a representative structure from metadynamics (blue).
The E5-R42 interaction is recovered in the metadynamics simulation
(blue licorice). (C) Superimposition of the L1 and L2 regions in final
structures from productive SMD runs (blue cartoon). The E5-R42 salt
bridge is recovered in all of the productive SMD runs. (D) Comparison
of the structures with the best Rosetta score (blue) from the high-CVSA metadynamics ensembles obtained using the BlindL2a (left),
BlindL2b (center), and BlindL2c (right) encodings for L2. The Rosetta
score is reported together with the L2 RMSDnat calculated
in local fit.CVSA metadynamics simulation run
using the Target rSA string.Percentage of structures in the
whole trajectory with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.Percentage of structures in the
high-CVSA ensemble with both RMSDnat(L1) and RMSDnat(L2) ≤ 2 Å.Cα RMSD from the
native structure calculated over the whole structure. Minimum values
observed during the simulation are reported for metadynamics and unbiased
MD.Cα RMSD
from the
native structure calculated over only L1 and L2. Minimum values observed
during the simulation are reported for metadynamics and unbiased MD.
Values in parentheses are calculated in local fit (only the L1 or
L2 structures are used for the best fit superposition instead of the
whole structure).The rSA
encoding sequences of L1 and L2 in the best model and in
the target experimental structure (Table ) show that L1 changes from an α–β-hairpin
with the central turn in a “UUK” conformation to a “KUK”
β-hairpin, while the central turn in L2 changes from “UUK”
to “UKR”. Interestingly, these three turn encodings
are among the top five motifs for length-4 loops in β-hairpins
described before (Table S3).Enhanced
sampling simulations were first run with a CVSA based on
the rSA target encodings of L1 and L2 (Target rSA in Table ) to test if local
information can be used to guide the refinement. A single CVSA coordinate was used containing the fragments of both regions. The
distance between the structures sampled during the simulations and
the target conformation was measured by calculating the Cα RMSD for the overall structure (RMSDnat) and separately
for the two regions L1 (RMSDnat(L1)) and L2 (RMSDnat(L2)).Structures significantly closer to the target structure
than the
starting model were sampled during 100 ns long metadynamics simulations.
Both RMSDnat(L1) and RMSDnat(L2) were reduced
to ≤2 Å in ∼21% of the CVSA-filtered
trajectory (pNatFiltCVSA in Table ), with deviations
for the single regions as low as 0.5 (L1) and 1.5 Å (L2) for
the best metadynamics structure (Figure B). Similarly, SMD simulations with rSA encoding
improved the starting model in 18% of the runs, with RMSDnat values as low as 1.7 Å (Figure C and Table S7). Control
unbiased simulations failed to significantly improve the starting
model. Indeed, no structures were found where both L1 and L2 were
in a native-like conformation (unbiased MD pNat in Table ).The analysis
of the time evolution of RMSDnat for metadynamics
(Figure S6B, upper panel) and SMD runs
(Figure S7A,B) indicates that L2 is the
last loop to adopt the native conformation, suggesting that its rearrangement
is the difficult step in the model refinement.To test the performance
of CVSA in a real life situation
where the target state is not known, multiple metadynamics simulations
were run using “blind” rSA encodings for the L2 loop
(BlindL2a–c in Table ). L2 encodings were chosen among the top five motifs for
length 4 loops in β-hairpins identified in the previous section
(Figure ). The BlindL2a
encoding “UKR” coincides with the native Target rSA
encoding used in the previous calculations.The blind trajectories
were filtered for high CVSA values
(CVSA ≥ CVSAmax – 2),
and the resulting structures were rescored using the Rosetta scoring
function[45,46] (Methods). The best
scoring structure was found in the BlindL2a simulation (Table S8), with a RMSDnat(L2) of 0.7
Å when using only L2 for the best fit superposition (local fit
or LF) and of 3.0 Å when calculated using the whole protein structure
(global fit or GF). Indeed, the L2 conformation in the best CVSA-refined structure (blue in Figure D, left panel), while featuring a residual
translational displacement from the experimental structure (green),
has a very good match to the experimental shape. Remarkably, 13 of
the top 20 scoring structures are from the BlindL2a simulation (gray
rows in Table S8).The best structures
from the other two blind simulations differed
significantly from the experimental L2 conformation (BlindL2b and
BlindL2c in Figure D), with RMSDnat(L2) values of 2.2 (BlindL2b) and 1.9
Å (BlindL2c) in the local fit. Interestingly, they have also
poorer scores according to Rosetta (Table S8). Thus, rescoring the structures with Rosetta allows BlindL2a structures
to be identified as the closest to the native state without using
information on the experimental structure.To summarize this
section, we showed how the CVSA coupled
with libraries of rSA loop motifs can be effectively used to sample
multiple alternative loop conformations and, when combined with a
knowledge-based rescoring potential, to refine protein models.
Discussion
The occurrence of recurring local structures in proteins has inspired
several approaches to structure prediction and design. Large libraries
of sequence-dependent fragments have been successfully employed in
fragment-assembly strategies such as Rosetta.[45−47] Alternatively,
small sets of coarse-grained sequence-independent fragments were used
as structural alphabets,[30,48−50] often coupled with machine learning predictors.[20] Recently, it was also shown that local structural changes
observed in MD conformational ensembles can be analyzed in terms of
changes of SA letters without loss of information. This was particularly
true for the M32K25 SA, which has been derived from conformational
attractors, i.e., regions in the fragment conformational space that
are highly populated by experimental structures.[30] The use of this SA turned out to be particularly effective
in investigating allosteric proteins.[32]All these data indicate that SAs can provide a compact and
reliable
representation of the most populated conformational states of protein
fragments, recapitulating the structural features of a large number
of experimental structures.[30] The central
idea of the present work is to exploit the experimental information
distilled in a SA to accelerate the exploration of protein conformations
in MD simulations. We thus introduced a SA-based collective variable
(CVSA) to control the match between simulated and SA fragment
conformations. Combining the CVSA with either metadynamics
or steered MD techniques, it was possible to bias the sampling of
fragment conformations toward experimentally preferred local states.
While SAs have been used in the past for postprocessing MD trajectories,[31−33] this is the first time to our knowledge that they are used to enhance
the sampling during an MD simulation.The use of the CVSA allows the introduction of knowledge-based
elements in the simulation without loss of generality. Since the CVSA is based on local states, no assumption is required on nonlocal
contacts and thus on the relative arrangement of the fragments. Moreover,
the SA fragments contain only Cα atoms, so that they
can be used with any amino acid sequence and no a priori information
is needed on side chain structures.CVs based on secondary structures
have been used in the past either
to accelerate the folding of small proteins to their native states[13] or to explore the space of their accessible
folds.[51] Performances comparable to the
CVSA were obtained when folding the GB1 β-hairpin
with metadynamics simulations.[13] However,
these CVs are based on canonical structures of blocks of secondary
structure elements, and for β structures they contain pairs
of β-strands.[13] The CVSA differ from these in that (a) it does not require nonlocal information
on specific arrangements of secondary structure elements and (b) it
can be used to describe regions with irregular structure.The
performance of the CVSA was first tested on the
folding of peptides and mini-proteins. In all cases, conformations
with a CVSA value equal or close to CVSAmax were sampled during metadynamics simulations starting from
unfolded conformations. In these high-CVSA ensembles, all
the fragments were in their target SA state or close to it, indicating
that the structures had a local native-like conformation. Remarkably,
a significant portion of each ensemble (27–40%) had also a
global native-like conformation. In the SMD runs, the CVSA was explicitly steered to its maximum value, so that high-CVSA states were ensured to be sampled in a fixed amount of time.
Similarly to metadynamics, the ensemble of high-CVSA conformations
composed by the final SMD structures had a significant proportion
of native-like conformations, with percentages up to 72%.Folded
structures were thus observed when the fragments, in addition
to being in the correct local state, had also a native-like arrangement,
with native-like interfragment contacts. No information was directly
provided on nonlocal native contacts, but folded structures were successfully
formed during the relatively short simulations performed here. The
bias on the CVSA ensured an increased probability of observing
native-like local structures, which in turn increased the probability
of finding them in a global native-like arrangement compared to an
unbiased simulation.While the local states used to perform
these calculations were
derived from the target global structures, experimental and predicted
information on the local structure of a protein may be available even
in the absence of its global structure. For example, the experimental
secondary structure composition can be derived from CD spectra,[52] while the sequence of secondary structure elements
can be usually predicted with a high level of accuracy.[53−55] For irregular regions, libraries of structural motifs are available,[44] while protein regions involved in conformational
changes can be identified by hydrogen/deuterium exchange mass spectrometry.[56]By construction, high CVSA values
can be used to discriminate
conformations with a local native structure among those generated
during the simulation, but these are not necessarily globally folded.
To fold larger and more complex systems than the ones considered in
this work in a comparable amount of time, additional information on
interfragment contacts would need to be provided. This information
could be derived from inter-residue contact prediction, for example
using recently proposed methods based on coevolution.[57] On the other end, when no contact information is provided
a priori, the possibility to sample multiple global arrangements compatible
with the same sequence of local states could be exploited for protein
design methods. Indeed, simulations using the same reference SA string
with different amino acid sequences would show how the spectrum of
different interfragment arrangements is modulated by the primary structure
of the protein.The form chosen for the CVSA (eq ) allows for some degree
of structural variability
also in the local structure of high-CVSA ensembles. Indeed,
the conformation of a fragment is not required to exactly match its
reference SA letter to contribute significantly to the CVSA, but small adjustments in the local structure are possible if energetically
favored. This behavior is regulated by the ρ0 parameter
(eq ), which defines
the tolerance on the fragment deviation from the reference letter
in the switching function. In most of the calculations, a value of
0.6 Å was used, which is comparable to the cluster radius in
the macrostates of the reduced alphabet (rSA). When using a CVSA with the rSA encoding, the requirement for the fragment
is thus to match a macrostate, with freedom to adapt to any of the
SA states that compose it.Coupling the MD sampling with the
reduced version of the alphabet
rSA instead of the full SA has the advantage that less information
on the system needs to be provided before the simulation. Indeed,
even when using the same reference rSA string for different amino
acid sequences, the structural differences of the experimental structures
are still recovered during the MD simulation. While the rSA-based
CVSA provides information on the sequence of local macrostates,
during the simulation the fragments can adopt the states that best
match their specific chemical environment.The reduced complexity
of rSA strings can be exploited for the
generation of guess reference strings for the CVSA without
detailed knowledge of the desired final structure. In particular,
secondary structure predictors[53−55] can be used for regular structures,
while loop encodings can be extracted from loop databases. Indeed,
we showed that a large part of the loops in experimental β-hairpin
structures can be described by a reduced number of rSA strings. This
analysis can be easily extended to other types of loops,[44] generating a comprehensive library of rSA motifs.The use of rSA sequence libraries was exemplified with the refinement
of a protein model. Alternative rSA sequences, generated from the
most populated motifs for β-hairpins, were used to refine the
L2 loop in the CASP8 target TR464. No information on the final state
was used to define the CVSA. L2 structures closer to the
native state than the starting model were identified by rescoring
with Rosetta. In particular, the ensemble generated with the native
rSA motif could be identified as the best one because of the higher
proportion of its conformations in the top ranking structures after
rescoring.The results discussed above suggest that the combination
of CVSA-based MD sampling with rSA libraries is a promising
approach
for the development of protein refinement methods. In particular,
the increasing availability of parallel computing resources could
be exploited to test a large number of rSA strings, which might be
needed to be considered when multiple regions are involved in the
refinement.The development of refinement methods that can systematically
improve
the quality of protein models has proven to be particularly challenging
so far, and it is still an area of ongoing work. Structure prediction
relying on knowledge-based approaches seems to have reached a plateau
in their accuracy,[58] and different refinement
strategies are now required to achieve effective improvements. The
combination of knowledge-based prediction with physics-based methods
looks particularly promising. Indeed, an MD-based method was found
for the first time to be the best performing in recent rounds of CASP.[58,59] Key factors in this success were the coupling of MD with knowledge-based
elements and the use of an averaging procedure over multiple parallel
trajectories to enhance the structure sampling and to generate an
enrichment of native-like versus non-native features.[60] Another example of these types of approaches is the use
of distance maps taken from high-resolution experimental structures
as restraints in MD simulations.[59]Sampling remains critical especially if the refinement requires
large energy barriers to be overcome.[58] In these cases, enhanced sampling, as performed in our study, is
necessary. Moreover, CASP11 results demonstrated that loops tend to
be more challenging to refine. To this end, a CVSA-based
strategy is particularly suitable since it uses tunable local biases
for loop regions.Enhanced sampling methods were used in this
work mainly to accelerate
the sampling of high-CVSA conformations and not to derive
energetic or kinetic data. Snapshots from metadynamics simulations
were rescored with an external scoring function for model refinement,
while work values in SMD simulations were only used to prefilter candidate
structures. However, provided that appropriate simulation lengths
and postprocessing protocols are used, it is in principle possible
to extract direct energetic information from CVSA-biased
trajectories and derive free energy changes for the processes involved,
ranging from free energy changes associated with loop rearrangements
or changes in secondary structure to folding energies. Moreover, when
combined with suitable global CVs, the CVSA could be used
to investigate the relative kinetics of formation of secondary and
tertiary elements[61,62] in the folding of proteins, for
example when comparing diffusion–collision and nucleation–condensation
pathways.[63] Following recent successful
examples where secondary structure-based CVs were combined with NMR
chemical shifts to study denatured states[64] and IDPs,[65] the CVSA could
in principle be coupled with CVs containing specific experimental
information on unfolded or denatured states. In this context, using
the CVSA is particularly suitable since it is able to describe
both regular and irregular local structures.As a final remark,
the CVSA can be used in any CV-based
enhanced sampling approach, including hybrid approaches mixing CVs
with REMD-like methods, such as parallel tempering[11] or bias exchange metadynamics,[66] where exchanges are allowed between replicas that use different
CVs. The second approach would be particularly suitable to explore
alternative local structure arrangements by using multiple CVSA with different SA strings.In conclusion, we showed
that, by enhancing the sampling of local
states from a structural alphabet, it is possible to recover the global
native state in MD simulations of small proteins. This finding is
robust against approximate definitions of local states. Moreover,
we showed how artificial sequences of SA states from libraries of
recurring SA motifs can be used to generate alternative conformations
of protein regions. Biasing the sampling of local states has a wide
range of potential applications, going from protein design to the
study of conformational changes based on large local rearrangements
such as hinged motions, secondary structure transitions, or loop remodelling.
Methods
Full details on methods, simulation setup and data analyses can
be found in the Supporting Information.
Structural
Alphabet and Structure Encoding
A SA is
a collection of prototypical backbone conformations adopted by short
fragments in protein structures, where each letter represents a fragment
conformational state. In this work, we used the M32K25 SA, composed
of 25 representative fragments of four consecutive Cα atoms.[30] A protein structure can be encoded
into a SA string by progressively labeling each overlapping four-residue
fragment (i.e., residues 1–4 for fragment 1, 2–5 for
fragment 2, and so on) from the N-term to the C-term of the protein
with a SA letter (A–Y), so that conformation of a protein of N residues is encoded into a structural string of length N – 3. In this work, the labeling of a fragment is
performed by identifying the SA letter that has the minimum root-mean-square
deviation (RMSD) from the fragment (local-fit encoding). No information
from adjacent fragments is used, so that letters encoding consecutive
fragments are assigned independently from each other, allowing for
a nonexact match in the overlapping region.Local macrostates
were defined by clustering the 25 letters of M32K25 (Supporting Information). The representatives of the six resulting
clusters define a reduced version of the M32K25 SA (rSA).
CVSA Definition
CVSA is defined
as the number of four-residue fragments f in the protein with Cα RMSD ρ
from a preassigned SA letter X (reference state) within a given cutoff ρ0:Each term in the sum is a differentiable function[13] switching from 1 (ρ ≪ ρ0, f very close
to the reference state X) to 0 (ρ ≫ ρ0, f very far from X), where n and m are user-defined parameters that modulate the switching rate. It
follows that the maximum value that a CVSA can adopt (CVSAmax) corresponds to the number of fragments used
for its definition. The RMSD ρ is calculated on the positions
of Cα atoms. The CVSA was implemented
in a modified version of PLUMED[10] 1.3.
The resulting user interface for the CVSA definition is
flexible, and any number of fragments can be used, spanning the entire
protein structure or parts of it. The sequence of X letters defines the CVSA reference string. The CVSA section of sample PLUMED input
files is reported in the Supporting Information for the GB1 β-hairpin (Appendix S1) and TR464 (Appendix S2).
Different criteria can be used for the string assignment, as it is
shown in Results. The CVSA can
be used in combination with any CV-based enhanced sampling method
implemented in PLUMED. The patch files used to implement the CVSA can be downloaded from https://afornililab.wordpress.com/software or http://people.brunel.ac.uk/~csstaap2/software.html.
MD Simulations
The GROMACS 4.5.5 program[67] was used to prepare the initial system coordinates
and to run the simulations. The Amber99SB*-ILDN[68] force field was used for all the simulations. Enhanced
sampling simulations were performed by coupling GROMACS with PLUMED-1.3.[10] In the metadynamics folding simulations, a CV
describing a global property was used in addition to the CVSA. In SMD simulations, the steering was performed with harmonic restraints
moving at constant velocity.
Tools for Trajectory Postprocessing and Analysis
MD
trajectories were analyzed with the GROMACS 4.5.5 tools[67] and with in-house scripts in R (available from https://afornililab.wordpress.com/software or http://people.brunel.ac.uk/~csstaap2/software.html). The bio3d[69] R package was used for
coordinate manipulation and for the analysis of the ArchDB database.[44] The encoding of experimental structures and
MD trajectories with the M32K25 SA was performed with GSATools.[34] Structures from the blind loop refinement of
the TR464 CASP target were independently rescored with Rosetta[45−47] (ver. 3.3).
Authors: Sander Pronk; Szilárd Páll; Roland Schulz; Per Larsson; Pär Bjelkmar; Rossen Apostolov; Michael R Shirts; Jeremy C Smith; Peter M Kasson; David van der Spoel; Berk Hess; Erik Lindahl Journal: Bioinformatics Date: 2013-02-13 Impact factor: 6.937