Yinglong Miao1, Victoria A Feher1, J Andrew McCammon1. 1. Howard Hughes Medical Institute, Department of Chemistry and Biochemistry, and Department of Pharmacology, University of California at San Diego , La Jolla, California 92093, United States.
Abstract
A Gaussian accelerated molecular dynamics (GaMD) approach for simultaneous enhanced sampling and free energy calculation of biomolecules is presented. By constructing a boost potential that follows Gaussian distribution, accurate reweighting of the GaMD simulations is achieved using cumulant expansion to the second order. Here, GaMD is demonstrated on three biomolecular model systems: alanine dipeptide, chignolin folding, and ligand binding to the T4-lysozyme. Without the need to set predefined reaction coordinates, GaMD enables unconstrained enhanced sampling of these biomolecules. Furthermore, the free energy profiles obtained from reweighting of the GaMD simulations allow us to identify distinct low-energy states of the biomolecules and characterize the protein-folding and ligand-binding pathways quantitatively.
A Gaussian accelerated molecular dynamics (GaMD) approach for simultaneous enhanced sampling and free energy calculation of biomolecules is presented. By constructing a boost potential that follows Gaussian distribution, accurate reweighting of the GaMD simulations is achieved using cumulant expansion to the second order. Here, GaMD is demonstrated on three biomolecular model systems: alanine dipeptide, chignolin folding, and ligand binding to the T4-lysozyme. Without the need to set predefined reaction coordinates, GaMD enables unconstrained enhanced sampling of these biomolecules. Furthermore, the free energy profiles obtained from reweighting of the GaMD simulations allow us to identify distinct low-energy states of the biomolecules and characterize the protein-folding and ligand-binding pathways quantitatively.
Dynamic
biomolecules often undergo large-scale structural changes
and visit distinct conformational states during their biological function.[1] It is of great biological and pharmaceutical
interest to characterize their structures and conformational transition
pathways. Ideally, detailed free energy landscapes are sought to understand
the functional mechanisms of biomolecules in a quantitative manner.
However, due to large energy barriers, conformational transitions
of biomolecules usually take place on time scales of milliseconds
or even longer.[1a,2] This has presented a grand challenge
for computational molecular dynamics (MD) simulations that are limited
to typically hundreds-of-nanoseconds to tens-of-microseconds.[3]To address this challenge, biasing simulation
methods have been
found useful in enhanced sampling and free energy calculation of the
biomolecules. These methods include umbrella sampling,[4] conformational flooding,[5] metadynamics,[6] adaptive biasing force (ABF) calculations,[7] and orthogonal space sampling,[8] etc. During the simulations, a potential or force bias
is applied along certain reaction coordinates (or collective variables)
to facilitate the biomolecular conformational transitions across high-energy
barriers. Typical reaction coordinates include atom distances, torsional
dihedrals, root-mean-square deviation (RMSD) relative to a reference
configuration, eigenvectors generated from the principal component
analysis,[5b] and so on. The definition of
the reaction coordinates, however, often requires expert knowledge
of the studied systems. Furthermore, the predefined reaction coordinates
largely place constraints on the pathway and conformational space
to be sampled during the biasing simulations. It often leads to slow
convergence of the simulations when important reaction coordinates
are missed during the simulation setup.[6b]Accelerated molecular dynamics (aMD) is an enhanced sampling
technique
that works often by adding a non-negative boost potential to smoothen
the biomolecular potential energy surface. The boost potential, ΔV, decreases the energy barriers and thus accelerates transitions
between the different low-energy states.[9] With this, aMD is able to sample distinct biomolecular conformations
and rare barrier-crossing events that are not accessible to conventional
MD (cMD) simulations. Unlike the previously mentioned biasing simulation
methods, aMD does not require predefined reaction coordinate(s), which
can be advantageous for exploring the biomolecular conformational
space without a priori knowledge or restraints. aMD has been successfully
applied to a number of biological systems, (10) and hundreds-of-nanosecond aMD simulations have been shown to capture
millisecond-time-scale events in both globular and membrane proteins.[11]While aMD has been demonstrated to greatly
enhance conformational
sampling of biomolecules, it suffers from large energetic noise during
reweighting.[12] The boost potential applied
in aMD simulations is typically on the order of tens-to-hundreds of
kilocalories per mole, which is much greater in magnitude and wider
in distribution than that of other biasing simulation methods that
make use of predefined reaction coordinates (e.g., several kilocalories
per mole). It has been a long-standing problem to accurately reweight
aMD simulations and recover the original free energy landscapes, especially
for large proteins.[13] Our recent studies
showed that when the boost potential follows near-Gaussian distribution,
cumulant expansion to the second order provides improved reweighting
of aMD simulations compared with the previously used exponential average
and Maclaurin series expansion reweighting methods.[14] The reweighted free energy profiles are in good agreement
with the long-time-scale cMD simulations as demonstrated on alaninedipeptide and fast-folding proteins.[15] However,
such improvement is limited to rather small systems (e.g., proteins
with less than ∼35 amino acid residues).[15] In simulations of larger systems, the boost potential exhibits
significantly wider distribution and does not allow for accurate reweighting.Here, a Gaussian accelerated molecular dynamics (GaMD) approach
is presented to reduce the energetic noise for simultaneous unstrained
enhanced sampling and free energy calculation of biomolecules, even
for large proteins. GaMD makes use of harmonic functions to construct
the boost potential that is adaptively added to the biomolecular potential
energy surface. A minimal set of simulation parameters is dynamically
adjusted to control the magnitude and distribution width of the boost
potential. As such, the resulting boost potential follows Gaussian
distribution and allows for accurate reweighting of the simulations
using cumulant expansion to the second order. In this study, GaMD
is demonstrated on unconstrained simulations of the alanine dipeptide,
chignolin folding, and ligand binding to the T4-lysozyme.
Methods
Gaussian Accelerated
Molecular Dynamics
Gaussian accelerated
molecular dynamics enhances the conformational sampling of biomolecules
by adding a harmonic boost potential to smoothen the system potential
energy surface (Figure ). Consider a system with N atoms at positions = {1,...,} . When the
system potential V() is lower than
a threshold energy, E, a boost potential is added
aswhere k is
the harmonic force
constant. The modified system potential, V*() = V() + ΔV(), is given
byOtherwise, when the system potential
is above
the threshold energy, i.e., V() ≥ E, the boost
potential is set to zero and V*() = V().
Figure 1
Scheme illustration of Gaussian accelerated molecular dynamics
(GaMD). When the threshold energy is set to the maximum potential
(E = Vmax), the system
potential energy surface is smoothened by adding a harmonic boost
potential that follows Gaussian distribution. The coefficient k0 in the range of 0–1 determines the
magnitude of the applied boost potential. With greater k0, higher boost potential is added to the original energy
surface in conventional molecular dynamics (cMD), which provides enhanced
sampling of biomolecules across decreased energy barriers.
Scheme illustration of Gaussian accelerated molecular dynamics
(GaMD). When the threshold energy is set to the maximum potential
(E = Vmax), the system
potential energy surface is smoothened by adding a harmonic boost
potential that follows Gaussian distribution. The coefficient k0 in the range of 0–1 determines the
magnitude of the applied boost potential. With greater k0, higher boost potential is added to the original energy
surface in conventional molecular dynamics (cMD), which provides enhanced
sampling of biomolecules across decreased energy barriers.In order to smoothen the potential energy surface
for enhanced
sampling, the boost potential needs to satisfy the following criteria.
First, for any two arbitrary potential values V1() and V2() found on the
original energy surface,
if V1() < V2(), ΔV should be a
monotonic function that does not change the relative order of the
biased potential values; i.e., V1*() < V2*(). By replacing V*() with eq and isolating E, we then obtainSecond,
if V1() < V2(), the potential
difference observed on
the smoothened energy surface should be smaller than that of the original;
i.e., V2*() – V1*() < V2() – V1(). Similarly,
by replacing V*() with eq , we can deriveWith Vmin ≤ V1() < V2() ≤ Vmax, we need to set the threshold energy E in the following
range by combining eqs and 4:where Vmin and Vmax are the system
minimum and maximum potential
energies. To ensure that eq is valid, Vmax ≤ Vmin + (1/k) and k has to satisfy the following:We define k ≡ k0(1/(Vmax – Vmin)); then 0 < k0 ≤ 1. As illustrated in Figure , k0 determines the magnitude
of the applied boost potential. With greater k0, higher boost potential is added to the potential energy
surface, which provides enhanced sampling of biomolecules across decreased
energy barriers.Third, the standard deviation of ΔV needs
to be small enough (i.e., narrow distribution) to ensure accurate
reweighting using cumulant expansion to the second order:[14]where Vav and
σ are the average and standard
deviation of the system potential energies and σΔ is the standard deviation of ΔV with σ0 as a user-specified upper limit (e.g.,
10kBT) for accurate reweighting.Provided eq gives
the range of threshold energy E, when E is set to the lower bound E = Vmax, we plug in E and k and obtainWe define the right-hand side in eq as k0′ = (σ0/σ)((Vmax – Vmin)/(Vmax – Vav)). For efficient
enhanced sampling with the highest possible acceleration, k0 can then be set to its upper bound asThe greater σΔ that is obtained from the original potential energy surface
(particularly for large biomolecules), the smaller k0 may be applicable to allow for accurate reweighting.
Alternatively, when the threshold energy E is set
to its upper bound E = Vmin + (1/k) according to eq , we plug in E and k in eq and
obtainWe define the
right-hand side in eq as k0″ ≡
(1 – (σ0/σ)) ((Vmax – Vmin)/(Vav – Vmin)). When k0″ ≤ 0 with σ ≤ σ0, k0 can be theoretically set to any value between zero and
1, although k0 = 1 is applied in this
case for the current implementation of GaMD (See Appendix A). Note that a smaller k0 will give higher threshold energy E, but smaller
force constant k. When 0 < k0″ ≤
1, k0 can be set to either k0″ for
the highest threshold energy E or its upper bound
1.0 for the greatest force constant k. In this case k0 = k0″ is applied in the current GaMD
implementation. When k0″ > 1, we have to lower the
threshold
energy E to ensure that 0 < k0 ≤ 1; e.g., E = Vmax for default k0 = 1.Given E and k0, we
can calculate the boost potential asSimilar
to aMD, GaMD provides
options to add only the total potential boost ΔVP, only dihedral potential boost ΔVD, or the dual potential boost (both ΔVP and ΔVD). The dual-boost
simulation generally provides higher acceleration than the other two
types of simulations for enhanced sampling.[9b] The simulation parameters comprise of the threshold energy values
and the effective harmonic force constants, k0P and k0D for the total and dihedral
potential boost, respectively.To characterize the extent to
which ΔV follows
Gaussian distribution, its distribution anharmonicity, γ, is
calculated as was done previously:[14]where ΔV is
dimensionless
as divided by kBT with kB and T being the Boltzmann
constant and system temperature, respectively, and Smax = (1/2) ln(2πeσΔ2) is the maximum entropy of ΔV.[16] When γ is zero, ΔV follows exact Gaussian distribution with sufficient sampling. Reweighting
by approximating the exponential average term with cumulant expansion
to the second order is able to accurately recover the original free
energy landscape (Appendix B). As γ
increases, the ΔV distribution becomes less
harmonic and the reweighted free energy profile obtained from cumulant
expansion to the second order deviates from the original. As demonstrated
on the alanine dipeptide, chignolin, and T4-lysozyme in this study,
a Gaussian distribution of the boost potential is normally achieved
in the GaMD simulations. The anharmonicity of ΔV distribution serves as an indicator of the enhanced sampling convergence
and accuracy of the reweighted free energy.
Simulation Protocols
GaMD is currently implemented
in the GPU version of AMBER 12[17] (see Appendix A for implementation details). Simulations
of the alanine dipeptide, chignolin, and T4-lysozyme were performed
using the AMBER ff99SB force field on GPUs.[18] The simulated systems were built using the Xleap module of the AMBER
package. Alanine dipeptide and chignolin were constructed as described
previously.[19] For T4-lysozyme, the ligand
benzene was removed from the X-ray crystal structure of the Leu99Ala
mutant (Protein Data Bank (PDB), 181L). Another four benzene molecules were
placed in the bulk solvent at least 40 Å away from the ligand-binding
site in the starting configuration. By solvating the structures in
a TIP3P[20] water box that extends 8–10
Å from the solute surface, the alanine dipeptide system contained
630 water molecules, 2,211 waters for chignolin, and 9,011 waters
for T4-lysozyme. The total number of atoms in the three systems are
1,912; 6,773; and 29,692 for alanine dipeptide, chignolin, and T4-lysozyme,
respectively (Table ).
Table 1
Biomolecular Model Systems Simulated
with GaMD in the Present Study (Nres and Natoms, Number of Residues and System Atoms,
Respectively; ΔVav and σΔ, Average, Standard Deviation of the
Boost Potential, ΔV)
system
Nres
Natoms
GaMD
ΔVav (kcal/mol)
σΔV (kcal/mol)
alanine dipeptide
2
1,912
30 ns × 3
10.9
2.9
chignolin
10
6,773
300 ns × 3
9.8
2.4
T4-lysozyme
162
29,692
800 ns × 4
36.3
4.7
1800 ns
Periodic boundary conditions were
applied for all of the simulation
systems. Bonds containing hydrogen atoms were restrained with the
SHAKE algorithm,[21] and a 2 fs time step
was used. Weak coupling to an external temperature and pressure bath
was used to control both temperature and pressure.[22] The electrostatic interactions were calculated using the
PME (particle mesh Ewald summation)[23] with
a cutoff of 8.0 Å for long-range interactions.The three
systems were initially minimized for 2,000 steps using
the conjugate gradient minimization algorithm, and then the solvent
was equilibrated for 50 ps in an isothermal–isobaric (NPT) ensemble with the solute atoms fixed. Another minimization
was performed with all atoms free, and the systems were slowly heated
to 300 K over 500 ps. Final system equilibration was achieved by a
200 ps isothermal–isovolumetric (NVT) and
400 ps NPT run to ensure that the water box of the
simulated systems had reached the appropriate density.In the
present study, the system threshold energy is set as E = Vmax for all GaMD simulations.
The maximum, minimum, average, and standard deviation values of the
system potential (Vmax, Vmin, Vav, and σ) were obtained from an initial 2 ns NVT simulation with no boost potential. For optimal acceleration,
the greatest σ0 and k0 were determined through short testing simulations (e.g., 2 ns):
with increasing σ0, either k0 calculated using eq reaches 1.0 (highest acceleration level) or σΔ reaches 10kBT (the upper limit of the ΔV distribution
width for accurate reweighting). Each GaMD simulation proceeds with
a 2 ns equilibration run, followed by production simulations.Testing simulations were performed with either the “total
potential boost” or “dihedral potential boost”
only to identify the optimal σ0 and k0 acceleration parameters (Supporting
Information Tables S1–S3). Long-time dual-boost GaMD
simulations were then obtained for analysis, including three independent
30 ns simulations of alanine dipeptide and three independent 300 ns
simulations of chignolin. For the T4-lysozyme, five independent 800
ns simulations were initially performed. Complete binding of benzene
to the target ligand-binding site was observed in one of the five
simulations. Even when the simulation was extended to 1,800 ns, benzene
remained tightly bound in the ligand-binding cavity. The simulation
frames were saved every 0.1 ps for analysis. A summary of the simulations
is listed in Table .
Simulation Analysis
Time courses of dihedral angles,
RMSD, radius of gyration (Rg), and residue
distances in the AMBER simulation trajectories were analyzed using
the cpptraj tool.[24] Particularly,
the backbone dihedral angles Φ and Ψ were calculated for
alanine dipeptide (Figure A). For chignolin, the Rg and
RMSD of simulation frames relative to the PDB native structure (Figure A) were calculated
for the protein Cα atoms with the terminal residues
Gly1 and Gly10 excluded. For ligand binding to the T4-lysozyme (Figure A), the symmetry-corrected
RMSD of benzene was obtained by generating six symmetrically imaged
reference benzene configurations from the 181L crystal structure, calculating the RMSDs
for the diffusing benzene molecules in each frame after aligning the
protein C-terminal domain (residues 80–160) and then extracting
the minimum value of the calculated RMSDs. Moreover, the number of
protein atoms found within 5 Å of benzene (Ncontact, only heavy atoms are considered) was calculated
using pbwithin in VMD that accounts for the periodic
boundary conditions.[25]
Figure 2
(A) Scheme
representation of backbone dihedrals Φ and Ψ
in alanine dipeptide. (B) Distribution of the boost potential ΔV applied in the GaMD simulation. (C) 2D potential of mean
force (PMF) of backbone dihedrals (Φ, Ψ) calculated from
three 30 ns GaMD simulations combined using cumulant expansion to
the second order. The low-energy wells are labeled corresponding to
the right-handed α helix (αR), left-handed
α helix (αL), β-sheet (β), and
polyproline II (PII) conformations. (D) Distribution anharmonicity
of ΔV of frames found in each bin of the PMF
profile.
Figure 3
Folding of chignolin captured in GaMD simulations: (A) comparison
of simulation-folded chignolin (blue) with the PDB (1UAO) native structure
(red) that exhibits 0.2 Å RMSD, (B) distribution of the boost
potential ΔV, (C) 2D (RMSD, Rg) PMF calculated by reweighting the three 300 ns GaMD
simulations combined, and (D) the distribution anharmonicity of ΔV of frames found in each bin of the PMF profile.
Figure 5
Binding of benzene to
the Leu99Ala T4-lysozyme simulated via GaMD:
(A) comparison of simulation-derived complex structure that captures
benzene binding (blue) with 0.1 Å ligand RMSD relative to the
181L PDB structure (red), (B) distribution of the boost potential
ΔV, (C) 2D (ligand RMSD, Ncontact) PMF calculated by reweighting the 1,800 ns GaMD
simulation, and (D) distribution anharmonicity of ΔV of frames found in each bin of the free energy profile.
The PyReweighting toolkit[14] was used
to reweight the GaMD simulations for calculating the PMF profiles
and to examine the boost potential distributions. Two-dimensional
(2D) PMF profiles were computed for backbone dihedrals (Φ, Ψ)
in alanine dipeptide. A bin size of 6° is selected to balance
between reducing the anharmonicity and increasing the bin resolution
as discussed earlier.[14] Two-dimensional
PMF profiles were also constructed using (RMSD, Rg) for chignolin with a bin size of (1.0 Å, 1.0 Å).
For benzene binding to the T4-lysozyme, 2D PMF was constructed using
(ligand RMSD, Ncontact) with a bin size
of (1.0 Å, 5). When the number of simulation frames within a
bin is lower than a certain limit (i.e., cutoff), the bin is not sufficiently
sampled and thus is excluded for reweighting. The cutoff can be determined
by iteratively increasing it until the minimum position of the PMF
profile does not change.[14] The final cutoff
was set as 10, 50, and 1000 for reweighting of GaMD simulations on
alanine dipeptide, chignolin, and T4-lysozyme, respectively.
Results
GaMD provides enhanced sampling of conformational transitions in
alanine dipeptide, chignolin folding, and ligand binding to the T4-lysozyme.
Furthermore, the boost potential applied in the present GaMD simulations
follows Gaussian distribution, which allows accurate reweighting using
cumulant expansion to the second order and recovery of the original
biomolecular free energy landscapes, even for proteins as large as
the T4-lysozyme. Notably, hundreds-of-nanoseconds GaMD simulations
are able to capture complete folding of chignolin and ligand binding
of benzene to the T4-lysozyme that take place on significantly longer
time scales.
Alanine Dipeptide
In order to balance between achieving
the highest acceleration (large ΔV) and ensuring
accurate reweighting (small enough standard deviation, σΔ), short GaMD simulations (2 ns) were
tested on alanine dipeptide to search for optimal acceleration parameters.
For the total potential boost, when σ0P was adjusted
from 1.0 to 2.0, the resulting k0P calculated
using eq was increased
from 0.21 to 1.0, accompanied by increases of the ΔVP standard deviation from 1.03 to 1.75 kcal/mol and the
ΔVP average from 2.36 to 3.85 kcal/mol
(see Supporting Information Table S1a).
After k0P reached its maximum 1.0, slight
changes were observed when σ0P was further increased
from 2.0 to 3.0. For the dihedral potential boost, the calculated k0D reached 1.0 when σ0D was
increased to 3.0 (Supporting Information Table S1b). Therefore, (σ0P,σ0D) were set to (3.0, 3.0) for the production dual-boost GaMD simulations
of alanine dipeptide, which enables the highest acceleration as well
as accurate reweighting with both k0P and k0D equal to 1.0 (Supporting
Information Table S1c).As shown in Figure B, the boost potential ΔV applied in GaMD
simulation of the alanine dipeptide follows Gaussian distribution.
Its distribution anharmonicity is small enough to be 1.69 × 10–3. The average and standard deviation of ΔV are 10.9 and 2.9 kcal/mol, respectively (Table ). Figure C plots the 2D PMF of backbone dihedrals
(Φ, Ψ) that are obtained by reweighting the three 30 ns
GaMD simulations combined using cumulant expansion to the second order.
The reweighted PMF is able to recover five energy minimum wells that
are centered around (−162°, −12°) and (−72°,
−12°) for the right-handed α helix (αR), (48°, 18°) for the left-handed α helix
(αL), (−156°, 162°) for the β-sheet,
and (−66°, 156°) for the polyproline II (PII) conformation. The corresponding minimum free energies are estimated
as 0, 1.40, 1.07, 2.94, and 4.27 kcal/mol, respectively. This is in
good agreement with the PMF results of an exceptionally long 1000
ns cMD simulation as presented earlier.[14] Additionally, the distribution anharmonicity of ΔVof frames clustered in each bin of the 2D PMF exhibits values smaller
than 0.10 in the low-energy regions (Figure D), which suggests sufficient sampling for
reweighting using cumulant expansion to the second order. Greater
anharmonicity is found in high-energy regions (especially the energy
barriers), for which low boost potential is applied and less sampling
is normally achieved. Therefore, the anharmonicity of ΔV distribution appears to be a good indicator for sufficiency
of the enhanced sampling and accuracy of the reweighted free energy.(A) Scheme
representation of backbone dihedrals Φ and Ψ
in alanine dipeptide. (B) Distribution of the boost potential ΔV applied in the GaMD simulation. (C) 2D potential of mean
force (PMF) of backbone dihedrals (Φ, Ψ) calculated from
three 30 ns GaMD simulations combined using cumulant expansion to
the second order. The low-energy wells are labeled corresponding to
the right-handed α helix (αR), left-handed
α helix (αL), β-sheet (β), and
polyproline II (PII) conformations. (D) Distribution anharmonicity
of ΔV of frames found in each bin of the PMF
profile.
Folding of Chignolin
For chignolin, short 2 ns testing
GaMD simulations showed that k0P reached
the 1.0 maximum for the highest total potential boost when σ0P was increased to 3.0 (Supporting Information Table S2a). For the dihedral potential boost, the simulation crashed
when σ0D was increased to 0.9 and stable simulation
was achieved only when σ0D was increased to 0.6 (Supporting Information Table S2b). Thus, (σ0P,σ0D) were set to (3.0, 0.6) for the production
dual-boost GaMD simulations of chignolin (Supporting
Information Table S2c). The resulting boost potential follows
Gaussian distribution and the calculated γ = 9.22 × 10–3 (Figure B). The average
and standard deviation of ΔV are 9.8 and 2.4
kcal/mol, respectively (Table ).Folding of chignolin captured in GaMD simulations: (A) comparison
of simulation-folded chignolin (blue) with the PDB (1UAO) native structure
(red) that exhibits 0.2 Å RMSD, (B) distribution of the boost
potential ΔV, (C) 2D (RMSD, Rg) PMF calculated by reweighting the three 300 ns GaMD
simulations combined, and (D) the distribution anharmonicity of ΔV of frames found in each bin of the PMF profile.Started from an extended conformation,
GaMD simulations were able
to capture complete folding of chignolin into its native structure
(Supporting Information Movie S1). The
RMSD obtained between the simulation-folded chignolin and NMR experimental
native structure (PDB, 1UAO) reaches a minimum of 0.2 Å (Figure A). Using the protein RMSD
relative to the PDB native structure and Rg, a 2D PMF profile was calculated by reweighting the three 300 ns
GaMD simulations combined (Figure C). The reweighted PMF allows us to identify three
distinct low-energy conformational states: the folded (“F”),
unfolded (“U”), and an intermediate (“I”).
The folded state corresponds to the global energy minimum at (1.0
Å, 4.0 Å), the unfolded state is 3.68 kcal/mol higher in
a local-energy well centered at (6.0 Å, 7.0 Å), and the
intermediate in a 3.06 kcal/mol free energy well centered at (4.0
Å, 5.5 Å). The energy barrier for chignolin folding between
the unfolded and folded states is ∼4.0 kcal/mol, and ∼3.5
kcal/mol for transitions between the intermediate and folded states
(Figure C). In comparison,
when the three 300 ns GaMD simulations were analyzed separately, the
reweighted PMF profiles exhibit significant differences (Supporting Information Figure S2). Whereas the
three folded, intermediate, and unfolded low-energy states were captured
in sim2, the unfolded and intermediate states were not sufficiently
sampled (no converged low-energy wells) in sim1 and sim3, respectively.
Improved sampling was obtained by combining the three independent
GaMD simulations. Figure D plots the distribution anharmonicity of ΔV for frames found in each bin of the 2D PMF as shown in Figure C. The anharmonicity
exhibits values smaller than 0.05 in the simulation sampled conformational
space, suggesting that ΔV achieves sufficient
sampling for reweighting using cumulant expansion to the second order.Compared with the native structure, the unfolded chignolin exhibits
an extended conformation without proper secondary structure formed
in the protein backbone as shown in Figure A. The intermediate conformation observed
during folding of chignolin is characterized by hydrophobic interactions
between the Pro4 and Trp9 residue side chains and a turn at Thr8.
Such conformation was also observed in previous microsecond-time-scale
cMD simulations.[26] In the simulation-derived
folded state, the protein residue side chains exhibit closely similar
conformations as in the NMR native structure (Figure C). Residues Tyr2 and Pro4 form hydrophobic
interactions between their side chains. Hydrophilic residues Asp3,
Glu5, Thr6, and Thr8 expose side chains to the solvent.
Figure 4
Structures
of chignolin observed during GaMD simulations corresponding
to the (A) unfolded (“U”), (B) intermediate (“I”),
and (C) folded (“F”) states (blue) aligned to the PDB
native structure (red). Residues including Tyr2, Asp3, Pro4, Glu5,
Thr6, Thr8, and Trp9 are represented by sticks. Notably, Trp9 and
Pro4 form hydrophobic interactions in the intermediate state.
Structures
of chignolin observed during GaMD simulations corresponding
to the (A) unfolded (“U”), (B) intermediate (“I”),
and (C) folded (“F”) states (blue) aligned to the PDB
native structure (red). Residues including Tyr2, Asp3, Pro4, Glu5,
Thr6, Thr8, and Trp9 are represented by sticks. Notably, Trp9 and
Pro4 form hydrophobic interactions in the intermediate state.Finally, a 2D free energy profile
of chignolin was calculated using
the RMSD of the protein Cα atoms relative to the
native PDB structure and the backbone dihedral Ψ of residue
Gly7 (Supporting Information Figure S3).
The three independent 300 ns GaMD simulations were combined for the
reweighting. This allows for identification of a misfolded (“M”)
low-energy state, in addition to the F, I, and U states as observed
in Figure C. The M
conformation of chignolin shows ∼180° rotation of the
C-terminal strand about its long axis relative to the native PDB structure.
The free energy of the misfolded chignolin is found to be 1.89 kcal/mol
greater than that of the folded state. This is consistent with previous
findings that the misfolded chignolin was observed with higher probability
in simulations with the AMBER ff99SB force field,[27] which is used in the present study.
Benzene Binding to T4-Lysozyme
Short testing GaMD simulations
of the T4-lysozyme showed that k0P reached
the 1.0 maximum for the highest total potential boost when σ0P was increased to 3.0 (Supporting Information Table S3a). For the dihedral potential boost, the simulation crashed
when σ0D was increased to 5.0 (Supporting Information Table S3b). Stable simulation was achieved
when σ0D was increased to 4.0 and the calculated k0D was 0.35. With further testing for stable
dual-boost GaMD simulation, (σ0P,σ0D) were set to (3.0, 4.0) for the final production simulations of
the T4-lysozyme (Supporting Information Table S3c).GaMD captured complete binding of benzene to the
deeply buried ligand-binding cavity in the Leu99Ala T4-lysozyme within
∼100 ns in one of the five independent 800 ns simulations (Supporting Information Movie S2). Benzene remained
bound in the ligand-binding site even when the simulation was extended
to 1,800 ns. By aligning the C-terminal domain (residues 80–160)
of the T4-lysozyme, the RMSD of the diffusing benzene molecules relative
to the bound pose in the 181L X-ray crystal structure reaches a minimum of 0.1 Å
(Figure A). The boost potential applied during the 1,800 ns
GaMD simulation follows Gaussian distribution, and its γ = 1.39
× 10–3 (Figure B). The average and standard deviation of ΔV are 36.5 and 4.7 kcal/mol, respectively (Table ). Although the ΔV average values exhibit variations between five independent
simulations, the ΔV standard deviations are
closely similar to each other provided that (σ0P,σ0D) were set to (3.0, 4.0) (Supporting
Information Table S3c). Using the RMSD of benzene relative
to the bound pose and the number of protein heavy atoms that are within
5 Å of benzene (Ncontact), a 2D PMF
profile was calculated by reweighting the 1,800 ns GaMD simulation
(Figure C). The reweighted PMF allows us to identify three
distinct low-energy states: the unbound (“U”), intermediate
(“I”), and bound (“B”) states. The bound
state corresponds to the global energy minimum located at ∼(0
Å, 30), the unbound state in a local-energy well centered at
∼(33 Å, 0), and the intermediate centered at ∼(11
Å, 20). It is important to note that since the complete binding
of benzene to the target ligand-binding site was observed only once,
the calculated binding free energy between the bound and unbound states
is subject to the error of limited sampling. Nevertheless, benzene
visits the intermediate site many times during the 1800 ns GaMD simulation
with the ligand RMSD decreased to ∼11 Å (Supporting Information Figure S2A). Repeated sampling of the
intermediate state was observed in the other four 800 ns GaMD simulations
as well (Supporting Information Figure
S2), for which a local-energy well appears around (11.0 Å, 20)
in the 2D PMF profiles (Supporting Information Figure S3). The relative free energy between the intermediate and
unbound states is estimated to be 0.53 ± 0.46 kcal/mol from PMF
profiles of the five GaMD simulations. Furthermore, benzene was observed
to bind another intermediate 2 (“I2”) site that is located
in the pocket formed by the hinge αC helix and the
αB helix from the N-terminal domain (Supporting Information Figure S2). A corresponding
local energy well of the I2 state appears in the calculated 2D PMF
profiles. Figure D
plots the ΔV distribution anharmonicity, γ,
for frames found in each bin of the 2D PMF. It exhibits relatively
large values in the high-energy regions (less sampling), notably the
energy barrier between the intermediate and bound states. The ligand
entry from the intermediate to the bound state is thus suggested to
be the rate-limiting step for benzene binding. In comparison, γ
exhibits values smaller than 0.01 in the energy well regions, suggesting
that ΔV achieves sufficient sampling for reweighting
using cumulant expansion to the second order.Binding of benzene to
the Leu99Ala T4-lysozyme simulated via GaMD:
(A) comparison of simulation-derived complex structure that captures
benzene binding (blue) with 0.1 Å ligand RMSD relative to the
181L PDB structure (red), (B) distribution of the boost potential
ΔV, (C) 2D (ligand RMSD, Ncontact) PMF calculated by reweighting the 1,800 ns GaMD
simulation, and (D) distribution anharmonicity of ΔV of frames found in each bin of the free energy profile.A complete binding pathway of benzene that was
observed in the
GaMD simulation is shown in Figure A. Benzene diffuses from the bulk solvent to the protein
surface formed by the αD and αG helices
and then to the target ligand-binding site in the protein C-terminal
domain. Figure B depicts
the unbound pose of a benzene molecule that is located far away from
the ligand-binding site in the 181L X-ray crystal structure. In the intermediate
state, benzene interacts with residues Lys83, Pro86, and Val87 from
the αD helix and the Thr115, Thr119, and Gln122 residues
from the αG helix (Figure C). In the bound pose, benzene is superimposable
with the ligand cocrystallized in the 181L crystal structure. It forms hydrophobic
interactions with residues Ile78, Leu84, Tyr88, Val87, Leu91, Val111,
Leu118, and Leu121 in the deeply buried protein cavity (Figure D). Supporting
Information Figure S5 shows a transient snapshot observed during
benzene binding between the intermediate and bound poses. Benzene
appears to slide into the ligand-binding cavity, interacting with
residues Gln81, Lys83, Leu84, Val111, Phe114, Thr115, and Leu118 from
the αD, αF, and αG helices.
Figure 6
(A) Pathway of benzene binding to the T4-lysozyme observed during
the GaMD simulation. (B–D) The (B) unbound (“U”),
(C) intermediate (“I”), and (D) bound (“B”)
poses of the protein–ligand complex (blue) with the protein
C-terminal domain (residues 80–160) aligned to the PDB native
structure (red). The protein and benzene are represented by ribbons
and spheres, respectively, and they are colored by blue for the simulation
structure while red for the PDB native structure, except that in A
the simulated benzene is represented by lines and colored by simulation
time in a BWR color scale. Residues with heavy atoms found within
3 Å of benzene are represented by sticks.
(A) Pathway of benzene binding to the T4-lysozyme observed during
the GaMD simulation. (B–D) The (B) unbound (“U”),
(C) intermediate (“I”), and (D) bound (“B”)
poses of the protein–ligand complex (blue) with the protein
C-terminal domain (residues 80–160) aligned to the PDB native
structure (red). The protein and benzene are represented by ribbons
and spheres, respectively, and they are colored by blue for the simulation
structure while red for the PDB native structure, except that in A
the simulated benzene is represented by lines and colored by simulation
time in a BWR color scale. Residues with heavy atoms found within
3 Å of benzene are represented by sticks.
Discussion
By adaptively adding a harmonic boost potential
to smoothen the
system energy surface, GaMD provides both unconstrained enhanced sampling
and free energy calculation of biomolecules. Important statistical
properties of the system potential, such as the average, maximum,
minimum, and standard deviation values, are used to calculate the
simulation acceleration parameters, particularly the threshold energy, E, and force constant, k0.For alanine dipeptide, both k0P and k0D can be increased to the 1.0 maximum for the
greatest possible boost of the total and dihedral potential energies.
The resulting standard deviation of ΔV that
follows Gaussian distribution is 2.9 kcal/mol, which allows for accurate
reweighting using cumulant expansion to the second order. Notably,
high-energy regions of the GaMD sampled free energy surface (especially
the energy barriers) exhibit increased anharmonicity compared with
the low-energy wells (Figure D). Thus, the free energy barriers appear unconverged and
still suffer from insufficient sampling during the three short 30
ns GaMD simulations. Nevertheless, the GaMD reweighted PMF profile
is able to recover low-energy states of the system that are in good
agreement with those observed in a 1000 ns cMD simulation.[14]Furthermore, GaMD is able to fold chignolin
from an extended conformation
to the NMR native structure in three independent 300 ns simulations.
While k0P can be increased to 1.0 for
the greatest total potential boost, k0D can reach only 0.15–0.17 for stable simulation. Nevertheless,
this seems to provide sufficient sampling of the chignolin-folding
process. Among the three independent simulations, although the folded,
intermediate, and unfolded low-energy states are captured in sim2,
the unfolded and intermediate states are not sufficiently sampled
in sim1 and sim3, respectively (Supporting Information Figure S2). More sufficient sampling is obtained by combining the
three GaMD simulations. All three folded, intermediate, and unfolded
states are identified from the resulting reweighted free energy profiles.
Particularly, the intermediate conformation was also observed in earlier
microsecond-time-scale cMD simulations.[26] GaMD appears to achieve better convergence than the previous aMD
simulations that were not able to distinguish between the intermediate
and unfolded states within the same simulation time.[28]Finally, GaMD captured complete binding of benzene
to the ligand-binding
site of the T4-lysozyme. In one of the five independent simulations,
benzene diffused into the deeply buried ligand-binding cavity within
∼100 ns and remained bound even when the simulation was extended
to 1800 ns. Similar to chignolin, while k0P can be increased to the 1.0 maximum for the greatest total potential
boost in the T4-lysozyme simulations, k0D can reach only 0.33–0.35 for the greatest dihedral boost.
The resulting ΔV standard deviation is ∼4.7
kcal/mol in the final dual-boost GaMD simulations. Such narrow distribution
of ΔV ensures accurate reweighting using cumulant
expansion to the second order. Distinct low-energy unbound, intermediate,
and bound states were identified from the reweighted PMF profiles.
The atomistic GaMD simulation also elucidates a highly detailed binding
pathway of benzene that diffuses from the bulk solvent to an intermediate
site located on the protein surface formed by the αD and αG helices and then slides into the target
ligand-binding cavity through a channel formed by the αD, αF, and αG helices. The
free energy difference between the intermediate and unbound states
was found to be small at 0.53 ± 0.46 kcal/mol as estimated from
the five independent GaMD simulations. Benzene repeatedly visits the
intermediate site on the protein surface. In comparison, the ligand
entry from protein surface to the deeply buried protein cavity appears
to be the rate-limiting step for complete benzene binding. It is important
to note that the complete ligand binding was not observed in the four
800 ns GaMD simulations, suggesting that the present GaMD simulations
still suffer from insufficient sampling of the ligand entry process
and the reweighted free energy profiles remain unconverged (Figure C and Supporting Information Figure S5). This is also
indicated by the increased anharmonicity corresponding to the free
energy barrier between the intermediate and bound states as shown
in Figure D. Nevertheless,
our GaMD simulation captured a binding pathway of benzene to the T4-lysozyme.
The ligand entry site is indeed adjacent to the mobile αF helix (residues 108–113), which has been suggested
earlier[29] based on the finding that the
αF helix exhibits increased B-factors in the Leu99Ala
complex structures compared to the apo structures.[30]Here, GaMD is compared with the original aMD, particularly
on the
performance of smoothening the potential energy surface and energetic
reweighting. Supporting Information Table
S4 summarizes the statistical properties of the boost potential and
the original and modified potential energies as obtained from aMD
and GaMD simulations of the alanine dipeptide, chignolin, and T4-lysozyme,
with reference to the cMD simulations. Specifically, dual-boost aMD
simulations of the alanine dipeptide and chignolin obtained from our
previous study[14] were used for comparison.
For the T4-lysozyme, restarting from the initial configuration of
GaMD simulations, a 200 ns dual-boost aMD simulation was performed
using the following acceleration parameters: Edihed = Vdihed_av + 4Nres, αdihed = 4Nres/5, Etotal = Vtotal_av + 0.2Natoms, and
αtotal = 0.2Natoms, where Vdihed_av and Vtotal_av are the average dihedral and total potential energies calculated
from a short 10 ns cMD simulation. As shown in Supporting Information Table S4, although higher average boost
potentials are applied in GaMD simulations of the three systems than
in the aMD simulations, the boost potentials exhibit smaller standard
deviations (i.e., narrower distribution) in the GaMD simulations,
except for the alanine dipeptide. Furthermore, anharmonicity of the
boost potential distribution is significantly reduced in the GaMD
simulations. Relative to cMD, both aMD and GaMD simulations mostly
exhibit smaller standard deviations in the modified dihedral and total
potential energies. Notably for the T4-lysozyme, standard deviations
of the modified potential energies are significantly smaller in the
GaMD simulations than in aMD. Thus, the potential energy surfaces
appear more smoothened in GaMD enhanced sampling of the T4-lysozyme.
Provided the narrower distribution and lower anharmonicity of the
boost potential (Supporting Information Table S4), GaMD allows for more accurate approximation of exponential
average reweighting factor using cumulant expansion to the second
order and thus improved free energy calculation, especially for the
T4-lysozyme.In the present simulations of ligand binding to
the T4-lysozyme,
complete binding of benzene was observed only once, largely because
the target ligand-binding site is deeply buried in the protein. However,
benzene binding to the intermediate site on the protein surface formed
by the αD and αG helices was captured
many times, which provides better statistics for free energy calculation.
It has been suggested that a PMF-based approach is appropriate to
calculate binding free energy for ligands (especially the charged)
that bind to the protein surface.[31] Systems
of this type for future applications of GaMD might include binding
of benzamidine to trypsin[32] and allosteric
modulators to the protein surface of the G-protein coupled receptors.[33]In comparison with many enhanced sampling
methods such as umbrella
sampling,[4] conformational flooding,[5] metadynamics,[6] ABF
calculations,[7] and orthogonal space sampling,[8] GaMD has the advantage of no need to set predefined
reaction coordinates. Metadynamics, in particular, is another potential
biasing technique that has been widely used to map the free energy
landscapes of biomolecules such as protein conformational changes[34] and protein–ligand binding.[6b,35] By monitoring the energy surface of biomolecules during the simulation,
metadynamics keeps adding small Gaussians of potential energies to
the low-energy regions. This will eventually fill the low-energy wells
and achieve uniform sampling of the free energy surface along selected
reaction coordinates. The usage of predefined coordinates greatly
reduces the complexity of biomolecular simulation problems and facilitates
the free energy calculations (e.g., significantly lower energetic
noise compared with aMD simulations). However, it is key to select
proper reaction coordinates, which often requires expert knowledge
of the studied systems. Construction of biomolecular reaction coordinates
or collective variables has thus been one of the main objectives in
metadynamics studies.[6a] When important
reaction coordinates are missed during the simulation setup, metadynamics
simulations may suffer from slow convergence problems as discussed
earlier. Furthermore, the predefined reaction coordinates tend to
place constraints on the sampled space and pathways. It seems difficult
to identify certain intermediate states during the protein-folding
and ligand-binding pathways, e.g., intermediate 2 as observed during
the binding of benzene to the T4-lysozyme. In comparison, aMD simulations
are not constrained by reaction coordinates, but this also leads to
much higher energetic noise and presents grand challenge for accurate
reweighting to recover the original free energy landscapes of biomolecules.[12] Although cumulant expansion to the second order
was shown to improve aMD reweighting when the boost potential follows
near Gaussian distribution,[14] such improved
reweighting is still limited to small systems such as protein with
≤35 residues.[15] Here, by constructing
boost potential using a harmonic function that follows Gaussian distribution,
GaMD enables rigorous energetic reweighting through cumulant expansion
to the second order, even for simulations of larger proteins (e.g.,
T4-lysozyme). With this, GaMD achieves simultaneous unconstrained
enhanced sampling and free energy calculations.However, several
cautions also result from this study. First, while
the present GaMD simulations seem to provide sufficient sampling of
the low-energy regions, they appear to remain unconverged in sampling
of the high-energy barriers. This is particularly true for the ligand
entry step in the GaMD simulation of benzene binding to the T4-lysozyme.
It is worth recalling that the threshold energy for adding the boost
potential is set to its lower bound in the present GaMD simulations.
It is subject to future investigation into whether using the upper
bound of the threshold energy will facilitate sampling of the high-energy
barriers in GaMD simulations. Second, based on a potential biasing
approach, GaMD mainly accelerates transitions across enthalpic energy
barriers. Improvement for its application to systems with high entropic
barriers is still in need. On this regard, GaMD can be potentially
combined with the parallel tempering[36] and
replica exchange[37] algorithms for further
enhanced sampling. Particularly, the combination of parallel tempering
and metadynamics (PT-MetaD)[34a] has been
shown to facilitate enhanced sampling of biomolecules over entropic
barriers.In summary, without the need to set predefined reaction
coordinates,
GaMD is generally applicable to a wide range of biomolecular systems
as demonstrated on protein folding and ligand binding in this study.
For systems of increasing size, the upper limit of the ΔV standard deviation, σ0 can be adjusted
dynamically to ensure that the distribution width of the applied boost
potential is narrow enough for accurate energetic reweighting using
cumulant expansion to the second order. Therefore, GaMD provides both
unconstrained enhanced sampling and free energy calculation for biomolecular
simulations.
Authors: Romelia Salomon-Ferrer; Andreas W Götz; Duncan Poole; Scott Le Grand; Ross C Walker Journal: J Chem Theory Comput Date: 2013-08-20 Impact factor: 6.006
Authors: Yinglong Miao; Sara E Nichols; Paul M Gasper; Vincent T Metzger; J Andrew McCammon Journal: Proc Natl Acad Sci U S A Date: 2013-06-18 Impact factor: 11.205
Authors: Andrew C Kruse; Aaron M Ring; Aashish Manglik; Jianxin Hu; Kelly Hu; Katrin Eitel; Harald Hübner; Els Pardon; Celine Valant; Patrick M Sexton; Arthur Christopoulos; Christian C Felder; Peter Gmeiner; Jan Steyaert; William I Weis; K Christopher Garcia; Jürgen Wess; Brian K Kobilka Journal: Nature Date: 2013-11-20 Impact factor: 49.962
Authors: Yulin Zhao; Peter Man-Un Ung; Gergely Zahoránszky-Kőhalmi; Alexey V Zakharov; Natalia J Martinez; Anton Simeonov; Ian W Glaaser; Ganesha Rai; Avner Schlessinger; Juan J Marugan; Paul A Slesinger Journal: Cell Rep Date: 2020-06-16 Impact factor: 9.423
Authors: Andrea Rizzi; Travis Jensen; David R Slochower; Matteo Aldeghi; Vytautas Gapsys; Dimitris Ntekoumes; Stefano Bosisio; Michail Papadourakis; Niel M Henriksen; Bert L de Groot; Zoe Cournia; Alex Dickson; Julien Michel; Michael K Gilson; Michael R Shirts; David L Mobley; John D Chodera Journal: J Comput Aided Mol Des Date: 2020-01-27 Impact factor: 3.686
Authors: Giulia Palermo; Yinglong Miao; Ross C Walker; Martin Jinek; J Andrew McCammon Journal: Proc Natl Acad Sci U S A Date: 2017-06-26 Impact factor: 11.205