Samuel D Lotz1, Alex Dickson1,2. 1. Department of Biochemistry & Molecular Biology, Michigan State University, East Lansing 48824, Michigan, United States. 2. Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing 48824, Michigan, United States.
Abstract
Here, we introduce the open-source software framework wepy (https://github.com/ADicksonLab/wepy) which is a toolkit for running and analyzing weighted ensemble (WE) simulations. The wepy toolkit is in pure Python and as such is highly portable and extensible, making it an excellent platform to develop and use new WE resampling algorithms such as WExplore, REVO, and others while leveraging the entire Python ecosystem. In addition, wepy simplifies WE-specific analyses by defining out-of-core tree-like data structures using the cross-platform HDF5 file format. In this paper, we discuss the motivations and challenges for simulating rare events in biomolecular systems. As has previously been shown, high-dimensional WE resampling algorithms such as WExplore and REVO have been successful at these tasks, especially for rare events that are difficult to describe by one or two collective variables. We explain in detail how wepy facilitates implementation of these algorithms, as well as aids in analyzing the unique structure of WE simulation results. To explain how wepy and WE work in general, we describe the mathematical formalism of WE, an overview of the architecture of wepy, and provide code examples of how to construct, run, and analyze simulation results for a protein-ligand system (T4 Lysozyme in an implicit solvent). This paper is written with a variety of readers in mind, including (1) those curious about how to leverage WE rare-event simulations for their domain, (2) current WE users who want to begin using new high-dimensional resamplers such as WExplore and REVO, and (3) expert users who would like to prototype or implement their own algorithms that can be easily adopted by others.
Here, we introduce the open-source software framework wepy (https://github.com/ADicksonLab/wepy) which is a toolkit for running and analyzing weighted ensemble (WE) simulations. The wepy toolkit is in pure Python and as such is highly portable and extensible, making it an excellent platform to develop and use new WE resampling algorithms such as WExplore, REVO, and others while leveraging the entire Python ecosystem. In addition, wepy simplifies WE-specific analyses by defining out-of-core tree-like data structures using the cross-platform HDF5 file format. In this paper, we discuss the motivations and challenges for simulating rare events in biomolecular systems. As has previously been shown, high-dimensional WE resampling algorithms such as WExplore and REVO have been successful at these tasks, especially for rare events that are difficult to describe by one or two collective variables. We explain in detail how wepy facilitates implementation of these algorithms, as well as aids in analyzing the unique structure of WE simulation results. To explain how wepy and WE work in general, we describe the mathematical formalism of WE, an overview of the architecture of wepy, and provide code examples of how to construct, run, and analyze simulation results for a protein-ligand system (T4 Lysozyme in an implicit solvent). This paper is written with a variety of readers in mind, including (1) those curious about how to leverage WE rare-event simulations for their domain, (2) current WE users who want to begin using new high-dimensional resamplers such as WExplore and REVO, and (3) expert users who would like to prototype or implement their own algorithms that can be easily adopted by others.
Biomolecular simulation is a valuable tool to gain insight into
the atomic-level mechanisms of biological systems. Molecular processes
such as ligand (un)binding, protein (un)folding, protein–protein
association/dissociation, conformational changes, transport, and enzymatic
reactions underlie the functioning of all living organisms. Because
the length scales of these phenomena are too small to be observed
experimentally at atomic resolution, computer simulations have long
attempted to serve as a proxy. This application of molecular dynamics
(MD) has been limited by two factors: the accuracy of force fields
and available computational power. Despite huge breakthroughs in both
force fields[1,2] and hardware,[3−5] it is still
difficult to perform simulations of sufficient length to capture the
biomolecular processes of interest. For instance, the waiting time
for ligand-unbinding events of pharmacologically relevant ligands
can extend to tens of minutes or hours,[6] while MD trajectories are typically limited to microsecond timescales.
Thus, despite our knowledge (from experiment) that these transitions
occur readily at macroscopic timescales, the occurrence of these events
during a microsecond molecular simulation can be seen as a “rare
event”.A series of algorithms have been developed called
“enhanced
sampling” methods, which attempt to gain information about
long-timescale processes using only short-timescale simulations. Many
of these methods use perturbations to the system by either applying
biasing forces or increasing the system temperature. This includes
enhanced sampling methods such as replica exchange,[7] metadynamics,[8−11] temperature-accelerated MD,[12,13] and umbrella sampling.[14] These methods
invoke an assumption that the molecular systems are in equilibrium,
usually in the form of a canonical probability density function. Additionally,
although there are some approaches to approximate rates based on biased
simulations,[8] this approach makes it very
difficult to recover full atomistic detail of transition-state ensembles
that govern the forward and backward rate constants for a given transition.There is thus a need for a method for accelerating the simulation
of rare-event processes that generate trajectories using the unbiased
Hamiltonian of interest. One such class of methods is called “path
sampling”, which are unique in that they apply a sampling process
over a collection of trajectories rather than single conformations.[15] Importantly, they do not require modifications
to the underlying dynamics model and provide contiguous trajectories
of transition paths. This makes them suitable for studying all varieties
of path-dependent observables [e.g., (un)binding
rates], as well as providing detailed atomistic models of transition
states. There are a variety of path-sampling methods including transition
path sampling (TPS),[16−18] forward flux sampling (FFS),[19] multilevel splitting,[20] and weighted
ensemble (WE).[15,21] Of these, WE has advantages in
which it does not in general require a Markovian assumption or a priori knowledge of full trajectories connecting two states
to start[15,22] and does not require the definition of a
progress coordinate, which will be discussed further below. At the
heart of WE is a resampling process that uses a set of
cloning and merging operations among a set of parallel simulation
replicas. More formal definitions of resampling and WE are discussed
in Section as
well as in refs.[15,22,23] WE is a
conceptually portable method that can be applied to any field of study
including molecular biophysics,[23−33] systems biology,[34−41] telecommunications,[42] and aerospace and
engineering.[43]Another benefit of
WE is the flexibility of its framework, allowing
for resampling methods that avoid the predefinition of progress coordinates
and collective variables (CVs).[20,44−46] Approaches involving CVs are typically limited to low-dimensional
representations (often ≤3), while WE-based resamplers such
as WExplore[45] and REVO[46] perform well in high-dimensional spaces. High-dimensional
adaptive resampling algorithms have been especially successful in
obtaining preliminary rare-event simulations for systems with waiting
times well beyond what is typical.[33,41,47−49] Notably, WExplore simulations
produced unbinding trajectories of a drug ligand (TPPU) from its target
(soluble epoxide hydrolase) that has an experimentally determined
mean first passage time of 11 min, using less than 1 μs of simulation
and a speedup of 109-fold.[48]a WExplore and REVO have shown to be particularly
useful for discovering multiple pathways. Both algorithms were able
to discover multiple ligand dissociation pathways for the trypsin–benzamidine
system, which requires substantial rearrangement of the loops comprising
the trypsin ligand-binding pocket.[46,47] In contrast,
single and low-dimensional projections such as reaction coordinates
often constrict the search space to particular paths, which precludes
the discovery of alternative paths (and transition states) between
macrostates. Finding multiple pathways can be particularly useful
for applications such as kinetics-based drug design when we want to
understand the structure of the ligand-binding transition state not
only for a particular ligand but also for closely related ligands.
Finally, adaptive algorithms such as WExplore and REVO (as well as
the history-augmented Markov state model WE method[50]) require less upfront parametrization such as the definition
of bin boundaries.Despite the advantages of high-dimensional
adaptive WE algorithms
such as WExplore and REVO, their adoption has been hindered for a
number of reasons. First, the implementation of these resampling algorithms
is complex and difficult to implement correctly. Second, independent
implementations of WE algorithms lack interoperability of produced
data and so are difficult to compare. Third, the barrier to entry
for other researchers to write an implementation of the resampling
algorithm as well as progress metrics for their system of interest
is prohibitive.Here, we introduce the open-source wepy software
framework for
running WE simulations that attempts to address these issues. We first
describe a software and data architecture that both reflect a simple
mathematical formalism (described in 3.1) and also decompose into multiple modular components. The software
architecture allows for reuse of vetted resampling algorithm implementations
written by researchers with domain-specific progress metrics written
by users. The data architecture solves interoperability through the
introduction of a general purpose decision record design (described
in the Appendix).Wepy is implemented
in the Python 3 programming language and thus
allows users to natively leverage a massive ecosystem for scientific
computing. Other benefits of a pure-Python implementation are that
it (1) increases portability between platforms, (2) has a uniform
interface that can be used as a library and embedded into other software
easily, and (3) only requires knowledge of a single popular programming
language (Python), which if necessary has facilities for writing extremely
high performance code (e.g., numba,[51] dask[52]). Currently, wepy is
tightly integrated with the OpenMM[53] MD
engine and provides excellent support for running GPU MD simulations.
The architecture of wepy, however, is agnostic to the underlying dynamics
engine as well as to any particular parallel computing strategy or
framework. The wepy project also introduces a high-performance single-file
storage format and schema for cloning-merging-type simulations implemented
in HDF5.[54]b Use of
HDF5 also provides “out-of-core” data structures which
allow access to simulation data that do not fit entirely into computer
main memory. On top of this, an extensive interface [application programmer
interface (API)] is provided to make querying, analysis, conversion
to other formats of complex path trajectories easy. Part of this interface
is to support the intuitive representation of WE trajectories which
have been cloned and merged as trees (referred to as “tree-like”
data structures).Although there are two other WE frameworks
that have been developed,
these have different strengths and design goals. The AWE-WQ[23] system provides an implementation of the accelerated
weighted ensemble (AWE) along with a “Work Queue”-distributed
computing framework. However, AWE-WQ is less flexible than wepy in
that it solely implements AWE simulations and is opinionated about
the distributed computing framework. The WESTPA[55] software suite is a popular implementation of many binning-based
WE resamplers and provides excellent support for running simulations
on large clusters of CPU cores and GPUs. It is implemented in Python
and typically run using a combination of UNIX-like shell scripts and
YAML configuration files. It shares many of the same benefits of wepy
discussed above including modularity and the use of HDF5 files for
simulation data. WESTPA however did not satisfy our requirements because
the architecture is oriented around fixed-topology explicit binning
approaches making implementation of binless algorithms such as REVO[46] difficult. Additionally, we also favor the configuration-as-code
approach where simulation components are constructed in Python code.In this paper, we first briefly describe WE, resampling, and rate
calculations along with a sketch of some of the adaptive resamplers
that motivated the construction of wepy. We then introduce a mathematical
formalism that provides an overview of the design and architecture
of the system followed by a description of the major software components
in wepy including how to initialize, run, and analyze simulations.
Finally, we present an example ligand-unbinding scenario, using a
Lysozyme model system, along with concrete code examples and explanations.
Algorithms
Weighted Ensemble Algorithm
The WE
algorithm is a general strategy for simulating rare or long-timescale
events in stochastic systems.[21] Its fundamental
features are as follows. A set of trajectories are propagated forward
in time in a parallel fashion, each one assigned a statistical weight
(p). Periodically, the trajectories are “resampled”
to rebalance the computational effort toward lower-probability regions.
This is done by cloning trajectories in sparsely populated, lower-probability
regions, and merging together trajectories in overpopulated, higher-probability
regions. To maintain proper statistics, the weight of a cloned trajectory
is divided among its progeny, and the weights of merged trajectories
are summed and given to the resulting trajectory. Here, we refer to
this process as “resampling”.It was shown by
Zhang et al.(44) that this
is a statistically exact process in that ensemble averages were obtained
with WE converge to those of the underlying distribution (e.g., the canonical distribution) in the limit of large
sampling, regardless of the particular resampling function. One of
the first strategies for resampling is a simple binning of trajectories
along a progress coordinate. Trajectories are then cloned in under-represented
bins and merged in over-represented bins until the count in each bin
is equal to a target value. However, because of the combinatorial
explosion for tessellating high-dimensional search spaces with uniform
bins, algorithms such as WExplore and REVO were developed to effectively
leverage the WE algorithm for processes that take place in inherently
high-dimensional spaces.
WExplore Resampler
The main problem
with defining bins in a high-dimensional space is that the number
of bins needed to cover the space scales exponentially with the dimensionality
of a space. This number of bins is proportional to the overall computational
effort of the simulation, as a target number of trajectories are to
be run in each bin. The WExplore algorithm was developed by Dickson
and Brooks to circumvent this difficulty.[45] The key is to construct a set of hierarchical regions: a small set
of large regions that tile the entire space, which are each subdivided
by a set of smaller regions, which are in turn subdivided by even
smaller regions. In the WExplore resampler, these regions are Voronoi
polyhedra that are defined by points (called “images”)
in a high-dimensional space. In order to assign a trajectory to a
region, we must measure the distance from that trajectory state to
each image. The trajectory is then assigned to the region with the
closest image. More information on this algorithm can be found in
the original paper.[45]In a typical
WExplore simulation, we start with only a single image and define
new images as they are visited by the set of trajectories. A new image
is defined when a trajectory in the ensemble reaches a new region
of space, or more precisely, when the distance to the closest image
is greater than a critical value. The list of critical values (dmin = (d1, d2, ..., d)) thus control the sizes of the regions at each level of the
hierarchy and are parameters of the WExplore resampler. As the set
of images grows over the course of the simulation, the WExplore resampling
function (denoted below)
can change with each resampling
step.
REVO Resampler
Although WExplore
presents a viable means of indexing tessellations of high-dimensional
spaces, there are still some behaviors related to the construction
of regions that are nonoptimal—most notably, the discontinuous
behavior related to reaching new levels of the hierarchy for the first
time. For this reason, a new resampling algorithm was recently proposed
which avoids the construction of sampling regions altogether. In the
REVO algorithm (resampling of ensembles by variation optimization),
an objective quantity called the “trajectory variation”
(denoted, V) is used to guide the merging and cloning
process.[33,46] We calculate V before and
after each proposed merging and cloning operation and execute only
the operations that cause V to increase.The
variation is given bywhere d(X, X) represents the distance between trajectories i and j and ϕ(X)
is a function
that describes the importance of individual trajectories. The exponent
α allows us to balance the relative strength of the distances
and the importance functions, and the d0 value does not affect resampling but serves to keep the variation
function unitless. On the whole, the variation function increases
as the ensemble of trajectories gets further apart.The importance
functions can be defined to take the probabilistic
weight of the trajectories into accountwhere C is a constant.
This
has the effect of prioritizing not only a broad ensemble of trajectories
but one where the highest weighted trajectories are distributed as
far as possible from each other. This strategy can minimize the error
in the calculation of observables that depend on the weights of rare
trajectories, such as transition rates.
Calculating
Transition Rates Using Boundary
Conditions
The wepy software supports the construction of
nonequilibrium ensembles to calculate transition rates. In this technique,[56−58] a set of history-dependent ensembles are defined using a set of
“basins”. For instance, if we are studying ligand-binding
transition pathways, then the basins will be the “bound”
and the “unbound” states. The unbinding ensemble is then defined as the set of trajectories that have most recently
visited the bound state. When a trajectory from the unbinding ensemble
enters the unbound state, it transitions to the binding ensemble. In a typical wepy simulation, we can initialize trajectories in
a given basin and use boundary conditions that terminate trajectories
that enter the opposite basin. The weights of these trajectories are
used to calculate a transition flux (probability per unit time), which
is used to calculate a rate constant (e.g., kon or koff). These trajectories
are then “warped” back to the initial state, retaining
the same statistical weight.
Design
and Architecture
Formalism
Let
us first define ensemble
resampling simulations in general. First, we define an ensemble to
be a finite multiset of walkers, w, of size Nwhere a walker is a set of two elements:
a
probabilistic weight, p, and a state, XAn ensemble of walkers defines
a probability
distribution, P(X), and must be
normalized such thatThere are at least two
steps in an ensemble resampling simulation:
propagating dynamics and resampling. We can also introduce a third
step for the so-called boundary conditions that allow for running
nonequilibrium simulations and calculation of rate constants. An ensemble
resampling simulation process, (for
“apparatus”), is made
up of three components: a runner function , a boundary
condition function , and
a resampling function .The dynamics can be any stochastic
dynamical process such as MD,
Monte Carlo simulations, and so forth. Formally, a runner function
has the formwhere is a stochastic function such that multiple
evaluations of an input state X might not yield identical X′s and τ is the number of steps of propagation.
Although acts independently
on each walker state X, it is convenient to write
it asThe resampling
function, , maps an ensemble, , to another ensemble ′
along with an updated resampling functionThe walkers in this new ensemble must satisfy these constraintsIn words, ′ consists of walkers
with states chosen from the states of the ensemble. Note that these constraints are necessary but not
sufficient to ensure that the sampled ensembles are consistent with
the cloning and merging process in the WE algorithm.The updated
resampler allows for algorithms that take into account
history dependence. Thus, a resampling functionis said to be stateless and does not take
into account any walker history.Although more commonly stateless,
boundary conditions and runners
also have equivalent history-dependent definitionsThese
functions typically utilize an application strategy that
followsTo simplify this, we write it asSimilarly, a single cycle of simulation is the application
of the three components of the apparatus to an initial ensemble, 0This interleaved application of apparatus components can be
simplified
towhere which can easily adopt
the notation mentioned
above for n cycles applied to the initial walkersA final useful construct is the snapshot which is
the complete state of a simulation
Software Components
In this section,
we describe concretely the software components that make up the wepy
framework and how they are integrated. We begin by describing the simulation manager, which implements the apparatus described
above and handles the reporting of output to the user. We then discuss
some unique features of WE algorithms that can lead to challenges
during data analysis and tools provided by wepy that make this analysis
easier.
Building and Running Simulations
To run simulations, we need two things: an ensemble of initial walkers
and a simulation manager. A simulation manager is an object that contains
all of the apparatus necessary to move the walkers forward.The flowchart in Figure describes the functioning of a simulation manager acting
on a set of initial walkers. First, the input walkers have their states
propagated by the runner, which can be any sort of dynamics
engine. The initial release of wepy supports OpenMM as an MD engine,
as well as experimental support for other engines such as NAMD. Because
the runner propagation is typically extremely compute-intensive, we
also provide another work mapper component, which can
be customized for different computing environments. Wepy includes
a serial implementation that can be used for testing on a single core
or device, as well as two additional work mappers that use the python
built-in multiprocessing module that are suitable for simulations
using hardware on a single node. Because the work mapper has been
factored out of the implementation of the runner itself, these mappers
will work for all different MD engines with little modification.
Figure 1
Diagram
showing the components and flow of data for the main simulation
loop of the simulation manager. Dynamics are performed in the runner
component and parallelized onto the compute nodes via the work mapper. The warping and resampling records are shown in
the data flow, which are serialized and saved using an HFD5 reporter.
Other reporters can be defined to record simulation data, which is
indicated by the I/O elements.
Diagram
showing the components and flow of data for the main simulation
loop of the simulation manager. Dynamics are performed in the runner
component and parallelized onto the compute nodes via the work mapper. The warping and resampling records are shown in
the data flow, which are serialized and saved using an HFD5 reporter.
Other reporters can be defined to record simulation data, which is
indicated by the I/O elements.After propagation by the runner, each walker is tested by the boundary
condition function to see if it has met the criteria. If any walker
satisfies the boundary condition, the function is free to modify the
state of those walkers; this is called a warping event.
In addition to the new walkers, a set of warping records are produced
that describe how the warping event(s) occurred. These records are
then made available for generating the final data structure and any
other report that requests them. In wepy, two kinds of warping events
are recognized: continuous and discontinuous. A discontinuous warping event is one where dynamical variables
of the walker are modified, for example, restarting walkers at the
initial state after reaching a target state. A continuous warping
event is one in which none of those dynamical variables are modified.
This could be an auxiliary attribute such as a “color”
after a walker passes through some boundary[59] or none at all in which case only a record will be produced.After the boundary conditions are applied, the resampler resamples
the walkers. Again, a collection of records is produced for the resampling;
however, unlike the warping records, a record for each walker is produced
every cycle. These resampling records contain critical information
about the lineages of walker states that is necessary for reconstructing
continuous trajectories. It is useful to use a diagram to depict the
histories of each walker as they pass through these stages in a single
cycle, an example of which we have shown in Figure . Finally, at the end of a cycle, the walkers
and records are passed off to an arbitrary number of reporters which can record whatever output is necessary or desired.
Figure 2
Walker history
schematic. The vertically stacked boxes are the
ensembles of walkers with labels below. The index of the “box”
is the walker index in that ensemble. The color of the circle inside
the box represents the state of the walker, where we treat white as
the initial state of the simulation. The jagged lines indicate propagation
of the walker state through the dynamics of the runner. The lines
for the boundary condition step indicate whether a warping event occurred,
where the dot indicates that a continuous warp occurred and the slash
indicates a discontinuous warp, in this case, returning the walker
state to the initial conditions. The lines in the resampling phase
show cloning and merging, where a solid line indicates that a child
inherits the state of its parent and a dashed line indicates a transfer
of weight to another walker. In this example, w0 has cloned itself to produce two child walkers while w1 has been “squashed” and its
state forgotten. It is “merged” and its weight was added
to the resultant w2.
Walker history
schematic. The vertically stacked boxes are the
ensembles of walkers with labels below. The index of the “box”
is the walker index in that ensemble. The color of the circle inside
the box represents the state of the walker, where we treat white as
the initial state of the simulation. The jagged lines indicate propagation
of the walker state through the dynamics of the runner. The lines
for the boundary condition step indicate whether a warping event occurred,
where the dot indicates that a continuous warp occurred and the slash
indicates a discontinuous warp, in this case, returning the walker
state to the initial conditions. The lines in the resampling phase
show cloning and merging, where a solid line indicates that a child
inherits the state of its parent and a dashed line indicates a transfer
of weight to another walker. In this example, w0 has cloned itself to produce two child walkers while w1 has been “squashed” and its
state forgotten. It is “merged” and its weight was added
to the resultant w2.Although reporters can be customized for a particular simulation,
there are a number of useful reporters included with wepy. The WepyHDF5
reporter generates HDF5 output files, which are a major component
of wepy and will be discussed on their own later in relation to analysis.
The rest of the reporters are designed to give real-time insights
to potentially long-running simulations as well as to give an accessible
summary of the simulation results. The dashboard is an executive summary
of the simulation provided in a simple plaintext file that is formatted
in emacs org-mode that makes it easy to read a potentially large output
file in a hierarchical folding manner. The walker ensemble reporter
is for visualizing the current state of the walkers that simply outputs
a reference topology file and a DCD trajectory file that can be used
for visualizing in any of the main 3D molecular visualization programs,
while these snapshots of simulations show only transient information
that they are useful to indicate rough sketches of progress or for
debugging purposes. Finally, the resampling tree output will generate
a GEXF formatted XML file that shows the “family tree”
of all walkers in the simulation. This is useful for helping understand
both when and where resampling and warping events are taking place.
These reporters are provided only for convenience as all of this information
can be generated from WepyHDF5 files.
Data
Format and Analysis
The WE
algorithm is built on branching trajectories: simulations that have
a single starting point may have multiple ending points. We call these
kinds of tree-like branching trajectories nonlinear and
a diagram of the difference between them and more traditional linear
trajectories can be seen in Figure . In wepy, we call a locally linear selection of frames
from the entire nonlinear dataset a trace. These are
shown in Figure as
the solid colored lines drawn next to the frames they encompass. Nonlinear
trajectories introduce additional complexity and typically require
modifications to common time-dependent analyses.
Figure 3
Nonlinear trajectory
comparison. The black squares indicate single
frames which are connected in time by the thin black lines. The larger
lines indicate a locally linear trajectory, here called a “trace”.
Nonlinear trajectory
comparison. The black squares indicate single
frames which are connected in time by the thin black lines. The larger
lines indicate a locally linear trajectory, here called a “trace”.First, visualizing a single linear trajectory requires
a selection
step and the input of information. For instance, in Figure , the single red and blue traces
for the linear trajectories are trivial, whereas there is some redundancy
in the nonlinear trajectories from the tree. In a nonlinear tree,
we can choose a frame for which we are interested in its history and
recover its lineage as a trajectory. These linear trajectories
can then be exported to a common format (e.g., CHARMM
DCD) and visualized.Second, state network models such as Markov
State models (MSMs)
are a common method for representing and understanding large amounts
of simulation data. This is a perfect match for WE simulations, as
the trajectory segments are typically generated using unbiased dynamics.
However, current tools for constructing MSMs do not support construction
of the transition matrices from nonlinear trajectory trees. Fundamentally,
the problem involves avoiding the double-counting frames when counting
the transitions for lag times greater than our cycle length (τ).
A comparison for the solution for the nonlinear case is compared to
the linear case in Figure , which is implemented in wepy.Third, when calculating
free energies of macrostates, the weights
of trajectory observations must be taken into account. This simply
amounts to doing weighted sums rather than counts for macrostate bins
and requires careful association of trajectory frames to the instantaneous
values of trajectory weights.Finally, observations in different
trajectories cannot be assumed
to be statistically independent. Take for example, a set of observed
warping events take place where a ligand molecule reaches the threshold
of unbinding for a receptor. Each single instance contributes to the
overall rate of unbinding proportional to its weight via the Hill relation.[15] However, the calculation
of macroscopic observable uncertainties (such as the probability of
an unbinding event) is complicated by the duplication of single observations via cloning. The degree to which two observations can be
seen as “independent” depends on the time point of their
last common ancestor.To deal with these fundamental differences
in the abstract structure
of data generated by a WE simulation, wepy implements a new mechanism
for storing trajectory data. To accomplish this, we have designed
and implemented a storage layer using the common HDF5 format, which
is suited to storing large amounts of heterogeneous data. In this
implementation, all data results from a simulation are contained in
a single monolithic binary file. Although the data can be accessed
with any tool that supports HDF5, in wepy, we provide an extensive
API for creating, accessing, querying, processing, and transforming
the data at a useful semantic level.In the WepyHDF5 format,
we bundle together the three critical and
interdependent data pieces into a single file: walker data (including
weights and states), resampling data, and boundary condition warping
data. This combination is what allows us to generate views of linear
trajectories from nonlinear data. The resampling data inform the parent-child
relationships between walkers and the warping data alerts to the presence
of discontinuities in dynamics of these walker lineages. These primitives
solve most of the major problems listed above for dealing with nonlinear
trajectories with tools provided in wepy.
Results
In this section, we first discuss code examples
on how to set up,
run (Section ),
and analyze (Section ) a simulation for the test system used in this paper: Lysozyme
ligand unbinding in an implicit solvent. We then describe all the
parameters and details about the simulations that were run, the results
of which are briefly discussed. This small experiment is shown to
give an example of the kinds of results and analysis that are typical
for wepy simulations.
Code for Running Simulations
Here,
we give a brief sketch of how these components were constructed and
put together into a simulation manager in a Python script. The complete
code is given in the Supporting Information. The system and parameter choices will be discussed in Section .We first
need to set up our wepy runner for OpenMM, which requires an OpenMM
system and integrator objects. Using OpenMM-Systems helper library
(installed with wepy), we can easily create a ready-to-go MD system.The integrator can be constructed using the OpenMM constructorwhere the three arguments are the temperature, friction coefficient,
and dynamics time step, respectively. We can then create a runner
object that contains everything needed to propagate the system, where
the reference platform specifies a cross-platform reference implementation
on a CPU. Note that our production simulations for this work were
run with the CUDA platform.Second, we need to create the initial
ensemble of walker objects
from which to start our simulations. We use a wepy helper function
gen_walker_state to generate state objects directly from OpenMM systems.init_state is a WalkerState object which can be put inside a walker.
Because we are using a Langevin integrator which has a stochastic
component (required by all WE simulations), we can copy the same structure
for all starting replicas. Each worker manually has a weight assigned
to them; in this case, it is a uniform distribution.Third, we construct a resampler to use. Both the WExplore and REVO
resamplers require a metric, which is a way of measuring the distance
between two walkers. Details regarding the distance metrics are discussed
in more detail in Appendix . For receptor ligand-based systems, there are distance metrics
already included in wepy:where lig_idxs and bs_idxs are
the atomic indices of the ligand
and the binding site in the system. A user can also easily make their
own distance metrics; a recipe for doing this is shown belowThe dependencies here are the distance base class and some helper
functions for performing geometric operations on arrays from our custom
library geomm. The distance object is created in a similar way, except
that we do not need the reference state (which is needed in UnbindingDistance
for performance reasons):Fourth, we construct the resampler
that we are going to use. Here,
we create an instance of the WExploreResampler using the distance
and init_state objects that we created earlier as well as some additional
algorithm parameters (as discussed in Section )Fifth, to set up a nonequilibrium
ligand-unbinding simulation,
we will construct boundary conditions that capture walkers as they
cross into the unbound state. Wepy also comes with some built-in modules
under receptor-based boundary conditions, which we can import and
parametrize as followswhere json_top is an internal JSON-based
topology format in wepy,
more information on this is given in the supplemental example script.
Again, this is easy to customize for your application; we show a simplified
example of a custom boundary condition below. The warp_walkers function
computes the minimum distance between the ligand and the receptor
atoms for each walker and if it exceeds a threshold, we “warp”
it by replacing that walker state with the initial bound state, while
keeping its weight constant.Finally, we assemble
these components into a simulation managerOnce the simulation manager is constructed, all we need to do now
is to tell it to run. A simulation for 10 cycles, each having 10,000
steps per walker per cycle runs as followsThis returns the walkers at the end of the simulation along with
the final state of the runner, resampler, and boundary conditions
which are contained in the final_apparatus.Another important
component is the use of reporters. In addition
to the walkers and apparatus returned by the simulation manager functions,
wepy supports a plugin system to output data as the simulation is
progressing. In the following example, we will show you how to use
two different reporters: one for creating the HDF5 files and another
that produces a plaintext dashboard file for every cycle to show the
progress of long running simulations in a high-level overview. If
you have the resampler and boundary conditions already constructed,
it is very simple to make the HDF5 reporter. This will work out of
the box but should likely be customized as the default is to save
all data, including the velocities, at each step.The dashboard is composed of different sections for the different
components. There is a generic one that displays the number of cycle
runs, walker weights, and total simulation time, and there are component-specific
sections for information on resampling, boundary conditions, and so
forth.These reporters are attached to
the simulation manager at creation
timeAs mentioned in Section , the HDF5 reporter is of utmost importance
and is
a purpose-built fully featured storage format implemented in HDF5.
Saving your data in the HDF5 format will let you use an extensive
API designed specifically for manipulating WE data (the WepyHDF5 class
in the wepy.hdf5 module) as well as allowing lower-level manipulations via libraries such as h5py. Furthermore, a fairly comprehensive
analysis toolkit is made available in the wepy.analysis module. This
toolkit makes it easy to transform and structure data to interoperate
with other analysis toolkits such as scipy,[60] dask,[52] mdtraj,[61] and gephi.[62] In the next section, we
show some examples of how to leverage these tools to perform common
analyses and visualizations.
Code for Analysis and Visualizations
Probability Distributions
One of
the most common ways to visualize simulation data is to project structural
data to a small number of dimensions and then plot this as a probability
distribution. In practice, free-energy profiles are computed by binning
the projection domain and then computing the weighted histogram over
those bins by summing the weights of the samples in those bins. The
bin values are then normalized to get a probability distribution,
which is then transformed by −ln(p) to get
a free energy-like value. Typically, the term free energy refers to
probability distributions over equilibrium ensembles. However, for
convenience, here, we refer to probability distributions transformed
as described as “free energy” or more specifically “nonequilibrium
free energy” even if the underlying ensemble is not necessarily
an equilibrium one. In a single linear MD trajectory, the frames are
equally weighted, but in an ensemble of walkers in WE, the weights
of each frame can be different and vary over time. As such, the trajectory
coordinate data are associated with the weights in the wepy HDF5 file.Here, we show how to create 1D free-energy profiles for each experimental
group projected onto the ligand RMSD relative to the bound pose (see Figure ). The process is
to first compute the ligand RMSD for all of the simulation frames
as an “observable” and then to compute the free-energy
profiles which can be plotted with a tool such as matplotlib.[63]
Figure 4
Series of −log(p) distributions
for ensemble
simulation groups at different time points. The time shown in the
bottom right of each panel is the total amount of sampling across
all replicas in the ensemble.
Series of −log(p) distributions
for ensemble
simulation groups at different time points. The time shown in the
bottom right of each panel is the total amount of sampling across
all replicas in the ensemble.To compute the ligand RMSD observable, we open the file with the
WepyHDF5 API in PythonThen, we define a function for
computing RMSD that will be mapped
over all of the dataWe then apply the function to the
datawhere the second argument asks to retrieve the relevant
data from
each frame, which here is just the positions. This also saves the
observable into the database as the field “lig-rmsd”.
By default, the calc_rmsd_observable function is applied in serial.
In wepy, we also provide a distributed version which can connect to
a dask cluster server for distributed parallelism on large computing
clusters (see documentation).Now, we can create the free-energy
profile for this observable.
One intermediate step, although, is to use another representation
called the “contig tree”, which makes combining multiple
contiguous simulations (such as when a simulation is restarted) much
easier to analyze. Construction of a contig tree requires (1) a dataset,
(2) a “decision” (used internally by the resampler see Appendix for more details),
and optionally (3) the boundary condition class that was used for
the simulation if any.With the contig tree constructed,
we can then feed it to the profiler,
bin the domain, and calculate the free energies for each.A line plot of fe against bin is shown for each simulation type
in Figure and discussed
in Section .
Visualizing Ligand-Unbinding Events
To make rate estimates for a process or to analyze the transition
path ensemble, we need an efficient way to examine the pathways from
one basin to another. There are tools in wepy to help analyze this
kind of boundary condition data, which we call the “warping”
records. The data for all the warping events can be found in the HDF5
and are accessed through either the WepyHDF5 object or the ContigTree.
Here, we can make a pandas data frame table for run 0, which can easily
be exported to any number of formats.Each row of this table
contains the index, weight,
and point in time that the event took place.To get trajectories
from these warping events to the starting structure,
we use the following functions on a Contig object which is a single
simulation from a ContigTreeOn the first line, we
first generate a linearized Contig from the
ContigTree which is necessary for traversal. Then, we open a context
for the contig to open the HDF5 file where we can then access the
simulation data. We then query for a “trace” of all
frames in which a warping event occurred as the warp_trace. From each
of these events, we then produce the individual histories (“lineages”)
of that walker to the initial state. For this example, we only choose
to look at one of these, which is set as lineage_trace. We now want
to actually retrieve the simulation data to be able to visualize and
perform analysis on it. To do this, we use the lineage trace and choose
which attributes we want. For this example, we get only the positions
and box vectors because our purpose is to generate visualizations.
Now that we have the trace_fields for this trajectory, we can easily
convert to an mdtraj object which can then be used to save the trajectory
to a file format visualization software can read.
Simulating Lysozyme Ligand Unbinding
To showcase the
use of wepy for performing simulations of rare events
in real systems, we simulate the small-molecule p-xylene unbinding
from the T4 lysozymeL99A mutant protein, which we refer to henceforth
as “lysozyme”. Lysozyme is a common model system for
ligand-unbinding studies both experimentally and computationally.[64] Lysozyme-unbinding pathways have been simulated
through a variety of enhanced and brute-force methods.[65−68] Here, we simulate lysozyme interacting with the p-xylene ligand
in an implicit solvent where the event of interest, ligand unbinding,
is not trivial to observe but is still tractable with straightforward
MD simulations.The system was prepared in an OBC GBSA implicit
solvent, using Amber ff96 for the protein force field and a GAFF and
AM1-BCC parametrization of the ligand. A Langevin integrator was used
with a 2 fs time step, a temperature of 300 K, and a friction coefficient
of 1 ps–1. Here, we determine a target p-xylene
residence time in an implicit solvent using a total of 3.385 μs
of straightforward MD simulation, where p-xylene is “warped”
back to the binding site upon unbinding a total number of 11 times.
This resulted in an unbinding rate of 3.25 μs–1, which we use as a target rate for comparison with both WE simulations
and trajectory ensembles using wepy without resampling. The test groups
that were simulated are as follows: ensemble of 48 walkers with no
resampling, that is, straightforward (SF group); ensemble of 48 walkers
with the REVO resampler (REVO group); ensemble of 48 walkers with
the WExplore resampler (WExplore group).For each group, four
independent simulations were run, each for
a total sampling time (summation across all replicas of ensemble)
of 1 μs. All simulations used the same boundary condition criterion,
which is that the minimum of all ligand–protein interatomic
distances is greater than 1.0 nm. Boundary conditions are checked
at the end of every cycle, which was τ = 20 ps for every group.
The WE simulations (REVO and WExplore) employed the following parameters:
a minimum walker weight of 10–12 and a maximum walker
weight of 5.0 × 10–1. Both REVO and WExplore
used the UnbindingDistance, where the distance between two walker
structures is computed as the distance between their ligands after
alignment of their binding sites to the initial starting structure.
This is included in wepy and has been used with success in other ligand
release simulations.[28,33,47−49]For the WExplore simulation, the same parameters
from previous
publications[28,47,48] were used; four region hierarchy levels with cutoffs d = 10, 5, 3.5, and 2.5 Å with a maximum of 10 subregions per
parent region. For REVO,[46] the following
parameters were used: a characteristic distance of 1 Å, a merge
distance of 2.5 nmc, and a distance exponent
of 4, and we use the weight-based importance function, as described
in eq .
Analysis of Simulation Results
Probability
Distributions
Figure shows a series of
plots of free energies of each simulation group projected onto the
RMSD of the ligand to the starting pose as a function of aggregate
simulation time. Again, the “free energies” here are
more accurately the negative logarithm of the nonequilibrium probability
distributions because we are warping unbound walkers back to the starting
position. Accordingly, we see a local free-energy minimum at RMSD
close to 0 Å but no corresponding minimum in the unbound state.REVO and WExplore reach much larger ligand RMSD values than the
straightforward (SF) simulations. Both WExplore and REVO observed
close to 5 nm RMSD values at 0.052 μs (Figure B), whereas SF has only reached ligand RMSD
values of around 1.7 nm. The probability of large RMSD states tends
to be underestimated early on, as can be seen by comparison between
panel A and B and the final estimates in panel D.By the end
of the simulations (Figure D), all the profiles are very similar between
the different groups until around 4 nm. The curves for WExplore and
REVO beyond that are noisy because of the difficulty in reaching very
large distances (e.g., 7 nm) without reaching the
unbinding boundary condition. In general, note that early time predictions
of WE free-energy profiles can differ significantly from equilibrium
free-energy profiles. These should instead be viewed as direct estimates
of a conditional time-dependent probability distribution: P(x, t|x0, t0), or the probability
of being at a point x and time t, given that we began at point x0 at
time t0. Although we expect these to converge
to equilibrium probabilities only for t →
∞, we can often learn valuable information from the tails of
these distributions.We note that a peculiar artifact in this
system is that the p-xylene
ligand sticks strongly to the surface of lysozyme, and breaking out
of the binding pocket was much easier than leaving the surface of
the protein. This likely explains high-variance ligand RMSD values
above 4 nm and performance could be improved by incorporating the
latter process (leaving the surface) more directly into the distance
metric that governs the resampling process.
Unbinding
Events and Trajectories
A primary interest in the study of
rare events in nonequilibrium
systems is to understand the kinetics of transition paths. Although
we can use a variety of simulation methods to estimate the rate constants
associated with events, it can be much more difficult to obtain accurate
data on the actual mechanistic determinants of these rates. Primarily,
this is the structural details
of transition states, but all structures along a path can potentially
be useful for design purposes. Transition paths obtained with path
sampling methods, such as WE, are especially meaningful because the
Hamiltonian is unperturbed throughout the sampling process. Here,
we show how wepy can easily obtain and analyze continuous, unbiased
trajectories of ligand unbinding.Table shows that both REVO and WExplore outperform
the SF methods in terms of the absolute number of unbinding events
that are observed. More important, although, is that the initial times
of the first observations for REVO and WExplore are much faster than
the SF ensemble simulations. All replicates for REVO and WExplore
simulations have exit points within the first 100 ns, while the first
among SF simulations takes about 3 times as long (at 300 ns) and is
highly variable among replicates. The slowest SF simulation does not
observe an unbinding event until around 700 ns. This highlights the
utility of high-dimensional resamplers such as WExplore and REVO for
generating observations of rare events with a more modest investment
of computing power. In Figure , we show structures along the first unbinding paths generated
by the SF and REVO simulations. From this, we can see that the REVO
simulation (panel B) takes a much shorter, more directed path to unbinding
compared to the SF one (panel A). The ligand in the SF simulation
can be seen to move around to multiple other locations on the surface
of the protein before ultimately unbinding.
Table 1
Unbinding Events and Final Rate Estimates
of Simulations
group
num warps
SF
(numerical target)
11
SF (ensemble)
39
WExplore
486
REVO
4337
Figure 5
Ligand-unbinding trajectories.
These 3D renderings of lysozyme
protein (from the trajectory seed structure) showing the positions
of the p-xylene ligand from the bound positions (red) to unbound positions
(blue). A cartoon representation of lysozyme is in purple surrounded
by a gray surface representation. Positions of the ligand are shown
for the first observed unbinding trajectory from each of the two simulation
groups shown at the end of each WE cycle (τ, i.e., every 20 ps). (A) Unbinding trajectory from the SF group (no resampling).
(B) Trajectory from the REVO group.
Ligand-unbinding trajectories.
These 3D renderings of lysozyme
protein (from the trajectory seed structure) showing the positions
of the p-xylene ligand from the bound positions (red) to unbound positions
(blue). A cartoon representation of lysozyme is in purple surrounded
by a gray surface representation. Positions of the ligand are shown
for the first observed unbinding trajectory from each of the two simulation
groups shown at the end of each WE cycle (τ, i.e., every 20 ps). (A) Unbinding trajectory from the SF group (no resampling).
(B) Trajectory from the REVO group.
Discussion
Successes
Wepy is a useful and flexible
implementation of advanced WE simulations with a growing number of
applications.[33,46,49,69] In our experience, wepy has been particularly
useful in three major ways. First, it has made the prototyping of
new methods very easy even for researchers with little experience
in programming or the Python language. The initial goal of wepy was
to simplify and modularize the original implementation of the WExplore
algorithm.[45] Following this, the REVO algorithm
was fully prototyped, tested, and eventually used as a resampler for
a variety of problems.[33,46,69] This process was made much simpler and faster by the sharing of
the same infrastructure that was already developed. In addition, the
binless nature of the REVO algorithm was actually conceived of partly
as a response to the abstractions formalized by wepy.This highlights
the second point: not only does wepy provide useful software to perform
simulations but it also provides a common language with which researchers
could communicate with each other about their simulations. We have
found the use of diagrams such as Figures and 2 to be valuable
not only for reasoning about our programs but also in explaining WE
and nonequilibrium simulations in general.Finally, wepy has
made the process of analyzing WE simulations
dramatically simpler. The biggest contribution here is the support
for out-of-core data structures (via the HDF5 format)
and expressive in-memory representations that reflect the tree-like
trajectory structures (i.e., WepyHDF5 and ContigTree).
Using this data structure, wepy also provides facilities for easy
analysis and visualization of “resampling trees”.As mentioned in the Introduction (Section ), a big asset to the WE method is that the
microscopic trajectories are generated using the unbiased Hamiltonian.
In wepy, we ensure that these data can be fully leveraged however
the user sees fit. One major use case for these data is the construction
of MSMs with long lag times, which in turn can be used to predict
steady-state probabilities or identify transition states. Although
not mentioned here, wepy provides facilities for constructing macrostate
models [such as conformation state networks (CSNs) and MSMs].Others are encouraged to use, share components for, or even contribute
to wepy which is open source with an MIT license. The source code
is currently hosted on github (https://github.com/ADicksonLab/wepy) and documentation is currently available at https://adicksonlab.github.io/wepy. At the time of publication, the 1.00 release of wepy has been made
and has been archived and given a persistent identifier from Zenodo.org (DOI: 10.5281/zenodo.3973431).
Opportunities for Future Development
Although
wepy provides many essential and novel features, there are
areas for potential improvements that could make it even more widely
useful. Currently, it has not yet been integrated with major MD engines
such as GROMACS,[70] CHARMM,[71] Amber,[72] and Desmond,[73] and there is currently only experimental support
for NAMD[74] and ASE.[75] We note that the architecture of some of these engines
makes it difficult to interface without going through a UNIX-like
system environment (a difficulty for all software using these tools).
We note that OpenMM allows for the use of force fields native to each
engine (e.g., CHARMM, AMBER) to be used within it,
so the issue is more about choice of implementation rather than the
content of the simulations.Second, the priority for wepy developers
has been on implementing and prototyping new adaptive and high-dimensional
resampling algorithms rather than implementing standard static binning
methods or accelerated WE.[23,59] Fortunately, many of
these methods are not complex to implement as resampler objects in
wepy and we hope that these will either be included in future releases
or as standalone libraries. We note that a goal between WESTPA[55] and wepy developers is to enable a modular program
design, where resampler objects could be used interchangeably between
the two WE implementations.Third, for protein and other macromolecular
simulations, wepy resorted
to using an ad hoc serialization format in JSON for
molecular topologies (the schema of which was borrowed from an internal
representation in the mdtraj HDF5 implementation). In principle, wepy
is agnostic to topology formats but in practice, this is an extremely
important component in building simulations and performing analyses.
Despite there being a large number of software packages implementing
in-memory representations of molecular topologies, there are no formats
suitable for serialization and communication between software. We
encourage users to consider the merits of the JSON topology used in
wepy but by no means recommend it as a general purpose standard nor
do we wish it to become a de facto standard.Finally, although the use of an HDF5 based file format has proven
to be a good choice for many reasons, it is currently not supported
natively by any molecular visualization software. Thus, trajectories
must be converted and duplicated into separate files with a supported
format (a simple task using the integration with the mdtraj library).
In our work flows, we treat these files as temporary intermediates;
however, this can bloat the necessary disk space needed and cause
some time delays when (re)generating them. We note, although, that
visualization tools could benefit immensely by adopting a random-access
format such as HDF5 which would allow for visualizing single trajectories
which do not fit in memory, a problem we frequently encounter when
attempting to view very long trajectories.
Authors: B R Brooks; C L Brooks; A D Mackerell; L Nilsson; R J Petrella; B Roux; Y Won; G Archontis; C Bartels; S Boresch; A Caflisch; L Caves; Q Cui; A R Dinner; M Feig; S Fischer; J Gao; M Hodoscek; W Im; K Kuczera; T Lazaridis; J Ma; V Ovchinnikov; E Paci; R W Pastor; C B Post; J Z Pu; M Schaefer; B Tidor; R M Venable; H L Woodcock; X Wu; W Yang; D M York; M Karplus Journal: J Comput Chem Date: 2009-07-30 Impact factor: 3.376
Authors: Ask Hjorth Larsen; Jens Jørgen Mortensen; Jakob Blomqvist; Ivano E Castelli; Rune Christensen; Marcin Dułak; Jesper Friis; Michael N Groves; Bjørk Hammer; Cory Hargus; Eric D Hermes; Paul C Jennings; Peter Bjerre Jensen; James Kermode; John R Kitchin; Esben Leonhard Kolsbjerg; Joseph Kubal; Kristen Kaasbjerg; Steen Lysgaard; Jón Bergmann Maronsson; Tristan Maxson; Thomas Olsen; Lars Pastewka; Andrew Peterson; Carsten Rostgaard; Jakob Schiøtz; Ole Schütt; Mikkel Strange; Kristian S Thygesen; Tejs Vegge; Lasse Vilhelmsen; Michael Walter; Zhenhua Zeng; Karsten W Jacobsen Journal: J Phys Condens Matter Date: 2017-03-21 Impact factor: 2.333
Authors: John D Russo; She Zhang; Jeremy M G Leung; Anthony T Bogetti; Jeff P Thompson; Alex J DeGrave; Paul A Torrillo; A J Pratt; Kim F Wong; Junchao Xia; Jeremy Copperman; Joshua L Adelman; Matthew C Zwier; David N LeBard; Daniel M Zuckerman; Lillian T Chong Journal: J Chem Theory Comput Date: 2022-01-19 Impact factor: 6.006