Literature DB >> 33344813

Wepy: A Flexible Software Framework for Simulating Rare Events with Weighted Ensemble Resampling.

Abstract

Here, we introduce the open-source software framework wepy (https://github.com/ADicksonLab/wepy) which is a toolkit for running and analyzing weighted ensemble (WE) simulations. The wepy toolkit is in pure Python and as such is highly portable and extensible, making it an excellent platform to develop and use new WE resampling algorithms such as WExplore, REVO, and others while leveraging the entire Python ecosystem. In addition, wepy simplifies WE-specific analyses by defining out-of-core tree-like data structures using the cross-platform HDF5 file format. In this paper, we discuss the motivations and challenges for simulating rare events in biomolecular systems. As has previously been shown, high-dimensional WE resampling algorithms such as WExplore and REVO have been successful at these tasks, especially for rare events that are difficult to describe by one or two collective variables. We explain in detail how wepy facilitates implementation of these algorithms, as well as aids in analyzing the unique structure of WE simulation results. To explain how wepy and WE work in general, we describe the mathematical formalism of WE, an overview of the architecture of wepy, and provide code examples of how to construct, run, and analyze simulation results for a protein-ligand system (T4 Lysozyme in an implicit solvent). This paper is written with a variety of readers in mind, including (1) those curious about how to leverage WE rare-event simulations for their domain, (2) current WE users who want to begin using new high-dimensional resamplers such as WExplore and REVO, and (3) expert users who would like to prototype or implement their own algorithms that can be easily adopted by others.

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 33344813 PMCID： PMC7745226 DOI： 10.1021/acsomega.0c03892

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Biomolecular simulation is a valuable tool to gain insight into the atomic-level mechanisms of biological systems. Molecular processes such as ligand (un)binding, protein (un)folding, protein–protein association/dissociation, conformational changes, transport, and enzymatic reactions underlie the functioning of all living organisms. Because the length scales of these phenomena are too small to be observed experimentally at atomic resolution, computer simulations have long attempted to serve as a proxy. This application of molecular dynamics (MD) has been limited by two factors: the accuracy of force fields and available computational power. Despite huge breakthroughs in both force fields[1,2] and hardware,[3−5] it is still difficult to perform simulations of sufficient length to capture the biomolecular processes of interest. For instance, the waiting time for ligand-unbinding events of pharmacologically relevant ligands can extend to tens of minutes or hours,[6] while MD trajectories are typically limited to microsecond timescales. Thus, despite our knowledge (from experiment) that these transitions occur readily at macroscopic timescales, the occurrence of these events during a microsecond molecular simulation can be seen as a “rare event”. A series of algorithms have been developed called “enhanced sampling” methods, which attempt to gain information about long-timescale processes using only short-timescale simulations. Many of these methods use perturbations to the system by either applying biasing forces or increasing the system temperature. This includes enhanced sampling methods such as replica exchange,[7] metadynamics,[8−11] temperature-accelerated MD,[12,13] and umbrella sampling.[14] These methods invoke an assumption that the molecular systems are in equilibrium, usually in the form of a canonical probability density function. Additionally, although there are some approaches to approximate rates based on biased simulations,[8] this approach makes it very difficult to recover full atomistic detail of transition-state ensembles that govern the forward and backward rate constants for a given transition. There is thus a need for a method for accelerating the simulation of rare-event processes that generate trajectories using the unbiased Hamiltonian of interest. One such class of methods is called “path sampling”, which are unique in that they apply a sampling process over a collection of trajectories rather than single conformations.[15] Importantly, they do not require modifications to the underlying dynamics model and provide contiguous trajectories of transition paths. This makes them suitable for studying all varieties of path-dependent observables [e.g., (un)binding rates], as well as providing detailed atomistic models of transition states. There are a variety of path-sampling methods including transition path sampling (TPS),[16−18] forward flux sampling (FFS),[19] multilevel splitting,[20] and weighted ensemble (WE).[15,21] Of these, WE has advantages in which it does not in general require a Markovian assumption or a priori knowledge of full trajectories connecting two states to start[15,22] and does not require the definition of a progress coordinate, which will be discussed further below. At the heart of WE is a resampling process that uses a set of cloning and merging operations among a set of parallel simulation replicas. More formal definitions of resampling and WE are discussed in Section as well as in refs.[15,22,23] WE is a conceptually portable method that can be applied to any field of study including molecular biophysics,[23−33] systems biology,[34−41] telecommunications,[42] and aerospace and engineering.[43] Another benefit of WE is the flexibility of its framework, allowing for resampling methods that avoid the predefinition of progress coordinates and collective variables (CVs).[20,44−46] Approaches involving CVs are typically limited to low-dimensional representations (often ≤3), while WE-based resamplers such as WExplore[45] and REVO[46] perform well in high-dimensional spaces. High-dimensional adaptive resampling algorithms have been especially successful in obtaining preliminary rare-event simulations for systems with waiting times well beyond what is typical.[33,41,47−49] Notably, WExplore simulations produced unbinding trajectories of a drug ligand (TPPU) from its target (soluble epoxide hydrolase) that has an experimentally determined mean first passage time of 11 min, using less than 1 μs of simulation and a speedup of 109-fold.[48]a WExplore and REVO have shown to be particularly useful for discovering multiple pathways. Both algorithms were able to discover multiple ligand dissociation pathways for the trypsin–benzamidine system, which requires substantial rearrangement of the loops comprising the trypsin ligand-binding pocket.[46,47] In contrast, single and low-dimensional projections such as reaction coordinates often constrict the search space to particular paths, which precludes the discovery of alternative paths (and transition states) between macrostates. Finding multiple pathways can be particularly useful for applications such as kinetics-based drug design when we want to understand the structure of the ligand-binding transition state not only for a particular ligand but also for closely related ligands. Finally, adaptive algorithms such as WExplore and REVO (as well as the history-augmented Markov state model WE method[50]) require less upfront parametrization such as the definition of bin boundaries. Despite the advantages of high-dimensional adaptive WE algorithms such as WExplore and REVO, their adoption has been hindered for a number of reasons. First, the implementation of these resampling algorithms is complex and difficult to implement correctly. Second, independent implementations of WE algorithms lack interoperability of produced data and so are difficult to compare. Third, the barrier to entry for other researchers to write an implementation of the resampling algorithm as well as progress metrics for their system of interest is prohibitive. Here, we introduce the open-source wepy software framework for running WE simulations that attempts to address these issues. We first describe a software and data architecture that both reflect a simple mathematical formalism (described in 3.1) and also decompose into multiple modular components. The software architecture allows for reuse of vetted resampling algorithm implementations written by researchers with domain-specific progress metrics written by users. The data architecture solves interoperability through the introduction of a general purpose decision record design (described in the Appendix). Wepy is implemented in the Python 3 programming language and thus allows users to natively leverage a massive ecosystem for scientific computing. Other benefits of a pure-Python implementation are that it (1) increases portability between platforms, (2) has a uniform interface that can be used as a library and embedded into other software easily, and (3) only requires knowledge of a single popular programming language (Python), which if necessary has facilities for writing extremely high performance code (e.g., numba,[51] dask[52]). Currently, wepy is tightly integrated with the OpenMM[53] MD engine and provides excellent support for running GPU MD simulations. The architecture of wepy, however, is agnostic to the underlying dynamics engine as well as to any particular parallel computing strategy or framework. The wepy project also introduces a high-performance single-file storage format and schema for cloning-merging-type simulations implemented in HDF5.[54]b Use of HDF5 also provides “out-of-core” data structures which allow access to simulation data that do not fit entirely into computer main memory. On top of this, an extensive interface [application programmer interface (API)] is provided to make querying, analysis, conversion to other formats of complex path trajectories easy. Part of this interface is to support the intuitive representation of WE trajectories which have been cloned and merged as trees (referred to as “tree-like” data structures). Although there are two other WE frameworks that have been developed, these have different strengths and design goals. The AWE-WQ[23] system provides an implementation of the accelerated weighted ensemble (AWE) along with a “Work Queue”-distributed computing framework. However, AWE-WQ is less flexible than wepy in that it solely implements AWE simulations and is opinionated about the distributed computing framework. The WESTPA[55] software suite is a popular implementation of many binning-based WE resamplers and provides excellent support for running simulations on large clusters of CPU cores and GPUs. It is implemented in Python and typically run using a combination of UNIX-like shell scripts and YAML configuration files. It shares many of the same benefits of wepy discussed above including modularity and the use of HDF5 files for simulation data. WESTPA however did not satisfy our requirements because the architecture is oriented around fixed-topology explicit binning approaches making implementation of binless algorithms such as REVO[46] difficult. Additionally, we also favor the configuration-as-code approach where simulation components are constructed in Python code. In this paper, we first briefly describe WE, resampling, and rate calculations along with a sketch of some of the adaptive resamplers that motivated the construction of wepy. We then introduce a mathematical formalism that provides an overview of the design and architecture of the system followed by a description of the major software components in wepy including how to initialize, run, and analyze simulations. Finally, we present an example ligand-unbinding scenario, using a Lysozyme model system, along with concrete code examples and explanations.

Algorithms

Weighted Ensemble Algorithm

The WE algorithm is a general strategy for simulating rare or long-timescale events in stochastic systems.[21] Its fundamental features are as follows. A set of trajectories are propagated forward in time in a parallel fashion, each one assigned a statistical weight (p). Periodically, the trajectories are “resampled” to rebalance the computational effort toward lower-probability regions. This is done by cloning trajectories in sparsely populated, lower-probability regions, and merging together trajectories in overpopulated, higher-probability regions. To maintain proper statistics, the weight of a cloned trajectory is divided among its progeny, and the weights of merged trajectories are summed and given to the resulting trajectory. Here, we refer to this process as “resampling”. It was shown by Zhang et al.(44) that this is a statistically exact process in that ensemble averages were obtained with WE converge to those of the underlying distribution (e.g., the canonical distribution) in the limit of large sampling, regardless of the particular resampling function. One of the first strategies for resampling is a simple binning of trajectories along a progress coordinate. Trajectories are then cloned in under-represented bins and merged in over-represented bins until the count in each bin is equal to a target value. However, because of the combinatorial explosion for tessellating high-dimensional search spaces with uniform bins, algorithms such as WExplore and REVO were developed to effectively leverage the WE algorithm for processes that take place in inherently high-dimensional spaces.

WExplore Resampler

The main problem with defining bins in a high-dimensional space is that the number of bins needed to cover the space scales exponentially with the dimensionality of a space. This number of bins is proportional to the overall computational effort of the simulation, as a target number of trajectories are to be run in each bin. The WExplore algorithm was developed by Dickson and Brooks to circumvent this difficulty.[45] The key is to construct a set of hierarchical regions: a small set of large regions that tile the entire space, which are each subdivided by a set of smaller regions, which are in turn subdivided by even smaller regions. In the WExplore resampler, these regions are Voronoi polyhedra that are defined by points (called “images”) in a high-dimensional space. In order to assign a trajectory to a region, we must measure the distance from that trajectory state to each image. The trajectory is then assigned to the region with the closest image. More information on this algorithm can be found in the original paper.[45] In a typical WExplore simulation, we start with only a single image and define new images as they are visited by the set of trajectories. A new image is defined when a trajectory in the ensemble reaches a new region of space, or more precisely, when the distance to the closest image is greater than a critical value. The list of critical values (dmin = (d1, d2, ..., d)) thus control the sizes of the regions at each level of the hierarchy and are parameters of the WExplore resampler. As the set of images grows over the course of the simulation, the WExplore resampling function (denoted below) can change with each resampling step.

REVO Resampler

Although WExplore presents a viable means of indexing tessellations of high-dimensional spaces, there are still some behaviors related to the construction of regions that are nonoptimal—most notably, the discontinuous behavior related to reaching new levels of the hierarchy for the first time. For this reason, a new resampling algorithm was recently proposed which avoids the construction of sampling regions altogether. In the REVO algorithm (resampling of ensembles by variation optimization), an objective quantity called the “trajectory variation” (denoted, V) is used to guide the merging and cloning process.[33,46] We calculate V before and after each proposed merging and cloning operation and execute only the operations that cause V to increase. The variation is given bywhere d(X, X) represents the distance between trajectories i and j and ϕ(X) is a function that describes the importance of individual trajectories. The exponent α allows us to balance the relative strength of the distances and the importance functions, and the d0 value does not affect resampling but serves to keep the variation function unitless. On the whole, the variation function increases as the ensemble of trajectories gets further apart. The importance functions can be defined to take the probabilistic weight of the trajectories into accountwhere C is a constant. This has the effect of prioritizing not only a broad ensemble of trajectories but one where the highest weighted trajectories are distributed as far as possible from each other. This strategy can minimize the error in the calculation of observables that depend on the weights of rare trajectories, such as transition rates.

Calculating Transition Rates Using Boundary Conditions

The wepy software supports the construction of nonequilibrium ensembles to calculate transition rates. In this technique,[56−58] a set of history-dependent ensembles are defined using a set of “basins”. For instance, if we are studying ligand-binding transition pathways, then the basins will be the “bound” and the “unbound” states. The unbinding ensemble is then defined as the set of trajectories that have most recently visited the bound state. When a trajectory from the unbinding ensemble enters the unbound state, it transitions to the binding ensemble. In a typical wepy simulation, we can initialize trajectories in a given basin and use boundary conditions that terminate trajectories that enter the opposite basin. The weights of these trajectories are used to calculate a transition flux (probability per unit time), which is used to calculate a rate constant (e.g., kon or koff). These trajectories are then “warped” back to the initial state, retaining the same statistical weight.

Design and Architecture

Formalism

Let us first define ensemble resampling simulations in general. First, we define an ensemble to be a finite multiset of walkers, w, of size Nwhere a walker is a set of two elements: a probabilistic weight, p, and a state, X An ensemble of walkers defines a probability distribution, P(X), and must be normalized such that There are at least two steps in an ensemble resampling simulation: propagating dynamics and resampling. We can also introduce a third step for the so-called boundary conditions that allow for running nonequilibrium simulations and calculation of rate constants. An ensemble resampling simulation process, (for “apparatus”), is made up of three components: a runner function , a boundary condition function , and a resampling function . The dynamics can be any stochastic dynamical process such as MD, Monte Carlo simulations, and so forth. Formally, a runner function has the formwhere is a stochastic function such that multiple evaluations of an input state X might not yield identical X′s and τ is the number of steps of propagation. Although acts independently on each walker state X, it is convenient to write it as The resampling function, , maps an ensemble, , to another ensemble ′ along with an updated resampling function The walkers in this new ensemble must satisfy these constraints In words, ′ consists of walkers with states chosen from the states of the ensemble. Note that these constraints are necessary but not sufficient to ensure that the sampled ensembles are consistent with the cloning and merging process in the WE algorithm. The updated resampler allows for algorithms that take into account history dependence. Thus, a resampling functionis said to be stateless and does not take into account any walker history. Although more commonly stateless, boundary conditions and runners also have equivalent history-dependent definitions These functions typically utilize an application strategy that follows To simplify this, we write it as Similarly, a single cycle of simulation is the application of the three components of the apparatus to an initial ensemble, 0 This interleaved application of apparatus components can be simplified towhere which can easily adopt the notation mentioned above for n cycles applied to the initial walkers A final useful construct is the snapshot which is the complete state of a simulation

Software Components

In this section, we describe concretely the software components that make up the wepy framework and how they are integrated. We begin by describing the simulation manager, which implements the apparatus described above and handles the reporting of output to the user. We then discuss some unique features of WE algorithms that can lead to challenges during data analysis and tools provided by wepy that make this analysis easier.

Building and Running Simulations

To run simulations, we need two things: an ensemble of initial walkers and a simulation manager. A simulation manager is an object that contains all of the apparatus necessary to move the walkers forward. The flowchart in Figure describes the functioning of a simulation manager acting on a set of initial walkers. First, the input walkers have their states propagated by the runner, which can be any sort of dynamics engine. The initial release of wepy supports OpenMM as an MD engine, as well as experimental support for other engines such as NAMD. Because the runner propagation is typically extremely compute-intensive, we also provide another work mapper component, which can be customized for different computing environments. Wepy includes a serial implementation that can be used for testing on a single core or device, as well as two additional work mappers that use the python built-in multiprocessing module that are suitable for simulations using hardware on a single node. Because the work mapper has been factored out of the implementation of the runner itself, these mappers will work for all different MD engines with little modification.

Figure 1

Diagram showing the components and flow of data for the main simulation loop of the simulation manager. Dynamics are performed in the runner component and parallelized onto the compute nodes via the work mapper. The warping and resampling records are shown in the data flow, which are serialized and saved using an HFD5 reporter. Other reporters can be defined to record simulation data, which is indicated by the I/O elements. After propagation by the runner, each walker is tested by the boundary condition function to see if it has met the criteria. If any walker satisfies the boundary condition, the function is free to modify the state of those walkers; this is called a warping event. In addition to the new walkers, a set of warping records are produced that describe how the warping event(s) occurred. These records are then made available for generating the final data structure and any other report that requests them. In wepy, two kinds of warping events are recognized: continuous and discontinuous. A discontinuous warping event is one where dynamical variables of the walker are modified, for example, restarting walkers at the initial state after reaching a target state. A continuous warping event is one in which none of those dynamical variables are modified. This could be an auxiliary attribute such as a “color” after a walker passes through some boundary[59] or none at all in which case only a record will be produced. After the boundary conditions are applied, the resampler resamples the walkers. Again, a collection of records is produced for the resampling; however, unlike the warping records, a record for each walker is produced every cycle. These resampling records contain critical information about the lineages of walker states that is necessary for reconstructing continuous trajectories. It is useful to use a diagram to depict the histories of each walker as they pass through these stages in a single cycle, an example of which we have shown in Figure . Finally, at the end of a cycle, the walkers and records are passed off to an arbitrary number of reporters which can record whatever output is necessary or desired.

Figure 2

Walker history schematic. The vertically stacked boxes are the ensembles of walkers with labels below. The index of the “box” is the walker index in that ensemble. The color of the circle inside the box represents the state of the walker, where we treat white as the initial state of the simulation. The jagged lines indicate propagation of the walker state through the dynamics of the runner. The lines for the boundary condition step indicate whether a warping event occurred, where the dot indicates that a continuous warp occurred and the slash indicates a discontinuous warp, in this case, returning the walker state to the initial conditions. The lines in the resampling phase show cloning and merging, where a solid line indicates that a child inherits the state of its parent and a dashed line indicates a transfer of weight to another walker. In this example, w0 has cloned itself to produce two child walkers while w1 has been “squashed” and its state forgotten. It is “merged” and its weight was added to the resultant w2. Although reporters can be customized for a particular simulation, there are a number of useful reporters included with wepy. The WepyHDF5 reporter generates HDF5 output files, which are a major component of wepy and will be discussed on their own later in relation to analysis. The rest of the reporters are designed to give real-time insights to potentially long-running simulations as well as to give an accessible summary of the simulation results. The dashboard is an executive summary of the simulation provided in a simple plaintext file that is formatted in emacs org-mode that makes it easy to read a potentially large output file in a hierarchical folding manner. The walker ensemble reporter is for visualizing the current state of the walkers that simply outputs a reference topology file and a DCD trajectory file that can be used for visualizing in any of the main 3D molecular visualization programs, while these snapshots of simulations show only transient information that they are useful to indicate rough sketches of progress or for debugging purposes. Finally, the resampling tree output will generate a GEXF formatted XML file that shows the “family tree” of all walkers in the simulation. This is useful for helping understand both when and where resampling and warping events are taking place. These reporters are provided only for convenience as all of this information can be generated from WepyHDF5 files.

Data Format and Analysis

The WE algorithm is built on branching trajectories: simulations that have a single starting point may have multiple ending points. We call these kinds of tree-like branching trajectories nonlinear and a diagram of the difference between them and more traditional linear trajectories can be seen in Figure . In wepy, we call a locally linear selection of frames from the entire nonlinear dataset a trace. These are shown in Figure as the solid colored lines drawn next to the frames they encompass. Nonlinear trajectories introduce additional complexity and typically require modifications to common time-dependent analyses.

Figure 3

Nonlinear trajectory comparison. The black squares indicate single frames which are connected in time by the thin black lines. The larger lines indicate a locally linear trajectory, here called a “trace”. First, visualizing a single linear trajectory requires a selection step and the input of information. For instance, in Figure , the single red and blue traces for the linear trajectories are trivial, whereas there is some redundancy in the nonlinear trajectories from the tree. In a nonlinear tree, we can choose a frame for which we are interested in its history and recover its lineage as a trajectory. These linear trajectories can then be exported to a common format (e.g., CHARMM DCD) and visualized. Second, state network models such as Markov State models (MSMs) are a common method for representing and understanding large amounts of simulation data. This is a perfect match for WE simulations, as the trajectory segments are typically generated using unbiased dynamics. However, current tools for constructing MSMs do not support construction of the transition matrices from nonlinear trajectory trees. Fundamentally, the problem involves avoiding the double-counting frames when counting the transitions for lag times greater than our cycle length (τ). A comparison for the solution for the nonlinear case is compared to the linear case in Figure , which is implemented in wepy. Third, when calculating free energies of macrostates, the weights of trajectory observations must be taken into account. This simply amounts to doing weighted sums rather than counts for macrostate bins and requires careful association of trajectory frames to the instantaneous values of trajectory weights. Finally, observations in different trajectories cannot be assumed to be statistically independent. Take for example, a set of observed warping events take place where a ligand molecule reaches the threshold of unbinding for a receptor. Each single instance contributes to the overall rate of unbinding proportional to its weight via the Hill relation.[15] However, the calculation of macroscopic observable uncertainties (such as the probability of an unbinding event) is complicated by the duplication of single observations via cloning. The degree to which two observations can be seen as “independent” depends on the time point of their last common ancestor. To deal with these fundamental differences in the abstract structure of data generated by a WE simulation, wepy implements a new mechanism for storing trajectory data. To accomplish this, we have designed and implemented a storage layer using the common HDF5 format, which is suited to storing large amounts of heterogeneous data. In this implementation, all data results from a simulation are contained in a single monolithic binary file. Although the data can be accessed with any tool that supports HDF5, in wepy, we provide an extensive API for creating, accessing, querying, processing, and transforming the data at a useful semantic level. In the WepyHDF5 format, we bundle together the three critical and interdependent data pieces into a single file: walker data (including weights and states), resampling data, and boundary condition warping data. This combination is what allows us to generate views of linear trajectories from nonlinear data. The resampling data inform the parent-child relationships between walkers and the warping data alerts to the presence of discontinuities in dynamics of these walker lineages. These primitives solve most of the major problems listed above for dealing with nonlinear trajectories with tools provided in wepy.

Results

In this section, we first discuss code examples on how to set up, run (Section ), and analyze (Section ) a simulation for the test system used in this paper: Lysozyme ligand unbinding in an implicit solvent. We then describe all the parameters and details about the simulations that were run, the results of which are briefly discussed. This small experiment is shown to give an example of the kinds of results and analysis that are typical for wepy simulations.

Code for Running Simulations

Here, we give a brief sketch of how these components were constructed and put together into a simulation manager in a Python script. The complete code is given in the Supporting Information. The system and parameter choices will be discussed in Section . We first need to set up our wepy runner for OpenMM, which requires an OpenMM system and integrator objects. Using OpenMM-Systems helper library (installed with wepy), we can easily create a ready-to-go MD system. The integrator can be constructed using the OpenMM constructor where the three arguments are the temperature, friction coefficient, and dynamics time step, respectively. We can then create a runner object that contains everything needed to propagate the system, where the reference platform specifies a cross-platform reference implementation on a CPU. Note that our production simulations for this work were run with the CUDA platform. Second, we need to create the initial ensemble of walker objects from which to start our simulations. We use a wepy helper function gen_walker_state to generate state objects directly from OpenMM systems. init_state is a WalkerState object which can be put inside a walker. Because we are using a Langevin integrator which has a stochastic component (required by all WE simulations), we can copy the same structure for all starting replicas. Each worker manually has a weight assigned to them; in this case, it is a uniform distribution. Third, we construct a resampler to use. Both the WExplore and REVO resamplers require a metric, which is a way of measuring the distance between two walkers. Details regarding the distance metrics are discussed in more detail in Appendix . For receptor ligand-based systems, there are distance metrics already included in wepy: where lig_idxs and bs_idxs are the atomic indices of the ligand and the binding site in the system. A user can also easily make their own distance metrics; a recipe for doing this is shown below The dependencies here are the distance base class and some helper functions for performing geometric operations on arrays from our custom library geomm. The distance object is created in a similar way, except that we do not need the reference state (which is needed in UnbindingDistance for performance reasons): Fourth, we construct the resampler that we are going to use. Here, we create an instance of the WExploreResampler using the distance and init_state objects that we created earlier as well as some additional algorithm parameters (as discussed in Section ) Fifth, to set up a nonequilibrium ligand-unbinding simulation, we will construct boundary conditions that capture walkers as they cross into the unbound state. Wepy also comes with some built-in modules under receptor-based boundary conditions, which we can import and parametrize as follows where json_top is an internal JSON-based topology format in wepy, more information on this is given in the supplemental example script. Again, this is easy to customize for your application; we show a simplified example of a custom boundary condition below. The warp_walkers function computes the minimum distance between the ligand and the receptor atoms for each walker and if it exceeds a threshold, we “warp” it by replacing that walker state with the initial bound state, while keeping its weight constant. Finally, we assemble these components into a simulation manager Once the simulation manager is constructed, all we need to do now is to tell it to run. A simulation for 10 cycles, each having 10,000 steps per walker per cycle runs as follows This returns the walkers at the end of the simulation along with the final state of the runner, resampler, and boundary conditions which are contained in the final_apparatus. Another important component is the use of reporters. In addition to the walkers and apparatus returned by the simulation manager functions, wepy supports a plugin system to output data as the simulation is progressing. In the following example, we will show you how to use two different reporters: one for creating the HDF5 files and another that produces a plaintext dashboard file for every cycle to show the progress of long running simulations in a high-level overview. If you have the resampler and boundary conditions already constructed, it is very simple to make the HDF5 reporter. This will work out of the box but should likely be customized as the default is to save all data, including the velocities, at each step. The dashboard is composed of different sections for the different components. There is a generic one that displays the number of cycle runs, walker weights, and total simulation time, and there are component-specific sections for information on resampling, boundary conditions, and so forth. These reporters are attached to the simulation manager at creation time As mentioned in Section , the HDF5 reporter is of utmost importance and is a purpose-built fully featured storage format implemented in HDF5. Saving your data in the HDF5 format will let you use an extensive API designed specifically for manipulating WE data (the WepyHDF5 class in the wepy.hdf5 module) as well as allowing lower-level manipulations via libraries such as h5py. Furthermore, a fairly comprehensive analysis toolkit is made available in the wepy.analysis module. This toolkit makes it easy to transform and structure data to interoperate with other analysis toolkits such as scipy,[60] dask,[52] mdtraj,[61] and gephi.[62] In the next section, we show some examples of how to leverage these tools to perform common analyses and visualizations.

Code for Analysis and Visualizations

Probability Distributions

One of the most common ways to visualize simulation data is to project structural data to a small number of dimensions and then plot this as a probability distribution. In practice, free-energy profiles are computed by binning the projection domain and then computing the weighted histogram over those bins by summing the weights of the samples in those bins. The bin values are then normalized to get a probability distribution, which is then transformed by −ln(p) to get a free energy-like value. Typically, the term free energy refers to probability distributions over equilibrium ensembles. However, for convenience, here, we refer to probability distributions transformed as described as “free energy” or more specifically “nonequilibrium free energy” even if the underlying ensemble is not necessarily an equilibrium one. In a single linear MD trajectory, the frames are equally weighted, but in an ensemble of walkers in WE, the weights of each frame can be different and vary over time. As such, the trajectory coordinate data are associated with the weights in the wepy HDF5 file. Here, we show how to create 1D free-energy profiles for each experimental group projected onto the ligand RMSD relative to the bound pose (see Figure ). The process is to first compute the ligand RMSD for all of the simulation frames as an “observable” and then to compute the free-energy profiles which can be plotted with a tool such as matplotlib.[63]

Figure 4

Series of −log(p) distributions for ensemble simulation groups at different time points. The time shown in the bottom right of each panel is the total amount of sampling across all replicas in the ensemble. To compute the ligand RMSD observable, we open the file with the WepyHDF5 API in Python Then, we define a function for computing RMSD that will be mapped over all of the data We then apply the function to the data where the second argument asks to retrieve the relevant data from each frame, which here is just the positions. This also saves the observable into the database as the field “lig-rmsd”. By default, the calc_rmsd_observable function is applied in serial. In wepy, we also provide a distributed version which can connect to a dask cluster server for distributed parallelism on large computing clusters (see documentation). Now, we can create the free-energy profile for this observable. One intermediate step, although, is to use another representation called the “contig tree”, which makes combining multiple contiguous simulations (such as when a simulation is restarted) much easier to analyze. Construction of a contig tree requires (1) a dataset, (2) a “decision” (used internally by the resampler see Appendix for more details), and optionally (3) the boundary condition class that was used for the simulation if any. With the contig tree constructed, we can then feed it to the profiler, bin the domain, and calculate the free energies for each. A line plot of fe against bin is shown for each simulation type in Figure and discussed in Section .

Visualizing Ligand-Unbinding Events

To make rate estimates for a process or to analyze the transition path ensemble, we need an efficient way to examine the pathways from one basin to another. There are tools in wepy to help analyze this kind of boundary condition data, which we call the “warping” records. The data for all the warping events can be found in the HDF5 and are accessed through either the WepyHDF5 object or the ContigTree. Here, we can make a pandas data frame table for run 0, which can easily be exported to any number of formats. Each row of this table contains the index, weight, and point in time that the event took place. To get trajectories from these warping events to the starting structure, we use the following functions on a Contig object which is a single simulation from a ContigTree On the first line, we first generate a linearized Contig from the ContigTree which is necessary for traversal. Then, we open a context for the contig to open the HDF5 file where we can then access the simulation data. We then query for a “trace” of all frames in which a warping event occurred as the warp_trace. From each of these events, we then produce the individual histories (“lineages”) of that walker to the initial state. For this example, we only choose to look at one of these, which is set as lineage_trace. We now want to actually retrieve the simulation data to be able to visualize and perform analysis on it. To do this, we use the lineage trace and choose which attributes we want. For this example, we get only the positions and box vectors because our purpose is to generate visualizations. Now that we have the trace_fields for this trajectory, we can easily convert to an mdtraj object which can then be used to save the trajectory to a file format visualization software can read.

Simulating Lysozyme Ligand Unbinding

To showcase the use of wepy for performing simulations of rare events in real systems, we simulate the small-molecule p-xylene unbinding from the T4 lysozyme L99A mutant protein, which we refer to henceforth as “lysozyme”. Lysozyme is a common model system for ligand-unbinding studies both experimentally and computationally.[64] Lysozyme-unbinding pathways have been simulated through a variety of enhanced and brute-force methods.[65−68] Here, we simulate lysozyme interacting with the p-xylene ligand in an implicit solvent where the event of interest, ligand unbinding, is not trivial to observe but is still tractable with straightforward MD simulations. The system was prepared in an OBC GBSA implicit solvent, using Amber ff96 for the protein force field and a GAFF and AM1-BCC parametrization of the ligand. A Langevin integrator was used with a 2 fs time step, a temperature of 300 K, and a friction coefficient of 1 ps–1. Here, we determine a target p-xylene residence time in an implicit solvent using a total of 3.385 μs of straightforward MD simulation, where p-xylene is “warped” back to the binding site upon unbinding a total number of 11 times. This resulted in an unbinding rate of 3.25 μs–1, which we use as a target rate for comparison with both WE simulations and trajectory ensembles using wepy without resampling. The test groups that were simulated are as follows: ensemble of 48 walkers with no resampling, that is, straightforward (SF group); ensemble of 48 walkers with the REVO resampler (REVO group); ensemble of 48 walkers with the WExplore resampler (WExplore group). For each group, four independent simulations were run, each for a total sampling time (summation across all replicas of ensemble) of 1 μs. All simulations used the same boundary condition criterion, which is that the minimum of all ligand–protein interatomic distances is greater than 1.0 nm. Boundary conditions are checked at the end of every cycle, which was τ = 20 ps for every group. The WE simulations (REVO and WExplore) employed the following parameters: a minimum walker weight of 10–12 and a maximum walker weight of 5.0 × 10–1. Both REVO and WExplore used the UnbindingDistance, where the distance between two walker structures is computed as the distance between their ligands after alignment of their binding sites to the initial starting structure. This is included in wepy and has been used with success in other ligand release simulations.[28,33,47−49] For the WExplore simulation, the same parameters from previous publications[28,47,48] were used; four region hierarchy levels with cutoffs d = 10, 5, 3.5, and 2.5 Å with a maximum of 10 subregions per parent region. For REVO,[46] the following parameters were used: a characteristic distance of 1 Å, a merge distance of 2.5 nmc, and a distance exponent of 4, and we use the weight-based importance function, as described in eq .

Analysis of Simulation Results

Probability Distributions

Figure shows a series of plots of free energies of each simulation group projected onto the RMSD of the ligand to the starting pose as a function of aggregate simulation time. Again, the “free energies” here are more accurately the negative logarithm of the nonequilibrium probability distributions because we are warping unbound walkers back to the starting position. Accordingly, we see a local free-energy minimum at RMSD close to 0 Å but no corresponding minimum in the unbound state. REVO and WExplore reach much larger ligand RMSD values than the straightforward (SF) simulations. Both WExplore and REVO observed close to 5 nm RMSD values at 0.052 μs (Figure B), whereas SF has only reached ligand RMSD values of around 1.7 nm. The probability of large RMSD states tends to be underestimated early on, as can be seen by comparison between panel A and B and the final estimates in panel D. By the end of the simulations (Figure D), all the profiles are very similar between the different groups until around 4 nm. The curves for WExplore and REVO beyond that are noisy because of the difficulty in reaching very large distances (e.g., 7 nm) without reaching the unbinding boundary condition. In general, note that early time predictions of WE free-energy profiles can differ significantly from equilibrium free-energy profiles. These should instead be viewed as direct estimates of a conditional time-dependent probability distribution: P(x, t|x0, t0), or the probability of being at a point x and time t, given that we began at point x0 at time t0. Although we expect these to converge to equilibrium probabilities only for t → ∞, we can often learn valuable information from the tails of these distributions. We note that a peculiar artifact in this system is that the p-xylene ligand sticks strongly to the surface of lysozyme, and breaking out of the binding pocket was much easier than leaving the surface of the protein. This likely explains high-variance ligand RMSD values above 4 nm and performance could be improved by incorporating the latter process (leaving the surface) more directly into the distance metric that governs the resampling process.

Unbinding Events and Trajectories

A primary interest in the study of rare events in nonequilibrium systems is to understand the kinetics of transition paths. Although we can use a variety of simulation methods to estimate the rate constants associated with events, it can be much more difficult to obtain accurate data on the actual mechanistic determinants of these rates. Primarily, this is the structural details of transition states, but all structures along a path can potentially be useful for design purposes. Transition paths obtained with path sampling methods, such as WE, are especially meaningful because the Hamiltonian is unperturbed throughout the sampling process. Here, we show how wepy can easily obtain and analyze continuous, unbiased trajectories of ligand unbinding. Table shows that both REVO and WExplore outperform the SF methods in terms of the absolute number of unbinding events that are observed. More important, although, is that the initial times of the first observations for REVO and WExplore are much faster than the SF ensemble simulations. All replicates for REVO and WExplore simulations have exit points within the first 100 ns, while the first among SF simulations takes about 3 times as long (at 300 ns) and is highly variable among replicates. The slowest SF simulation does not observe an unbinding event until around 700 ns. This highlights the utility of high-dimensional resamplers such as WExplore and REVO for generating observations of rare events with a more modest investment of computing power. In Figure , we show structures along the first unbinding paths generated by the SF and REVO simulations. From this, we can see that the REVO simulation (panel B) takes a much shorter, more directed path to unbinding compared to the SF one (panel A). The ligand in the SF simulation can be seen to move around to multiple other locations on the surface of the protein before ultimately unbinding.

Table 1

Unbinding Events and Final Rate Estimates of Simulations

group	num warps
SF (numerical target)	11
SF (ensemble)	39
WExplore	486
REVO	4337

Figure 5

Ligand-unbinding trajectories. These 3D renderings of lysozyme protein (from the trajectory seed structure) showing the positions of the p-xylene ligand from the bound positions (red) to unbound positions (blue). A cartoon representation of lysozyme is in purple surrounded by a gray surface representation. Positions of the ligand are shown for the first observed unbinding trajectory from each of the two simulation groups shown at the end of each WE cycle (τ, i.e., every 20 ps). (A) Unbinding trajectory from the SF group (no resampling). (B) Trajectory from the REVO group.

Discussion

Successes

Wepy is a useful and flexible implementation of advanced WE simulations with a growing number of applications.[33,46,49,69] In our experience, wepy has been particularly useful in three major ways. First, it has made the prototyping of new methods very easy even for researchers with little experience in programming or the Python language. The initial goal of wepy was to simplify and modularize the original implementation of the WExplore algorithm.[45] Following this, the REVO algorithm was fully prototyped, tested, and eventually used as a resampler for a variety of problems.[33,46,69] This process was made much simpler and faster by the sharing of the same infrastructure that was already developed. In addition, the binless nature of the REVO algorithm was actually conceived of partly as a response to the abstractions formalized by wepy. This highlights the second point: not only does wepy provide useful software to perform simulations but it also provides a common language with which researchers could communicate with each other about their simulations. We have found the use of diagrams such as Figures and 2 to be valuable not only for reasoning about our programs but also in explaining WE and nonequilibrium simulations in general. Finally, wepy has made the process of analyzing WE simulations dramatically simpler. The biggest contribution here is the support for out-of-core data structures (via the HDF5 format) and expressive in-memory representations that reflect the tree-like trajectory structures (i.e., WepyHDF5 and ContigTree). Using this data structure, wepy also provides facilities for easy analysis and visualization of “resampling trees”. As mentioned in the Introduction (Section ), a big asset to the WE method is that the microscopic trajectories are generated using the unbiased Hamiltonian. In wepy, we ensure that these data can be fully leveraged however the user sees fit. One major use case for these data is the construction of MSMs with long lag times, which in turn can be used to predict steady-state probabilities or identify transition states. Although not mentioned here, wepy provides facilities for constructing macrostate models [such as conformation state networks (CSNs) and MSMs]. Others are encouraged to use, share components for, or even contribute to wepy which is open source with an MIT license. The source code is currently hosted on github (https://github.com/ADicksonLab/wepy) and documentation is currently available at https://adicksonlab.github.io/wepy. At the time of publication, the 1.00 release of wepy has been made and has been archived and given a persistent identifier from Zenodo.org (DOI: 10.5281/zenodo.3973431).

Opportunities for Future Development

Although wepy provides many essential and novel features, there are areas for potential improvements that could make it even more widely useful. Currently, it has not yet been integrated with major MD engines such as GROMACS,[70] CHARMM,[71] Amber,[72] and Desmond,[73] and there is currently only experimental support for NAMD[74] and ASE.[75] We note that the architecture of some of these engines makes it difficult to interface without going through a UNIX-like system environment (a difficulty for all software using these tools). We note that OpenMM allows for the use of force fields native to each engine (e.g., CHARMM, AMBER) to be used within it, so the issue is more about choice of implementation rather than the content of the simulations. Second, the priority for wepy developers has been on implementing and prototyping new adaptive and high-dimensional resampling algorithms rather than implementing standard static binning methods or accelerated WE.[23,59] Fortunately, many of these methods are not complex to implement as resampler objects in wepy and we hope that these will either be included in future releases or as standalone libraries. We note that a goal between WESTPA[55] and wepy developers is to enable a modular program design, where resampler objects could be used interchangeably between the two WE implementations. Third, for protein and other macromolecular simulations, wepy resorted to using an ad hoc serialization format in JSON for molecular topologies (the schema of which was borrowed from an internal representation in the mdtraj HDF5 implementation). In principle, wepy is agnostic to topology formats but in practice, this is an extremely important component in building simulations and performing analyses. Despite there being a large number of software packages implementing in-memory representations of molecular topologies, there are no formats suitable for serialization and communication between software. We encourage users to consider the merits of the JSON topology used in wepy but by no means recommend it as a general purpose standard nor do we wish it to become a de facto standard. Finally, although the use of an HDF5 based file format has proven to be a good choice for many reasons, it is currently not supported natively by any molecular visualization software. Thus, trajectories must be converted and duplicated into separate files with a supported format (a simple task using the integration with the mdtraj library). In our work flows, we treat these files as temporary intermediates; however, this can bloat the necessary disk space needed and cause some time delays when (re)generating them. We note, although, that visualization tools could benefit immensely by adopting a random-access format such as HDF5 which would allow for visualizing single trajectories which do not fit in memory, a problem we frequently encounter when attempting to view very long trajectories.

58 in total

1. Computer simulation of protein folding.

Authors: M Levitt; A Warshel
Journal: Nature Date: 1975-02-27 Impact factor: 49.962

2. DNA-Binding Kinetics Determines the Mechanism of Noise-Induced Switching in Gene Networks.

Authors: Margaret J Tse; Brian K Chu; Mahua Roy; Elizabeth L Read
Journal: Biophys J Date: 2015-10-20 Impact factor: 4.033

3. Efficient and verified simulation of a path ensemble for conformational change in a united-residue model of calmodulin.

Authors: Bin W Zhang; David Jasnow; Daniel M Zuckerman
Journal: Proc Natl Acad Sci U S A Date: 2007-11-01 Impact factor: 11.205

Review 4. CHARMM: the biomolecular simulation program.

Authors: B R Brooks; C L Brooks; A D Mackerell; L Nilsson; R J Petrella; B Roux; Y Won; G Archontis; C Bartels; S Boresch; A Caflisch; L Caves; Q Cui; A R Dinner; M Feig; S Fischer; J Gao; M Hodoscek; W Im; K Kuczera; T Lazaridis; J Ma; V Ovchinnikov; E Paci; R W Pastor; C B Post; J Z Pu; M Schaefer; B Tidor; R M Venable; H L Woodcock; X Wu; W Yang; D M York; M Karplus
Journal: J Comput Chem Date: 2009-07-30 Impact factor: 3.376

5. Multiple Ligand Unbinding Pathways and Ligand-Induced Destabilization Revealed by WExplore.

Authors: Alex Dickson; Samuel D Lotz
Journal: Biophys J Date: 2017-02-28 Impact factor: 4.033

6. The atomic simulation environment-a Python library for working with atoms.

Authors: Ask Hjorth Larsen; Jens Jørgen Mortensen; Jakob Blomqvist; Ivano E Castelli; Rune Christensen; Marcin Dułak; Jesper Friis; Michael N Groves; Bjørk Hammer; Cory Hargus; Eric D Hermes; Paul C Jennings; Peter Bjerre Jensen; James Kermode; John R Kitchin; Esben Leonhard Kolsbjerg; Joseph Kubal; Kristen Kaasbjerg; Steen Lysgaard; Jón Bergmann Maronsson; Tristan Maxson; Thomas Olsen; Lars Pastewka; Andrew Peterson; Carsten Rostgaard; Jakob Schiøtz; Ole Schütt; Mikkel Strange; Kristian S Thygesen; Tejs Vegge; Lasse Vilhelmsen; Michael Walter; Zhenhua Zeng; Karsten W Jacobsen
Journal: J Phys Condens Matter Date: 2017-03-21 Impact factor: 2.333

7. Predicting ligand binding affinity using on- and off-rates for the SAMPL6 SAMPLing challenge.

Authors: Tom Dixon; Samuel D Lotz; Alex Dickson
Journal: J Comput Aided Mol Des Date: 2018-08-23 Impact factor: 3.686

8. Funnel metadynamics as accurate binding free-energy method.

Authors: Vittorio Limongelli; Massimiliano Bonomi; Michele Parrinello
Journal: Proc Natl Acad Sci U S A Date: 2013-04-03 Impact factor: 11.205

Review 9. Drug-target residence time: critical information for lead optimization.

Authors: Hao Lu; Peter J Tonge
Journal: Curr Opin Chem Biol Date: 2010-07-19 Impact factor: 8.822

10. Distance-based topological polynomials and indices of friendship graphs.

Authors: Wei Gao; Mohammad Reza Farahani; Muhammad Imran; M R Rajesh Kanna
Journal: Springerplus Date: 2016-09-15

6 in total

1. CHARMM-GUI Implicit Solvent Modeler for Various Generalized Born Models in Different Simulation Programs.

Authors: Kye Won Wang; Jumin Lee; Han Zhang; Donghyuk Suh; Wonpil Im
Journal: J Phys Chem B Date: 2022-09-18 Impact factor: 3.466

Review 2. Enhanced-Sampling Simulations for the Estimation of Ligand Binding Kinetics: Current Status and Perspective.

Authors: Katya Ahmad; Andrea Rizzi; Riccardo Capelli; Davide Mandelli; Wenping Lyu; Paolo Carloni
Journal: Front Mol Biosci Date: 2022-06-08

3. Molecular Paradigms for Biological Mechanosensing.

Authors: David Gomez; Willmor J Peña Ccoa; Yuvraj Singh; Enrique Rojas; Glen M Hocky
Journal: J Phys Chem B Date: 2021-10-28 Impact factor: 3.466

4. Single-molecule studies reveal method for tuning the heterogeneous activity of alkaline phosphatase.

Authors: Tal Gilboa; Alana F Ogata; Charles B Reilly; David R Walt
Journal: Biophys J Date: 2022-05-07 Impact factor: 3.699

5. WESTPA 2.0: High-Performance Upgrades for Weighted Ensemble Simulations and Analysis of Longer-Timescale Applications.

Authors: John D Russo; She Zhang; Jeremy M G Leung; Anthony T Bogetti; Jeff P Thompson; Alex J DeGrave; Paul A Torrillo; A J Pratt; Kim F Wong; Junchao Xia; Jeremy Copperman; Joshua L Adelman; Matthew C Zwier; David N LeBard; Daniel M Zuckerman; Lillian T Chong
Journal: J Chem Theory Comput Date: 2022-01-19 Impact factor: 6.006

6. Local Ion Densities can Influence Transition Paths of Molecular Binding.

Authors: Nicole M Roussey; Alex Dickson
Journal: Front Mol Biosci Date: 2022-04-26

6 in total