Literature DB >> 25246856

Simultaneous Computation of Dynamical and Equilibrium Information Using a Weighted Ensemble of Trajectories.

Ernesto Suárez¹, Steven Lettieri¹, Matthew C Zwier², Carsen A Stringer³, Sundar Raman Subramanian¹, Lillian T Chong², Daniel M Zuckerman¹.

Abstract

Equilibrium formally can be represented as an ensemble of uncoupled systems undergoing unbiased dynamics in which detailed balance is maintained. Many nonequilibrium processes can be described by suitable subsets of the equilibrium ensemble. Here, we employ the "weighted ensemble" (WE) simulation protocol [Huber and Kim, Biophys. J.1996, 70, 97-110] to generate equilibrium trajectory ensembles and extract nonequilibrium subsets for computing kinetic quantities. States do not need to be chosen in advance. The procedure formally allows estimation of kinetic rates between arbitrary states chosen after the simulation, along with their equilibrium populations. We also describe a related history-dependent matrix procedure for estimating equilibrium and nonequilibrium observables when phase space has been divided into arbitrary non-Markovian regions, whether in WE or ordinary simulation. In this proof-of-principle study, these methods are successfully applied and validated on two molecular systems: explicitly solvated methane association and the implicitly solvated Ala4 peptide. We comment on challenges remaining in WE calculations.

Entities: Chemical Disease Gene

Year: 2014 PMID： 25246856 PMCID： PMC4168800 DOI： 10.1021/ct401065r

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

Although it is textbook knowledge that the functions of biomacromolecules are strongly coupled to their conformational motions and fluctuations,[1] computer simulation of such motions has been a challenge for decades.[2] Typically, distinct algorithms are employed to estimate equilibrium quantities (e.g., refs (3) and (4)) and dynamical properties (e.g., refs (5−10)). In principle, a single long dynamics trajectory would be sufficient to determine both equilibrium and dynamical properties,[11] but such simulations remain impractical for most systems of interest. Aside from straightforward simulations, more technical approaches that can yield both equilibrium and dynamical simulation, sometimes under minor assumptions, have drawn increasing attention. A number of approaches employ Markov state models (MSMs) as part their overall computational strategy. On the basis of replica exchange molecular dynamics (REMD),[12,13] it is possible to extract kinetic information from continuous trajectory segments between exchanges and thereby construct an MSM.[13] The adaptive seeding method (ASM) similarly builds an MSM based on trajectories seeded from states discovered via REMD or another of the so-called generalized ensemble (GE) algorithms.[14] MSMs have also been used in combination with short, off-equilibrium simulations to construct the equilibrium ensemble of folding pathways of a protein.[15] Another general strategy is to employ a series of nonintersecting interfaces that interpolate between states of interest selected in advance. Milestoning generates and analyzes transitions between interfaces assuming prior history does not affect the distribution of trajectories.[16,17] Transition interface sampling (TIS)[18,19] and its variants also analyze such transitions and can yield free energy barriers in addition to rates while accounting for some history information.[20] Forward flux sampling (FFS) again samples interface transitions: it accounts for history information and can yield rates and equilibrium information.[7,21] The “weighted ensemble” (WE) simulation strategy[5] (see Figure 1), which has a rigorous basis as a path-sampling method,[22] has also been suggested as an approach for computation of both equilibrium and nonequilibrium properties.[23,24] Although WE was originally developed as a tool for characterizing nonequilibrium dynamical pathways and rates (e.g., refs (5), (25−28)), the strategy was extended to steady-state conditions including equilibrium.[23] The simultaneous computation of equilibrium and kinetic properties using WE was demonstrated with configuration space separated into two states by a dividing surface[24] and later for arbitrary states defined in advance of a simulation.[29] In contrast to many other advanced sampling strategies, WE generates an ensemble of continuous trajectories, all at the physical condition (e.g., temperature) of interest.

Figure 1

Equilibrium in different representations. (a) Ensemble of trajectories with arrow tips indicating the instantaneous configuration and tails showing recent history in the space of two schematic coordinates q1 and q2. States A and B, shown in gray, are two arbitrary regions of phase space. (b) Dissection into two subsets based on whether a trajectory was most recently in state A (black solid arrows, the “α” steady state) or state B (red dashed, the “β” steady state). (c) Statistically equivalent ensemble of weighted trajectories, with arrow thickness suggesting weight. Configuration space has been divided into cells (“bins”) which each containing an equal number of trajectories. Here, we further develop the capability of WE simulation to calculate equilibrium and nonequilibrium quantities simultaneously in several ways that may be important for future studies of increasingly complex systems. (i) The approach described below permits the calculation of rates between arbitrary states, which can be defined after a simulation has been completed. In a complex system, the most important physical states, including intermediates, generally will not be obvious prior to simulation. Further, the present approach opens up the possibility to use rate calculations to aid in the state-definition process. (ii) The non-Markovian analysis described here enables unbiased rate calculations in the typical case where “bins” used by WE simulation do not exhibit Markovian behavior. The analysis is general and can be applied outside the WE context, including the analysis of ordinary long trajectories. (iii) The non-Markovian analysis can improve the efficiency of WE simulations by yielding accurate estimates of observables from shorter simulations. The analysis is based on a previously suggested decomposition of the equilibrium ensemble into two nonequilibrium steady states.[9,20,21,30,31] Generally speaking, WE provides an attractive basis for complex simulations. WE is easily parallelizable because it employs multiple trajectories and was recently used with 3500 cores.[32] Because there is no need to “catch” trajectories at precise transition interfaces, WE algorithms lend themselves to a scripting-like implementation which has been employed to study a wide range of stochastic systems via regular molecular dynamics,[28] Monte Carlo,[26] the string strategy,[33] and Gillespie-algorithm dynamics of chemical kinetic networks.[34]

Theoretical Formulation

WE simulation uses multiple simultaneous trajectories, with weights that sum to one, that are occasionally coupled by replication or combination events every τ units of time.[5] The coupling events typically are governed by a static partition of configuration space into “bins” (Figure 1c), although dynamical/adaptive bins may be used.[22] In the case of static bins, when one or more trajectories enters an unoccupied bin, those trajectories are replicated so that their count conforms to a (typically) preset value, M. Replicated “daughter” trajectories inherit equal shares of the parent’s weight. If more than M trajectories are found to occupy a bin, trajectories are combined statistically in a pairwise fashion until M remain, with weight from pruned trajectories assigned to others in the same bin. These procedures are carried out in such a way that dynamics remain statistically unbiased.[22] This study does not adjust weights according to previously developed reweighting procedures[23] during the simulation. Rather, the WE simulations described here are long enough to permit relaxation to the equilibrium state.

Direct Calculation of Observables

Once the equilibrium state is reached in a WE simulation, meaning that there is a detailed balance of probability flow between any two states, equilibrium observables such as state populations or a potential of mean force can be calculated simply by summing trajectory weights in the corresponding regions of phase space. We term this “direct” estimation of observables. To calculate rates, the equilibrium set of trajectories (Figure 1a) is decomposed into two steady states as shown in Figure 1b: the α steady state consisting of trajectories more recently in A than B, and the β steady state with those most recently in B;[9,31] these were denoted “AB” and “BA” steady states, respectively, in ref (31). Trajectories are “labeled” according to the last state visited, i.e., classified as α or β, during a WE simulation or in a postsimulation analysis (“post-analysis”). The direct rate kAB estimate is computed from the probability arriving at the final state[4,7,9,20,23,35] viawhere MFPT is the mean-first-passage time, Flux(A → B|α) is the probability per unit time arriving at state B in the α steady state, and p(α) is the total probability in the α steady state. By construction p(α) + p(β) = 1. Normalizing by p(α) effectively excludes the reverse steady state, and the rate calculation only “sees” the unidirectional α steady state as in ref (23). An expression analogous to eq 1 applies for kBA. Also note that the effective first order rate constant, defined by Flux(A → B|α)/pAeq, can be determined from equilibrium WE simulation because PAeq can be directly computed by summing weights in A. We note that analogous direct calculation of observables can be performed from an equilibrium ensemble of unweighted (i.e., “brute force”) trajectories by assigning equal weights to each.

Non-Markovian Matrix Calculation of Observables

Beyond the direct estimates of observables based on trajectory weights, we also generalize previous matrix formulations for nonequilibrium steady states[9,30,36] into an equilibrium formulation that explicitly accounts for the embedded steady states (as in Figure 1b,c). These non-Markovian matrix estimates are tested below and may prove important for future WE studies using shorter simulations, as described in the Discussion. Our matrix approach explicitly uses the decomposition of the equilibrium population into α and β components for each bin i:which implies p(α) = ∑pα and p(β) = ∑pβ. We called this a “labeled” analysis. Thus, with N bins, a set of 2N probabilities is required rather than N. Similarly, a 2N × 2N rate matrix is required: k, where μ and ν can be either the α or β subsets of trajectories. See Figure 2. Each of the previously considered k rate elements is thus decomposed into four history-dependent elements which account for whether the particular trajectory was last in state A or B and whether the trajectory transitions between the α and β subsets. The analysis assumes states consist strictly of one or more bins, but this is always possible in a post-analysis without a loss of generality. In other words, given the flexibility we have when we define the bins, it is not a real limitation that the states have to be strictly constituted by bins.

Figure 2

Constructing a labeled rate matrix for unbiased calculations. For purposes of illustration, here state A consists solely of bin 1 and state B solely of bin 3. Left: A traditional rate matrix with history-blind elements. The rate k gives the conditional probability for transitioning from bin i to bin j in a fixed time increment, regardless of previous history. Right: The labeled rate matrix accounting for history. The element kμv is the conditional probability for the i to j transition for trajectories initially in the μ subensemble which transition to the ν subensemble, where μ and ν are either α or β. The labeled rate matrix correctly assigns the α and β subpopulations of each bin, whereas the traditional matrix may not. We wish to emphasize that this analysis is “non-Markovian” because we are explicitly including history information (i.e., α and β labels) in the new 2N × 2N rate matrix. Once the matrix is built, the steady state observables are obtained using the same mathematical formalism that would be used in a regular Markov model. However, the matrix should be seen as a tool of linear algebra and not as embodying any physical assumptions. Note that more than half the kμv elements are zero. For example, consider a bin in the “intermediate” region (neither A nor B), such as bin 2 in Figure 2. In this region, an α trajectory cannot change into a β trajectory, nor vice versa; hence rates for these processes are zero. Similarly, an α trajectory in the intermediate region which enters a bin in B must turn into a β trajectory, so the rate will always be zero to the α components of bins in B. The non-Markovian results below stem from the division into α and β steady states, but several steps are required. First, rates among bins are estimated in a post-analysis aswhere ωμν is the probability flux, for a given iteration, from bin i to j of trajectories only with initial and final “labels” μ and ν, respectively, while ωμ is the population labeled as μ which is initially in i. The subscript “2” in the numerator indicates that the rate kμv is estimated to be nonzero only when more than one transition is observed; after the second event, all events are included, from the first one, to avoid bias. The requirement for two transitions was found to greatly enhance numerical stability in estimating fluxes and rates between macroscopic states: rates estimated from single events exhibit large fluctuations. Notice that eq 3 is a ratio of averages and differs from the average ratio ⟨ωμν/ωμ⟩, which might seem equally or more “natural.” However, our data show that eq 3 yields unbiased estimates, while the average ratio may not (data not shown). The difference between the two estimators indicates that transitions are correlated with trajectory weights. Perhaps more importantly, the average ratio places less importance on high weight transitions due to the instantaneous normalization—and so, in a time-average sense, may be incorrect. That is, low-weight transitions count as heavily as high-weight events, which evidently biases the rate estimate. In the ratio of averages, high-weight events appropriately count more. To obtain “macroscopic” rates between states consisting of arbitrary sets of bins (noting that arbitrary bins can be employed in a post-analysis), we calculate “labeled” fluxes for use in eq 1 viaThe labeled bin populations pα and pβ are obtained from the steady-state solution of the labeled rate matrix K = {kμν}. A summary of the “labeled” or non-Markovian matrix procedure for estimating rates between arbitrary states is as follows. First, we obtain the labeled rate matrix K = {kμv} using eq 3 to average interbin transitions. Second, we solve the matrix problem KpSS = pSS, yielding the steady state solution pSS. Notice that the equilibrium bin populations can be computed by eq 2. Then, the steady state solution pSS along with the labeled rate matrix elements are used to calculate the α flux entering state B and the β flux entering A (eq 4). Finally, the MFPT values are obtained from eq 1. In the graphs below, each non-Markovian estimate shown is from the matrix solution using the kμν rates calculated based on all data obtained until the given iteration of the simulation. The non-Markovian matrix formulation exhibits a number of desirable properties: (i) Unlike with unlabeled (i.e., implicitly Markovian) analysis, kinetic properties will be unbiased as shown below. (ii) Solution of both the α and β steady states is performed simultaneously via a standard Markov-state-like analysis of the kμν rate matrix. By contrast, if the α and β steady states are independently solved within a Markov formalism, there can be substantial ambiguity in how to assign feedback from the target to initial state when the initial state consists of more than one bin. (iii) The labeled formulation guarantees, by construction, the flux balance intrinsic to equilibrium, namely, Flux(A → B|α) = Flux(B → A|β). (iv) The analysis can be performed using arbitrary bins (and states defined as sets of these bins). It is not necessary to employ the bins originally used to run the WE simulation because a post-analysis can calculate rates among any regions of configuration space. (v) The analysis is equally applicable to ordinary brute-force simulations.

Markovian Matrix Calculation of Observables

For reference, we also perform a traditional Markov analysis of the trajectories, which will prove to yield biased rate estimates because most divisions of configuration space (e.g., WE bins) are not true Markovian states. The Markov analysis proceeds without labeling the trajectories. Elements of the rate matrix are estimated aswhere the subscript “2” again means that we only estimate a rate as nonzero once at least two transitions from i to j have occurred. Bin populations are then computed by solving for the steady-state solution of the Markov matrix with elements k. The computation of an MFPT requires the use of source (A) and sink (B) states. This task is automatically performed within the labeled formalism previously described. Hence, we determine Markovian macroscopic rates by substituting the Markovian k for all nonzero elements of the kμν. We emphasize that this is merely an accounting trick to establish sources and sinks and simultaneously measure both A-to-B and B-to-A fluxes/rates. We perform a smoothing operation on the macroscopic Markovian rates because otherwise the data are fairly noisy. The MFPT results shown for the Markovian matrix analysis are running averages based on the last 50% of the estimates (where each estimate is from the matrix solution using k estimates from all data obtained until the particular iteration). We confirmed numerically that such smoothing did not contribute bias to any of the MFPT estimates.

Model Systems and Simulation Details

Weighted ensemble simulations were performed on two systems: the alanine tetrapeptide (Ala4) solvated implicitly and a pair of explicitly solvated methane molecules. All simulations were performed at 300 K with a stochastic thermostat (Langevin thermostat). Friction constants of 5.0 and 1.0 ps–1 were used for Ala4 and methane systems, respectively. The molecular dynamics time step used for all systems was Δt = 2 fs. An iteration is defined to be the simultaneous propagation of all trajectories in the ensemble for some amount of time, τ. In these studies, a value of τ = 2500Δt is used for Ala4 and τ = 250Δt for the methane–methane system. For Ala4, the all-atom AMBER ff99SB force field[37] with implicit GB/SA solvent and no cutoff for the evaluation of nonbonded interactions was simulated using the AMBER 11 software package.[38] The Hawkins, Cramer, and Truhlar[39,40] pairwise generalized Born model is used, with parameters described by Tsui and Case[41] (option igb=1 in AMBER 11 input file). The progress coordinates were selected and “binned” using a 10 × 10 partition of a 2D space. A dihedral distance D = ((1/N)∑d2)1/2 ∈ [0,180] with respect to a reference set of torsions is used in the first dimension, where N is the number of torsional angles considered and d is the circular distance between the current value of the ith angle and our reference, i.e., the smaller of the two arclengths along the circumference. This dimension was divided every 14° from 0 to 126° and then a final partition covering the space (126,180]). In the second dimension, a regular RMSD, using only heavy atoms, is measured with respect to an α-helical structure. In this case, the space was divided every 0.4 Å from 0 to 3.6 Å and then a final partition covering the space [3.6,∞). Values and coordinates for the references used to compute the order parameters are given in the Supporting Information (SI). The methane molecules were simulated using the GROMACS 4.5 software package[42] with the united-atom GROMOS 45a3 force field[43] and dodecahedral periodic box of TIP3P water molecules[44] (about 900 water molecules in a 34 × 34 × 24 Å box). van der Waals interactions were switched off smoothly between 8 and 9 Å; real-space electrostatic interactions were truncated at 10 Å. Long range electrostatic interactions were calculated using particle mesh Ewald (PME) summation. The single progress coordinate was the distance r between the two methane molecules, following ref (28). The coordinate r ∈ [0,∞) Å was partitioned with a bin spacing of 1 Å from 0 to 16 Å and a last bin covering the space r ∈ [16,∞) Å. For the post analysis of methane, different bins were used to demonstrate the flexibility of the approach. The coordinate r ∈ [0,∞) Å was partitioned so that the first bin is the space r ∈ [0,5) Å, then a bin spacing of 2 Å was used from 5 to 17, while the last bin covers the space r ∈ [17,∞) Å. The results shown below include all data generated in all trajectories: no transient or relaxation period has been omitted.

Results

Ala4

For Ala4, populations and MFPTs are estimated using WE and compared to independent measurements based on ordinary “brute force” (BF) simulation. Rates are estimated in both directions between the two sets of states A1,B1 and A2,B2 shown in Figure 3 (see SI to visualize representative structures). The second set is less populated and consequently expected to be more difficult to sample. Figure 3 also shows the bin definitions used in the post-analysis, which were the same as those used during the WE simulation. However, as we shall see in our second system, we can use any partition of the space for the post analysis.

Figure 3

The Ala4 free energy surface. The surface is projected onto two coordinates: D = ((1/N)∑d2)1/2 ∈ [0,180] from one reference structure (see SI) and the RMSD with respect to an ideal α-helix. The surface was computed using 3.0 μs of ordinary “brute force” simulation. The set of states A1,B1 is highlighted in green, while the second set A2,B2 is highlighted in red. The grid shows bins that were used both for WE simulation and for the post-analysis calculation of observables via the non-Markovian matrix formulation. The data shown below are based on the same total simulation times in BF and WE. The BF estimates and confidence intervals are based on a single long trajectory of 3.0 μs where thousands of transitions between states were observed. Five independent WE simulations were run, each employing a total of 3.0 μs accounting for all the trajectories. The use of independent WE runs permits straightforward error analysis for comparison with BF.

Direct Estimation of Observables via WE

As described above, “direct” WE measurements sum trajectory weights for population and flux calculations. Figures 4 and 5 show direct estimates for both equilibrium and kinetic quantities for both sets of states. WE estimates as a function of simulation time are compared to 95% confidence intervals for BF simulation.

Figure 4

Figure 5

Direct WE estimates for populations and mean first passage times for Ala4 states A2,B2 from Figure 3. Five independent WE runs are shown, each based on 3.0 μs of total simulation time. Dashed lines indicate roughly a 95% confidence interval based on 3.0 μs of brute force simulation. Each nanosecond of molecular time corresponds to approximately 200 ns of WE simulation accounting for all trajectories in a single run.

Direct WE estimates for populations and mean first passage times (MFPTs) for Ala4 states A1,B1 from Figure 3. Five independent WE runs are shown, each based on 3.0 μs of total simulation time. Dashed lines indicate roughly a 95% confidence interval based on 3.0 μs of brute force simulation. Each nanosecond of molecular (single-trajectory) time corresponds to approximately 200 ns of WE simulation including all trajectories in a single run. Direct WE estimates for populations and mean first passage times for Ala4 states A2,B2 from Figure 3. Five independent WE runs are shown, each based on 3.0 μs of total simulation time. Dashed lines indicate roughly a 95% confidence interval based on 3.0 μs of brute force simulation. Each nanosecond of molecular time corresponds to approximately 200 ns of WE simulation accounting for all trajectories in a single run. As with all observables, data from five independent WE simulations are shown. The final/rightmost point from each run is the estimate using all data from the run and thus is based on a total simulation time equal to that of BF (3 μs). The spread of the rightmost WE data points therefore can be compared with the BF confidence interval to gauge statistical quality. The mean values of the direct estimates are in agreement with BF confidence intervals in all cases. In some cases, the spread of WE estimates is significantly less than that for BF prior to the full extent of WE simulation. Each nanosecond of “molecular time” in Figures 4 and 5 (i.e., single-trajectory time) corresponds to approximately 200 ns of total simulation in a single WE run accounting for all trajectories. Hence, in some cases, considerably less WE simulation is required for an estimate of the same statistical quality as resulted from the full BF simulation of 3.0 μs.

Non-Markovian Matrix Analysis

We also show results of the non-Markovian matrix analysis for select observables. Figure 6 shows that the non-Markovian analysis yields unbiased estimates of the same equilibrium and nonequilibrium properties calculated with direct estimates. (Results for other observables, like the population of A1 and the A1→B1 MFPT, not shown, exhibit qualitatively similar agreement.) The agreement contrasts with a purely Markovian matrix formulation, which does not account for the “labeling” described above, which can yield statistically biased estimates for kinetic quantities (see methane results, below). Unbiased matrix-based estimates are important when reweighting is used in WE[23] as noted in the Discussion. Reweighting was not used in the present study, however.

Figure 6

Population of A2 and mean first passage time for Ala4 from A2 to B2, estimated by the non-Markovian matrix analysis of WE data. Dashed lines indicate roughly a 95% confidence interval from brute force simulation, as in Figures 4 and 5. The states are defined in Figure 3.

Methane

In the methane system, WE simulation is used to measure first-passage times based on a range of state definitions. For a complex system, analyzing the sensitivity of the MFPT to state definitions could aid in the definition of states. The MFPT was estimated directly, as well as by both non-Markovian and Markovian matrix analysis. To assess statistical uncertainty, once again five independent WE simulations were run. The bins used for post-analysis differ from those used in the original WE simulation, as a matter of convenience—underscoring the flexibility of the approach. Figure 7 shows passage times measured as a function of the boundary position for the unbound state. The boundary of the bound state A was held fixed at a separation of 5 Å while the definition of the unbound state was varied from 5 to 17 Å. The passage times were measured in increments of 2 Å and compared with BF results as shown in Figure 7. The BF confidence intervals are based on a single long trajectory of 0.4 μs, the same total simulation time used in each WE simulation.

Figure 7

The mean first passage time for methane association (B to A) and dissociation (A to B) measured “directly” and from the non-Markovian matrix analysis from WE simulation as a function of the boundary of state A. The inset displays the PMF along with the definitions of the unbound and bound states, indicated by B and A, respectively. Dashed lines indicate roughly a 95% confidence interval based on 0.4 μs of brute force simulation. Figure 7 shows that both direct and non-Markovian matrix estimates are in agreement with BF confidence intervals. For fixed state definitions, Figure 8 shows the evolution of state populations MFPTs, as was done for Ala4. We fix the movable boundary position in Figure 7 (inset), defining state B as all configurations with r > 11 Å.

Figure 8

Methane association/dissociation observables. Direct and non-Markovian WE estimates for populations and mean first passage times (MFPTs) are plotted vs molecular time. Five independent WE runs are shown, each based on 0.4 μs of total simulation time. Dashed lines indicate roughly a 95% confidence interval based on 0.4 μs of brute force simulation. Each nanosecond of molecular time corresponds to approximately 80 ns of WE simulation accounting for all trajectories in a single run. The bound state (A) is defined by distances less than 5 Å, and B is defined by distances greater than 11 Å. The performance of the non-Markovian matrix estimates are particularly noteworthy in Figure 8. The matrix estimates converge faster than direct estimates to the exact results for the state populations. Presumably, this is because the direct approach requires relaxation of the full probability distribution to equilibrium, whereas the matrix approach requires only relaxation of the distribution with each bin (in order to obtain accurate interbin rates kμν). In contrast to the unbiased MFPT estimates obtained by both direct and non-Markovian analysis, the Markov analysis can be significantly biased for the MFPT. Figure 9 shows that applying the Markovian analysis (section 2.3) leads to MFPT estimates clearly outside the BF confidence interval. Data in the SI show that the use of a more sophisticated model such as a maximum-likelihood estimator for reversible Markov models[45] yields similar results and does not correct the bias.

Figure 9

Populations of A (r < 5 Å) and B (r > 11 Å) and MFPTs for the methane system, estimated by the non-Markovian matrix analysis and the Markovian analysis without history information. Dashed lines indicate roughly a 95% confidence interval from brute force simulation based on 0.4 μs of total simulation time. Equilibrium properties, however, can be estimated without bias in a Markovian analysis because history dependence is immaterial. Figure 9 also illustrates correct (equilibrium) population estimates based on the Markovian analysis.

Discussion

To our knowledge, this is the first weighted ensemble (WE) study using the original Huber and Kim algorithm[5] to simultaneously calculate both equilibrium and nonequilibrium quantities. The present study estimates observables (populations and MFPTs) based on arbitrary states defined in a postsimulation analysis, permitting the examination of different state definitions and their effects on observables. Two qualitatively different estimation schemes were examined, including a non-Markovian rate-matrix formulation which shows promise for reducing transient initial-state bias (a bias which is intrinsic to direct estimation of observables based on weights). Both schemes showed substantial efficiency gains for some observables even in the test systems which appear to lack significant energy barriers in their configurational landscapes. All results were validated using independent “brute force” simulations. Nevertheless, as described below, the present data do point to further challenges likely to be exhibited by larger, more complex systems.

Flexibility in State Choice

One key feature of the WE implementation studied here is the ability to investigate a range of state choices. As computer simulations tackle systems of growing complexity, it seems increasingly unlikely that states chosen prior to a study will prove physically or biochemically relevant. Indeed, it is already the case that specialized algorithms are invoked to identify physical states, separated by the slowest time scales, from existing trajectories.[46,47] With WE simulation, as suggested by our methane data, one can adjust state boundaries to minimize the sensitivity of rates to those boundaries. A possible concern with postsimulation state construction is the need to store a potentially large set of coordinates to ensure sufficient flexibility in post analysis. However, modern hardware should be sufficient for most cases of interest. As an illustration, storage of {x,y,z} coordinates for 1000 heavy atoms in a WE run of 1000 iterations using 1000 trajectories would require ∼10 GB.

Simultaneous Calculation of Nonequilibrium and Equilibrium Observables

The estimation of both equilibrium and kinetic properties from relatively short simulations is an important goal of current methods development, including for WE.[24,29] Here, we have demonstrated as a “proof of principle” that WE simulation can do this efficiently (compared to brute force simulation), without bias, in parallel, and with flexibility in defining states. Given the relatively fast time scales (nanosecond scale) characterizing the present systems, it is somewhat surprising that WE is better than brute-force simulation for some of the observables and never worse. Previous studies suggest that WE has the potential for greater efficiency in more complex systems.[27,28,48]

Non-Markovian Behavior

Many of our results employ a non-Markovian analysis. Once a configuration space is discretized (e.g., bins in WE simulation), one expects in general that transitions among such discrete regions will not be Markovian. To take the simplest example, in a 1D system, whether a trajectory enters a finite-width bin from the left or right will affect the probability to make a transition in a given direction. So generally, discretized systems are non-Markovian, even when the underlying continuous dynamics are Markovian.

Reweighting and the Matrix Formulation

This study compared estimation of equilibrium and nonequilibrium observables using the original WE algorithm and via post-analysis. As mentioned in the Introduction, the occasional rescaling of weights to match an equilibrium or nonequilibrium steady-state condition[23] was not used to avoid any potential complications. Our data clearly show that a standard Markovian analysis of WE simulation is inadequate (Figure 9), since WE bins typically are not Markovian. Additional information—history dependence, as embodied in the α/β labeling scheme—is needed to obtain unbiased results. Inclusion of history information in the matrix analysis means it is intrinsically “non-Markovian” regardless of the linear algebra employed. Future work will incorporate the rate estimation and non-Markovian matrix schemes developed here, as well as possibly the simpler Markovian scheme shown in section 2.3. Our data (Figure 8) suggest that these could be very successful in bringing a WE simulation closer to a specified steady state. But it is an open question whether reweighting simulations will prove superior to the type of post-analysis suggested here. Importantly, data presented here indicate that some rate estimators could lead to biased estimates for populations, which, in turn, would bias a reweighted simulation. One practical future approach, suggested by the work of Darve and co-workers,[49] could be to define preliminary states in advance to aid sampling transitions in both directions and then to subject the data to the same post analysis performed here to examine additional state definitions besides the initial choices.

Limitations and Future Work

The present study has not addressed some of the intrinsic limitations of the WE approach, which are the related issues of correlations among trajectories (due to the replication and merging events) and sampling “orthogonal” coordinates not divided up by WE bins. In the systems examined here, there was sufficient sampling in orthogonal dimensions to obtain excellent agreement with brute force results in all cases. However, significant future effort will be required to address correlations and orthogonal sampling, the latter being a problem common to methods which preselect coordinates such as multiple-window umbrella sampling[36,50,51] and metadynamics.[52−54]

Conclusions

In this proof-of-principle study, the parallel weighted ensemble (WE) approach has been applied to measure equilibrium and kinetic properties from a single simulation in small but nontrivial molecular systems. Importantly, populations and rates could be measured for arbitrary states chosen after the simulation. For all tested observables, unbiased estimates were obtained, as validated by independent brute-force simulations. In a number of instances, WE was significantly more efficient—yielding estimates of a given statistical quality in less overall computing time compared to simple simulation, including all trajectories. In this sense, not only is WE a parallel method but it can exhibit “super-linear scaling;” e.g., 100 cores can yield desired information more than 100 times faster than single-core simulation. We also developed a non-Markovian matrix approach for analyzing WE or brute-force trajectories, capable of yielding unbiased results, sometimes faster than direct estimates of observables from WE. The non-Markovian formulation also yields simultaneous estimates of equilibrium and nonequilibrium observables based on an arbitrary division of phase space, which is not possible in a standard Markovian analysis. The approaches tested here will need to be further developed and tested in more complex systems.

42 in total

1. Computing time scales from reaction coordinates by milestoning.

Authors: Anton K Faradjian; Ron Elber
Journal: J Chem Phys Date: 2004-06-15 Impact factor: 3.488

2. Simulations of the alternating access mechanism of the sodium symporter Mhp1.

Authors: Joshua L Adelman; Amy L Dale; Matthew C Zwier; Divesh Bhatt; Lillian T Chong; Daniel M Zuckerman; Michael Grabe
Journal: Biophys J Date: 2011-11-15 Impact factor: 4.033

3. Equilibrium free energies from nonequilibrium metadynamics.

Authors: Giovanni Bussi; Alessandro Laio; Michele Parrinello
Journal: Phys Rev Lett Date: 2006-03-07 Impact factor: 9.161

4. Comparison of multiple Amber force fields and development of improved protein backbone parameters.

Authors: Viktor Hornak; Robert Abel; Asim Okur; Bentley Strockbine; Adrian Roitberg; Carlos Simmerling
Journal: Proteins Date: 2006-11-15

5. Random walk in orthogonal space to achieve efficient free-energy simulation of complex systems.

Authors: Lianqing Zheng; Mengen Chen; Wei Yang
Journal: Proc Natl Acad Sci U S A Date: 2008-12-15 Impact factor: 11.205

6. Efficient stochastic simulation of chemical kinetics networks using a weighted ensemble of trajectories.

Authors: Rory M Donovan; Andrew J Sedgewick; James R Faeder; Daniel M Zuckerman
Journal: J Chem Phys Date: 2013-09-21 Impact factor: 3.488

7. Weighted-ensemble Brownian dynamics simulations for protein association reactions.

Authors: G A Huber; S Kim
Journal: Biophys J Date: 1996-01 Impact factor: 4.033

8. Folding Proteins at 500 ns/hour with Work Queue.

Authors: Badi' Abdul-Wahid; Li Yu; Dinesh Rajan; Haoyun Feng; Eric Darve; Douglas Thain; Jesús A Izaguirre
Journal: Proc IEEE Int Conf Escience Date: 2012-10

Review 9. Equilibrium sampling in biomolecular simulations.

Authors: Daniel M Zuckerman
Journal: Annu Rev Biophys Date: 2011 Impact factor: 12.981

32 in total

1. Tabulation as a high-resolution alternative to coarse-graining protein interactions: Initial application to virus capsid subunits.

Authors: Justin Spiriti; Daniel M Zuckerman
Journal: J Chem Phys Date: 2015-12-28 Impact factor: 3.488

2. Estimating first-passage time distributions from weighted ensemble simulations and non-Markovian analyses.

Authors: Ernesto Suárez; Adam J Pratt; Lillian T Chong; Daniel M Zuckerman
Journal: Protein Sci Date: 2015-09-09 Impact factor: 6.725

3. Galerkin approximation of dynamical quantities using trajectory data.

Authors: Erik H Thiede; Dimitrios Giannakis; Aaron R Dinner; Jonathan Weare
Journal: J Chem Phys Date: 2019-06-28 Impact factor: 3.488

4. Statistical Uncertainty Analysis for Small-Sample, High Log-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping.

Authors: Barmak Mostofian; Daniel M Zuckerman
Journal: J Chem Theory Comput Date: 2019-05-07 Impact factor: 6.006

5. Multiple Ligand Unbinding Pathways and Ligand-Induced Destabilization Revealed by WExplore.

Authors: Alex Dickson; Samuel D Lotz
Journal: Biophys J Date: 2017-02-28 Impact factor: 4.033

6. Predicting ligand binding affinity using on- and off-rates for the SAMPL6 SAMPLing challenge.

Authors: Tom Dixon; Samuel D Lotz; Alex Dickson
Journal: J Comput Aided Mol Des Date: 2018-08-23 Impact factor: 3.686

7. Computational Estimation of Microsecond to Second Atomistic Folding Times.

Authors: Upendra Adhikari; Barmak Mostofian; Jeremy Copperman; Sundar Raman Subramanian; Andrew A Petersen; Daniel M Zuckerman
Journal: J Am Chem Soc Date: 2019-04-12 Impact factor: 15.419

Review 8. Principles and Overview of Sampling Methods for Modeling Macromolecular Structure and Dynamics.

Authors: Tatiana Maximova; Ryan Moffatt; Buyong Ma; Ruth Nussinov; Amarda Shehu
Journal: PLoS Comput Biol Date: 2016-04-28 Impact factor: 4.475

Review 9. Path-sampling strategies for simulating rare events in biomolecular systems.

Authors: Lillian T Chong; Ali S Saglam; Daniel M Zuckerman
Journal: Curr Opin Struct Biol Date: 2016-12-13 Impact factor: 6.809

10. WESTPA: an interoperable, highly scalable software package for weighted ensemble simulation and analysis.

Authors: Matthew C Zwier; Joshua L Adelman; Joseph W Kaus; Adam J Pratt; Kim F Wong; Nicholas B Rego; Ernesto Suárez; Steven Lettieri; David W Wang; Michael Grabe; Daniel M Zuckerman; Lillian T Chong
Journal: J Chem Theory Comput Date: 2015-02-10 Impact factor: 6.006