Literature DB >> 27340835

Accurate Estimation of Protein Folding and Unfolding Times: Beyond Markov State Models.

Ernesto Suárez¹, Joshua L Adelman², Daniel M Zuckerman¹.

Abstract

Because standard molecular dynamics (MD) simulations are unable to access time scales of interest in complex biomolecular systems, it is common to "stitch together" information from multiple shorter trajectories using approximate Markov state model (MSM) analysis. However, MSMs may require significant tuning and can yield biased results. Here, by analyzing some of the longest protein MD data sets available (>100 μs per protein), we show that estimators constructed based on exact non-Markovian (NM) principles can yield significantly improved mean first-passage times (MFPTs) for protein folding and unfolding. In some cases, MSM bias of more than an order of magnitude can be corrected when identical trajectory data are reanalyzed by non-Markovian approaches. The NM analysis includes "history" information, higher order time correlations compared to MSMs, that is available in every MD trajectory. The NM strategy is insensitive to fine details of the states used and works well when a fine time-discretization (i.e., small "lag time") is used.

Entities: Chemical Disease Gene

Mesh：

Substances：

Year: 2016 PMID： 27340835 PMCID： PMC5022777 DOI： 10.1021/acs.jctc.6b00339

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

The field of biomolecular simulation is poised at a moment of opportunity and challenge: with current parallel and distributed computing capabilities, vast amounts of trajectory data can readily be generated; however, the optimal means of orchestrating simulations and analyzing the output remain unresolved. Currently, a number of sophisticated approaches, such as “replica exchange”,[1−4] path-sampling,[5−9] and other strategies,[10−13] provide precise prescriptions for the generation and analysis of trajectories. In another popular approach, large collections of relatively short molecular dynamics (MD) trajectories initiated in diverse regions of configuration space are often analyzed using Markov state models (MSMs) which assume that future dynamics depends only on the present state and not on prior history.[14−16] Extremely long continuous MD trajectories have also been generated and subjected to an array of analyses.[17−20] The present report attempts to improve on the analysis of MD data, which in turn can aid in optimizing the design of increasingly common multiple-trajectory studies.[21] By improving the extraction of kinetic information from trajectory segments, which can depend on segment length, this work offers quantitative insight into optimizing the allocation of resources. The focus of the present study is estimation of the slowest “macroscopic” time scales, particularly the mean first-passage time (MFPT) for protein folding and unfolding events. The MFPT is a standard definition of the inverse rate for a process.[22] All of the analyses discussed are general and also apply to rate estimation for conformational change and (un)binding processes. Our work builds on the extensive theory and software development that already underpins MSMs.[14−16,21,23−25] MSM estimation of kinetic behavior typically proceeds from a particular set of trajectory data by (i) subdivision of the full configuration space into distinct regions or “states” based on clustering of trajectory configurations, (ii) generating a history-independent transition matrix based on transition counts observed in the data, (iii) extracting “implied time scales” based on dominant eigenvalues, often as a function of lag time (time-discretization), and (iv) repeating the process multiple times because of the nonuniqueness of clustering to ensure robust estimation of time scales. Our revised procedure modifies steps (ii) and (iii) to account for history information. We also revisit the issues of lag-time dependence and sensitivity to clustering in MSMs. We note that, in contrast to (i), alternative procedures for constructing states that do not include all of the configuration space have been proposed for Markov analysis,[26−28] but those procedures are not generally used for trajectory postanalysis and will not be considered here. It is essential to recognize that even when the underlying dynamics of a continuous system (e.g., molecular) is Markovian, the system’s behavior becomes non-Markovian when projected onto a finite discrete space.[29] A simple example illustrates the point. Consider a particle exhibiting simple diffusive continuous dynamics in a one-dimensional space: when the space is divided into finite sequentially numbered states, the probability to transition from state i to i+1 will depend on whether the particle entered state i from i–1 (further away from i+1) or from i+1 (closer). The issue becomes more pronounced when projecting from a high-dimensional space to a relatively small number of discrete states. A key non-Markovian precedent for this work is the unbiased procedure for calculating the MFPT when sufficient history information is present in the trajectory data set.[30−33] Specifically, the MFPT for a transition between arbitrary “macrostates” A and B can be estimated without bias from a transition matrix calculated using “last-in-A” trajectories, i.e., the subset of trajectories which can be traced back in time and were more recently in state A than B. Remarkably, this result holds regardless of the clustering or division of configuration space defining the transition matrix.[32] However, MFPT estimates generally will be biased when trajectories do not satisfy the last-in-A requirement.[32] An important distinction of the present approach is the focus on well-defined time scales between specified states, in contrast to calculating less specific “implied time scales”. The choice is motivated by the interest of many investigators in specific processes (e.g., folding, functional conformational changes), and equally importantly, the ability to compare MFPT estimates with independent reference calculations. In particular, recent μs–ms scale MD simulations exhibiting multiple protein folding and unfolding events[17] provide an unprecedented opportunity for quantitative validation of the approach. We construct and test a series of non-Markovian (NM) MFPT estimators, i.e., estimators that use trajectory history in constructing transition matrices. The most useful estimators are guaranteed to recover unbiased MFPT values, for the lag-time used, when trajectories possess sufficient history information, independently of the partition of the space into states. The data show that using configuration-space clustering generated by standard MSM software, the NM estimators provide MFPT values in substantially better agreement with long-trajectory reference values, compared to MSM estimates, for every system and every trajectory set examined. The NM estimators function equally well with crude states based on a simple coordinate.

Theoretical Formulation

MFPT, Lag Time, and Discretization Error

The MFPT is the primary kinetic observable studied in this report because its inverse quantifies the rate of a process such as folding. The MFPT is the mean time to first reach a defined target state “B” (e.g., the folded state of a protein) starting from a defined initial distribution in state “A” (e.g., the unfolded state). To estimate the MFPT accurately, it is essential to understand its relation to the lag time τ, the fixed interval at which trajectory data is examined. Kinetic observables fundamentally are defined by the τ → 0 limit, although considerations of expediency may motivate use of larger τ: for example, in the molecular simulation trajectories analyzed below, configurations have been saved only every δt = 0.2 ns. Whenever τ is finite, discretization error must be expected, but the error in the MFPT should be negligible for the small τ = δt = 0.2 ns used in the present analysis. Importantly, τ is not an upper bound on the discretization error in the MFPT because of potential recrossing of the state B boundary, which can be a significant effect. Regardless of how the target state of a first-passage process is defined, one expects discretization error in the MFPT to increase monotonically with τ because eliminating trajectory points can only delay arrival to the target state. This behavior is indeed seen for protein folding and unfolding (see Figures S1–S6 in Supporting Information (SI)). Importantly, the long lag-time limit of the MFPT does not depend on the kinetics of a system but instead is completely specified by the equilibrium probability of the target state peq(B) via MFPT(τ → ∞) ∼ τ/peq(B). Increasing the lag time in any MFPT analysis (Markov or otherwise) thus presents the danger of introducing artifactual (nonkinetic) dependence. For these reasons, this report considers only τ = δt.

Markovian Estimation of the MFPT

The main focus of this report is the non-Markovian estimation of observables, but for reference, we also perform traditional Markov analysis of the trajectories. In a MSM, the full configuration space is first divided into relatively small states, and kinetic analysis is performed without employing history information.[14−16,21,23−25] That is, the rate k between states i and j is defined as the conditional transition probabilitywhere X is the random variable representing the state of the system at time t, and τ is the lag-time used for the Markov model. Each rate can be estimated by the maximum likelihood estimator (MLE) c/∑c,[24] where c are elements of the count matrix C, the total number of i-to-j transitions observed during the simulation at the given lag-time. However, since we we will be analyzing equilibrium ensembles where detailed balance must hold, it is more convenient to introduce that symmetry directly to our Markov model. Algorithms to compute the MLE of the transition matrix for reversible Markov models are well-known,[23,24,34] and here, we follow the one proposed by Prinz et al.[24] The MFPT is then computed analytically following ref (32). Importantly, MSM MFPT estimates generally are biased, as the data below shows. This is true even when the MSM is built from a single continuous trajectory many times the length of the MFPT. As noted, discrete state dynamics are intrinsically non-Markovian, so bias in MFPT estimation should be expected.

Non-Markovian Calculation of the MFPT without Bias Using Full History

Although our primary focus is the analysis of limited-history trajectory segments (with segment length less than the MFPT), we develop estimators based on exact procedures available when complete history information is available. We now briefly describe the full-history procedures, which are detailed elsewhere.[30−33] To calculate a MFPT, the exact procedure requires only a single history-based label indicating whether macroscopic state A or B was visited most recently. Trajectories most recently (or currently) in A are given the α label, and β denotes those most recently in B. See Figure a. Such labels may be called “color” information. States A and B cannot be overlapping, but their union need not cover the full phase space. For convenience, we take A and B to consist precisely of subsets of nonmacroscopic states i.

Figure 1

Non-Markovian trajectory analysis with full and partial data. (a) Schematic representation of a single continuous trajectory where the color indicates the last state visited: blue trajectories were last in state B and red trajectories were last in A. (b) Multiple trajectory segments where color information is not always known, as indicated by gray and partially gray segments. If one considers a large equilibrium ensemble of labeled trajectories, each nonmacroscopic state i will contain both α and β trajectories that must together sum to the equilibrium probabilityWith N states, a set of 2N probabilities is therefore required. Similarly, a 2N × 2N “colored” rate matrix is used for MFPT estimation, where there are now two additional history labels μ and ν (either α or β) for each element of the transition matrix {k} that track whether the label has changed.[29,32] Only if a trajectory enters a new macrostate will ν change from μ. Formally, we havewhere L and L account for the label of the trajectory (α or β) before and after the lag time τ, respectively. Solving the steady-state solution of the matrix with elements {k} yields both equilibrium and unbiased kinetic information, as described previously.[29,32] The steady state essentially describes the circulating flow of trajectories as they continually make transitions and switch back and forth between α and β labels. Equilibrium information is obtained simply by summing labeled probabilities via (2), whereas unbiased MFPT estimates result from computing fluxes solely based on a specific label α or β.[29,32] In contrast, MSM-based MFPT estimates fail to separate the α and β components, leading to bias.

Second-Order Markov Models for Using Additional History

Given the value of history information in generating unbiased MFPT estimates, we consider a number of alternatives for using information beyond the two time-point pairs employed by typical MSMs. We first examine second-order Markov analyses, as sketched in Figure .

Figure 2

Schematic representation of the second-order Markov analysis. The y-axis represents the continuous configuration space divided in discrete states (shown at right). Regular Markov analysis examines two time points, t and t + τ, whereas the second-order analysis additionally considers the system state at time t – nτ for arbitrary n, which is a form of history information. The rates in a second-order Markov model have three indexes since they will depend on states the system visited at two prior time points. For maximum generality, we consider the family of such models where the prior time points are separated by an arbitrary multiple n of the lag time; rates are defined bywhere n can be adjusted for the given τ (see Figure ). If n = 0, the rates and results are equivalent to a regular first order Markov model. The same is true when n → ∞ since the transition between i and j becomes independent of what happened in the very distant past. As described in the SI, numerical estimates of the MFPT are obtained by kinetic simulation using the transition probabilities specified in the rate matrix at every time step.

Non-Markovian Estimation of the MFPT with Limited “Color” Information

Building further on the sufficiency of the “color” labels α and β for the unbiased MFPT calculation described above, we now construct approximate estimators applicable to sets of trajectories that may lack color labels in some instances (see Figure b). A finite trajectory segment will lack a label, for example, if it never visits either initial macrostate A or target state B. The new estimators use color information if it is available and otherwise assign labels based on Markovian statistical models. More precisely, the construction of the rate matrix used for MFPT estimation proceeds in two steps. First, elements of the 2N × 2N colored rate matrix are assigned preliminary label-independent values, k = k based on all trajectory data, which provides estimates for pα and pβ in a Markovian picture. Second, the colored matrix is completely recalculated by examining each pair of τ-separated points in every trajectory. If a given segment possesses sufficient history to assign an α or β label, that is done. If not, the label μ is assigned with probability pμ/peq previously inferred from the preliminary Markov calculation. This procedure is used to construct the approximate colored 2N × 2N count matrix C = {c}, that is transformed into the rate matrix used for MFPT estimation by the same steady-state procedure noted above. This procedure defines the “Markov+Color” estimator used throughout the study. We can define a similar MFPT estimator combining color information with the second-order Markov models described above. As for the “Markov+Color” estimator, we use the color information when it is available, but the procedure is somewhat different because the MFPT ultimately will be determined via kinetic simulation. We examine each trajectory segment using a fixed n value and determine whether either macroscopic state A or B was visited between t–nτ and t. If neither was visited during this interval, the state m is used as the third element of the three-dimensional (3D) matrix in eq . Otherwise, the “color” index (α or β) is stored instead. In practice, for limited history windows corresponding to small n values, color information will rarely be available. A numerical estimate of the MFPT, termed “2nd-Markov+Color” is obtained by kinetic simulation based on the 3D rate matrix. By construction, both partial-history estimators, Markov+ Color and second-order Markov+Color, converge to the true MFPT when sufficiently long trajectories are available. Below, we show the dependence of all the estimators as a function of the history length, defined as the maximum time length we are allow to go back in time in order to assign a color label or a previous system state (see SI for further details).

History Length

Although we have very long trajectories available, we want to mimic the effects of having shorter fragments. We define the (nominal) history length as the maximum time we review in order to assign color labels or a previous system state. However, the effective or average history length actually used will depend on the length of the trajectory fragments we are considering. If the fragments are very long as compared to the history length we are considering, then the nominal history length and the effective history length used are going to be very similar. To understand this, consider, for instance, that our fragments have n+1 time points, and we want to analyze them with a history length n. That history length can only be used completely in the last point of the fragment, and in general, for the time point k we can only go back in history k–1 steps, even when the history length we are considering is n > k. In average, in this case, the effective history length used would be (n+1)/2.

Systems, Model Construction, and Reference Values

Model Systems and Macrostate Definitions

We analyzed trajectories for four proteins previously studied in explicit solvent by μs–ms scale atomistic molecular dynamics simulations generated using the special purpose Anton supercomputer.[17] We examined Chignolin, Trp-cage, NTL9, and Villin. In the case of NTL9, more than one trajectory was available, but we only analyzed the longest trajectory reported. Table shows, for each system, the RMSD basis of the folded and unfolded macrostates, the Protein Data Bank code of the reference structure used for RMSD calculation, the total simulation time considered for the analysis and the number of residues. Further details about the trajectories can be found in the original paper.[17]

Table 1

Protein Models Used for Markovian and Non-Markovian Analysesa

protein	num. residues	time (μs)	RMSD (folded)	RMSD (unfolded)	reference structure (PDB ID)
chignolin	10	106	<1.10 Å	>7.0 Å	5AWL
Trp-cage	20	208	<1.75 Å	>10.0 Å	2JOF
NTL9	39	1100	<1.50 Å	>10.0 Å	2HBA
villin	35	125	<1.50 Å	>11.0 Å	2F4K

For each system, the table shows the number of residues, the total simulation time used in the analysis, and the state definitions based on heavy-atom RMSD with respect to the folded structure whose protein data bank code (PDB ID) is given in the last column.

Markov State Model Construction with “Optimized” Bins

Trajectories were projected into a reduced dimension space represented by the ϕ, ψ, and χ1 dihedral angles of each residue of the protein. Each frame of the trajectory was then characterized by a vector containing the sine and cosine of each of these torsions in order to account for the periodicity of the angles. We applied time-independent component analysis (tICA) to further reduce the dimensionality of the model space before constructing the Markov state model.[35,36] All transformations of the trajectory data, along with subsequent construction of the MSM were performed using MSMBuilder3[34] together with MDTraj.[37] We then constructed a large set of candidate MSMs by varying the hyperparameters of the model, specifically the number of microstates (clusters), tICA components and both the lag time of the MSM and the lag time used in determining the tICA components. The grid of tested parameters is shown in Table , and iterating over all combinations of parameters resulted in 576 distinct models.

Table 2

MSM Hyperparameters

hyperparameter	grid search values
number of tICA components	1, 2, 4, 8
tICA lag time (ns)	50, 100, 200
number of MSM microstates	50, 100, 250, 500, 1000,
	2000, 4000, 8000
MSM lag time (ns)	0.2, 5, 10, 25, 50, 100

A subset of these models was then selected for further analysis based on the MFPTs of folding and unfolding calculated from projecting the full continuous trajectory onto the space of discrete states. If the dynamics in the discrete space reproduced the MFPTs calculated from the continuous trajectories[17] to within a reasonable interval (see Table ), then the model is considered as acceptable and used in subsequent analysis. Here, we refer not to MSM behavior, but merely the averaging of first-passage events occurring in the long continuous trajectories when mapped on to the discrete-state model with macrostates defined in the manner described above. Models not reproducing established MFPTs were discarded to avoid artifactual MFPT behavior in both MSM and non-Markov analysis.

Table 3

Definitions of Acceptable Modelsa

	MFPT interval (μs)
protein	folding	unfolding
chignolin	0.1–3.0	0.5–8.0
Trp-cage	8.0–10	1.0–5.0
NTL9	5.0–30	75–250
villin	1.0–5.0	0.2–3.0

The models with MFPTs within the intervals shown around the values reported in the literature[17] are considered as good models and otherwise are discarded.

The models with MFPTs within the intervals shown around the values reported in the literature[17] are considered as good models and otherwise are discarded. Definitions of the folded and unfolded states in the discrete space were defined by examining the mean RMSD to the native conformation of all frames assigned to a particular cluster subject to a system-specific cutoff (see Table ). That is, if the average RMSD within a cluster was within the specified RMSD cutoff, the cluster/state was considered part of the corresponding macrostate, folded or unfolded. Definitions of the unfolded and folded states for the continuous trajectories were also defined using a dual-cutoff approach as in ref (38). Culling models with state-spaces that are unable to reproduce MFPTs generated from the full continuous trajectories ensures that a reasonable comparison can be made between different estimators and the long brute-force trajectories.

RMSD-Based States

In addition to the partition of the systems in states as previously described using the tools given by MSMBuilder3[34] and for the sake of comparison, we also examined the results from a simple RMSD partition, with respect to the folded structure, of the space where the folded and unfolded states are defined by single states, and the intermediate region is evenly divided in 48 additional states. The definition of folded and unfolded states in terms of the RMSD are given in Table .

Reference Calculation of Observables from Long Trajectories

All the simulations analyzed here are single regular molecular dynamics trajectories[17] that were saved every 0.2 ns. The first passage times we use as reference are measured and averaged at that time resolution directly during the simulations by tracing the evolution of the trajectories and recording the time points at which the trajectories entered and exited the folded and unfolded states. Folded and unfolded state definitions are given above.

Results and Discussion

We compare non-Markov and Markov estimates of the MFPT using trajectory data previously generated[17] for four peptides and proteins: Chignolin, Trp-cage, the Villin headpiece, and NTL9 (see Figure ). The available trajectory data contain at least 10 events of each type for every system, providing reliable reference values. The analyses are performed on models using two approaches to dividing configuration space: (i) states constructed by the MSMBuilder software (v3.3)[34] and (ii) simple states based on subdividing a single coordinate, the RMSD with respect to the folded structure. Further, we perform analyses using all the available data, as well as subsets corresponding to less than the sum of folding and unfolding MFPTs.

Figure 3

Four proteins studied. The total simulation time is shown along with the Protein Data Bank entry of the experimental structure.

Four proteins studied. The total simulation time is shown along with the Protein Data Bank entry of the experimental structure. After analyzing hundreds of models for each protein with reasonable agreement with the MFPTs reported,[17] we found that the dependence of the estimated Markovian MFPTs on the number of states is very small as compared to the strong sensitivity to the lag-time. For that reason, we decided to choose a model for further analyses, for each protein, with only 50 states. The model was selected randomly among all the acceptable models we generated. Non-Markovian MFPT estimates are shown along with Markov-based values for reference. For the model selected in each system, the lag time used is τ = 0.2ns, which is the time interval used to store trajectory snapshots and hence minimizes the discretization error discussed above. (Figures S1–S6 motivate our choice, showing there is considerable ambiguity and potential error in selecting a longer lag time.) As the data show, the bias in the MFPTs can be corrected by just considering a relatively limited amount of history in the analysis.

MFPT Estimates Using All Data

MSM-Optimized States

We performed different analyses on states constructed via MSMBuilder. First, we confirmed the correctness of the non-Markovian analysis previously proposed[29,32] and described by eq . We generated MFPT estimates from the steady-state solution of the rate matrix (3), with matrix elements calculated based on the full trajectories. Thus, we used all available history information, and every segment could be categorized according to a label α or β. These matrix-based MFPT estimates were compared with reference values obtained by direct averaging of FPTs obtained from the long trajectories. Figure shows the excellent agreement for all the proteins’ folding and unfolding times and motivates pursuing history-based MFPT estimation using less history.

Figure 4

Exactness of the “color”-labeled formulation. Comparison of reference MFPT values for both folding and unfolding of all four proteins with MFPT estimates from solving the steady-state solution of the labeled rate matrix (eq 3) when the full continuous trajectories are used to assign color labels, i.e., when all available history information is used. Reference values are direct averages of first-passage times from the trajectories. We next examined the effects of reducing the amount of history information used in MFPT estimation, as shown in Figure . Although all trajectory data are included, the non-Markovian estimates use only the finite amount of history indicated on the horizontal axis. Both estimators using color labels (when available) can drastically reduce the bias based on a history length significantly shorter than the MFPTs themselves. The amount of history needed is of the order of the transition event duration, t,[39] which is usually very small as compared to the MFPTs.

Figure 5

MFPT estimates for protein folding (first row) and unfolding (second row) using MSM-optimized states. MFPT values are normalized by reference values (shown with text) obtained from long MD simulations[17] at the shortest lag time. The plots show the dependence of the various estimators on the history length used (see text). The green horizontal stripes are the 95% confidence interval for the reference values. In general, the second-order Markov approach improves the estimate for short history lengths but then converges to the Markovian value as expected. Only if color history information is included do first-order and second-order Markov estimates converge to the exact value. In every case, the (history independent) Markov MFPT estimates are significantly shorter than the reference value, by one or more orders of magnitude. Although Markov MFPT estimates based on longer lag times do increase, their large-τ behavior is linear and ultimately must exceed the reference value; as shown in the SI, there is no unambiguous way to select a longer lag time and longer lags certainly include discretization error.

Simple RMSD-Based States

Because construction of a MSM is not a simple task that may require some degree of user curation, we also examined MFPT estimates based on very simple coordinates. We chose RMSD from the (experimental) folded state because it is straightforward to calculate and interpret, although it has well-known weaknesses.[40] With enough history information, the “quality” of the states does not seem to negatively affect our non-Markovian estimators. If simple states based on subdividing the RMSD with respect to the folded structure (see SI for further details) are used, we still have unbiased results when the color information is considered (Figure ). Because the states used for the RMSD-based analysis differ slightly from those based on MSM-optimized states, the absolute MFPT values differ slightly as well.

Figure 6

MFPT estimates for protein folding (first row) and unfolding (second row) using RMSD-based states. MFPT values are normalized by reference values (shown with text) obtained from long MD simulations[17] at the shortest lag time. The plots show the dependence of the various estimators on the history length used (see text). The green horizontal stripes are the 95% confidence interval for the reference values. Note that absolute MFPT values differ slightly from those in Figure ; see text.

Estimates with Reduced data and MSM-Optimized States

A major goal of trajectory analysis is to predict the long time behavior of a system using a minimum amount of data, i.e., a subset that can be generated affordably. Because the preceding results show that the amount of history needed to obtain satisfactory results is small compared to the MFPTs, and in general as compared to the longest time scales of the system, it is reasonable to think that multiple short trajectories starting from different regions of the configurational space would be a valid way to predict observables. We mimic low-data scenarios by dividing the continuous trajectories of Chignolin, Trp-Cage, and Villin in fragments of 0.5 μs and the trajectory of NTL9 in fragments of 1.0 μs, again using states constructed by MSMBuilder. Such trajectory sets can be considered an idealized use case without artifacts from initiating multiple trajectories and hence constitute at least a minimal test for any valid analysis. In all cases, the length of the MFPT is significantly longer than the fragment size (see Figure ). MFPTs are estimated from a small fraction of the fragments, chosen with replacement and limiting the number of trajectories starting in each state. When the amount of trajectory data is drastically reduced, the second-order approaches start to fail because there are more parameters to estimate with the same amount of information. Here, we will only show results when the data are reduced bellow 5% of the original data, as this is the regime of greatest practical interest. Because the second-order approaches are overwhelmed by noise in the low-data regime, we only show first-order estimates employing color information. Figure shows that the Markov+Color estimator functions well in the low-data regimes when the history length is sufficient. Most of the bias is removed, as compared with the history-independent Markov estimator. The fluctuations in Markov estimates are not true history dependence but simply noise since every point in the plots is generated from an independent subsample, extracted with replacement from the full data set.

Figure 7

Folding (first row) and unfolding (second row) MFPT estimates using MSM-optimized states. The estimations were done from subsamples that represent less than 5% of the available MD data[17] for each protein at the shortest lag time. For each value of history length, four independent subsamples were extracted with replacement. A pair of folding and unfolding MFPT curves is generated using a total amount of trajectory data less than the round-trip time (sum of forward and backward MFPTs), as indicated for each protein.

Estimates with Reduced Data, Simple RMSD-Based States

Figure shows MFPTs estimated from a small fraction of the simulation extracted as described in the previous section. In this case, the model was built from simple RMSD-based bins.

Figure 8

Folding (first row) and unfolding (second row) MFPT estimates using RMSD-based states. The estimations were done from subsamples that represent less than 5% of the available MD data[17] for each protein at the shortest lag time. For each value of history length, four independent subsamples were extracted with replacement. A pair of folding and unfolding MFPT curves is generated using a total amount of trajectory data less than the round-trip time (sum of forward and backward MFPTs), as indicated for each protein.

Conclusions and Outlook

When projected onto finite discrete spaces, molecular dynamics is intrinsically non-Markovian, i.e., it is history-dependent. By analyzing continuous trajectories that exhibit multiple (un)folding transitions and reliable MFPT values, our data indicate that non-Markovian analyses offer the prospect of drastically reducing the bias intrinsic to history-independent Markov MFPT estimation. For the proteins considered here, this conclusion holds regardless of whether configuration space was subdivided using sophisticated approaches with MSM-optimized states or much simpler “bins” in a one-dimensional RMSD space. Importantly, the non-Markov analyses are readily applied at the shortest lag times where the discretization error is minimal. We note that the bias in MSMs reported here has been observed in the context of standard MSMs used for postanalysis which tile configuration space so that the union of the states or clusters comprises the full configuration space. Although the non-Markovian estimators offer clear advantages, their optimal use in typical cases may require somewhat longer trajectory segments (hundreds of nanoseconds or more) than were readily available in the past; however, given modern computing hardware, such trajectory sets may soon be routinely available. Further, because the non-Markovian estimators converge to the unbiased MFPT, their behavior with varying history length offers a self-diagnostic tool. The main conclusion, in any case, is that analysis of complex systems should exploit history information when it is available. Future work will explore the effects of non-Markovian analysis on pathway inference.

35 in total

1. Escaping free-energy minima.

Authors: Alessandro Laio; Michele Parrinello
Journal: Proc Natl Acad Sci U S A Date: 2002-09-23 Impact factor: 11.205

2. Computing time scales from reaction coordinates by milestoning.

Authors: Anton K Faradjian; Ron Elber
Journal: J Chem Phys Date: 2004-06-15 Impact factor: 3.488

3. How fast-folding proteins fold.

Authors: Kresten Lindorff-Larsen; Stefano Piana; Ron O Dror; David E Shaw
Journal: Science Date: 2011-10-28 Impact factor: 47.728

4. Separating forward and backward pathways in nonequilibrium umbrella sampling.

Authors: Alex Dickson; Aryeh Warmflash; Aaron R Dinner
Journal: J Chem Phys Date: 2009-10-21 Impact factor: 3.488

5. Equilibrium free energies from nonequilibrium metadynamics.

Authors: Giovanni Bussi; Alessandro Laio; Michele Parrinello
Journal: Phys Rev Lett Date: 2006-03-07 Impact factor: 9.161

6. Dynamics and optimal number of replicas in parallel tempering simulations.

Authors: Walter Nadler; Ulrich H E Hansmann
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2007-12-04

7. Identification of slow molecular order parameters for Markov model construction.

Authors: Guillermo Pérez-Hernández; Fabian Paul; Toni Giorgino; Gianni De Fabritiis; Frank Noé
Journal: J Chem Phys Date: 2013-07-07 Impact factor: 3.488

8. Progress and challenges in the automated construction of Markov state models for full protein systems.

Authors: Gregory R Bowman; Kyle A Beauchamp; George Boxer; Vijay S Pande
Journal: J Chem Phys Date: 2009-09-28 Impact factor: 3.488

9. Enhanced modeling via network theory: Adaptive sampling of Markov state models.

Authors: Gregory R Bowman; Daniel L Ensign; Vijay S Pande
Journal: J Chem Theory Comput Date: 2010 Impact factor: 6.006

10. From metadynamics to dynamics.

Authors: Pratyush Tiwary; Michele Parrinello
Journal: Phys Rev Lett Date: 2013-12-03 Impact factor: 9.161

20 in total

1. How proteins bind to DNA: target discrimination and dynamic sequence search by the telomeric protein TRF1.

Authors: Milosz Wieczór; Jacek Czub
Journal: Nucleic Acids Res Date: 2017-07-27 Impact factor: 16.971

2. Galerkin approximation of dynamical quantities using trajectory data.

Authors: Erik H Thiede; Dimitrios Giannakis; Aaron R Dinner; Jonathan Weare
Journal: J Chem Phys Date: 2019-06-28 Impact factor: 3.488

3. Computational Estimation of Microsecond to Second Atomistic Folding Times.

Authors: Upendra Adhikari; Barmak Mostofian; Jeremy Copperman; Sundar Raman Subramanian; Andrew A Petersen; Daniel M Zuckerman
Journal: J Am Chem Soc Date: 2019-04-12 Impact factor: 15.419

Review 4. Computational membrane biophysics: From ion channel interactions with drugs to cellular function.

Authors: Williams E Miranda; Van A Ngo; Laura L Perissinotti; Sergei Yu Noskov
Journal: Biochim Biophys Acta Proteins Proteom Date: 2017-08-26 Impact factor: 3.036

5. ivis Dimensionality Reduction Framework for Biomacromolecular Simulations.

Authors: Hao Tian; Peng Tao
Journal: J Chem Inf Model Date: 2020-09-01 Impact factor: 4.956

Review 6. Path-sampling strategies for simulating rare events in biomolecular systems.

Authors: Lillian T Chong; Ali S Saglam; Daniel M Zuckerman
Journal: Curr Opin Struct Biol Date: 2016-12-13 Impact factor: 6.809

7. Transient probability currents provide upper and lower bounds on non-equilibrium steady-state currents in the Smoluchowski picture.

Authors: Jeremy Copperman; David Aristoff; Dmitrii E Makarov; Gideon Simpson; Daniel M Zuckerman
Journal: J Chem Phys Date: 2019-11-07 Impact factor: 3.488

Review 8. Weighted Ensemble Simulation: Review of Methodology, Applications, and Software.

Authors: Daniel M Zuckerman; Lillian T Chong
Journal: Annu Rev Biophys Date: 2017-03-01 Impact factor: 12.981

9. Integrated Variational Approach to Conformational Dynamics: A Robust Strategy for Identifying Eigenfunctions of Dynamical Operators.

Authors: Chatipat Lorpaiboon; Erik Henning Thiede; Robert J Webber; Jonathan Weare; Aaron R Dinner
Journal: J Phys Chem B Date: 2020-10-09 Impact factor: 2.991

10. Deciphering the Allosteric Process of the Phaeodactylum tricornutum Aureochrome 1a LOV Domain.

Authors: Hao Tian; Francesco Trozzi; Brian D Zoltowski; Peng Tao
Journal: J Phys Chem B Date: 2020-10-01 Impact factor: 2.991