Literature DB >> 30939006

Efficient Ensemble Refinement by Reweighting.

Jürgen Köfinger¹, Lukas S Stelzl¹, Klaus Reuter², César Allande², Katrin Reichel¹, Gerhard Hummer^1,3.

Abstract

Ensemble refinement produces structural ensembles of flexible and dynamic biomolecules by integrating experimental data and molecular simulations. Here we present two efficient numerical methods to solve the computationally challenging maximum-entropy problem arising from a Bayesian formulation of ensemble refinement. Recasting the resulting constrained weight optimization problem into an unconstrained form enables the use of gradient-based algorithms. In two complementary formulations that differ in their dimensionality, we optimize either the log-weights directly or the generalized forces appearing in the explicit analytical form of the solution. We first demonstrate the robustness, accuracy, and efficiency of the two methods using synthetic data. We then use NMR J-couplings to reweight an all-atom molecular dynamics simulation ensemble of the disordered peptide Ala-5 simulated with the AMBER99SB*-ildn-q force field. After reweighting, we find a consistent increase in the population of the polyproline-II conformations and a decrease of α-helical-like conformations. Ensemble refinement makes it possible to infer detailed structural models for biomolecules exhibiting significant dynamics, such as intrinsically disordered proteins, by combining input from experiment and simulation in a balanced manner.

Entities: Chemical Disease Gene

Mesh：

Substances：
Peptides
polyproline

Year: 2019 PMID： 30939006 PMCID： PMC6727217 DOI： 10.1021/acs.jctc.8b01231

Source DB: PubMed Journal: J Chem Theory Comput ISSN： 1549-9618 Impact factor: 6.006

Introduction

To infer structures and functions of biological macromolecules, we combine information from diverse experimental and theoretical sources.[1−3] However, in many experiments the observables reporting on biomolecular structure are averaged over ensembles. Nuclear magnetic resonance (NMR) and pulsed electron paramagnetic resonance (EPR) experiments provide ensemble-averaged high-resolution information about distances (e.g., using the nuclear Overhauser effect, paramagnetic relaxation enhancement, or double electron–electron resonance (DEER))[4−9] and angles (e.g., using J-couplings and residual dipolar couplings).[10,11] Small-angle X-ray scattering (SAXS) experiments provide ensemble-averaged information about macromolecular size and shape,[12] and wide-angle X-ray scattering (WAXS) experiments report on secondary structure and fold.[13] Ensemble refinement promises faithful descriptions of the true ensemble of structures underlying the experimental data even for highly dynamic systems.[8,14−17] The conformational diversity of the ensemble can be described in terms of a set of representative reference structures. The relative weights of the ensemble members are then determined by ensemble refinement against experimental data. To regularize this inverse problem one can, for example, restrict the number of conformers as is done in minimal-ensemble refinement[8,18] or replica simulations,[15,19,20] limit the weight changes relative to the reference ensemble as is done in maximum-entropy approaches[14,15,21,22] or in Bayesian formulations,[4,21] or limit both ensemble size and weight changes.[8] See ref (15) for an in-depth discussion and further references. The reference ensemble is often defined in terms of a molecular simulation force field, that is, a classical potential energy function for which one has some confidence that it captures essential features. The experimental data can then be used directly as a bias in molecular dynamics (MD) simulations[5,19,23−28] or a posteriori to reweight an unbiased ensemble in a way that improves the agreement with experiment.[8,18,21,29,30] Biased simulations improve the coverage of the configuration space but suffer from finite-size effects due to a limited ensemble size in simulations. Reweighting requires good coverage but can handle much larger ensemble sizes. The “Bayesian inference of ensembles” (BioEn) approach[21] makes it possible to combine, if needed, biased sampling and subsequent reweighting to ensure both good coverage of the configuration space and a well-defined, converged ensemble. Ensemble refinement by reweighting is a computationally challenging optimization problem because the number of structures in the ensemble, usually generated in simulations, and the number of experimental data points provided by experiments can both be large. Simulations can easily create hundreds of thousands of structures. In general, we would like to include as many structures as possible in ensemble refinement, not only to avoid artifacts due to the finite size of the ensemble[21] but also to ensure that we pick up small but significant subensembles. Experiments like NMR, SAXS/WAXS, and DEER can provide thousands of data points. The numbers of pixels or voxels in electron-microscopy projection images or 3D maps, respectively, are of even larger magnitude. More than ten thousand data points are thus common when integrating data from different experimental sources. With respect to computational efficiency, we also have to take into account that we usually want to perform multiple reweighting runs for different subensembles and subsets of the experimental data, while at the same time varying the confidence that we have in the reference ensemble. Consequently, we have to be able to efficiently solve the optimization problem underlying ensemble refinement by reweighting for large numbers of structures and data points. The paper is organized as follows. In section , “Theory”, we present two complementary numerical methods to calculate the optimal ensemble by reweighting based on the “ensemble refinement of SAXS” (EROS) method,[14] which is a special case of BioEn.[21] In both methods, positivity and normalization constraints on the statistical weights are taken into account implicitly such that we can take advantage of efficient gradient-based optimization algorithms. In the first method, we solve for the logarithms of the N statistical weights, where N is the number of structures. In the second method, we solve for M generalized forces, where M is the number of experimental data points. The efficiency of the two methods depends on N and M. For both methods, we derive analytical expressions for the gradients that render gradient-based optimization algorithms highly efficient. In section , “Results”, we systematically investigate the efficiency and accuracy of these methods using synthetic data. For illustration, we then refine fully atomistic MD simulations of Ala-5 using J-couplings. In the Supporting Information, we present a detailed derivation of the gradients for correlated Gaussian errors.

Theory

We first present the BioEn posterior,[21] whose maximum determines the optimal statistical weights of the structures in the ensemble. We then show that the optimal solution is unique. To be able to apply gradient-based optimization methods to the constrained optimization problem, we recast the posterior as a function of log-weights and as a function of the generalized forces, as already introduced in ref (21). For both formulations, we calculate the respective gradients analytically, facilitating efficient optimization. We focus here on uncorrelated Gaussian errors. Supporting Information contains a detailed derivation of the gradients for correlated Gaussian errors, which includes the expressions for uncorrelated Gaussians in the main text as special cases.

Background

In the BioEn method,[21] which is a generalization of the EROS method,[14] we determine the optimum of the posterior probability as a function of the statistical weights, wα, where α is the index of the N ensemble members (α = 1, ..., N) given the experimental data,P(data|w) is the likelihood function, and w is the vector of weights wα. The prior is given bywhereis the Kullback–Leibler divergence.[31] Both refined weights (wα > 0) and reference weights (wα0 > 0) are normalized, ∑α=1wα = ∑α=1wα0 = 1. The parameter θ expresses the confidence in the reference ensemble. Large values of θ express high confidence, and the optimal weights will be close to the reference weights. Instead of maximizing the posterior with respect to wα, we can minimize the negative log-posterior given byThe optimization problem is constrained byandthat is, the weights lie in a simplex. For uncorrelated Gaussian errors, σ, of the ensemble-averaged measurements, Y, of the observables i = 1, ..., M, the likelihood is given byHere, yα is the calculated value of observable i for the individual structure α. Note that the measurements Y can stem from different experimental methods, for example, from SAXS and NMR, and that σ2 = (σ)2 + (σ)2 is the sum of uncertainties in the experiment and in the calculation of the yα.[4] The negative log-posterior then becomesNote that for Gaussian errors the negative log-posterior L corresponds to the EROS free energy χ2 – θS, where S = −SKL is the negative Kullback–Leibler divergence.[14] The BioEn and EROS formulations differ by a factor 1/2 scaling χ2, which is equivalent to a trivial rescaling of θ. To solve this optimization problem efficiently, we first show that the negative log-posterior is convex such that there is a unique solution. The gradient of eq is given bywhere angular brackets indicate the average over the reweighted ensemble, that is, ⟨y⟩ = ∑α=1wαyα. The Hessian is given bywhere δαγ = 1 if α = γ and δαγ = 0 otherwise. By casting the Hessian in this form, as a sum of a positive definite diagonal matrix and of dyadic products of vectors, it is straightforward to show that the quadratic form xThx is positive definite,for |x|2 = ∑α=1xα2 = 1. The Hessian is thus positive definite everywhere, and the optimal solution is unique. A possible concern is that the optimal solution is on the boundary of the simplex, that is, wα = 0 for some α, because the Kullback–Leibler divergence is bounded. One might then not be able to use gradient-based methods without modification. However, because of the nonanalytical character of the logarithm, the partial derivatives of L with respect to every wα diverge to negative and positive infinity at wα = 0 and 1, respectively, and are monotonic in between. Therefore, the optimal solution is contained within the simplex, not on its surface. Another concern is that to find the unique optimal solution, we have to take into account the constraints acting on the weights given by eqs and 6. One could optimize the log-posterior given by eq using algorithms for constrained optimization like LBFGS-B that take advantage of the gradient.[32] To avoid the performance penalty associated with treating constraints explicitly, we instead recast the optimization problem into an unconstrained form.

Optimization via Log-Weights

To optimize the log-posterior given by eq under the constraints given by eqs and 6 and to determine the optimal values of the weights wα > 0 by gradient-based minimization, we introduce log-weightswhich are only determined up to an additive constant. This constant cancels in the normalization of wα. We can then writeWithout loss of generality, because all wα > 0, we can set g = 0. For the gradient of L with respect to the remaining gμ (μ = 1, ..., N – 1), we havewhere Gα = ln wα0 and angular brackets indicate the average over the reweighted ensemble, for example, ⟨g⟩ = ∑α=1wαgα. We simplified the expressions by taking advantage of the normalization condition. Importantly, we need to minimize L only with respect to the N – 1 variables gμ (μ = 1, ..., N – 1). A starting point of a gradient-based minimization of L could be the normalized prior wα0, corresponding to gα = ln(wα0/w0) = Gα – G. In a practical implementation, a procedure to evaluate L and its gradients, called with gμ (μ = 1, ..., N – 1) as arguments, would do the following: define g = 0, evaluate wα according to eq for α = 1, ..., N, evaluate L according to eq or eq below, and evaluate the gradient according to eq . Both L and its gradient can be evaluated efficiently using vector-matrix operations. Given the gα, we define vα = e, s = ∑α=1vα = ∑α=1e, s0 = ∑α=1e, and wα = vα/s (all being efficiently evaluated in vector form). The averages can be calculated as vector dot products:We then havewhere ỹ is an M × N matrix with components ỹ = yα/σ, and Ỹ is a vector with M components Y/σ that can be precalculated. To evaluate the gradient, the averages in eq can be evaluated as dot products. The first part on the right-hand side of eq can then be evaluated as an in-place vector operation. The second part can also be evaluated by a combination of matrix-vector multiplication (for ⟨y⟩), vector dot products (for the sum over i), and in-place vector operations (for the different μ).

Optimization via Generalized Forces

We showed previously[21] that the weights at the maximum of the log-posterior can be expressed in terms of generalized forcesasNote that these generalized forces correspond to Lagrange multipliers in closely related maximum entropy (MaxEnt) approaches to ensemble refinement.[25,30,33−35] See ref (21) and the Discussion (section ) below concerning the relation between MaxEnt and BioEn methods. In many practical cases, we have fewer observables than weights, M ≪ N. In such cases, one may want to take advantage of eq and minimize L with respect to the M generalized forces F instead of the N weights. By applying the chain rule, we obtain the gradient with respect to the generalized forces asIn a numerical minimization of L with respect to the M generalized forces, one would thus at each iteration step do the following: calculate the current weights wα from the forces according to eq ; evaluate L according to eq or eq ; evaluate the gradient according to eq . Equations , 19, and 20 can be evaluated efficiently by using vector-matrix methods in NumPy etc., using precalculated vectors of intermediates. However, for large M × N, care should be taken to minimize the memory requirements by avoiding M × N matrices other than yα.

Optimization Strategies

Small θ values are more challenging than large θ values because the optimal weights will deviate more from the reference weights. In practice, we usually do not know how to set θ a priori. In such cases, we recommend to perform an L-curve analysis.[36] In an L-curve or elbow plot, we plot χ2 or the reduced chi-square value, χ2/M, as a function of the relative entropy SKL for the optimal solutions at different θ values. The χ2 values will decrease with increasing relative entropy SKL, and we can choose a θ value corresponding to the elbow in this plot. Finding optimal solutions for a series of θ values also has the advantage that we can use the more rapidly converging optimal solutions at large θ values as starting points for optimizations at smaller θ values.

Methods

Implementation

With the analytical expressions for gradients in the log-weights and forces formulations derived above, we can take advantage of highly optimized gradient-based optimization methods. The BioEn optimize package, which can be downloaded from https://github.com/bio-phys/BioEn, provides Python and C implementations of the log-posterior and its gradient for both methods and a selection of different gradient-based optimizers and implementations. The reference implementation is based on Python and on the packages NumPy and SciPy in particular. The log-posterior and its derivatives are written in NumPy notation, and the BFGS minimizer from SciPy is used to compute the minimum.[37] Thanks to the fact that NumPy is typically linked to high-performance mathematical libraries such as MKL, the Python-based implementation is capable of exploiting vectorization and multithreading on state-of-the-art hardware. On the other hand, there is some overhead associated with NumPy related to the use of temporary buffers during expression evaluation. To improve the performance, we provide C-based implementations of the log-posterior functions and their derivatives, largely avoiding temporary buffers by using explicit loops to implement the expressions. OpenMP directives are used to explicitly leverage vectorization and thread parallelization. The Python interfaces are written in Cython. While these kernels are significantly faster than the NumPy-based code, there is still some overhead when the BFGS minimizer from SciPy is used because it is written in plain Python. To eliminate the bottleneck caused by the SciPy minimizer, we have implemented a Cython-based interface to the multidimensional minimizers of the GNU Scientific Library (GSL), that is, conjugate gradient, BFGS, and steepest descent minimizers.[38] In doing so, the minimization is performed completely in the C layer without any overhead from the Python layer. Additionally, the C implementation of Jorge Nocedal’s Fortran implementation of the limited-memory BFGS algorithm[39,40] by Naoaki Okazaki (https://github.com/chokkan/liblbfgs) can be used. A test suite is provided to check the implementations against each other. During code development work, we noticed that performing parallel reductions can lead to numerically slightly different results. The reason is that parallel reductions introduce nondeterministic summation orders such that round-off errors vary between runs. Therefore, we also provide parallelized C kernels where we eliminated any nonreproducibility effects.

Simulation Details

Ala-5 was simulated at pH 2, using the AMBER99SB*-ildn-q force field matching the experimental solution conditions.[41] To describe the protonated C-terminus at a low pH, we took partial charges from the protonated aspartate side chain. Excess charges were distributed across the C-terminal residue. The simulations of Ala-5 were run for 1 μs using simulation options previously described.[42]J-couplings were calculated as in previous work[43] for the 50000 structures used for the BioEn reweighting. Chemical shifts were calculated with SPARTA+[44] using MDTraj.[45] MD simulations were analyzed using MDAnalysis.[46,47]

Results

We first investigate the stability, accuracy, and efficiency of the optimization methods using log-weights and generalized forces by applying them to synthetic data. We then refine molecular dynamics simulation ensembles for Ala-5 using J-couplings.

Accuracy and Performance of Optimization Methods

We investigate how accuracy and efficiency of the log-weights and forces methods depend on the size of the ensemble N and the number of data points M using synthetic data. To generate a data set, we drew M experimental values Y from a normal distribution, that is, . We generated calculated observables y by drawing Gaussian numbers from N(Y + 1, 2), where the offset of 1 mimics systematic deviations due to force field inaccuracies. We set the experimental error for all data points to σ = 0.5. For each combination of five M-values, M = 102, 316 (∼102.5), 103, 3162 (∼103.5), and 104, and nine N-values, N = 102, 316 (∼102.5), 103, 3162 (∼103.5), 104, 31623 (∼104.5), 105, 316228 (∼105.5), and 106, we generated randomly four sets, giving us 5 × 9 × 4 = 180 data sets in total. To fully define the optimization problem, we chose uniform reference weights wα0 = 1/N and a value for the confidence parameter θ = 0.01. The latter expresses little confidence in our reference ensemble, such that the optimal weights will be significantly different from the reference weights, rendering this optimization more challenging than for large values of θ. We minimize the negative log-posterior L given by eq for each data set using the log-weights and forces methods. The efficiency and accuracy of gradient-based optimization methods depends strongly on their detailed parametrization. Here, we present results for the limited-memory BFGS (LBFGS) algorithm.[39,40] Due to its memory efficiency, we can refine larger ensembles using more data points compared to other algorithms like BFGS or conjugate gradients. Specifically, we explored the effect of the choice of the line search algorithm and the convergence criteria on the convergence behavior. We found that using the backtracking line search algorithm applying the Wolfe condition[48,49] in connection with a convergence criterion acting on the relative difference of the log-posterior with respect to a previous value (relative difference 10 iterations before the current one <10–6) strikes the best balance between accuracy, efficiency, and robustness. We used these parameters to obtain the results we show in the following. From all optimal solutions found in our exploration of the parameter space of the LBFGS algorithm, we chose for each data set the solution with the lowest negative log-posterior to compare with. We call these solutions the most optimal solutions in the following. To characterize the optimization problem for the synthetic data sets considered here, we plot the optimal reduced χ2 value as a function of the optimal relative entropies, SKL, in Figure . The larger the value of the relative entropy SKL, the more the optimal weights differ from the reference weights and the more challenging is the optimization problem. In general, we found that the optimal values for the log-weights and forces methods agree well with each other. Due to the nature of the synthetic data sets, results for individual (M, N) can be visually identified as clusters, especially for large ensemble sizes N. Note that for the data sets considered here, the clusters for large N pose more challenging optimization problems because the optimal weights are further from the initial weights.

Figure 1

Scatter plot of the optimal reduced χ2 and the optimal relative entropy, SKL, obtained with the log-weights method (circles) and the forces method (crosses) for 5 × 9 = 45 values of (M, N) and θ = 0.01. For each value of (M, N), we show results for four synthetic data sets drawn at random as specified in the text. Crosses on top of circles indicate excellent agreement of the two methods. Optimal values for the four data sets for a specific (M, N) can be visually identified as clusters, especially for large N. The optimal weights obtained with the two methods are highly correlated and correlate excellently with the most optimal weights found in our exploration of parameter space of the LBFGS algorithm. We quantify these correlations using Pearson’s correlation coefficient r,[50] which for two sets of weights wα(1) and wα(2) is given byWe find that the cumulative distribution functions of the correlation coefficient for the forces and log-weights methods with respect to the most optimal weights found are strongly peaked at r = 1 (see Figure ). For the forces method, 91% of all samples have a correlation coefficient of r > 0.99. For the log-weights method, the peak at r = 1 is even narrower as 95% of all samples have a correlation coefficient of r > 0.99. However, the log-weights solutions of fewer than 10 out of 180 samples have a correlation coefficient of r < 0.9 and thus show poorer correlation with the most optimal weights.

Figure 2

Cumulative distribution functions of Pearson’s correlation coefficient r given by eq of the optimized weights obtained with the log-weights and forces methods with respect to the most optimal weights found in optimizations with different parameters for the LBFGS algorithm. The lowest r values for particular (M, N) combinations are shown in Figure .

Figure 3

Optimality of solutions as a function of the ensemble size N (horizontal axis) and number of experimental data points M (vertical axis) with respect to the most optimal solutions found with different convergence criteria and line search algorithms. (top) Difference of negative log-posterior values to optimum for the forces method (left) and the log-weights method (right). (bottom) Minimum value of the Pearson correlation coefficient r over the four samples at a given (M, N) with respect to the optimal weights for the forces method (left) and log-weights method (right).

A more detailed analysis of the accuracy shows that the log-weights method performs not as well in cases where the ensemble size is much larger than the number of data points, N ≫ M. To quantify the accuracy, we calculate the difference in log-posterior, ΔL, obtained with the forces and log-weights methods to the most optimal negative log-posterior, L(opt), found. An average over all samples for given M and N indicates that the forces method performs well (see Figure , top left). Only when M ≈ N, we find occasional small deviations from the most optimal values. The log-weights method performs excellently for M ≈ N, but not as well where N ≫ M. This behavior is also reflected in the minimum value of the correlation coefficients over the four random samples at given M and N (see Figure , bottom). Optimality of solutions as a function of the ensemble size N (horizontal axis) and number of experimental data points M (vertical axis) with respect to the most optimal solutions found with different convergence criteria and line search algorithms. (top) Difference of negative log-posterior values to optimum for the forces method (left) and the log-weights method (right). (bottom) Minimum value of the Pearson correlation coefficient r over the four samples at a given (M, N) with respect to the optimal weights for the forces method (left) and log-weights method (right). For the chosen convergence criterion and line search algorithm, the log-weights method is computationally more efficient than the forces method (Figure ). We performed benchmark calculations on a single node with two E5-2680-v3 CPUs, 12 cores each, and 64 GB RAM using OpenMP. For the largest system considered, (N, M) = (106, 105) we used a machine with identical CPUs but 128 GB RAM. For all values of the number of data points M, the run time as a function of the ensemble size N shows a step where the matrix of calculated observables y has reached a size of ∼107 elements, that is, at M × N = 102 × 105, 103 × 104, and 104 × 103. At this size, the matrix y no longer fits into the CPU cache. However, for larger sizes the run time again depends linearly on the ensemble size. In Table , we summarize the average run times for the largest ensemble size considered here (N = 106). For M = 100 the log-weights methods is ∼20 times faster than the forces method (∼12 s versus ∼4 min on a single node; see Table ).

Figure 4

Table 1

Average Single-Node Run Time in Minutes and Minimum and Mean Value of the Pearson’s Correlation Coefficient, r, Calculated for the Optimized and Most Optimal Weights for the Largest Ensemble Size, N = 106 and M = 100, 1000, and 10000 Data Points

	run time [min]		min./avg. r
M	log-weights	forces	log-weights	forces
10²	0.2	4	0.63/0.78	1.00/1.00
10³	1.5	7	0.98/0.99	1.00/1.00
10⁴	48	140	1.00/1.00	0.98/0.99

Run times for the log-weights (circles) and forces (crosses) optimization methods as a function of ensemble size N for different numbers of data points M = 100, 1000, and 10000 (in green, orange, and blue, respectively). Run times have been averaged (bold symbols) over four different synthetic data sets each (light symbols). In conclusion, for the chosen convergence criterion and line search algorithm, optimization using the LBFGS algorithm is stable, efficient, and accurate for both the forces method and the log-weights method. In cases where the ensemble size is much larger than the number of data points, N ≫ M, the forces method is more accurate but also less efficient. In cases where N ≈ M, the log-weights method is both more efficient and more accurate than the forces method. The BioEn optimization library has been written to make it easy and straightforward not only to choose from a variety of optimization algorithms, but also to fine-tune the chosen optimization algorithms to further improve accuracy or efficiency or both.

Refinement of Ala-5 Using J-Couplings

As a realistic example for a biomolecular system, we have conducted BioEn refinement of the disordered peptide penta-alanine (Ala-5) against NMR J-couplings.[41] The Ala-5 model system is simple enough that well converged simulations can be obtained straightforwardly. Nevertheless, it displays much of the complexity encountered in MD simulations of intrinsically disordered proteins (IDPs) with a myriad of shallow free energy minima. Hence, details of the force field matter greatly for such systems, and simulations do not provide results at a level routinely achieved for well-ordered proteins. NMR observables such as J-couplings, which report on dihedral angle equilibria, provide accurate information on disordered systems.[51] We assessed the quality of a 1 μs simulation of Ala-5 with the AMBER99SB*-ildn-q force field by comparison to experimental J-couplings.[41]J-couplings were calculated from the MD trajectory using the Karplus parameters from the original publication[41] and two sets of Karplus parameters determined from DFT calculations (DFT1 and DFT2).[52] The DFT2 parameters were used to define the AMBER99SB*-ildn-q force field, and hence we initially focused on this set of Karplus parameters. Even without refinement, the MD simulation gives very good agreement with the experimental J-couplings with χ2/M ≈ 1.0 (1.1 and 0.8 for original and DFT1 Karplus parameters, respectively) using the error model of ref (43). For uncorrelated errors, χ2/M < 1 would signify agreement within the experimental uncertainty on average. However, a closer inspection of measured and calculated J-couplings shows that there are systematic deviations. For the 3JHNHα and 3JHαC′ couplings, which report on the ϕ-dihedral angle equilibrium, the simulations predict larger couplings than in experiments (Figure A,C). In addition, for the 2JNCα couplings, which are sensitive to the ψ-dihedral angle equilibrium, couplings calculated from simulations are all smaller than the experimental couplings (Figure G).

Figure 5

Comparison of J-couplings measured by NMR[41] (black squares) and calculated from MD simulation with the AMBER99SB*-ildn-q force field (red squares) and the optimal BioEn ensemble (blue circles, θ = 6.65). The DFT2 set of Karplus parameters was used to calculate J-couplings. With BioEn reweighting, we refined the weights of 50000 structures from the 1 μs simulation of Ala-5 against 28 experimental J-couplings. Optimizing the effective log-posterior at different values of the confidence parameter θ (Figure A), we see the expected drop in χ2 as θ is decreased. At small values of θ, we find only marginal improvements in χ2, but start to move away from the reference weights as indicated by a substantial increase in the relative entropy. At θ = 6.65, we find a good compromise between reducing χ2 and staying close to the reference weights. The agreement with experiment increased or stayed the same for all J-couplings (Figure S6) except for 3JHNC′ and 3JHNCβ of residue 2 for which the already very good agreement got somewhat worse (Supporting Information). The overall improvement demonstrates that the different experiments are consistent with each other. In particular, for the 3JHNHα (Figure A) and 3JHαC′ (Figure C) couplings, which report on the ϕ dihedral angle, and the 2JNCα couplings (Figure G), reporting on the ψ dihedral angles, systematic deviations from the experiment disappear with the refinement. The changes in the weights are associated with an entropy SKL ≈ 0.5 (Figure A). The weights of most structures were changed only slightly by the reweighting. In the optimal BioEn ensemble, the most important 20% of the structures constitute ∼60% of the refined ensemble (Figure B). The weights of these structures approximately double with the refinement. After refinement ∼20% of the structures contribute negligibly to the ensemble, with weights close to zero. As expected, the optimal weights from the log-weights or generalized forces methods were highly correlated (Figure S1), confirming the equivalence of the two methods to solve the BioEn reweighting problem. Using the LBFGS algorithm, the run times of the forces and log-weights optimizations for all θ values are comparable, at 42 and 33 s, respectively, on a standard workstation.

Figure 6

BioEn optimization for Ala-5. (A) L-curve analysis to determine the optimal value of the confidence parameter θ by plotting χ2 as a function of SKL for different values of θ. (B) Cumulative weight of rank-ordered wα for the uniformly distributed reference weights wα0 (red) and for optimized weights (blue) at θ = 6.65 with SKL ≈ 0.5. The polyproline-II (ppII) conformation at ϕ ≈ −60° and ψ ≈ 150° becomes more populated in the optimal ensemble (Figure D), irrespective of the choice of Karplus parameters. The shift to ppII is in agreement with the original analysis of the J-couplings for Ala-5,[41] where it was concluded that the ppII state dominates the conformational equilibrium, and with infrared (IR) spectroscopy.[53] The same conclusion was drawn from refining Ala-3 MD simulation ensembles against 2D-IR data.[54] The 3JCC′ coupling for residue 2 has been highlighted as potentially spurious by Best et al.[55] because the reported coupling is atypical for a polyalanine. Leaving out this observable from the BioEn refinement results in an essentially unchanged refined ensemble (Figure S5). Using alternative Karplus parameters to calculate the J-couplings (Figure S2) also leads to a shift to the ppII state (original and DFT1 in Figures S3 and S4, respectively), and in all cases, the ppII state becomes more favorable at the expense of α-helical like conformations (Figure S5). For the original Karplus parameters, we also find a reduction in β-strand like conformations and an even larger ppII population than for DFT1 and DFT2. While the choice of Karplus parameter model somewhat affects the optimal ensemble, the overall conclusions are robust.

Figure 7

Ala-5 Ramachandran maps. (A) Free energy surface G(ϕ,ψ) = −ln p(ϕ, ψ) from MD simulation with the AMBER99SB*-ildn-q force field averaged over the central residues 2–4. (B) Ramachandran plot for Ala residues outside of regular secondary structure from the PDB.[11] (C) Free energy surface for the optimal BioEn ensemble with DFT2 Karplus parameters. (D) Free energy differences between initial ensemble and the optimal BioEn ensemble. The ensemble refinement improves and preserves the agreement with experimental data not included in the refinement, Protein Data Bank (PDB) statistics, and the experimental chemical shifts. The distribution of ϕ and ψ angles for Ala residues outside of regular secondary structure from the PDB,[11] while clearly not reflective of the structure of a specific disordered protein in solution, provides a measure of conformational preferences of disordered proteins. Indeed, the BioEn reweighting of the α and ppII conformations leads to a Ramachandran plot agreeing more closely with the PDB statistics, with a large reduction in the population of left-handed α-helical conformations as is apparent from Figure D, Figure S3, and Figure S4. No information from PDB statistics was included in the refinement and the improved agreement with an independent data set is encouraging. As a second independent data set, which was not included in the BioEn refinement, we compare the experimental chemical shifts for Ala-5[41] to the initial ensemble and the optimal ensemble. The chemical shifts predicted by SPARTA+[44] are within the prediction error before and after ensemble refinement (Figure S7). The comparison shows the following for Ala-5: (1) Chemical shifts cannot be used to refine the ensemble because the initial ensemble already agrees with experiment within the large prediction error. (2) Ensemble refinement either improves or leaves unchanged predictions for observables not included in the refinement. The BioEn reweighting leads to a better description of the disordered peptide Ala-5 and highlights the trade-offs inherent even in the most advanced force fields. Current fixed-charge protein force fields underestimate the cooperativity of the helix–coil equilibrium[56] because force fields describe the formation hydrogen bonds relatively poorly. To compensate for the lack of cooperativity of helix formation, the formation of α-helices was favored by the “star” correction to the ψ torsion potential with the aim to define a force field balanced between helix and coil conformations. The slight rebalancing of the AMBER force field[56] enabled the folding of both α-helical and β-sheet proteins.[57] Here BioEn reweighting compensates for an adverse effect of the overall very successful rebalancing of the AMBER force field, that is, the overestimation of the helix content for short peptides such as Ala-5. BioEn reweighting can thus serve as a system specific correction to the force field, which is a promising avenue to tackle systems such as intrinsically disordered proteins where the details of the force field are critical.[58,59]

Discussion

We have presented two separate approaches to optimizing the BioEn posterior, the log-weights and generalized forces methods. Both approaches have in common that the resulting optimization problem is unconstrained, that is, both log-weights and forces can take on any real value in principle (with positivity of the weights enforced by the Kullback–Leibler divergence). For such unconstrained optimization problems, efficient gradient-based optimization methods exist. We take advantage of these by deriving analytical expressions for the gradients in both formulations. The main differences between the log-weights and the forces methods concerns the dimensionality of the underlying optimization problem. Usually, higher-dimensional problems are harder to optimize. We can either optimize for N – 1 log-weights, where N is the ensemble size, or for M generalized forces, where M is the number of data points. We have shown here for the memory efficient LBFGS optimization algorithm[39,40] and synthetic data sets that optima corresponding to identical weights are reliably and efficiently found with both formulations. Importantly, the BioEn method contains solutions of traditional MaxEnt approaches to ensemble refinement as special cases (see Figure ). These methods either treat experimental observables as strict constraints[20,22,61,62] or consider errors explicitly.[25,30,60] If solutions for these methods exist then they correspond to different choices of the value of the confidence parameter θ: The BioEn optimal ensemble approaches the traditional MaxEnt solution with strict constraints forcing deviations from the experimental values to vanish, that is, χ2 = 0, in the limit of θ → 0+. Note that if an experimental observable does not fall within the range of the calculated observables, such a strict constraint cannot be fulfilled and the MaxEnt method in principle fails to give a solution (though, in practice, methods such as replica sampling[20] will still give a result). In these cases where the MaxEnt solution does not exist, the limit of θ → 0+ corresponds to the least-χ2 solution under the constraints that all weights are positive and normalized. As for MaxEnt approaches that account for Gaussian errors, the method of Cesari et al. gives the same solution as BioEn for θ = 1.[25,30] The MaxEnt method of Gull and Daniell[60] includes errors but uses a strict constraint by demanding that the reduced χ2 is equal to one. If such a solution exists, then this condition determines a particular value of θ.

Figure 8

MaxEnt approaches to ensemble refinement as special cases of BioEn. For BioEn optimal ensembles, we plot the reduced χ2 and the relative entropy SKL parametrized by the confidence parameter θ (blue). The solution of Gull–Daniell-type[60] methods is given by the intersection of this curve with χ2 = 1 (orange), of traditional MaxEnt methods[20,22,61,62] by the intersection with χ2 = 0 (green), and of the method of Cesari et al.[25,30] by the BioEn solution for θ = 1 (red). For a simple model (M × N = 1 × 2, y = (0,1), σ = 0.14), we vary the experimental value Y, top to bottom. (A) All methods provide a solution for an experimental value Y = 0.09 within the range of calculated observables. (B) Traditional MaxEnt methods fail to give a solution for Y values outside this range (Y = 1.08). (C) Both traditional MaxEnt and Gull–Daniell-type methods fail to give a solution where a reduced χ2 ≤ 1 cannot be realized by reweighting (Y = 1.16). The vertical gray lines indicate the maximum value SKL = ln(2) for a two-state system. Reweighting relies on good coverage of the conformational space such that the true ensemble underlying the experimental data is a subensemble of the simulation ensemble.[63] In coarse-grained simulations, sampling is efficient and the free-energy landscapes are smooth such that good coverage can be achieved. In atomistic simulations, where sampling is more expensive and the free energy landscape is rougher, we often have to apply enhanced sampling methods to obtain good coverage. Independent of the details of the enhanced sampling method and with or without steering by experimental values, one can use binless WHAM[42,64] or MBAR[65] to obtain the reference weights of the unbiased ensemble, which serve as input for ensemble refinement by reweighting.[21] Here, we demonstrated that even without applying enhanced sampling methods, refinement of fully atomistic trajectories of penta-alanine using J-couplings alleviates deficiencies in the force field and leads to better agreement not only with the NMR data but also with expectations from experimental structures for proteins. These results indicate that ensemble refinement via reweighting is a promising route for highly flexible systems such as nucleic acids[30] and intrinsically disordered proteins.[58,59] For such systems, the number of accessible states can be enormous and consequently even small inaccuracies in the simulation force fields can lead to a poor representation of the experimental ensemble. Importantly, ensembles do not have to be generated by simulations to be amenable to ensemble refinement via reweighting. For example, in the analysis of EPR experiments like DEER, libraries of the rotameric states of spin labels are used. For a specific residue, one selects from this library all rotameric states that do not have steric clashes with the protein structure. However, the interactions of the spin label with its surroundings can make some rotameric states in this ensemble more preferable than others. To account for this uncertainty, one can perform a BioEn refinement using the DEER data and the ensemble of rotameric states. This procedure has been used recently to resolve angstrom-scale protein domain movements.[66] BioEn-type ensemble refinement has also been applied successfully to IDP structural modeling using NMR data as input and coil libraries as reference.[10,11] To integrate experimental results, we often have to take nuisance parameters into account. For refining against SAXS intensities, we have to consider an unknown scaling parameter and often use an additive constant to account for inelastic scattering and, to a first approximation, for differences in the contrast. Using DEER data, we have to determine the modulation depths. We can include such nuisance parameters in the optimization either directly (by minimizing L simultaneously with respect to the weights and nuisance parameters) or iteratively. In the iterative approach, we perform the following: (1) A least chi-squared fit of the calculated ensemble averages determined by the current weights to the experimental data sets with the corresponding nuisance parameters as fit parameters. We have to perform one fit for every experimental method providing data. (2) With these fitted values of the nuisance parameters, we adjust the calculated observables yα. These enter another round of optimization from which we obtain the optimal weights given the values of the nuisance parameters. (3) We use these weights for another round starting with step 1 until convergence is achieved. Note that instead of using least-chi-squared fits, one can also include priors acting on the nuisance parameters in both the direct and iterative formulations. Interestingly, ensemble refinement by reweighting offers a way to quantify the agreement between simulations and experiment. After reweighting, we can make a quantitative statement of how much we would have had to change the simulated ensemble, expressed by the relative entropy or Kullback–Leibler divergence to be able to obtain agreement with experiment. The quantification of the agreement between simulation and experiment can also be used to identify and correct deficiencies in molecular dynamics force fields.[21] In a perturbative formulation, one can seek force field corrections that capture the weight change.[25] BioEn accommodates a wide range of error models. With the gradients of the BioEn log-posterior presented here for Gaussian error models, with and without correlation, we already cover a large range of experimental methods. Moreover, in many cases the Gaussian error model can be used to efficiently obtain an initial estimate for the optimal weights. These estimates can then be used as initial weights for an optimization using a more accurate error model but perhaps a less efficient optimization method. We provide an open-source implementation at https://github.com/bio-phys/BioEn at no cost under the GPLv3 license.

17 in total

1. Refining All-Atom Protein Force Fields for Polar-Rich, Prion-like, Low-Complexity Intrinsically Disordered Proteins.

Authors: Wai Shing Tang; Nicolas L Fawzi; Jeetain Mittal
Journal: J Phys Chem B Date: 2020-10-20 Impact factor: 2.991

Review 2. Characterizing conformational ensembles of multi-domain proteins using anisotropic paramagnetic NMR restraints.

Authors: Xue-Ni Hou; Hidehito Tochio
Journal: Biophys Rev Date: 2022-01-11

3. Further Optimization and Validation of the Classical Drude Polarizable Protein Force Field.

Authors: Fang-Yu Lin; Jing Huang; Poonam Pandey; Chetan Rupakheti; Jing Li; Benoı T Roux; Alexander D MacKerell
Journal: J Chem Theory Comput Date: 2020-04-27 Impact factor: 6.006

Review 4. FRET-based dynamic structural biology: Challenges, perspectives and an appeal for open-science practices.

Authors: Eitan Lerner; Anders Barth; Jelle Hendrix; Benjamin Ambrose; Victoria Birkedal; Scott C Blanchard; Richard Börner; Hoi Sung Chung; Thorben Cordes; Timothy D Craggs; Ashok A Deniz; Jiajie Diao; Jingyi Fei; Ruben L Gonzalez; Irina V Gopich; Taekjip Ha; Christian A Hanke; Gilad Haran; Nikos S Hatzakis; Sungchul Hohng; Seok-Cheol Hong; Thorsten Hugel; Antonino Ingargiola; Chirlmin Joo; Achillefs N Kapanidis; Harold D Kim; Ted Laurence; Nam Ki Lee; Tae-Hee Lee; Edward A Lemke; Emmanuel Margeat; Jens Michaelis; Xavier Michalet; Sua Myong; Daniel Nettels; Thomas-Otavio Peulen; Evelyn Ploetz; Yair Razvag; Nicole C Robb; Benjamin Schuler; Hamid Soleimaninejad; Chun Tang; Reza Vafabakhsh; Don C Lamb; Claus Am Seidel; Shimon Weiss
Journal: Elife Date: 2021-03-29 Impact factor: 8.140

5. Interpretation of HDX Data by Maximum-Entropy Reweighting of Simulated Structural Ensembles.

Authors: Richard T Bradshaw; Fabrizio Marinelli; José D Faraldo-Gómez; Lucy R Forrest
Journal: Biophys J Date: 2020-02-15 Impact factor: 4.033

6. DEER-PREdict: Software for efficient calculation of spin-labeling EPR and NMR data from conformational ensembles.

Authors: Giulio Tesei; João M Martins; Micha B A Kunze; Yong Wang; Ramon Crehuet; Kresten Lindorff-Larsen
Journal: PLoS Comput Biol Date: 2021-01-22 Impact factor: 4.475

7. Characterization of Weak Protein Domain Structure by Spin-Label Distance Distributions.

Authors: Irina Ritsch; Laura Esteban-Hofer; Elisabeth Lehmann; Leonidas Emmanouilidis; Maxim Yulikov; Frédéric H-T Allain; Gunnar Jeschke
Journal: Front Mol Biosci Date: 2021-04-12

8. Federating Structural Models and Data: Outcomes from A Workshop on Archiving Integrative Structures.

Authors: Helen M Berman; Paul D Adams; Alexandre A Bonvin; Stephen K Burley; Bridget Carragher; Wah Chiu; Frank DiMaio; Thomas E Ferrin; Margaret J Gabanyi; Thomas D Goddard; Patrick R Griffin; Juergen Haas; Christian A Hanke; Jeffrey C Hoch; Gerhard Hummer; Genji Kurisu; Catherine L Lawson; Alexander Leitner; John L Markley; Jens Meiler; Gaetano T Montelione; George N Phillips; Thomas Prisner; Juri Rappsilber; David C Schriemer; Torsten Schwede; Claus A M Seidel; Timothy S Strutzenberg; Dmitri I Svergun; Emad Tajkhorshid; Jill Trewhella; Brinda Vallat; Sameer Velankar; Geerten W Vuister; Benjamin Webb; John D Westbrook; Kate L White; Andrej Sali
Journal: Structure Date: 2019-11-25 Impact factor: 5.006

9. A novel high-dimensional NMR experiment for resolving protein backbone dihedral angle ambiguities.

Authors: Clemens Kauffmann; Krzysztof Kazimierczuk; Thomas C Schwarz; Robert Konrat; Anna Zawadzka-Kazimierczuk
Journal: J Biomol NMR Date: 2020-04-01 Impact factor: 2.835

10. Using Cross-Correlated Spin Relaxation to Characterize Backbone Dihedral Angle Distributions of Flexible Protein Segments.

Authors: Clemens Kauffmann; Anna Zawadzka-Kazimierczuk; Georg Kontaxis; Robert Konrat
Journal: Chemphyschem Date: 2020-12-10 Impact factor: 3.102