Literature DB >> 31139712

Machine Learning of Coarse-Grained Molecular Dynamics Force Fields.

Jiang Wang^1,2, Simon Olsson³, Christoph Wehmeyer³, Adrià Pérez⁴, Nicholas E Charron^1,5, Gianni de Fabritiis^4,6, Frank Noé^1,2,3, Cecilia Clementi^1,2,3,5.

Abstract

Atomistic or ab initio molecular dynamics simulations are widely used to predict thermodynamics and kinetics and relate them to molecular structure. A common approach to go beyond the time- and length-scales accessible with such computationally expensive simulations is the definition of coarse-grained molecular models. Existing coarse-graining approaches define an effective interaction potential to match defined properties of high-resolution models or experimental data. In this paper, we reformulate coarse-graining as a supervised machine learning problem. We use statistical learning theory to decompose the coarse-graining error and cross-validation to select and compare the performance of different models. We introduce CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force-matching scheme. CGnets maintain all physically relevant invariances and allow one to incorporate prior physics knowledge to avoid sampling of unphysical structures. We show that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse-grained beads and no solvent, while classical coarse-graining methods fail to capture crucial features of the free energy surface. Thus, CGnets are able to capture multibody terms that emerge from the dimensionality reduction.

Entities: Chemical Disease Species

Year: 2019 PMID： 31139712 PMCID： PMC6535777 DOI： 10.1021/acscentsci.8b00913

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Recent technological and methodological advances have made possible to simulate macromolecular systems on biologically relevant time-scales.[1−3] For instance, one can simulate binding, folding, and conformation changes of small to intermediate proteins on time-scales of milliseconds, seconds, or beyond.[4−8] However, the extensive sampling of large macromolecular complexes on biological time-scales at atomistic resolution is still out of reach. For this reason, the design of simplified, yet predictive, models is of great interest,[9−11] in particular, to interpret the experimental data that are becoming increasingly accessible in high throughput and resolution. Experimental data provide a partial view of certain aspects of a macromolecular system but do not directly give a full dynamical representation, and simulation can help obtain a more comprehensive understanding.[12−14] As it is clear that not every single atom is important in determining the relevant collective features of biomolecular dynamics and function, simplified models could provide more insights into the general physicochemical principles regulating biophysical systems at the molecular level. Here we use recent advances in machine learning to design optimal reduced models to reproduce the equilibrium thermodynamics of a macromolecule. Significant effort has been devoted in the past few years to apply machine learning (e.g., deep neural network or kernel methods) to learn effective models from detailed simulations[15−19] and specifically to learn potential energy surfaces from quantum-mechanical calculations on small molecules.[20−36] In principle a similar philosophy could be used to define models at lower resolutions, that is, to learn the effective potential energy of coarse-grained (CG) models from fine-grained (e.g., atomistic) molecular dynamics (MD) simulation data.[37−41] There are however important differences. In the definition of potential energy surfaces from quantum calculations, the relevant quantity to reproduce is the energy, and it is relatively straightforward to design a loss function for a neural network to minimize the difference between the quantum-mechanical and classical energy (and forces[25,33]) over a sample of configurations. In contrast, in the definition of a CG model, the potential energy can not be matched because of the reduction in dimension, and it is important to define what are the properties of the system that need to be preserved by the coarse-graining. The approximation of free energy surfaces, e.g., from enhanced sampling simulations, is therefore a related problem.[42−44] Several approaches have been proposed to design effective CG energy functions for large molecular systems that either reproduce structural features of atomistic models (bottom-up)[45−50] or reproduce macroscopic properties for one or a range of systems.[12−14,51−54] Popular bottom-up approaches choose that the CG model reproduce the canonical configuration distribution determined by the atomistic model. For instance, one may want to be able to represent the different metastable states populated by a protein undergoing large conformational changes. One of the difficulties in the practical application of these methods has been that, in general, a CG potential optimally reproducing selected properties of a macromolecular system includes many-body terms that are not easily modeled in the energy functions. Here, we formulate the well-known force-matching procedure for coarse-graining as a supervised machine learning problem. Previously, coarse-graining has been mostly discussed as a fitting procedure, but the aim of machine learning is to find a model that has minimal prediction error on data not used for the training. We use classical statistical learning theory to show that the force-matching error can be decomposed into Bias, Variance, and Noise terms and explain their physical meaning. We also show that the different CG models can be ranked using their cross-validation score. Second, we discuss a class of neural networks, which we refer to as CGnets, for coarse-graining molecular force systems. CGnets have a lot of similarities with neural networks used to learn potential energy surfaces from quantum data, such as enforcing the relevant invariances (e.g., rotational and translational invariance of the predicted energy, equivariance of the predicted force). In contrast to potential energy networks, CGnets predict a free energy (potential of mean force) and then use the gradient of this free energy with respect to the input coordinates to compute a mean force on the CG coordinates. As the CG free energy is not known initially, only the force information can be used to train the network. Third, CGnets are extended to regularized CGnets. Using a generic function approximator such as a neural network to fit the CG force field from training data only may lead to force predictions that are “catastrophically wrong” for configurations not captured by the training data, i.e., predictions of forces in the direction of increasingly unphysical states that lead to diverging and unrealistic simulation results. We address this problem by adding a prior energy to the free energy network that does not compromise the model accuracy within the training data region, but ensures that the free energy approaches infinity for unphysical states, resulting in a restoring force toward physically meaningful states. Finally, we demonstrate that CGnets succeed in learning the CG mean force and the CG free energy for a 2D toy model, as well as for the coarse-graining of all-atom explicit-solvent simulations of (i) alanine dipeptide to a CG model with 5 particles and no solvent and (ii) the folding/unfolding of the polypeptide Chignolin to a CG model consisting only of the protein C atoms and no solvent. We show explicitly that CGnets achieve a systematically better performance than classical CG approaches which construct the CG free energy as a sum of few-body terms. In the case of the Chignolin protein, the classical few-body model can not reproduce the folding/unfolding dynamics. On the contrary, the inherently multibody CGnet energy function approximates the all-atom folding/unfolding landscape well and captures all free energy minima. This study highlights the importance of machine learning and generic function approximators in the CG problem.

Theory and Methods

Here we introduce the main theoretical concepts and define the machine learning problems involved in coarse-graining using the force-matching principle and introduce CGnets and regularized CGnets. The more practically inclined reader may skip to the CGnets: Learning CG Force Fields with Neural Networks section.

Coarse-Graining with Thermodynamic Consistency

We first define what we mean by coarse-graining and which physical properties shall be preserved in the coarse-grained model. The starting point in the design of a molecular model with resolution coarser than atomistic is the definition of the variables. The choice of the coarse coordinates is usually made by replacing a group of atoms by one effective particle. Because of the modularity of a protein backbone or a DNA molecule, popular models coarse-grain a macromolecule to a few interaction sites per residue or nucleotide, e.g., the C and C atoms for a protein.[51,54−56] Alternative schemes have also been proposed for the partitioning of the atoms into coarse-grained coordinates.[57,58] In general, given a high-dimensional atomistic representation of the system r ∈ 3, a CG representation is given by a coordinate transformation to a lower-dimensional space:with n < N. Here we assume that ξ is linear; i.e., there is some coarse-graining matrix Ξ ∈ 3 that clusters atoms to coarse-grained beads: x = Ξr. The aim is to learn a coarse-grained energy function U(x; θ) that will be used in conjunction with a dynamical model, e.g., Langevin dynamics, to simulate the CG molecule. θ is the parameters of the coarse-grained model—in classical CG approaches these are parameters of the potential energy function, such as force constants and partial charges, while here they denote the weights of the neural network. A common objective in coarse-graining methods is to preserve the equilibrium distribution; i.e., the equilibrium distribution of the coarse-grained model shall be as close as possible to the equilibrium distribution of the atomistic model when mapped to the CG coordinates. We will be using a simulation algorithm for the dynamics such that the system’s equilibrium distribution is identical to the Boltzmann distribution of the employed potential U; therefore this objective can be achieved by enforcing the thermodynamic consistency:where is the thermal energy with Boltzmann constant kB and temperature T, the probability distribution pCG(x) is the equilibrium distribution of the atomistic model, mapped to the CG coordinatesand μ(r) = exp(−V(r)/) is the Boltzmann weight associated with the atomistic energy model V(r). Note that the additive constant in eq can be chosen arbitrarily. Therefore, this constant will be omitted in the expressions below, which means that it will absorb normalization constants that are not affecting the CG procedure, such as the logarithm of the partition function. Several methods have been proposed for defining a coarse-grained potential U(x) that variationally approximates the consistency relation at a particular thermodynamic state (temperature, pressure etc.) Two popular approaches are the multiscale coarse-graining (force-matching)[48,59] and the relative entropy method[50] (the two approaches are connected[60]).

CG Parameter Estimation as a Machine Learning Problem

Here, we follow the force-matching scheme. It has been shown that thermodynamic consistency (eq ) is achieved when the CG model predicts the instantaneous CG forces with minimal mean square error.[48,59] We call the instantaneous atomistic forces F(r) and the instantaneous force projected on the CG coordinates ξ(F(r)). At the same time, the CG model predicts a force −∇U(x; θ) for a CG configuration x. The force-matching error is defined asThe average ⟨·⟩ is over the equilibrium distribution of the atomistic model, i.e., r ∼ μ(r). We reiterate a result shown in ref (59) that has important consequences for using eq in machine learning. For this, we introduce the mean force:where r|x indicates the equilibrium distribution of r constrained to the CG coordinates x, i.e., the ensemble of all atomistic configurations that map to the same CG configuration. Then we can decompose expression as follows (see the SI for derivation):with the termsThis loss function differs from the force-matching loss function used in the learning of force fields from quantum data by the Noise term. The Noise term is purely a function of the CG map ξ (and when training with finite simulation data also of the data set), and it cannot be changed by varying the parameters θ. As a result, the total force-matching error cannot be made zero, but it is bounded from below by χ2(θ) ≥ Noise.[59] On the contrary, when matching force fields from quantum data, the error χ2 approaches zero for a sufficiently powerful model. Physically, the Noise term arises from the fact that instantaneous forces on the CG coordinates vary in the different atomistic configurations associated with the same CG configuration. Here, we call this term Noise as it corresponds to the noise term known in statistical estimator theory for regression problems.[61] The learning problem is now to find a CG model and its parameters θ that minimizes the potential of mean force (PMF) error term. To obtain a physical interpretation, we apply eq and write the average purely in CG coordinates:This error term is the matching error between the mean force at the CG coordinates, f(x), and the CG forces predicted by the CG potentialHence, the machine learning task is to find the free energy U whose negative derivatives best approximate the mean forces in eq , and U is thus called a potential of mean force (PMF). Equation implies that the mean force field f̂ is conservative, as it is generated by the free energy U(x). Machine learning the CG model is complicated by two aspects: (i) As the PMF error cannot be computed directly, its minimization in practice is accomplished by minimizing the variational bound eq . Thus, to learn f(x) accurately, we need to collect enough data “close” to every CG configuration x such that the learning problem is dominated by the variations in the PMF error term and not by the variations in the Noise term. As a result, machine learning CG models typically require more data points than force-matching for potential energy surfaces. (ii) The free energy U(x) is not known a priori but must be learned. In contrast to fitting potential energy surfaces we can therefore not directly use energies as inputs. For a finite data set R = (r1, ..., r) with M samples, we define the force-matching loss function by the direct estimator:where ξ(R) = [ξ(r1), ..., ξ(r)]Τ ∈ and ξ(F(R)) = [ξ(F(r1)), ..., ξ(F(r))]Τ ∈ are data matrices of coarse-grained coordinates and coarse-grained instantaneous forces that serve as an input to the learning method, and F denotes the Frobenius norm.

CG Hyperparameter Estimation as a Machine Learning Problem

While eq 9 defines the training method, machine learning is not simply about fitting parameters for a given data set but rather about minimizing the expected prediction error (also called “risk”) for data not used for training. This concept is important to be able to select an optimal model, i.e., to choose the hyperparameters of the model, such as the type and number of neurons and layers in a neural network, or even to distinguish between different learning models such as a neural network and a spline model. Statistical estimator theory is the field that studies optimal prediction errors.[61] To compute the prediction error, we perform the following thought experiment: We consider a fixed set of CG configurations X = [x1, ..., x]Τ at which we want to fit the mean forces. We assume that these configurations have been generated by MD or MCMC such that the full atomistic configurations, R = (r1, ..., r), are Boltzmann distributions conditioned on the CG configurations, i.e., r ∼ r|x. Now we ask if we repeat this experiment, i.e., in every iteration we produce a new set of all-atom configurations r ∼ r|x, and thereby a new set of instantaneous forces on the CG configurations, what is the expected prediction error, or risk of the force-matching error, ? More formally, the following is performed:where Rtrain and Rtest are two independent realizations. Although we cannot execute this thought experiment in practice, we can approximate it by cross-validation, and we can obtain insightful expressions for the form of the expected prediction error. As the loss function in force-matching is a least-squares regression problem, the form of the expected prediction error is well-known (see the SI for a short derivation) and can be written aswith the Noise term as given in eq and the bias and variance terms given bywhereis the mean estimator, i.e., the average force field learnt when the training is repeated many times for different data realizations. The terms in eqs and 13 have the following meaning: Equation is the expected error between the mean forces and the average predicted force field. It is therefore the systematic bias of the machine learning model. The variance (eq ) is the fluctuation of the individual estimates from single training procedures around the mean estimator and thus represents the estimator’s fluctuation due to finite-sample effects. given CG coordinates X, generate training set Rtrain ∼ R|X and find θ̂ = arg minL(θ; Rtrain); generate test set Rtest ∼ R|X and compute L(θ̂; Rtest) As the optimal model minimizes the PMF error, it must balance bias and variance. These contributions are typically counteracting: A too simple model (e.g., too small neural network) typically leads to low variance but high bias, and it corresponds to “underfitting” the data. A too complex model (e.g., too large neural network) leads to low bias but large variance, and it corresponds to “overfitting” the data. The behavior of bias, variance, and estimator error for a fixed data set size is illustrated in Figure .

Figure 1

Typical bias–variance trade-off for fixed data set size, indicating the balance between underfitting and overfitting. The noise level is defined by the CG scheme (i.e., which particles are kept and which are discarded) and is the lower bound for the mean prediction error. The optimum at which bias and variance balance depends on the amount of data used, and in the limit of an infinitely large data set, the variance is zero, and the optimal model can be made very complex to also make the bias zero. For small data sets, it is often favorable to reduce the model complexity and accept significant bias, to avoid large variance. To implement model selection, we approximate the “ideal” iteration above by the commonly used cross-validation method[62,63] and then choose the model or hyperparameter set that has the minimal cross-validation score. In cross-validation, the estimator error (eq ) is estimated as the validation error, averaged over different segmentations of all available data into training and validation data.

CGnets: Learning CG Force Fields with Neural Networks

It is well-known that the CG potential U(x; θ) defined by thermodynamic consistency may be a complex multibody potential even if the underlying atomistic potential has only few-body interactions.[59] To address this problem, we use artificial neural networks (ANNs) to represent U(x; θ) as ANNs can approximate any smooth function on a bounded set of inputs, including multibody functions.[64] Therefore, we use ANNs to model U(x), train them by minimizing the loss (eq 9), and select optimal models by minimizing the cross-validation error. For the purpose of training CG molecular models, we would like to have the following physical constraints and invariances, which determine parts of the architecture of the neural network.Figure a shows the neural network architecture resulting from these choices. The free energy network is D layers deep, and each layer is W neurons wide. Additionally, we use L2 Lipschitz regularization[68] in the network, with a tunable parameter λ that regulates the strength of the regularization. Thus, (D, W, λ) are the remaining hyperparameters to be selected (as discussed in the Results section).

Figure 2

Neural network schemes. (a) CGnet. (b) Regularized CGnet with prior energy. (c) Spline model representing a standard CG approach, for comparison. Each energy term is a function of only one feature, and the features are defined as all the bonds, angles, dihedrals, and nonbonded pairs of atoms.

Differentiable free energy function: To train U(x; θ) and simulate the associated dynamics by means of Langevin simulations, it must be continuously differentiable. As the present networks do not need to be very deep, vanishing gradients are not an issue, and we select tanh activation functions here. After D nonlinear layers we always add one linear layer to map to one output neuron representing the free energy. Invariances of the free energy: The energy of molecules that are not subject to an external field only depends on internal interactions and is invariant with respect to translation or rotation of the entire molecule. The CG free energy may also be invariant with respect to permutation of certain groups of CG particles, e.g., exchange of identical molecules, or certain symmetric groups within molecules. Compared to quantum-mechanical potential energies, permutation invariance is less abundant in CG. For example, permutation invariance does not apply to the α-carbons in a protein backbone (not even for identical amino acids), as they are ordered by the MD bonding topology. CGnets include a transformationfrom CG Cartesian coordinates x to a set of features that contain all desired invariances, and use the features y as an input to the network that computes the free energy, U(g(x); θ). This transformation can be chosen in many different ways, e.g., by using local coordinate systems,[34] two- or three-body correlation functions,[20] permutation-invariant distance metrics,[65−67] or by a learned representation.[29] In this work, only translation and rotation invariances are needed, and we hence choose the following features: distances between all pairs of CG atoms, the angles between three consecutive CG atoms, and the cos and sin of torsion angles defined by the CG atoms. Conservative PMF: The PMF is a conservative force field generated by the free energy (eq ). As in quantum potential energy learning,[25,29] we enforce this requirement by computing the free energy U with a neural network and then adding a gradient layer to compute the derivatives with respect to the input coordinates: Neural network schemes. (a) CGnet. (b) Regularized CGnet with prior energy. (c) Spline model representing a standard CG approach, for comparison. Each energy term is a function of only one feature, and the features are defined as all the bonds, angles, dihedrals, and nonbonded pairs of atoms.

Simulating the CGnet Model

Once the neural network has been trained to produce a free energy U(x), it can be used to simulate dynamical trajectories of the CG model. Here we use overdamped Langevin dynamics to advance the coordinates of the CG model from x at time t to x after a time-step τ:where x is the CG configuration at time t (e.g., the x coordinate in the toy model, a 15-dimensional vector in the alanine dipeptide, and a 30-dimensional vector in the Chignolin applications presented below), ξ is Gaussian random noise with zero mean and identity as covariance matrix, τ is the integration time-step, and D is the diffusion constant of the system. In the following, we use reduced energy units; i.e., all energies are in multiples of B. Since the implementation of CGnet is vectorized, it is more efficient to compute free energies and mean forces for an entire batch of configurations rather than a single configuration at a time. Therefore, we run simulations in parallel for the examples shown below. This is done by sampling 100 starting points randomly from atomistic simulations, coarse-graining them, and then integrating eq stepwise.

Regularizing the Free Energy with a Baseline Energy Model

Training the free energy with a network as shown in Figure a and subsequently using it to simulate the dynamics with eq produces trajectories of new CG coordinates x. When parts of the coordinate space are reached that are very different from any point in the training set, it is possible that the network makes unphysical predictions. In particular, the atomistic force field used to produce the training data has terms that ensure the energy will go toward infinity when departing from physical states, e.g., when stretching bonds or when moving atoms too close to each other. These regions will not be sampled in the underlying MD simulations, and therefore result in “empty” parts of configuration space that contain no training data. Simulating a network trained only on physically valid training data via eq may still produce points x that enter this “forbidden regime” where bonds are overstretched or atoms start to overlap. At this point the simulation can become unstable if there is no regularizing effect ensuring that the predicted free energy U(x; θ) will increase toward infinity when going deeper into the forbidden regime. Methods to modify a learning problem to reduce prediction errors are collectively known as “regularization” methods.[69] To avoid the catastrophically wrong prediction problem described above, we introduce regularized CGnets (Figure b). In a regularized CGnet, we define the energy function aswhere Unet(x; θ) is a neural network free energy as before, and U0(x) is a baseline energy that contains constraint terms that ensure basic physical behavior. Such baseline energies to regularize a more complex multibody energy function have also been used in the machine learning of QM potential energy functions.[70−72] Note that eq can still be used to represent any smooth free energy because Unet(x; θ) is a universal approximator. The role of U0(x) is to enforce U → ∞ for unphysical states x that are outside the training data. As for many other regularizers, the baseline energy U0(x) in eq takes the role of a prior distribution in a probabilistic interpretation: The equilibrium distribution generated by eq becomesHere, U0(x) is simply a sum of harmonic and excluded volume terms. For the 2D toy model, a harmonic term in the form is used, and the parameters k and x0 are determined by the force-matching scheme restricted to the scarcely populated regions defined by the 100 sampled points with highest and the 100 with lowest x-value (see Figure ).

Figure 3

Machine-learned coarse-graining of dynamics in a rugged 2D potential. (a) Two-dimensional potential used as a toy system. (b) Exact free energy along x. (c) Instantaneous forces and the learned mean forces using feature regression and CGnet models (regularized and unregularized) compared to the exact forces. The unit of the force is B, with the unit of length equal to 1. (d) Free energy (PMF) along x predicted using feature regression, and CGnet models compared to the exact free energy. Free energies are also computed from histogramming simulation data directly, using the underlying 2D trajectory, or simulations run with the feature regression and CGnet models (dashed lines). For alanine dipeptide, we use harmonic terms for the distance between atoms that are adjacent (connected by covalent bonds) and for angles between three consecutive atoms. For each bond i, we use , where r is the instantaneous distance between the two consecutive atoms defining the bond, r is the equilibrium bond length, and k is a constant. Analogously, for each angle j, we use , where θ is the instantaneous value of the angle, θ is the equilibrium value for the angle, and k is a constant. When statistically independent, each such term would give rise to a Gaussian equilibrium distribution:with mean μ = r (or μ = θ), and variance σ2 = B/k (or σ2 = B/k). The prior energy is obtained by assuming independence between these energy terms and estimating these means and variances from the atomistic simulations. For the application of CGnet to the protein Chignolin, an additional term is added to the baseline energy to enforce excluded volume and penalize clashes between nonbonded CG particles. In particular, we add a term Urep(r) for each pairwise distances between CG particles that are more distant than two covalent bonds, in the formwhere the exponent c and effective excluded volume radius σ are two additional hyperparameters that are optimized by cross-validation. We note that in general one could use classical CG approaches with predefined energy functions to first define the prior CG energy U0 and then use an ANN to correct it with multibody terms.

Results

Two-Dimensional Toy Model

As a simple illustration, we first present the results on the coarse-graining of a two-dimensional toy model. The potential energy is shown in Figure and given by the expressionThe potential corresponds to a double well along the x-axis and a harmonic confinement along the y-axis. The last term in eq adds small-scale fluctuations, appearing as small ripples in Figure a. The coarse-graining mapping is given by the projection of the 2-dimensional model onto the x-axis. In this simple toy model, the coarse-grained free energy (potential of mean force) can be computed exactly (Figure b): We generate a long (one million time-steps) simulation trajectory of the 2-dimensional model and use the x component of the forces computed along the trajectories in the loss function (eq 9). We report below the resulting CG potential obtained by (1) using a feature regression, i.e., least-squares regression with a set of feature functions defined in the SI, Section B, and (2) a CGnet (regularized and unregularized). Cross-validation is used to select the best hyperparameters for the least-squares regression and the CGnet architectures. For the feature regression, the same cross-validation procedure as introduced in ref (73) was used and returned a linear combination of four basis functions among the selected set (see Figure S1a and the Supporting Information for details). For the regularized CGnet, a two-stage cross-validation is conducted, first choosing the depth D with a fixed width of W = 50, and then choosing the width W (Figure S1b,c). The minimal prediction error is obtained with D = 1 (one hidden layer) and W = 50. For the unregularized CGnet, a similar procedure is performed, and the best hyperparameters are selected as D = 1, W = 120. Note that the prediction error cannot become zero, but is bounded from below by the chosen CG scheme (Figure , eq )—in this case by neglecting the y variable. Figure c,d shows the results of the predicted mean forces and free energies (potentials of mean force) in the x-direction. The instantaneous force fluctuates around the mean but serves to accurately fit the exact mean force in the x range where sampling is abundant using both feature regression and CGnet (Figure c). At the boundary where few samples are in the training data, the predictors start to diverge from the exact mean force and free energy (Figure c,d). This effect is more dramatic for the unregularized CGnet; in particular, at large x values, the CGnet makes an arbitrary prediction: here the force tends to zero. In the present example, reaching these states is highly improbable. However, a CGnet simulation reaching this region can fail dramatically, as the simulation may continue to diffuse away from the low energy regime. As discussed above, this behavior can be avoided by adding a suitable prior energy that ensures that the free energy keeps increasing outside the training data, while not affecting the accuracy of the learned free energy within the training data (Figure c,d). Note that the quantitative mismatch in the low-probability regimes is not important for equilibrium simulations. The matching mean forces translate into matching free energies (potentials of mean force, Figure d). Finally, we conduct simulations with the learned models and generate trajectories {x} using eq . From these trajectories, free energies can be computed bywhere p̃(x) is a histogram estimate of the probability density of x in the simulation trajectories. As shown in Figure d, free energies agree well in the x range that has significant equilibrium probability.

Coarse-Graining of Alanine Dipeptide in Water

We now demonstrate CGnets on the coarse-graining of an all-atom MD simulation of alanine dipeptide in explicit solvent at T = 300 K to a simple model with 5 CG particles located at the five central backbone atoms of the molecule (Figure ). One trajectory of length 1 μs was generated using the simulation setup described in ref (74); coordinates and forces were saved every picosecond, giving rise to one million data points. The CG model has no solvent; therefore, the CG procedure must learn the solvation free energy for all CG configurations.

Figure 4

Mapping of alanine dipeptide from an all-atom solvated model (top) to a CG model consisting of the five central backbone atoms (bottom).

Mapping of alanine dipeptide from an all-atom solvated model (top) to a CG model consisting of the five central backbone atoms (bottom). We compare two different CG models. The first model, called “spline model”, uses the state-of the art approach established in MD coarse-graining:[11,49,59] to express the CG potential as a sum of few-body interaction terms, similar as in classical MD force fields. The most flexible among these approaches is to fit one-dimensional splines for each of the pairwise distance, angle, and dihedral terms to parametrize two-, three-, and four-body interactions.[75] To ensure a consistent comparison, we represent 1D splines with neural networks that map from a single input feature (pairwise distance, angle, or dihedral) to a single free energy term, resulting in the spline model network shown in Figure c. We use the same regularization and baseline energy for spline model networks and CGnets. The second model uses a regularized multibody CGnet, i.e., a fully connected neural network shown in Figure b, to approximate the CG free energy. The comparison of the results from the two models allows us to evaluate the importance of multibody interactions that are captured by the CGnet but are generally absent in CG models that use interaction terms involving a few atoms only. The hyperparameters for both models consist of the number of layers (depth, D), the number of neurons per layer (width, W) of the network, and the Lipschitz regularization strength (λ)[68] and are optimized by a three-stage cross-validation. In the first stage, we find the optimal D at fixed W = 30 and λ = ∞ (no Lipschitz regularization); subsequently, we choose W at the optimal D, and λ at the optimal W, D. This results in D = 5, W = 160, and λ = 4.0 for CGnet and D = 4, W = 30 (for each feature), and λ = 10.0 for the spline model (Figure ). The cross-validation error of CGnet is significantly lower than the cross-validation error of the spline model (Figure a–c). We point out that the cross-validation error cannot become zero but is bounded from below by the chosen CG scheme (Figure , eq )—in this case by coarse-graining all solvent molecules and all solute atoms except the five central backbone atoms away. Hence, the absolute values of the cross-validation error in Figure a–c are not meaningful and only differences matter.

Figure 5

(a–c) Cross-validated force-matching error in [kcal/(mol A)]2 for the selection of the optimum structure of the network. (d–f) Difference between the two-dimensional free energy surfaces obtained from the CG models and from the reference all-atom simulations (see Figure ) for the regularized CGnet and the spline model of alanine dipeptide. (a) Selection of the number of layers, D. (b) Selection of the number of neurons per layer, W. (c) Selection of the Lipschitz regularization strength, λ. The selected hyperparameters, corresponding to the smallest cross-validation error, are highlighted by orange boxes. Blue dashed lines represent the regularized CGnet, red dashed lines the spline model, and vertical bars the standard error of the mean. (d–f) Difference between the reference all-atom free energy surface and the free energy surfaces corresponding to the choices of hyperparameters indicated in panels a–c as (C1, C2, C3, C4, C5) for CGnet and as (S1, S2, S3, S4) for the spline model.

Figure 6

Free energy profiles and simulated structures of alanine dipeptide using all-atom and machine-learned coarse-grained models. (a) Reference free energy as a function of the dihedral angles, as obtained from direct histogram estimation from all-atom simulation. (b) Standard coarse-grained model using a sum of splines of individual internal coordinates. (c) Regularized CGnet as proposed here. (d) Unregularized CGnet. (e) Representative structures in the six free energy minima, from atomistic simulation (ball-and-stick representation) and regularized CGnet simulation (licorice representation).

CG MD simulations are generated for the selected models by iterating eq . For each model, one hundred independent simulations starting from structures sampled randomly from the atomistic simulation are performed for 1 million steps each, and the aggregated data are used to produce the free energy as a function of the dihedral coordinates. Figure compares the free energy computed via eq from the underlying atomistic MD simulations and the free energy resulting from the selected CG models. Only the regularized CGnet model can correctly reproduce the position of the all main free energy minima (Figure a,c). On the contrary, the spline model is not able to capture the shallow minima corresponding to positive values of the dihedral angle ϕ, and introduces several spurious minima (Figure b). This comparison confirms that selecting CG models by minimal mean force prediction error achieves models that are better from a physical viewpoint. Free energy profiles and simulated structures of alanine dipeptide using all-atom and machine-learned coarse-grained models. (a) Reference free energy as a function of the dihedral angles, as obtained from direct histogram estimation from all-atom simulation. (b) Standard coarse-grained model using a sum of splines of individual internal coordinates. (c) Regularized CGnet as proposed here. (d) Unregularized CGnet. (e) Representative structures in the six free energy minima, from atomistic simulation (ball-and-stick representation) and regularized CGnet simulation (licorice representation). As an a posteriori analysis of the results, we have performed MD simulation for the CG models corresponding to different choices of hyperparameters, both for the spline model and CGnet. For each choice of hyperparameters, we have computed the difference between the free energy as a function of the dihedral angles resulting from the CG simulations and the one from the all-atom models. Differences in free energy were estimated by discretizing the space spanned by the two dihedral angles and computing the mean square difference on all bins. The difference between a given model and CGnet was computed by shifting the free energy of CGnet by a constant value that minimizes the overall mean square difference. The free energy difference for the spline models is always significantly larger than for the CGnet models (Figure d–f). Interestingly, the minima in the difference in free energy correspond to the minima in the cross-validation curves reported in Figure a–c, and the optimal values of hyperparameters selected by cross-validation yield the absolute minimum in the free energy difference (indicated in Figure f as C5 for CGnet and S4 for the spline model). This point is illustrated more explicitly in the SI (Section E, Figures S4 and S5), and demonstrates that the cross-validation error of different models is correlated with errors in approximating the two-dimensional free energy surface of alanine dipeptide. For the CGnet, regularization is extremely important: without regularization, the free energy only matches near the most pronounced minima, and unphysical structures are sampled outside (Figure d and the SI, Section D). With regularization, these unphysical regimes are avoided; all sampled structures appear chemically valid (Figure e), and the distributions of bonds and angles follow those in the atomistic simulations (SI, Section D and Figure S3).

Coarse-Graining of Chignolin Folding/Unfolding in Water

Finally, we test the CGnet on a much more challenging problem: the folding/unfolding dynamics of the polypeptide Chignolin in water. Chignolin consists of 10 amino acids plus termini and exhibits a clear folding/unfolding transition. The all-atom model contains 1881 water molecules, salt ions, and the Chignolin molecule, resulting in nearly 6000 atoms. To focus on the folding/unfolding transition, data were generated at the melting temperature 350 K, mimicking the setup used for the Anton supercomputer simulation in ref (76). To obtain a well-converged ground truth, we generated 3742 short MD simulations with an aggregate length of 187.2 μs on GPUgrid[77] using the ACEMD program.[78] The free energy landscape is computed on the two collective variables describing the slowest processes, computed by the TICA method.[79] Since the individual MD simulations are too short to reach equilibrium, the equilibrium distribution was recovered by reweighting all data using a Markov state model.[80] See the SI for details on the MD simulation and Markov model analysis. Figure a shows the free energy as a function of the first two TICA coordinates. Three minima are clearly identifiable on this free energy landscape: states a (folded), b (unfolded), and c (partially misfolded), ordered alphabetically from most to least populated. Representative configurations in these minima are as shown in Figure e. As a result, the first TICA mode is a folding/unfolding coordinate, while the second is a misfolding coordinate.

Figure 7

Free energy landscape of Chignolin for the different models. (a) Free energy as obtained from all-atom simulation, as a function of the first two TICA coordinates. (b) Free energy as obtained from the spline model, as a function of the same two coordinates used in the all-atom model. (c) Free energy as obtained from CGnet, as a function of the same two coordinates. (d) Comparison of the one-dimensional free energy as a function of the first TICA coordinate (corresponding to the folding/unfolding transition) for the three models: all-atom (blue), spline (green), and CGnet (red). (e) Representative Chignolin configurations in the three minima from (a–c) all-atom simulation and (a′–c′) CGnet. Using a regularized CGnet, we coarse-grain the 6000-atom system to 10 CG beads representing the α-carbons of Chignolin. Thus, not only is the polypeptide coarse-grained, but also the solvation free energy is implicitly included in the CG model. Similar to what was done for alanine dipeptide, roto-translational invariance of the energy was implemented by using a CGnet featurization layer that maps the C Cartesian coordinates to all pairwise distances between CG beads, all angles defined by three adjacent CG beads, and the cos and sin of all the dihedral angles defined by four CG adjacent beads. The regularizing baseline energy includes a harmonic term for each bond and angle and an excluded volume term for each pairwise distance between CG particles that are separated by more than two bonds. Similar to the case of alanine dipeptide, a classical few-body spline model was defined for comparison whose CG potential is a sum of bonded and nonbonded terms, where each term is a nonlinear function of a single feature (pairwise distances, angles, dihedrals). Both the CGnet and spline model are optimized through a five-stage cross-validation search for the hyperparameters in the following order: depth D, width W, exponent of the excluded volume term c, radius of the excluded volume term σ, and Lipschitz regularization strength λ. The results of the cross-validation are shown in Figure S8. This optimization resulted in the hyperparameter values D = 5, W = 250, c = 6, σ = 5.5, and λ = 4.0. For the spline model, the optimal values of the hyperparameters are D = 3, W = 12 (for each feature), c = 10, σ = 4.0, and λ = 5.0 (Figure S8). The potential resulting from CGnet and the spline model is then used to run long simulations with eq . One hundred simulations of 1 million steps each were generated using randomly sampled configurations from the training data as starting points. For comparison, the aggregated data are projected onto the TICA coordinates obtained from all-atom simulations, and free energy landscapes are computed directly using eq (Figure b,c). For a more quantitative comparison, the free energies are also reported along the first TICA coordinate that indicates folding/unfolding (Figure d). These figures clearly show that the spline model cannot reproduce the folding/unfolding dynamics of Chignolin, as the folded and unfolded states are not well-defined (Figure b,d). On the contrary, CGnet not only can consistently fold and unfold the protein but also correctly identifies three well-defined minima: the folded (a′), unfolded (b′), and partially misfolded (c′) ensembles corresponding to the minima a, b, and c in the all-atom fully solvated model (Figure c,d). Representative structures in the three minima are shown in Figure e: the structures obtained from the CGnet simulations are remarkably similar to the ones obtained in the all-atom simulations. These results reinforce what has been already observed for alanine dipeptide above: the multibody interactions captured by CGnet are essential for correct reproduction of the free energy landscape for the protein Chignolin. The absence of such interactions in the spline model dramatically alters the corresponding free energy landscape to the point that the model can not reproduce the folding/unfolding behavior of the protein.

Conclusions

Here we have formulated coarse-graining based on the force-matching principle as a machine learning method. An important consequence of this formulation is that coarse-graining is a supervised learning problem whose loss function can be decomposed into the standard terms of statistical estimator theory: Bias, Variance, and Noise. These terms have well-defined physical meanings and can be used in conjunction with cross-validation to select model hyperparameters and rank the quality of different coarse-graining models. We have also introduced CGnets, a class of neural networks that can be trained with the force-matching principle and can encode all physically relevant invariances and constraints: (1) invariance of the free energy and mean force with respect to translation of the molecule, (2) invariance of the free energy and equivariance of the mean force with respect to rotation of the molecule, (3) the mean force being a conservative force field generated by the free energy, and (4) a prior energy able to be applied to prevent the simulations with CGnets to diverge into unphysical state space regions outside the training data, such as states with overstretched bonds or clashing atoms. Future CGnets may include additional invariances, such as permutational invariance of identical CG particles, e.g., permutation of identical particles in symmetric rings. The results presented above show that CGnet can be used to define effective energies for CG models that optimally reproduce the equilibrium distribution of a target atomistic model. CGnet provides a better approximation than functional forms commonly used for CG models as it automatically includes multibody effects and nonlinearities. The work presented here provides a proof of principle for this approach on relatively small solutes, but already demonstrates that the complex solvation free energy involved in the folding/unfolding of a polypeptide such as Chignolin can be encoded in a CGnet consisting of only the C atoms. The extension to larger and more complex molecules presents additional challenges and may require to include additional terms to enforce physical constraints. Additionally, the CG model considered here is designed ad hoc for a specific molecule and is not transferable to the study of different systems. Transferability remains an outstanding issue in the design of coarse-grained models,[11] and its requirement may decrease the ability to reproduce faithfully properties of specific systems.[49,81−84] In principle, transferable potentials can be obtained by designing input features for CGnet imposing a dependence of the energy function on the CG particle types and their environment,[82] similarly to what is done in the learning of potential energy functions from quantum mechanics data (see, e.g., refs (20, 24, 27, 33, and 66)). This approach may be able to define transferable functions if enough data are used in the training.[27,33] We leave the investigation on the trade-off between transferability and accuracy for future studies. It is also important to note that the formulation used here to define an optimal CG potential aims at reproducing structural properties of the system, but it does not determine the equations for its dynamical evolution. If one is interested in designing CG models that can reproduce molecular dynamical mechanisms, e.g., to reproduce the slow dynamical processes of the fine-grained model, alternative approaches need to be investigated.

46 in total

Review 1. Machine Learning Force Fields and Coarse-Grained Variables in Molecular Dynamics: Application to Materials and Biological Systems.

Authors: Paraskevi Gkeka; Gabriel Stoltz; Amir Barati Farimani; Zineb Belkacemi; Michele Ceriotti; John D Chodera; Aaron R Dinner; Andrew L Ferguson; Jean-Bernard Maillet; Hervé Minoux; Christine Peter; Fabio Pietrucci; Ana Silveira; Alexandre Tkatchenko; Zofia Trstanova; Rafal Wiewiora; Tony Lelièvre
Journal: J Chem Theory Comput Date: 2020-07-16 Impact factor: 6.006

2. Dynamic graphical models of molecular kinetics.

Authors: Simon Olsson; Frank Noé
Journal: Proc Natl Acad Sci U S A Date: 2019-07-08 Impact factor: 11.205

3. Markov state modeling reveals alternative unbinding pathways for peptide-MHC complexes.

Authors: Jayvee R Abella; Dinler Antunes; Kyle Jackson; Gregory Lizée; Cecilia Clementi; Lydia E Kavraki
Journal: Proc Natl Acad Sci U S A Date: 2020-11-12 Impact factor: 11.205

4. Predicting optical spectra for optoelectronic polymers using coarse-grained models and recurrent neural networks.

Authors: Lena Simine; Thomas C Allen; Peter J Rossky
Journal: Proc Natl Acad Sci U S A Date: 2020-06-08 Impact factor: 11.205

5. Gaussian Process Regression for Materials and Molecules.

Authors: Volker L Deringer; Albert P Bartók; Noam Bernstein; David M Wilkins; Michele Ceriotti; Gábor Csányi
Journal: Chem Rev Date: 2021-08-16 Impact factor: 60.622

6. Uncertainty Quantification in Atomistic Modeling of Metals and Its Effect on Mesoscale and Continuum Modeling: A Review.

Authors: Joshua J Gabriel; Noah H Paulson; Thien C Duong; Francesca Tavazza; Chandler A Becker; Santanu Chaudhuri; Marius Stan
Journal: JOM (1989) Date: 2021 Impact factor: 2.471

Review 7. Ab Initio Machine Learning in Chemical Compound Space.

Authors: Bing Huang; O Anatole von Lilienfeld
Journal: Chem Rev Date: 2021-08-13 Impact factor: 60.622

Review 8. Amyloid Oligomers: A Joint Experimental/Computational Perspective on Alzheimer's Disease, Parkinson's Disease, Type II Diabetes, and Amyotrophic Lateral Sclerosis.

Authors: Phuong H Nguyen; Ayyalusamy Ramamoorthy; Bikash R Sahoo; Jie Zheng; Peter Faller; John E Straub; Laura Dominguez; Joan-Emma Shea; Nikolay V Dokholyan; Alfonso De Simone; Buyong Ma; Ruth Nussinov; Saeed Najafi; Son Tung Ngo; Antoine Loquet; Mara Chiricotto; Pritam Ganguly; James McCarty; Mai Suan Li; Carol Hall; Yiming Wang; Yifat Miller; Simone Melchionna; Birgit Habenstein; Stepan Timr; Jiaxing Chen; Brianna Hnath; Birgit Strodel; Rakez Kayed; Sylvain Lesné; Guanghong Wei; Fabio Sterpone; Andrew J Doig; Philippe Derreumaux
Journal: Chem Rev Date: 2021-02-05 Impact factor: 60.622

9. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems.

Authors: John A Keith; Valentin Vassilev-Galindo; Bingqing Cheng; Stefan Chmiela; Michael Gastegger; Klaus-Robert Müller; Alexandre Tkatchenko
Journal: Chem Rev Date: 2021-07-07 Impact factor: 60.622

Review 10. Automation and data-driven design of polymer therapeutics.

Authors: Rahul Upadhya; Shashank Kosuri; Matthew Tamasi; Travis A Meyer; Supriya Atta; Michael A Webb; Adam J Gormley
Journal: Adv Drug Deliv Rev Date: 2020-11-24 Impact factor: 15.470