Literature DB >> 33521417

Machine Learning of Coarse-Grained Models for Organic Molecules and Polymers: Progress, Opportunities, and Challenges.

Abstract

Machine learning (ML) has emerged as one of the most powerful tools transforming all areas of science and engineering. The nature of molecular dynamics (MD) simulations, complex and time-consuming calculations, makes them particularly suitable for ML research. This review article focuses on recent advancements in developing efficient and accurate coarse-grained (CG) models using various ML methods, in terms of regulating the coarse-graining process, constructing adequate descriptors/features, generating representative training data sets, and optimization of the loss function. Two classes of the CG models are introduced: bottom-up and top-down CG methods. To illustrate these methods and demonstrate the open methodological questions, we survey several important principles in constructing CG models and how these are incorporated into ML methods and improved with specific learning techniques. Finally, we discuss some key aspects of developing machine-learned CG models with high accuracy and efficiency. Besides, we describe how these aspects are tackled in state-of-the-art methods and which remain to be addressed in the near future. We expect that these machine-learned CG models can address thermodynamic consistent, transferable, and representative issues in classical CG models.

Entities: Chemical Disease Species

Year: 2021 PMID： 33521417 PMCID： PMC7841771 DOI： 10.1021/acsomega.0c05321

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Over the past decades, molecular dynamics (MD) simulations have become one of the most important computational techniques to study the relationship between the properties of materials and the interactions of atoms, especially with the advances of computational resources, i.e., high-performance computing (HPC).[1] Although it has achieved great success, the progress driven by the demand to model more complex systems across multiple spatial and temporal scales is severely restrained by the limitations of computing aspects.[2] As shown in Figure a, the sweet spot for MD simulations is confined by the boundaries of memory limit (size of the system), communication limit (time across multiscale), and parallel limit (HPC). To model a larger system or a longer time, coarse-graining is required to overcome the current limitations and push the boundaries of MD simulations for a broad range of applications, such as self-assembly of organic molecules and crystallization of polymers.[3−5]

Figure 1

(a) Boundaries of all-atom molecular dynamics (MD) simulation using high performance computing facilities in terms of system size and time scale. (b) Schematic to show two types of coarse-grained methods: bottom-up and top-down in different time and length scales. Besides, the coarse-graining process of a polypeptide, polyalanine, is shown in the inset. The central problem for coarse-graining is how to link simulations of detailed models with simulations of coarse-grained (CG) ones through the propagation of information. Note that the CG model enables efficient simulations covering different spatial and temporal scales, from the quantum mechanics with the highest level of accuracy to the macroscopic continuum model that depends on the theoretical constitutive law or experimental empirical relationship. According to the direction of information propagation, CG methods can be classified into two different categories: bottom-up[3] and top-down[6] approaches. For the bottom-up methods, the fundamental physical principles at the smaller scales are used to parametrize a model at a CG scale; while the behavior at larger scales is used to inform the interactions at smaller scales in top-down CG methods. When designing a CG model, the first step is to define a pseudo atom or CG site. These sites can be designed to represent combined groups of multiple atoms or functional groups. For example, Figure b shows the coarse-graining process of polyalanine. The pseudo atom sites can be either groups of hydrogens or functional groups. The second key aspect in developing the CG model is to define an effective energy function U for the CG model, which determines the interactions between pseudo atoms. In the top-down CG methods, the energy function U is constructed and parametrized by either atomistic simulations or experiments to reproduce target properties, such as density, diffusivity, and partition energy. For example, Marrink et al.[6] developed the MARTINI model for biomolecular simulations, especially for the modeling of lipids and proteins. In the MARTINI model, the force field is parametrized in a systematic way to reproduce the free energies between polar and apolar phases of a large number of chemical compounds through increasing the number of possible interaction levels of the CG sites. On the contrary, in the bottom-up CG methods, the target properties are not used to optimize the potentials; instead, they are predicted by the CG potentials. Typical bottom-up methods use structure- and force-based approaches, depending on the target quantities. If the method aims to reproduce the structural distributions given by all-atom (AA) simulations, then it is structure-based, which is known as the iterative Boltzmann inversion (IBI) method, the inverse Monte Carlo (IMC) method, and the relative entropy method.[5] The force-matching method intends to match the force distribution of the CG model to that obtained from AA molecular simulations. Each method has its advantages and disadvantages: the IBI method can reproduce the correct structural distributions and conformations but misses the right thermodynamics and many-body free-energy landscape; while the force-matching method captures many-body effects and proper dynamics, it can misrepresent the structural distributions. Although these CG models are straightforward and can model much larger systems with reasonable computation costs, they still suffer from some important issues in terms of thermodynamic representability, transferability, and consistency.[5] The CG model cannot adequately reproduce the properties of the AA system due to the degeneration of freedoms during the CG process. Some properties that are highly sensitive to small-scale phenomena will be lost in the simulations using CG models.[7] Thus, the CG process from AA coordinates to CG sites strongly depends on the understanding of the physical problem and chemical intuition, and a poor CG process will lose more important information. Besides, both CG methods take distributions from AA molecular simulations for a specific thermodynamic state. Therefore, the derived effective CG potentials have limited thermodynamic transferability to other conditions, such as temperature and pressure. It results in the limited applications of the CG potentials into a subset of thermodynamics states. Last but not the least, the thermodynamic consistency can be influenced by the CG process. The most common inconsistency is caused by the dilemma of balancing structural and dynamic properties. Taking the IBI method as an example, the forces that CG sites bear are not rigorously reproduced as the optimization objective is the preservation of the structural distributions, compared to the force-matching method. Consequently, the optimized CG model, based on the IBI method, usually shows faster dynamics. It requires a dynamic rescaling factor, despite faithfully preserving structural distributions and properties.[5] Recently, machine learning (ML) of potentials is emerging as an alternative approach, in comparison with classical force fields with explicit functions. This method represents potential-energy surfaces by training large data sets from density-functional theory (DFT) calculations. Machine-learned AA potentials have excellent representability, accuracy, efficiency, and thermodynamic transferability, in comparison to DFT simulations. Similarly, machine-learned CG potentials are proposed to represent the free-energy landscape of AA molecular models with high efficiency and accuracy. Thus, ML of CG potentials becomes a new way to bridge the gap between accurate but computationally expensive ab initio methods and approximate but computationally cheap CG method.[8−19] Particularly, machine-learned CG potentials can predict the physical properties of complex molecular systems with ab initio accuracy. The general idea of this approach is to use statistical learning techniques, such as artificial neural network (ANN),[13] deep neural network (DNN),[12] and convolutional neural network (CNN),[18] to name a few, to formulate complex potentials with many model parameters, which are undetermined and optimized using ab initio or AA molecular simulation results as references. Similarly, based on the classes of CG methods, the machine-learned CG methods have two categories: bottom-up machine-learned CG method[9−11] and top-down machine-learned CG method,[12−16] as listed in Table .

Table 1

Classification of Different Machine-Learned CG Modelsa

category	name	machine learning technique	architecture hyperparameter selection	training data sets
top-down	BOP[9]	HOGA	two-stage optimization genetic algorithm	Experiment and atomistic simulation
	CG-LJ[10]	DNN	(48, 15) ground-truth error	MD simulation
	ANN-PSO[11]	ANN	(50, 50) property of force-match	MD simulation
bottom-up	DeepCG[12]	DNN	(120, 60, 30, 15) minimization of mean force	atomistic simulation
	CGnet[13]	ANN	(160, 160, 160, 160, 160) three-stage cross-validation	all-atom MD simulation
	Autoencoder[15]	auto-encoding	encoder and decoder Gubmel-softmax reparametrization	all-atom MD simulation
	Kernel-based[14]	kernel optimization	Hessian kernel mean force estimation	atomistic simulation
	Graph-based[16]	graph-theoretic principles	adjacency matrix structure of material	atomistic simulation

(p, q) in the architecture column are the numbers of nodes in the layers of the neural network.

(p, q) in the architecture column are the numbers of nodes in the layers of the neural network. For the top-down machine-learned CG method, an empirical expression of the interactions between CG particles is given in prior, such as Tersoff–Brenner potential for CG water[9] and Lennard-Jones potential for simple molecular liquids,[10] to perform MD simulations. Usually, the potential parameters in these expressions are undetermined and optimized by statistical learning techniques to match the structural properties (e.g., radial distribution functions, angular distribution functions) and dynamics properties (e.g., diffusion coefficient) with the corresponding experimental/computational results. On the contrary, in the bottom-up machine-learned CG method, an unknown CG potential U should be introduced and U is related to the representations (features associated with the coordinates of CG particles) through high-dimensional neural networks.[12] The neural networks can be trained and optimized by using the data set from AA molecular simulations. For instance, the mean force is used to regularize the optimization of the loss function in the learning process. In this study, we will review some of the typical machine-learned CG methods in these two categories. And we will focus more on the ML framework, the validation of the trained model, and its applications. We will also discuss the open problems and challenges facing the ML of CG models and possible solutions. We expect machine-learned CG models can play an important role in understanding the physical and mechanical properties of complex molecular systems and hope that this work inspires future studies in this field.

Top-down Machine-Learned CG Methods

Machine-Learned Bond Order Potential for CG Water

Accurate and efficient molecular models for water molecules are highly desirable to understand their phase transformation behaviors at different temperatures or pressures. However, it remains challenging to develop robust and efficient molecular models for water molecules. Recently, Chan and co-workers[9] have introduced an ML workflow to train the CG models that can accurately describe the behaviors of ice, supercooled, and normal liquid water at the mesoscopic scale. Two bond-order CG models are utilized: bond-order potential (BOP) and BOP with on-the-fly dihedrals (BOPdih), which are both based on the Tersoff–Brenner potential with 7 and 11 parameters undetermined and to be optimized, respectively. In conventional MD simulations, these parameters in the potential functions are given a priori. It limits the application of the potential, and modeling systems at different thermodynamic states requires the recalibration of the potential parameters. In this study, the CG potential parameters are optimized by a multilevel evolutionary strategy (hierarchical objective genetic algorithm–HOGA) as shown in Figure a. A two-stage optimization technique is introduced to find the global minimum in the objective value landscape. During the global optimization, the genetic algorithm[20] is used. It begins with the random sampling of the parameter sets N. The objective value for each of these parameter sets is evaluated, and the convergence is checked until the convergence criteria are met. If not, a new list of N is generated using genetic operations. Typically, this stage can return a list of close-to-optimal parameter sets. And using these parameter sets, the second stage local optimization technique starts to search the final parameter sets with the Nelder–Mead simplex algorithm.

Figure 2

(a) Work flow for ML CG model of water molecules through hierarchical objective genetic algorithm. Two stages: global minimization and local minimization are shown. (b) Comparisons of the results of density anomaly, diffusion coefficients, radial distribution functions and heat capacity between the experiments and predictions by ML CG model. (c) Application of ML model to predict the nucleation and crystallization process of water (Reprinted with permission from the work of Chan et al.[9] Copyright 2019 Springer Nature Limited). The extensive training data set of energies and structural properties for ice and liquid water are taken from both the atomistic model (TIP4/2005) through MD simulations and experimental data. The data set is chosen with the requirement to ensure the adequate representation of the diverse configurational space of ice and liquid water. The trained CG model is first validated by comparing the structural and dynamic properties with experimental data, such as radial distribution function (RDF), angular distribution function (ADF), and diffusion coefficient. From Figure b, it is found that the ML-BOP models can successfully capture the best-known thermodynamic anomaly and the existence of a density maximum at 277 K. Also, it can correctly describe the freezing/melting transition at 273 ± 1 K, densities of ice (140–273 K), and water (243–273 K) within 1.4% of experiments. For the transport properties of water, the room temperature diffusivity is calibrated as ∼3 × 10–5 cm2 s–1, which is close to that in experiments (2.3 × 10–5 cm2 s–1). Regarding the structural property, the O–O RDFs for ice at 77 K and water at 254 K are compared with the experiments. As shown in Figure b, the location and intensities of the peaks of these RDF are in good agreement with the corresponding experimental results. Finally, the heat capacities C for liquid water are investigated. It can reproduce the thermodynamic anomaly indicated by the sharp increase in C of supercooled water (cf. Figure b). After the validation, a representative test case is performed on multimillion water molecules to study nucleation of supercooled water, leading up to the formation and growth of grains as shown in Figure c. The water is slowly cooled from 275 to 210 K over 130.4 ns. When the first nucleation forms, the temperature is held at 210 K for another 100 ns, and the first nucleation event is followed by a slow transformation (150 ns), an accelerated transformation of a large number of nuclei (200 ns), and completion of grain growth to form polycrystalline ice. This workflow demonstrates the high accuracy of the machine-learned CG model in capturing the structural and dynamic properties of water, in comparison to the experimental results. Besides, this workflow shows almost the same computational cost as the other CG method—the monatomic water (mW) model—but with higher accuracy. Nevertheless, this workflow suffers from some limitations. First, it is partially temperature-transferable. It can only reproduce the thermodynamics within the regime which is covered in the training data set. Second, it cannot fully reproduce all the physical properties. For example, to improve the predictions on diffusion coefficient and temperature of maximum density, the prediction accuracy of the melting point is sacrificed.

Transfer-Learning-Based CG Method for Simple Molecular Fluids

In MD simulations, once the interatomic potential is specified, many properties like RDF and ADF are determined. However, the inverse problem–parametrization of the potential for a specific property is not straightforward. Moradzadeh et al.[10] adopted DNN to study this inverse problem, in terms of the relation between the RDF and the Lennard-Jones (LJ) potential parameters. Figure a shows the schematic of the DNN. The training data is taken from MD simulations with the potential parameters and thermodynamic states (density ρ and temperature T) sampled over a wide range. Here, the LJ potential has the formwhere C12 and C6 are potential parameters. After performing the MD simulations, the RDFs g(r) under specific thermodynamic states (ρ, T) are collected. And the complex relationship between potential parameters (C12 and C6) and RDFs can be expressed as (C12, C6) = f(g(r); ρ, T), where f is a vector function. In this work, a feed-forward neural network (FFNN) is adopted to represent this function based on the universal approximation theorem as x = (g(r), ρ1, ρ2, ..., ρ, T1, T2, ..., T,), where x is the input vector composed of the concatenation of system i RDF (size of n) and thermodynamic states (each with a size of p) in the data set with a total size of m. The node in DNN applies a linear transformation after receiving the input signal and is followed by a nonlinear activation function, resulting in an output signal. Finally, the DNN can be mathematically expressed aswhere ϕ is the nonlinear activation function of layer k, W and b are weights and biases. The FFNN is trained based on the loss function: mean-squared error (MSE) between the ground-truth and the predicted parameterswhere θ represents the free parameters like the weights and biases. v( and v((θ, x) are the ground-truth and predicted LJ interaction parameters.

Figure 3

(a) Schematic of a deep neural network (DNN). (b) Comparison of the pair potential parameters determined from the DNN with the ground-truth values for training, validation, testing, and transferability data sets. (c) Comparison of RDFs obtained with DNN, relative entropy, and simplex CG models and the all-atom (AA) model. Three types of molecules are studied: CO, F2, and CH4 (Reproduced from the work of Moradzadeh et al.[10] Copyright 2019 American Chemical Society). The performance of DNN is examined by investigating two cases. The first one is the generalizability and transferability of the interatomic potential parameters. The generalizability refers to the use of DNN to estimate the case, which is not part of the training data set, but falls within the range of the training data set; the transferability refers to the use of DNN to estimate the case, which is outside the range of the training data set. One-to-one comparison between the ground-truth (denoted by subscript GT) and predicted (denoted by subscript DL) potential parameters are shown in Figure b. The left four and right four panels are for C12 and C6, respectively. In each panel, the first one is for the training process; the second one shows the validation; the third one gives the generalizability, and the last one demonstrates the transferability. The accuracy is quantified by introducing the mean absolute percentage error (MAPE), defined as . The solid black line represents that the predicted parameters are exactly consistent with the ground-truth parameters. The dashed black lines are the boundaries in which the ∼99% of prediction points are located, in comparison to the ground-truth of the training data set. The low MAPE in Figure b demonstrates the DNN has good generalizability and transferability. The second one is transfer learning, which is studied by developing single-bead CG models for simple molecular liquids such as carbon monoxide, fluorine, and methane. The corresponding RDFs predicted from DNN, simplex, relative entropy method, and AA molecular simulations are shown in Figure c. It indicates that DNN prediction has a high accuracy from these comparisons. However, this high accuracy is only limited to simple molecular liquids. For complex molecules that cannot be described by the simple LJ potential functions, the CG process to a single LJ particle will lose plenty of information, leading to the wrong predictions of both structural and dynamic properties.

ANN-Assisted Particle Swarm Optimization for Transferable CG Models

Particle swarm optimization (PSO), inspired by Mother Nature and first applied to animate social behaviors, is an iterative global optimization technique powered by population migration. Although it is capable of nonlinear optimization, the convergence becomes slow when applied to force field development because of the increasing amount of expensive MD simulations. In addition, due to its population-based nature, its optimization path relies on only the local bests and the global bests, instead of all the MD results. Therefore, a cost-effective way to harness all the costly simulation data is desirable. Recently, Bejagam et al.[11] developed an artificial neural network (ANN) assisted-PSO framework to optimize the potential parameters for CG models. In a traditional PSO optimization, a particle represents a state in the multidimensional optimization space. In the case of CG potential development, the state represents the potential parameters. In each iteration of the PSO, MD simulations are first performed with particle-specified potential parameters. By comparing each of the MD results and target properties, a fitness value, which quantifies the particle-specific deviation from the target properties, is assigned to the particle as a personal score. For each particle, the local best will be updated if the current fitness value is higher than all the previous values. The best of all the local bests is assigned as the global optimized point for the current iteration and used to guide the next iteration. The process of optimization is performed until the global best particle (the best potential parameters) being capable of producing the target properties with a satisfactory small, usually 2–5%, error. On top of the classic PSO framework, an ANN is used to accelerate the convergence of the PSO. As shown in Figure a, in each iteration of the PSO, after the MD runs, all the MD data are used to train a dynamic ANN. The ANN then makes a prediction of an extra particle (a set of potential parameters). This extra particle is fed back to the group of particles in the PSO. In the next iteration, the new group of particles is used to advance the PSO. Thus, the combination of PSO and ANN algorithms could dramatically accelerate the searching process for the best solution for CG potential parameters.

Figure 4

Artificial neuron network (ANN) assisted particle swarm optimization (PSO). (a) Flowchart of the ANN-PSO. In every iteration, the ANN trained by data collection of all the previous iterations provides a new particle (a set of force field parameters). (b) Molecules D2O and DMF coarse-grained to nonpolar beads. (c) Optimization of the D2O. Error histories of systems with 40, 8, and 4 particles. (d) Optimization of the DMF. Error histories of systems with 40, 8, and 4 particles (Reproduced from the work of Bejagam et al.[11] Copyright 2018 American Chemical Society). Molecules of D2O and DMF, as depicted in Figure b, are used to demonstrate the competence of the proposed framework. The molecules are coarse-grained as one and two nonpolarizable beads, respectively. The single-bead CG model for the D2O needs two parameters, i.e., the ϵ and σ, in 12–6 LJ potential, to descried the interaction between beads. The model for DMF needs five parameters: kb the strength describing the bonding between the AM beads and the CGD2 beads, ϵAM and σAM for the AM beads, and ϵCGD2 and σCGD2 for the CGD2 beads respectively. The results of the optimization for the D2O and the DMF are shown in Figure c and d, respectively. The error of each iteration is calculated as the root-mean-square error between the optimization and the experimental target values of the densities and diffusion coefficients. The particle numbers used in the optimization are 40, 8, and 4 from the left to the right, respectively. In all the cases, the ANN-PSO method shows better prediction and fast convergence with smaller errors than the PSO method. In this study, the ANN is used to exploit all the MD results that the PSO produces to accelerate the convergence of the PSO, reducing the demand for expensive MD simulations. However, because one still needs to predefine the CG potential forms (12–6 LJ form in this study). Generally speaking, this framework can only preserve limited structural and dynamic properties for the CG models.

Bottom-up Machine-Learned CG Approaches

DeepCG: Constructing a CG Model via DNN

Defining an accurate free energy function in the space of CG variables is always the key to developing a CG model. But, it is the most difficult part, which requires substantial physical and chemical intuitions. ML methods can address this problem in a more accurate and automated way. Nevertheless, most of the ML approaches so far focus on the representation of potential energy surface, such as the deep potential method.[21] It allows us to perform MD simulations with comparable accuracy as the ab initio molecular dynamics (AIMD) but only at the cost of classical empirical force fields. To find a good ML approach to represent the free energy surface, Zhang et al.[12] developed the deep coarse-grained potential (DeePCG) scheme. Note that the ML of AA and CG models tends to represent potential-energy and free-energy surfaces, respectively, by high-dimensional neural networks. In DeePCG, a neural network representation U(ξ) is adopted for the CG potential U(ξ) (see Figure a). Here, ξ variables are the coordinates of the CG particles. The CG potential is assumed to be the sum of the local contributions of the CG particles, i.e., , is the potential contribution of the CG particle i. It is constructed in two steps: (1) The first is local frame transformation according to the descriptor {D}, which includes radial distance and angular information (Figure b). (2) The descriptor {D} will be given as input to a fully connected FFNN (total of four layers with two hidden layers) to compute the potential contribution of the CG particle i (see Figure a). Because the CG potential in the DeePCG is free energy that is not directly available, the optimization of CG potential is required. Here, the force-matching scheme is employed through fitting accurate mean forces from AIMD simulations. Therefore, the loss function in DeePCG is defined aswhere D is the number of configurations for CG variables in the data set and F(ξ) is the mean force estimated from AIMD simulations. The optimization of the loss function is fulfilled by the stochastic gradient descent (SGD) method, which is a highly noncanvex function corresponding to a rugged landscape in the large parameter space.

Figure 5

(a) Schematic of the deep neural network (DNN). (b) Example for CG of water molecules and the local transformation of coordinates of CG sites. {D} and {D} are radial and angular descriptors, respectively. (c) (upper) comparison of the O–O RDFs for liquid water between AIMD and DeePMD and DeePCG simulation; (lower) deviations of DeePMD and of two DeePCG models relative to the AIMD result. (d) O–O–O ADFs of liquid water from AIMD, DeePMD, and DeePCG simulations with four different cutoff radii (Reprinted with permission from the work of Zhang et al.[12] Copyright 2018 AIP Publishing LLC). To demonstrate the capability of the DeePCG, the CG model of liquid water is adopted. The training data set is obtained from the AIMD simulations. The AIMD data set consists of a total of 40 000 snapshots, during a 20 ps-long trajectory in the NVT ensemble with the system size N = 192 atoms (64 H2O molecules) at T = 300 K. After the training process, the NVT simulation with the trained DeePCG model is performed on the CG waters. The results of O–O RDF and O–O–O ADFs are compared with those from AIMD and DeePMD (see Figure c). It is found the DeePCG can reproduce the same RDF as those from AIMD and DeepMD. Also, the ADF results are examined for the CG system with different cutoff Rc values as shown in Figure d. In DeepCG, the CG site is the oxygen, and the cutoff determines the maximum distance where the surrounding CG particles can interact with each other. It is found that, with the increase of the cutoff distance Rc, the ADF from DeePCG will be more consistent with those from AIMD and DeePMD. It means that a short cutoff distance may lose information contained within the interactions among the CG particles. From the modeling of liquid water, DeePCG demonstrates the same accuracy as AIMD. And it performs faster than the corresponding AIMD since DeePCG only considers the local contributions of the potential functions, similar to classical atomistic MD simulations (see Figure b). Nevertheless, in the DeePCG simulations, the force discontinuity is found due to the sharp cutoff and limitation of the descriptors for angular information (see ADF results in Figure d). Moreover, the thermodynamic transferability of the developed CG model remains to be further examined with DeePCG.

CGnet: ML CG Force Fields

Another force-matching based ML approach for CG potential is named CGnet, as shown in Figure a. The difference is that in the CGnet, the CG potential U(x) is represented by ANN. Similarly, the ANN includes a transformation y = g(x) from CG particle’s coordinates x to a set of features y that contains the invariance of the free energy. The invariance is conserved by using the features: distances between all pairs of CG particles and the angles between three consecutive CG particles like those shown in Figure b. The features y are then given as inputs of the ANN. The loss function in ANN is defined:where θ are parameters used in ANN, ξ(R) are matrices of CG particle’s coordinates, and ξ(F(R)) are instantaneous force components projected on the CG coordinates, with F(R) defined as the instantaneous atomistic forces. This loss function reflects the potential mean force (PMF) error between the mean force ξ(F(R))) and the CG force predicted by the gradient of CG potential—ΔU(ξ(r); θ). To minimize this error, another gradient layer is added into the ANN to compute the derivatives with respect to the input coordinates/features. With this ANN at hand, the trained CG potential form can be used to perform MD simulations that will produce new CG coordinates. When part of the new coordinates are outside the training data set, it is possible that the ANN can generate unphysical predictions. To avoid the unphysical predictions, a regularized CGnet is introduced by utilizing “regularization” methods (see Figure a). In the regularized CGnet, a baseline (prior) energy U0(x) is added into the energy function as U(x; θ) = U0(x) + Unet(x; θ), where Unet is the neural network free energy as used in CGnet. The role of U0(x) is to enforce U → ∞ for unphysical states, in which the new CG coordinates are outside the training data set.

Figure 6

(a) ML schemes of CGnet and regularized CGnet. (b) Comparisons of force and free energy between predictions from CGnet, feature regression, and the exact results for the two-dimensional toy model. (c) Free energy profiles of alanine dipeptide using all-atom and machine-learned CG models (Reproduced from the work of Wang et al.[13] Copyright 2019 American Chemical Society). A two-dimensional toy model is used as an illustration to show the predicted potential energies by using different CG models. The two-dimensional potential energy has the analytical form , shown in Figure b. The CG mapping is defined by the projection of a two-dimensional model onto the x-axis (see Figure b). A long simulation trajectory of the two-dimensional model is obtained, which is used as the training data set. The CG potential and mean force are compared between those from feature regression, i.e., least-squares regression, and those from CGnet and regularized CGnet. It is found that within the training data set, both feature regression and CGnets (both CGnet and regularized CGnet) can accurately capture the mean force and free energy. However, outside the training data set, the predictions from CGnet severely deviates from the exact results, but the regularized CGnet, to some extent, can still nearly capture the exact results due to the introduction of the prior energy (here is a harmonic energy form). To further demonstrate the application of CGnet, the coarse-graining of alanine dipeptide in an explicit solvent is studied. To make a comparison, another CG model named the “spline model” is also studied.[2] The free energy profiles from these three CG models are displayed with the result from the AA molecular model (see Figure c). It can be found that only the regularized CGnet model can correctly reproduce the position of all the main free-energy minima. On the contrary, the spline model cannot capture the energy minimal corresponding to the positive value of the dihedral angle ϕ, and the CGnet can only reproduce part of the free energy minima. The above results demonstrate that the CGnets (CGnet and regularized CGnet) can be used to reproduce effective free energies for CG models, which can capture the equilibrium distribution of a specific atomistic model. However, the CGnet here is not transferable to the study of different systems, since it is designed ad hoc for a specific molecule. Besides, the CGnet can only reproduce the structural/configurational properties of the system with the dynamic properties remained to be explored using alternative approaches.

Kernel-Based ML of the CG Model for Efficient MD Simulation

Since current ML models for CG force fields suffer from high computational cost at every integration time step, Scherer et al.[14] proposed a kernel-based ML of the CG model for efficient simulations of molecular liquids. The central idea is to utilize a kernel machine to learn the mappingwhere Q is the representation of the systems, i.e., transformation of the coordinates of the atoms and F is the force. The kernel function is K̂ = K(Q, Q). Directly learning this mapping is very challenging, due to the large interpolation size of Q, even within a region defined by the cutoff distance rcut. In this work, the Q is decomposed into two terms q(2) and q(3), which represent two-body pair and three-body triplet interactions, respectively. To predict a rotationally invariant property, the two-body pair representation is chosen as the interparticle distance: , and the three-body triplet is chosen as a vector of three interparticle distances: (see the interaction m part in Figure a). Accordingly, the force F can be split into 2 terms f(2) and f(3), respectively (see the body expansion in Figure a). Therefore, the problem is simplified as the learning of the local mapping: q( → f(, b = 2, 3, with the local kernel k̂(. Once trained for the local kernel, the ML model can be used to predict the local interactions, f*. And after correlating the local kernel k̂( to the global kernel K̂, the new ML model can be used to predict the global force F*.

Figure 7

(a) (left) Description of the body expansion on pairs and triplets. Two- and three-body forces are shown. (right) Representations used for two- and three-body interactions. (b) Learning curves for two-body LJ fluid with different kernel matrices. (c) Learning curves for three-body SW fluids with different kernel matrices. (d) Comparison of MD simulation of tabulated three-body kernel predictions with the reference SW simulation, including RDF and ADF results. (e) Comparisons of RDF and ADF between atomistic simulations, the CG model using the force-matching scheme, and the SW model (Reproduced from the work of Scherer et al.[14] Copyright 2020 American Chemical Society). One of the key aspects in this ML framework is to construct the kernel matrix K̂. The basic energy conservation requires the kernel matrix is curl-free and the property rotates with the sample. Based on these requirements, three approaches are proposed: explicit rotations, integration over SO(3), and Hessian kernel. The performances of these three types of kernels are compared by studying the learning process for a two-body Lennard-Jones fluid (Figure b) and three-body Stillinger–Weber (SW) fluid (Figure c). It is found that, for the two-body interaction, there is no difference between the performances of the three kernels. However, for the three-body SW fluid, the Hessian kernel outperforms the other two kernels. Besides, by sorting and permutating the triplet representation vector q(3), the learning process will be significantly accelerated. Before performing MD simulations using the trained ML model, a switch function is used close to the cutoff distance to avoid numerical instabilities and maintain energy conservation. Furthermore, to reduce the error induced by the kernel predictions, the training data size is increased by the covariant meshing technique. To demonstrate this ML framework, the trained CG model is adopted to run MD simulations. The results of RDF and ADF are compared with those from the referenced SW model. It is found all the curves are almost identical, which means the kernel predictions lead to the correct sampling of the canonical ensemble. In addition, the computational costs of these simulations are comparable with the original SW potential, without the restriction on the explicit functional form of three-body potentials. This ML framework is further applied to the learning of the CG force field. After the CG process R = R(r), the instantaneous collective forces (ICFs) are expressed as F(r) = ∑f(r), where f(r) are the atomistic forces and i is the CG bead. Here, the mapping is not directly from the R to mean force. Instead, a two-step procedure is provided: first, the two-body force F2-body is obtained by the force-matching scheme, and second, the three-body forces are averaged as the residual force between target and two-body force: ΔF = F – F2-body. To learn these residual forces, a binned three-body Hessian kernel is adopted. The simulation results for liquid water with three different CG models are compared in Figure e. The Hessian kernel-based ML model is found to be very close to the atomistic results of RDF and ADF. More interestingly, it is found the ML model is even more accurate than the force-matched SW model. This kernel-based ML framework demonstrates that by adding of switch function close to the cutoff distance and covariant meshing of the training data set, it can accurately capture two-body and three-body interactions with the comparative computational cost as the original SW model. And the extension to CG liquids confirms it is a promising technique to construct efficient two- and three-body models for a wide range of CG applications. However, for many-body interactions, i.e., four-body (dihedral) interaction, more complex and efficient kernel functions should be introduced. And the computational efficiency could be a significant challenge, as more interactions are included in the kernel matrix within the cutoff distance.

Autoencoders for CG Molecular Models

The selection of an appropriate mapping from AA to CG models plays a critical role in developing CG models to reproduce consistent dynamics, structural correlation, and thermodynamics.[5] In general, the criteria for selecting CG mappings highly depends on the prior physical and chemical knowledge and intuition. Although many efforts have been devoted to develop forward- and back-mapping algorithms, the statistical connections are missing to reversibly bridge resolutions across different scales. To solve this issue, the powerful unsupervised learning technique–variational autoencoders (VAEs) are introduced. The VAEs compress the data through an information bottleneck that can first continuously map complex data into a low-dimensional latent space; second probabilistically infer the real data distribution via a generating process.[22] Accordingly, Wang et al.[15] proposed an autoencoder-based generative modeling framework that includes four steps: (1) learning discrete CG variables in latent space and decoding back to atomistic detail via geometric back–mapping; (2) using a reconstruction loss to help capture collective features from AA reference data; (3) regularizing the CG space with a semisupervised mean instantaneous force minimization to obtain a smoothed CG free-energy landscape; and (4) variationally finding the highly complex CG potential that matches the instantaneous mean force acting on the AA training data. Figure a shows the autoencoding framework. The atomistic data are first reconstructed by encoding atomistic trajectories through a low-dimensional bottleneck, and the CG mapping is parametrized by using the Gumbel-softmax reparametrization. In the training process, the encoder and decoder are variationally optimized to minimize the reconstruction loss:where the first term on the right-hand side is atom-wise reconstruction loss and the second term is the average instantaneous mean force regularization. ρh is the hyperparameter. Using this loss function, the force fluctuations of the encoded space is minimized by the instantaneous force regularizer and the CG free-energy landscape will be smoothed.

Figure 8

(a) CG autoencoding framework including the encoder, decoder, and the reconstruction of original all-atom data. (b) Reconstruction and mean force losses in Auto-Encoder training of gas-phase molecules with different resolutions. (c) Coarse-graining encoding and decoding for alanine dipeptide and the free energy profiles with different CG resolutions, compared to the atomistic simulation. (d) Comparison of the simulation results on RDF, bond distance distribution, and mean squared displacement (MSD) between atomistic and CG models for liquid ethane using a CG resolution of 2 per molecule (Reprinted with permission from the work of Wang et al.[15] Copyright 2019 Springer Nature Limited). This unsupervised autoencoding process is first shown for gas-phase ortho-terphenyl (OTP) (Figure b) and aniline (Figure c). In the case of OTP, it is found if the number of degrees of freedom is less than 4, for example, 3-beads mapping that partitions each of the phenyl rings, this encoding loses the configuration information that describes the relative rotation of the two side rings, and thus leads to higher error for decoded structures. In the case of a small peptide molecule, the latent variables of 8 CG beads are adequate to recover heavy atom positions through the arrangement of hydrogen atoms is not fully reproduced. With the minimization of the average of instantaneous mean forces as a regularization term, the learning of a smooth CG free energy surface is realized. This applicability of the framework is demonstrated by bulk simulations of liquids. Here, an example of liquid ethane is shown in Figure d. The CG resolution is 2. An autoencoder is trained to obtain the latent CG variables, and then a neural network-based CG force field is minimized by a force-matching scheme. Finally, the CG simulations are performed at the same density and temperature as the atomistic simulation. From the Figure d, it is found that the RDFs of both COM–COM and CG1–CG1 are captured accurately compared to those in the atomistic simulations. Also, the bond length distributions are in good agreement with that from atomistic simulations. The dynamic property informed from the mean squared displacement (MSD) is also shown in Figure d. The CG model exhibits faster dynamics, in comparison to atomistic simulations, which is the typical issue in classical CG models.[5] This ML framework proposes to use the CG coordinates as latent variables that can be regularized with force. Through the training of encoding mapping, deterministic decoding, and a CG potential, a larger system is able to be simulated for a longer time, thus accelerating the MD simulations. However, this framework has its limitations. First, the deterministic CG mapping leads to an irreversible loss of information that is reflected in the reconstruction of average AA molecular structures. Second, due to the force-matching scheme, the individual pair correlation cannot be fully recovered compared to the atomistic trajectories. Lastly, as the same as other bottom-up approaches based on force-matching, this framework can only reproduce structural correlation function at one point in the thermodynamics space. It is not transferable among different thermodynamic conditions, which should be further addressed.[18]

Graph-Based CG Molecular Model

The representability, in terms of describing the ground-truth system, has drawn less attention compared to the transferability. However, its importance still deserves special notice because it is the baseline of building CG models and the success of a CG model relies on a good representation.[5] Webb et al.[16] proposed a graph-based coarse-graining (GBCG) scheme. The method essentially uses graph theory to describe the chemical connectivity of an organic molecule, mapping the ground-truth system of the molecule to a coarser description by basic graph operation of edge contraction, with a controllable degree of the coarse-graining. Generally, a unique adjacency matrix is a complete description of the chemical structure of an organic molecule. The specific adjacency matrix includes vertexes, representing the atoms in the molecule, and edges, standing for the topological connections of these atoms. With this knowledge, the GBCG is nothing but a series of grouping and contraction operations, as illustrated in Figure a. In the grouping process, related vertexes are assigned to new groups; while in the contraction, vertexes in the new groups are combined together to form coarser sites. Here, a molecule with the SMILES string [C(= O)1OCC(CO)C(CC)C1] is coarse-grained into models with 12, 6, 2, and 1 sites in a sequential and systematic manner.

Figure 9

CG mapping scheme based on the graphic description of organic molecules. (a) Logistic of the graph-based coarse-graining (GBCG). The groupings and contractions produce different levels of CG descriptions of the original molecule. (b) (dashed-line box) Adjacency matrix A. (solid-line box) Grouping method contracts the adjacency matrix A. The spectral grouping ensures that those topologically important atoms are coarse-grained later than those less important atoms. (c) Example of toluene. Sequential CG mappings are shown. (d) Structural properties predicted by the CG models with potentials developed based on the different levels of coarse-graining. The results suggest the graphic CG mappings preserve the structural properties well (Reproduced from the work of Webb et al.[16] Copyright 2019 American Chemical Society). The essential questions needed to be answered in the GBCG are how and in what sequence one should combine the vertexes. There are generally speaking different methods. However, in the following, a simple protocol, spectral grouping that uses only the adjacency matrix, is adopted for a simple but effective demonstration. An example of dimethyl carbonate is shown in Figure b. All the hydrogen atoms are first discarded and the rest are in united-atoms fashion, for simplicity. The 6 united-atoms are defined as 6 vertexes. A 6 × 6 symmetric adjacency matrix A, highlighted in the dash-line box, is used to describe the connectivity of the system with the elements A equal to either 1 or 0, representing the connection between vertexes N and N exists or not, respectively. The grouping logic is in the solid-line box. The rank-order is defined by the eigenvector corresponding to the largest eigenvalues of the adjacency matrix A, based on the eigenvector centrality that counts the contributions of the vertexes to the overall connectivity. The grouping follows the queue, in which vertexes with a low-rank order lie in the front. Every vertex is grouped with its direct neighbor. The contraction of the atoms here is by the center of mass for simplicity. According to the listed ranks, the carbon atoms 1 and 5 are merged with their neighbors–the oxygen atoms 2 and 4 first. Then the oxygen atom 6 is combined with its neighbor—the carbon atom 3. Following this protocol, a toluene molecule is coarse-grained as shown in Figure c. The original molecule is labeled as TOL0, and it is sequentially coarse-grained into CG models TOL1, TOL2, and TOL3 that have 4, 2, and 1 beads, respectively. Validation to the example is shown in Figure d, where RDFs of the CG models at three different levels, equipped with potentials derived from IBI and force-matching methods, are shown. These results suggest that the graph method gives legitimate CG mappings that capture the essential structural properties, in comparison to the ground-truth atomistic system. It is worth noting that the difference between the GBCG from the intuitive CG site definition as the center-of-mass of the original molecule is that the mappings in this method are in a sequence, corresponding to the topological connection of the chemical structure. Consequently, the GBCG gives a systematic way to derive a CG mapping with chemical essence.

Discussion and Remarks

CG molecular simulations offer a unique opportunity to address challenging problems, such as phase separation, self-assembly, and crystallization of organic molecules and polymers. However, the classical CG models are limited by thermodynamic transferability and consistency.[5] Effective potential functions for the CG model are usually derived and optimized from one set of thermodynamic conditions. Thus, CG models derived from one thermodynamic state cannot be transferable to another set of conditions, limiting their applications to a small range of thermodynamic states. During the CG process, the free-energy landscape of the system is dramatically changed. Thus, it leads to thermodynamic inconsistency between AA and CG models. To overcome these limitations, the ML-based CG models provide an alternative solution, which can be optimized by using an extensive training data set of forces and structures from AA molecular simulations. In direct comparison with classical CG models, we believe that the ML-based CG models could have the following advantages: (i) thermodynamic consistency as neural networks can in principle approximate any function to arbitrary accuracy, which captures the surface roughness and highly nonconvex free-energy landscape of CG molecules; (ii) temperature/pressure transferability as neural networks has multiple layers of nonlinear functions and many model parameters, which can be optimized by extensive training data from AA simulations at different thermodynamic states; (iii) representability as neural networks can take two- and multibody molecular descriptors as inputs, which represents many-body interactions that are typically ignored in classical CG models. We should emphasize that the ML of AA and CG models tends to represent potential-energy and free-energy surfaces, respectively, by high-dimensional neural networks. During the ML of the AA model, both energies and forces are obtained directly from DFT simulations, leading to a fast and accurate neural network construction for the potential-energy surface. However, for the ML of CG models, only structures and forces are given from AA molecular simulations. The free-energy surface is challenging to estimate. Thus, most machine-learned CG models are formulated based on the force-matching scheme.[12−15] Note that both structure and force distributions are related to the free-energy landscape of the CG model. Therefore, the structural distributions (e.g., RDF and ADF) should be used to train the ML-based CG models in the near future. We also expect that these ML-based CG models can unify both structure- and force-matching during the optimization of neural network parameters. Analogous to the typical ML of AA models for potential-energy surface construction, the ML of a CG model includes four major steps: (1) generating an extensive training data set using AA molecular simulations at different thermodynamic states; (2) mapping from a AA model to a CG model and extracting target structure or force distributions; (3) transforming Cartesian coordinates of CG particles to a suitable set of input coordinates (descriptors, features or symmetry functions) for the training of neural networks; (4) defining the architecture of neural networks and formulating a fitting procedure to optimize the weights and minimize the loss function. Each step can dramatically affect the accuracy and efficiency of the machine-learned CG models, in terms of reproducing the correct structural or dynamical properties of AA molecular systems. In the following, we discuss the efforts and unsolved issues in each step.

Generation of Training Data Set

The ML is usually a highly nonlinear fitting process with enormous parameters to be determined, in order to minimize the loss functions of neural networks. Theoretically, the ML model can fit an arbitrary system with the assumptions that the training data set is representative, diverse enough and the computational resource is adequate. In reality, the training data cannot cover all thermodynamic conditions. And actually, we are only interested in a system within a specific regime of thermodynamic states. Nevertheless, the training data set in the specific regime should be representative and diverse, which means the training data set needs to cover all the typical behaviors happening within this regime in terms of different thermodynamic states (e.g., pressure and temperature). The lack of data set will result in wrong thermodynamic transferability and consistency. The CG model trained by the data set within a specific regime of temperature or pressure cannot predict the physical states outside the trained thermodynamic regime (no extrapolation and exploration). In general, the sampling of the training data within the interested regime should be uniform, except for the extreme events. For example, two CG water beads can approach each other closely. If this scenario is not included in the training data set, the trained ML model will produce unstable MD simulations when two CG beads are close to each other. Frequent extrapolations will be necessary to handle this situation, and it can generate unphysical results. Therefore, the training data set should include enough information about different events, particularly extreme events. An alternative approach to relieve the lack of representation or diversity of the training data set is applying physical constraints in the trained ML model, such as physics-informed ML. For instance, in the CGnet model,[13] an a priori energy form is adopted to regularize the motion of CG particles in the simulations when extrapolation is needed. This can ensure the correct physical behaviors near the boundary of the training data set, as shown in Figure b.

Mapping from AA to CG Models: Representation

As mentioned before, the CG process from atomistic coordinates to CG sites strongly depends on the understandings of physical problems and chemical intuition.[5] CG will degenerate the system’s degree of freedom and lose some important information that only can be observable in the atomistic simulations. A poor CG representation will lead to the loss of more important information and cannot reproduce it, no matter how efficiently and accurately the CG model performs in the following steps. Typically, for the well-known systems, for example, water molecules and peptides, it is easy to define the CG site on the oxygen atom and the backbone carbon atom, respectively. However, for unfamiliar systems, the center of mass is the most commonly used definition of the CG site, which is straightforward at first glance. But, the ambiguous nature of this method may jeopardize its usefulness. For example, when coarse-graining a linear flexible polymer chain, a CG site based on the center of mass of several monomers could be an inferior choice because the thermodynamic fluctuation for the center-of-mass potentially leads to an undesirable variation of the CG site.[5] Therefore, rigorous methods that give consistent and effective definitions of the mapping from AA to CG models are particularly important. To address this issue, the VAE-based ML framework can learn discrete CG variables in the latent space and decode them back to atomistic detail with high accuracy.[15] Besides, the graph-based CG adopts the graphic theory to describe the chemical connectivity of an organic molecule.[16] It maps the ground-truth system of the molecule to a coarser description by basic graphic operation of edge contraction, which provides a promising and efficient way for mapping from AA to CG models.

Inputs of ML Models: Descriptors, Features, or Symmetry Functions

The potential energy of the CG model should be only dependent on the internal interactions and be invariant with respect to the translation, rotation, or permutation of the entire molecules.[8] ML is a numerically fitting method and the output of an ML model depends on the absolute value of the input. Thus, Eulerian coordinates of the CG particles are not suitable as the inputs for ML models. Accordingly, molecular descriptors, features, or symmetry functions should be used to train these ML models.[23] The most straightforward input for the ML-based CG model is the distance between two CG particles. However, this choice is not unique and can0not capture many-body interactions. Currently, there are different transformations of Eulerian coordinates of CG particles as inputs of ML models, such as using local coordinate systems,[12] two- or three-body correlation functions,[14] permutation-invariant distance metrics,[24] and VAE-based encoder-decoder.[15] Apart from the energy invariance, the symmetry functions for ML models should be first- and second-order continuous that the force can be derived as the gradient of CG potential with respect to coordinates. Furthermore, the symmetry functions should be broad enough to capture the interactions between two nearby CG particles and ensure the decay of energy to zero when the distance between two CG particles approaches the cutoff distance. Therefore, too simple or too small symmetry functions can lower the accuracy of machine-learned CG models.[8]

Optimization of Weights in Neural Networks

In the development of a machine-learned CG model, the optimization of parameters is always the most time-consuming part. For neural networks, a variety of gradient-based optimization algorithms are available, from simple gradient descent (back-propagation) and variants-like RPROP (resilient back-propagation) or Adam to higher-order methods, such as the L-BFGS method and extended Kalman filter (EKF).[25] The choice of optimization algorithms depends on the specific problem. For example, EKF performs very well in training one or two hidden layers with each having less than 100 neurons but poorly in training networks with four or more hidden layers. Also, considering the resources of HPC, we need to choose the optimization algorithm, which can perform training with a large data set more efficiently.

Future Opportunities

So far, the machine-learned CG models have made extensive efforts in either one or two aspects of the above discussions to improve accuracy. This has led to limited applications of these models. For example, the methods of kernel-based and graph-based CG models focus on the representation aspect and provide different protocols to retain the information contained in the molecular structure as much as possible. While the optimization of these CG models is limited to the basis of the force-matching scheme. It can lead to poor reproductions of structural properties, such as RDF, ADF, etc. A potential way to solve this issue is by adding another loss function (target), such as RDF, to optimize the CG models. Due to the complexity of high dimensional neural networks, both structural and force distributions could be simultaneously reproduced by the machine-learned CG models. In terms of the training data set, the ML-BOP model focuses on the temperature-transferability by covering a wide range of temperature states.[9] Meanwhile, pressure-transferability remains to be studied. If we include more train data sets at different pressures, the ML-BOP model may realize pressure-transferability and, then, can predict the nucleation process of the water molecules more accurately. For the choice of symmetry functions, in general, it depends on the experience of trial-and-error. On one side, the symmetry functions should be more than enough to capture all the typical interactions among CG particles. On the other side, symmetry functions should be small enough to have good performance in terms of computational efficiency. Therefore, we need to pay close attention to how to choose appropriate symmetry functions. In conclusion, we are still facing enormous challenges and opportunities for constructing an accurate and efficient machine-learned CG model to be thermodynamically consistent, transferable, and representative.

3 in total

Review 1. Modelling Sorption and Transport of Gases in Polymeric Membranes across Different Scales: A Review.

Authors: Eleonora Ricci; Matteo Minelli; Maria Grazia De Angelis
Journal: Membranes (Basel) Date: 2022-08-31

Review 2. Bottom-up Coarse-Graining: Principles and Perspectives.

Authors: Jaehyeok Jin; Alexander J Pak; Aleksander E P Durumeric; Timothy D Loose; Gregory A Voth
Journal: J Chem Theory Comput Date: 2022-09-07 Impact factor: 6.578

3. Integration of Machine Learning and Coarse-Grained Molecular Simulations for Polymer Materials: Physical Understandings and Molecular Design.

Authors: Danh Nguyen; Lei Tao; Ying Li
Journal: Front Chem Date: 2022-01-24 Impact factor: 5.221

3 in total