| Literature DB >> 28989618 |
Qian Yang1, Carlos A Sing-Long2, Evan J Reed1,3.
Abstract
We propose a novel statistical learning framework for automatically and efficiently building reduced kinetic Monte Carlo (KMC) models of large-scale elementary reaction networks from data generated by a single or few molecular dynamics simulations (MD). Existing approaches for identifying species and reactions from molecular dynamics typically use bond length and duration criteria, where bond duration is a fixed parameter motivated by an understanding of bond vibrational frequencies. In contrast, we show that for highly reactive systems, bond duration should be a model parameter that is chosen to maximize the predictive power of the resulting statistical model. We demonstrate our method on a high temperature, high pressure system of reacting liquid methane, and show that the learned KMC model is able to extrapolate more than an order of magnitude in time for key molecules. Additionally, our KMC model of elementary reactions enables us to isolate the most important set of reactions governing the behavior of key molecules found in the MD simulation. We develop a new data-driven algorithm to reduce the chemical reaction network which can be solved either as an integer program or efficiently using L1 regularization, and compare our results with simple count-based reduction. For our liquid methane system, we discover that rare reactions do not play a significant role in the system, and find that less than 7% of the approximately 2000 reactions observed from molecular dynamics are necessary to reproduce the molecular concentration over time of methane. The framework described in this work paves the way towards a genomic approach to studying complex chemical systems, where expensive MD simulation data can be reused to contribute to an increasingly large and accurate genome of elementary reactions and rates.Entities:
Year: 2017 PMID: 28989618 PMCID: PMC5625287 DOI: 10.1039/c7sc01052d
Source DB: PubMed Journal: Chem Sci ISSN: 2041-6520 Impact factor: 9.825
Fig. 1Schematic of our overall approach. A given chemical system is described by its potential energy surface. We first sample from this surface via molecular dynamics simulations. We then use parameter estimation to derive a stochastic model consisting of elementary reactions and corresponding reaction rates from the molecular dynamics data. This model is called the full stochastic model because it includes all reactions observed from the molecular dynamics simulation. We apply one of several model reduction techniques to reduce the full stochastic model by eliminating reactions. Finally, we compare the dynamics of the reduced stochastic model to the molecular concentration trajectories observed in the molecular dynamics simulations.
Fig. 2Six independent molecular dynamics simulations of the same system under the same thermodynamic conditions, resulting in somewhat different molecular concentration trajectories due to different initial velocities. Each colored line corresponds to a projected molecular concentration trajectory derived from a single molecular dynamics simulation. The dotted black line corresponds to the mean of these trajectories. We see that although the general trends are the same across simulations, the number of molecules at any given time can fluctuate between simulations. The average difference (in the root mean square sense) between each molecular concentration trajectory and the mean trajectory is about 8.0, 5.9, and 3.3 molecules for CH4, H2, and C2H6, respectively.
Fig. 3Schematic of how bond duration τ is used to smooth out the signal of whether or not atoms are bonded. Two atoms are not considered bonded unless the bond has endured for τ timesteps. Two atoms that were previously bonded are not considered unbonded unless the bond has been broken for τ timesteps. Note that as τ increases, fewer events are detected.
Fig. 4As the chosen bond duration τ increases, there are fewer of all types of reactions, thus decreasing the overall complexity of the system. For small τ, it is likely that many of the observed reactions actually correspond to atomic vibrations. Note that in our high pressure, high temperature system, some trimolecular and more complex reactions may reasonably be considered elementary.
Fig. 5Root mean square error between molecular dynamics simulations and the corresponding stochastic model as a function of the bond duration criterion τ for the three highest concentration molecules. The error is computed according to eqn (6) and averaged over all six molecular dynamics simulations. For smaller τ, there is likely to be error from atomic vibrations. For larger τ, the reactions identified are unlikely to be elementary. The dotted line indicates the optimal choice of τ we use for the remainder of our study.
Fig. 6Two examples of Gillespie stochastic simulations of the chemical system compared to the molecular dynamics simulations they were trained from, when all of the molecular dynamics data are used. We see that both achieve reasonable agreement, especially in comparison to fluctuations between MD simulations as shown in Fig. 2. Differences in accuracy of the Gillespie simulations compared to the molecular dynamics are due to a variety of factors, including approximations made by the model as well as the stochasticity involved in a single simulation.
Fig. 7Time extrapolation: we use only the first 12, 25, 50, or 100 ps of available MD data to build a KMC model, then test model predictions at longer timescales. The colored lines show the molecular concentrations of CH4 (left panel) and H2 (right panel) averaged over 20 Gillespie simulations of the corresponding models. The black line corresponds to the true molecular concentrations from MD. While 12 ps of data is not enough to sufficiently model either molecule, 25 ps of data is enough to model CH4 for 200 ps and H2 for 500 ps.
Fig. 8Time extrapolation of carbon-containing clusters: since larger molecules with more than 5 carbon atoms do not appear in the system until after about 50 ps, we find that training data from times out to 150 ps is needed to capture their presence and growth in the chemical system over 550 ps. All colored lines showing molecular concentrations of carbon clusters are averaged over 10 Gillespie simulations.
Fig. 9Schematic of L1 regularization. We try to find the best approximate solution to a x = b that satisfies the L1 constraint. With high probability, this leads to a solution on a vertex of the L1 ball, which has sparsity (zero entries) in the solution.
Fig. 10Training on the first 200 ps of molecular dynamics data, we use both the count-based and LASSO methods to reduce the chemical reaction network and compare how predictive the reduced models are when extrapolated to 500 ps of simulation time. Errors were computed using eqn (6) over S = 10 Gillespie simulations. The full KMC model observed in 200 ps of molecular dynamics data contains 629 reactions. The minimum overall error for CH4, H2, and C2H6 is achieved by a reduced model consisting of about 100 reactions using the count-based method and between 300 and 450 reactions using the LASSO method. Comparing with Fig. 11, however, we see that the 450 reaction network obtained by LASSO is able to capture carbon cluster growth, whereas the 100 reaction network obtained by the count-based method does not result in any carbon clusters.
Fig. 11We train reduced KMC models with the first 200 ps of molecular dynamics data, and show the final number of molecules containing more than 5 carbons after 500 ps of simulation. The blue points correspond to reduced models built using the count-based method, averaged over 10 Gillespie simulations. The red line corresponds to reduced models built using the LASSO method, similarly averaged over 10 simulations. The dotted yellow line corresponds to the full KMC model averaged over 10 simulations, which is a good approximation for the MD data. Due to the lack of granularity in the count-based method, only the full model of 629 reactions is able to reasonably simulate carbon cluster growth. The next largest reduced model contains only 247 reactions, and overshoots the carbon cluster concentration. Using the LASSO method, we see that about 450 reactions are needed to reasonably model carbon cluster growth. We note that this was also among the best models for the largest concentration molecules in Fig. 10.
Fig. 12Comparison of the molecular concentration trajectories of CH4, H2, and C2H6 observed from a molecular dynamics simulation and a set of corresponding reduced stochastic models derived using the count-based method (top row) and the LASSO method (bottom row). Note that the reduced stochastic models are plotted using the mean concentration trajectories over 50 Gillespie simulations (and thus appear to have less fluctuations than the single MD simulation). The full stochastic model (100% reacts) contains 2000 reactions.
Fig. 13We compare the results of three model reduction methods: removing infrequent reactions (freq), solving the integer programming problem (iqp), and solving the constrained lasso problem (lasso). For each method, we adjust a single parameter λ over a range of values to obtain a series of increasingly small sets of reactions and corresponding reaction rates. We then simulate each reduced order model using Gillespie stochastic simulation. We compare the mean molecular concentration trajectories of CH4, H2, and C2H6 obtained over 30 Gillespie simulations with reference trajectories. In the top row, the reference trajectories are single molecular dynamics simulations. In the bottom row, the reference trajectories are the mean molecular concentration trajectories obtained from 30 Gillespie simulations over the full stochastic model. This process is repeated for all 30 pairs of training and test molecular dynamics simulations. All results above are averaged over these 30 pairs, and the error bars represent standard deviations. To provide a frame of reference for the amount of unavoidable error due to fluctuations in any single molecular dynamics simulation, the green dotted line in the top row represents the average deviation from the mean molecular dynamics trajectory among the 6 MD datasets (see Fig. 2). We can see that only around 100 reactions are needed to achieve approximately the same amount of error in CH4 compared to MD as the full model of almost 2000 reactions.
Fig. 14The reactions chosen by different reduced models may differ depending on the full stochastic model they were derived from, which are each constructed from a particular molecular dynamics simulation. For reduced models of different sizes, the plot shows the fraction of reactions that are common to all 6 models derived using each model reduction method from the 6 different full stochastic models corresponding to independent MD simulations. The purple line finds the percentage of reactions common to all models across all sets of data.
Fig. 15Comparison of the reactions selected by each of the three model reduction methods for a molecular dynamics simulation with 1848 total observed reactions. The total number of reactions in the reduced model given by the count-based method (COUNT) is 267; the reduced models given by the integer program (IQP) and its convex relaxation (LASSO) have 250 and 267 reactions, respectively. While it is difficult to compare reduced models due to differences in the number of reactions selected, we have attempted to choose similarly sized models here; the IQP and LASSO models were derived using the same λ and subsequently the most similarly sized RARE model was chosen. We can see that roughly half of the 250 to 267 reactions in each model are chosen by all three methods. As expected, the IQP and LASSO models are much more similar to each other, with the 250 reactions chosen by IQP a proper subset of the 267 reactions chosen by LASSO.
Important reactions for methane CH4. Reactions are numbered by the order in which they appear in the molecular dynamics simulation. The list of 63 reactions is computed from the count-based method for model reduction trained over a single molecular dynamics simulation. All of the selected reactions are observed within the first 150 ps of the molecular dynamics simulation. This reduced network exhibits an average error of 8 molecules when tested on the other 5 molecular dynamics simulations
| No. | Reaction |
| 2 | C1 H3 3(H–C) + H1 ⇒ C1 H4 4(H–C) |
| 4 | C1 H4 4(H–C) ⇒ C1 H3 3(H–C) + H1 |
| 5 | C1 H4 4(H–C) + H1 ⇒ C1 H5 4(H–C) 1(H–H) |
| 6 | C1 H5 4(H–C) 1(H–H) ⇒ C1 H3 3(H–C) + H2 1(H–H) |
| 8 | C1 H3 3(H–C) + C1 H4 4(H–C) ⇒ C2 H7 8(H–C) |
| 9 | C2 H7 8(H–C) ⇒ C1 H3 3(H–C) + C1 H4 4(H–C) |
| 11 | C2 H6 1(C–C) 6(H–C) + H1 ⇒ C2 H7 1(C–C) 7(H–C) |
| 12 | C2 H7 1(C–C) 7(H–C) ⇒ C2 H6 1(C–C) 6(H–C) + H1 |
| 13 | C2 H6 1(C–C) 6(H–C) ⇒ C2 H5 1(C–C) 5(H–C) + H1 |
| 14 | H1 + H1 ⇒ H2 1(H–H) |
| 15 | C1 H4 4(H–C) + C2 H5 1(C–C) 5(H–C) ⇒ C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) |
| 17 | C1 H3 3(H–C) + C1 H4 4(H–C) ⇒ C2 H7 1(C–C) 7(H–C) |
| 18 | C2 H7 1(C–C) 7(H–C) ⇒ C1 H3 3(H–C) + C1 H4 4(H–C) |
| 21 | C1 H4 4(H–C) + H1 ⇒ C1 H3 3(H–C) + H2 1(H–H) |
| 22 | C1 H3 3(H–C) + C1 H3 3(H–C) ⇒ C2 H6 1(C–C) 6(H–C) |
| 27 | C1 H3 3(H–C) + H2 1(H–H) ⇒ C1 H4 4(H–C) + H1 |
| 28 | C1 H3 3(H–C) + H2 1(H–H) ⇒ C1 H5 4(H–C) 1(H–H) |
| 29 | C1 H5 4(H–C) 1(H–H) ⇒ C1 H4 4(H–C) + H1 |
| 35 | C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) ⇒ C3 H9 1(C–C) 10(H–C) |
| 36 | C3 H9 1(C–C) 10(H–C) ⇒ C1 H4 4(H–C) + C2 H5 1(C–C) 5(H–C) |
| 37 | C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) ⇒ C3 H9 2(C–C) 9(H–C) |
| 38 | C3 H9 2(C–C) 9(H–C) ⇒ C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) |
| 42 | C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) ⇒ C1 H4 4(H–C) + C2 H5 1(C–C) 5(H–C) |
| 43 | C2 H5 1(C–C) 5(H–C) + H1 ⇒ C2 H6 1(C–C) 6(H–C) |
| 45 | C1 H4 4(H–C) + H1 ⇒ C1 H5 5(H–C) |
| 46 | C1 H5 5(H–C) ⇒ C1 H4 4(H–C) + H1 |
| 47 | C1 H4 4(H–C) + C2 H5 1(C–C) 5(H–C) ⇒ C3 H9 1(C–C) 10(H–C) |
| 48 | C3 H9 1(C–C) 10(H–C) ⇒ C1 H3 3(H–C) + C2 H6 1(C–C) 6(H–C) |
| 49 | C2 H5 1(C–C) 5(H–C) ⇒ C2 H5 1(C–C) 6(H–C) |
| 50 | C2 H5 1(C–C) 6(H–C) ⇒ C2 H5 1(C–C) 5(H–C) |
| 51 | C2 H6 1(C–C) 6(H–C) + H1 ⇒ C2 H7 1(C–C) 6(H–C) 1(H–H) |
| 52 | C2 H7 1(C–C) 6(H–C) 1(H–H) ⇒ C2 H5 1(C–C) 5(H–C) + H2 1(H–H) |
| 53 | C2 H5 1(C–C) 5(H–C) ⇒ C2 H4 1(C–C) 4(H–C) + H1 |
| 58 | H2 1(H–H) ⇒ H1 + H1 |
| 60 | C2 H4 1(C–C) 4(H–C) + H1 ⇒ C2 H5 1(C–C) 5(H–C) |
| 61 | C2 H6 1(C–C) 6(H–C) ⇒ C1 H3 3(H–C) + C1 H3 3(H–C) |
| 62 | C3 H9 2(C–C) 9(H–C) ⇒ C3 H8 2(C–C) 8(H–C) + H1 |
| 64 | C2 H6 1(C–C) 6(H–C) + H1 ⇒ C2 H5 1(C–C) 5(H–C) + H2 1(H–H) |
| 65 | C2 H5 1(C–C) 5(H–C) + H2 1(H–H) ⇒ C2 H7 1(C–C) 6(H–C) 1(H–H) |
| 67 | C2 H7 1(C–C) 6(H–C) 1(H–H) ⇒ C2 H6 1(C–C) 6(H–C) + H1 |
| 69 | C1 H3 3(H–C) + C2 H5 1(C–C) 5(H–C) ⇒ C3 H8 2(C–C) 8(H–C) |
| 71 | C1 H3 3(H–C) + C3 H8 2(C–C) 8(H–C) ⇒ C4 H11 2(C–C) 12(H–C) |
| 72 | C4 H11 2(C–C) 12(H–C) ⇒ C1 H4 4(H–C) + C3 H7 2(C–C) 7(H–C) |
| 73 | C4 H11 2(C–C) 12(H–C) ⇒ C1 H3 3(H–C) + C3 H8 2(C–C) 8(H–C) |
| 75 | C1 H4 4(H–C) + C3 H7 2(C–C) 7(H–C) ⇒ C1 H3 3(H–C) + C3 H8 2(C–C) 8(H–C) |
| 86 | C2 H5 1(C–C) 5(H–C) + C2 H6 1(C–C) 6(H–C) ⇒ C4 H11 2(C–C) 12(H–C) |
| 87 | C4 H11 2(C–C) 12(H–C) ⇒ C2 H5 1(C–C) 5(H–C) + C2 H6 1(C–C) 6(H–C) |
| 88 | C3 H8 2(C–C) 8(H–C) ⇒ C3 H7 2(C–C) 7(H–C) + H1 |
| 89 | C3 H7 2(C–C) 7(H–C) ⇒ C3 H7 2(C–C) 8(H–C) |
| 90 | C3 H7 2(C–C) 8(H–C) ⇒ C3 H7 2(C–C) 7(H–C) |
| 92 | C1 H3 3(H–C) + C3 H8 2(C–C) 8(H–C) ⇒ C1 H4 4(H–C) + C3 H7 2(C–C) 7(H–C) |
| 96 | C1 H4 4(H–C) + C3 H7 2(C–C) 7(H–C) ⇒ C4 H11 2(C–C) 12(H–C) |
| 118 | C1 H3 3(H–C) + C4 H10 3(C–C) 10(H–C) ⇒ C5 H13 3(C–C) 14(H–C) |
| 119 | C5 H13 3(C–C) 14(H–C) ⇒ C1 H4 4(H–C) + C4 H9 3(C–C) 9(H–C) |
| 122 | C4 H9 3(C–C) 9(H–C) ⇒ C4 H9 3(C–C) 10(H–C) |
| 123 | C4 H9 3(C–C) 10(H–C) ⇒ C4 H9 3(C–C) 9(H–C) |
| 126 | C1 H4 4(H–C) + C4 H9 3(C–C) 9(H–C) ⇒ C1 H3 3(H–C) + C4 H10 3(C–C) 10(H–C) |
| 172 | C5 H11 4(C–C) 11(H–C) ⇒ C5 H11 4(C–C) 12(H–C) |
| 173 | C5 H11 4(C–C) 12(H–C) ⇒ C5 H11 4(C–C) 11(H–C) |
| 178 | C2 H5 1(C–C) 5(H–C) + H2 1(H–H) ⇒ C2 H6 1(C–C) 6(H–C) + H1 |
| 254 | C4 H10 3(C–C) 10(H–C) ⇒ C4 H9 3(C–C) 9(H–C) + H1 |
| 255 | C1 H4 4(H–C) + C4 H9 3(C–C) 9(H–C) ⇒ C5 H13 3(C–C) 14(H–C) |
| 256 | C5 H13 3(C–C) 14(H–C) ⇒ C1 H3 3(H–C) + C4 H10 3(C–C) 10(H–C) |