The application of machine learning to theoretical chemistry has made it possible to combine the accuracy of quantum chemical energetics with the thorough sampling of finite-temperature fluctuations. To reach this goal, a diverse set of methods has been proposed, ranging from simple linear models to kernel regression and highly nonlinear neural networks. Here we apply two widely different approaches to the same, challenging problem: the sampling of the conformational landscape of polypeptides at finite temperature. We develop a local kernel regression (LKR) coupled with a supervised sparsity method and compare it with a more established approach based on Behler-Parrinello type neural networks. In the context of the LKR, we discuss how the supervised selection of the reference pool of environments is crucial to achieve accurate potential energy surfaces at a competitive computational cost and leverage the locality of the model to infer which chemical environments are poorly described by the DFTB baseline. We then discuss the relative merits of the two frameworks and perform Hamiltonian-reservoir replica-exchange Monte Carlo sampling and metadynamics simulations, respectively, to demonstrate that both frameworks can achieve converged and transferable sampling of the conformational landscape of complex and flexible biomolecules with comparable accuracy and computational cost.
The application of machine learning to theoretical chemistry has made it possible to combine the accuracy of quantum chemical energetics with the thorough sampling of finite-temperature fluctuations. To reach this goal, a diverse set of methods has been proposed, ranging from simple linear models to kernel regression and highly nonlinear neural networks. Here we apply two widely different approaches to the same, challenging problem: the sampling of the conformational landscape of polypeptides at finite temperature. We develop a local kernel regression (LKR) coupled with a supervised sparsity method and compare it with a more established approach based on Behler-Parrinello type neural networks. In the context of the LKR, we discuss how the supervised selection of the reference pool of environments is crucial to achieve accurate potential energy surfaces at a competitive computational cost and leverage the locality of the model to infer which chemical environments are poorly described by the DFTB baseline. We then discuss the relative merits of the two frameworks and perform Hamiltonian-reservoir replica-exchange Monte Carlo sampling and metadynamics simulations, respectively, to demonstrate that both frameworks can achieve converged and transferable sampling of the conformational landscape of complex and flexible biomolecules with comparable accuracy and computational cost.
Machine
learning (ML) techniques have begun to supplement atomistic
simulations by facilitating access to the potential energy surfaces
(PES) with outstanding accuracy at a greatly reduced computational
cost.[1−4] Behler and Parrinello’s seminal work introduced one of the
first condensed-phase potentials based on a neural network (NN). Using
atom-centered symmetry functions to encode the molecular structures[1,5] and expressing the corresponding potential energy as a sum of the
atomic contributions makes the potentials transferable and scalable.
In recent years, several NN architectures for atomic-based potentials
have been proposed, including SchNet[6−9] and PhysNet,[10,11] which predict
energies, forces, and other properties (e.g., dipole moments or chemical
potentials) of various chemical systems. Roitberg and co-workers also
introduced the ANI-1[12] model, where single-atom
atomic environment vectors (AEVs) are used to build deep NN potentials
to approach the golden standard of CCSD(T)/CBS for reaction thermochemistry,
isomerization, and drug-like molecular torsions.[13] Despite their widespread use, NNs have drawbacks: lack
of interpretability, the nondeterministic and computationally demanding
training, and the large amounts of training data required are some
of them.As an alternative to artificial NNs, kernel-based approaches
such
as kernel ridge regression (KRR) and gaussian process regression (GPR)
overcome some of these limitations.[14] Kernel
methods build a map between a target system and its properties by
evaluating a similarity measure between the target and a set of known
reference points. Gaussian approximation potentials (GAPs)[15,16] pioneered the use of kernels in molecular dynamic simulations and
demonstrated that they can achieve results equivalent to NNs. Since
then, they have been used to model bulk materials ranging from simple
silicon,[17−21] to ternary Ge2Sb2Te5.[22,23] In the wake of GAPs, numerous alternative kernel-based and linear
methods have been proposed to predict PESs for atomistic simulations,
including support vector machines (SVM),[24] the spectral neighbor analysis potentials (SNAPs),[25] general reproducing kernels models,[26,27] and coarse-grained approaches.[28] More
recently, the symmetrized gradient-domain (sGDML) model has proven
to yield nearly exact molecular dynamics simulations for small molecules
based on coupled-cluster energies and forces.[2,29,30] However, despite the increasing number of
kernel-based ML potentials, artificial NNs remain dominant for driving
atomistic simulations.[5,31−39]When paired with global molecular representations (e.g., Coulomb
matrix,[40] bag of bonds (BoB),[41] or the Spectral London Axilrod–Teller-Muto[42] (SLATM)), which encode the key physical information
about the structure and composition of molecules as whole indivisible
entities, kernel models are often lightweight, making them ideal for
predicting molecular properties.[17,43−46] However, predictions made with these global representations are
expected to be accurate only for molecules of similar size and composition
with respect to those in the training set. This constraint limit severely
the exploration and extrapolation to larger chemical and conformational
spaces. Local representations (e.g., FCHL,[47,48] aSLATM,[42] and SOAP[49]), which describe molecules as a collection of atoms within
their local environments, provide a greater transferability[50,51] but also significantly increase the computational cost of kernel-based
methods, as the similarity between molecules is then computed as a
function of the pairwise similarity between atoms.[52] To restore the data-efficiency typical of kernel-based
methods and efficiently exploit the local representations, one can
resort to sparse regression techniques. The simplest form amounts
to sampling the entire set of atom-centered environments and retaining
only the (a priori) most informative environments,
assuming that substantial redundancy arises from recurring environments
across training structures. The criteria for selection tend to be
based on techniques such as farthest point sampling (FPS) or CUR matrix
decomposition[53] that maximize the dissimilarity
of the selected environments. While the environments sampled with
FPS- or CUR-based methods represent the most varied set among the
training instances, they are not necessarily the best for regressing
the property of interest,[54] as the dissimilarity
in the representation space does not necessarily correlate with dissimilarity
in the property space.[55] (Δ-)56ML approaches represent a typical illustration of this issue,
as the vast majority of chemical environments are well described by
an approximated baseline model while the error is concentrated in
localized areas of the feature space. This is particularly true when
predicting PESs, where capturing the conformational changes (e.g.,
torsion of a single dihedral angle) is as crucial as capturing the
dependence on chemical diversity.In this work, our goal is
to address the limitations of traditional
unsupervised sparsification techniques and leverage the data-efficiency
and transferability of local kernel models, by combining a local kernel
regression (LKR) framework with a flexible orthogonal matching pursuit
(OMP) algorithm. The efficiency of the resulting model is demonstrated
by learning the PES of oligopeptides using a set of 52,000 conformations
of dipeptides comprised of 26 amino acids. In this context, the OMP
controls the sparsification process and selects (among tens of thousands
of atom-centered environments present in the training set) the best
possible reference pool for predicting the PES of any dipeptide. To
increase the smoothness of the target energies, the model is baselined
with density functional tight-binding (DFTB[57,58]) using a Δ-ML approach,[56] with
the model improving the description of the PES in regions that are
traditionally not accurately captured with the semiempirical baseline
method (e.g., hydrogen atoms and polarized bonds). To further illustrate
the transferability of LKR, we compare its performance with a state-of-the-art
Behler-Parrinello type neural network, both on the dipeptide set and
in an extrapolation test based on the Phe-Gly-Phe tripeptide. The
two ML models are then used to drive enhanced sampling simulations
to describe the free energy landscape of the tripeptide with DFT accuracy.
Methods
Machine
Learning Models
The ML potentials presented
in this work correct a semiempirical baseline obtained from density
functional tight-binding (DFTB) with the D3(BJ)[59] dispersion correction (shortened DFTB hereafter) to reproduce
the target PBE[60]-dDsC[61−63] (shortened
PBE hereafter) for DFT energetics. For each molecule in the data set,
the property learned within the Δ-ML framework corresponds to
the difference between the atomization energy evaluated at DFTB and
PBE. For both levels, the atomization energies are computed using
a two-step procedure. First, the contribution of each atom type to
the total energy is evaluated by a multilinear regression (MLR) on
the full data set (dressed-atom energies). Then, the difference between
the computed total energy and the sum of the dressed-atom energies
yields the atomization energy used herein. The following sections
describe the two types of complementary ML architectures exploited
in this work.
ML Model 1: Sparse Local Kernel Regression
The LKR
inputs are the target molecular properties and the atomic representations
of the corresponding molecular structures. In this case, we used the
atomic spectral London Axilrod–Teller–Muto[42] representation (aSLATM) (see step 1 in Figure , upper panel), but
other local atomic representations could be used. As it is standard
procedure for local kernel-based atomistic models, LKR uses a selected
pool of reference atomic environments taken from the training structures
as the regression basis for predicting the target property. The structures
available for the training are projected onto the pool of atomic environments
using a Gaussian kernel to create the matrix S, effectively
generating a new vectorial representation of the molecules (see step
2 in Figure , upper
panel). By assuming a linear relationship between the features of S and the global molecular properties, LKR allows one to obtain
the regression coefficients for each reference atomic environment
without requiring an a priori decomposition of the
target property, which is sometimes possible[64] but highly nontrivial for complex PES like the ones discussed here.
If the pool of atomic environments is too large, prefiltering, which
reduces the redundancy of the pool, is needed. Here, we use FPS,[53] which selects the most distinct environments
in terms of their Euclidean distances.
Figure 1
Workflow and schematic
depiction of the LKR model.
Workflow and schematic
depiction of the LKR model.For the final selection of the reference environments, the reduction
of the training environments is commonly performed by constructing
multiple models including a variable number of the FPS points, which
is gradually increased until achieving a satisfying accuracy. It was
already hypothesized[65] that some sort of
supervision in the sparsification procedure would be desirable. Here,
we rely on a supervised sparse regression model called orthogonal
matching pursuit (OMP).[66] OMP is a greedy
optimization algorithm that finds the best sparse choice of reference
environments for a particular application (see step 3 in Figure , upper panel). The
OMP algorithm searches greedily through the whole pool of atom-centered
environments and selects at each time the specific environment that
reduces the prediction error the most (i.e., the one with the highest
inner product with the targeted property). At each iteration, the
contributions from the previously selected environments to the global
target property are subtracted and the search continues for the best
match of the residual until convergence. With this procedure, OMP
automatically identifies the most suitable, property-specific environment
subset (i.e., best-matching basis) for the regression of the targeted
molecular property in one shot. In the prediction step (see step 4
in Figure , upper
panel), the similarity of each new atomic environment with respect
to the reference pool is evaluated by computing a kernel sum with
all the selected environments. The reader is referred to Figure for a schematic
depiction of the workflow and to the Supporting Information for a more detailed description of the model and
procedure.Overall, LKR-OMP combines the scalability and transferability
of
NNs, with the faster training and stability of kernel-based models.
The addition and removal of training data also require minimal computational
effort, as opposed to an NN, for which the procedure requires at best
a partial retraining. This would be especially beneficial for active
learning approaches,[67] when the training
data evolves throughout the process. The counterpart is that the cost
of the model scales linearly with the number of reference environments,
while the cost of NNs is fixed by the architecture.
ML Model
2: Behler-Parrinello Neural Networks
To benchmark
the LKR model against an established NN architecture, we further construct
a Behler-Parrinello artificial NN.[1] For
each atom, we describe the positions of all neighboring atoms inside
a cutoff radius (its “atomic environment”) by a set
of atom-centered many-body symmetry functions (SF)[68] (see the Computational Details).To allow for on-the-fly estimation of the uncertainties
in the predictions, a committee of four Behler-Parinello NNs,[1,68] which only differ in the random initialization of the NN weights
and the internal cross-validation splitting of the training data,
was trained to reproduce the differences between the DFTB baseline
and the target DFT energies and forces. This permits estimating the
uncertainty associated with each committee prediction of the Δ-ML
correction.[69] The uncertainty estimates
were also used to modulate the application of the NN correction, using
the weighted baseline scheme proposed by Imbalzano and co-workers.[70] This procedure minimizes the uncertainty in
the total potential and ensures that it falls back to the baseline
whenever the Δ-ML correction enters the extrapolative regime,
thereby stabilizing the simulation. The total energy is calculated
as the sum of the outputs of atomic NNs, and analytic gradients and
thus forces are readily available. To train the NN models both energies
and forces were used.
Training Data
The training set for
the construction
of the models described in the previous sections was built by selecting
configurations from the 300 K replica of a DFTB-based temperature
replica exchange (T-RE) simulation (with replicas at temperatures
between 300 K and 1000 K) for each amino acid dipeptide. The most
distinct 2000 configurations of each dipeptide were selected by means
of FPS, using the Ramachandran plot[71] coordinates
as the independent variables.For a total of 26 amino acid dipeptides,[72] we obtained a pool of 52,000 conformations.
Finally, to include the effects of side chain-side chain interactions
into the model, the training set was enriched with an additional set
of 3378 optimized peptide dimers from the BioFragment Database.[73] Single point computations were performed to
obtain energy and forces at the target and baseline levels.
Enhanced Sampling Methods for the Tripeptide
We use
the reservoir-Hamiltonian Replica Exchange (resH-RE)[3] technique to sample the canonical ensemble of the selected
Phe-Gly-Phe tripeptide at 300 K with the LKR potential. ResH-RE
is an enhanced Hamiltonian Replica Exchange[74] scheme, which serves to accelerate the sampling of the configurational
space at a high level of theory using a canonical reservoir of structures
generated with a less accurate but computationally cheaper potential
energy. The replicas essentially help to capture the local diffusion
in the phase space, whereas the most dramatic conformational changes,
such as swaps between local minima and crossings of energy barriers,
occur through coupling with the reservoir. By construction, the resH-RE
simulation can be driven by molecular dynamics in the NVT ensemble
but also by simpler Monte Carlo (MC) moves (i.e., random particle
moves), which are otherwise largely inefficient for systems characterized
by highly nonlinear PESs.[75] The possibility
of using both molecular dynamics and Monte Carlo moves within resH-RE
is especially advantageous given that the atomic forces are not readily
available with the LKR model used here albeit, in principle, obtainable
through computing the LKR energy derivatives[76] with respect to the nuclear coordinates.[77]Considering that the forces are available, and actually needed
to increase the robustness of the NN potential (vide infra), the NN-based sampling of the tripeptide was performed using the
ATLAS metadynamics framework,[78] which employs
a divide-and-conquer strategy to enable efficient biasing when working
with several collective variables (CV). In ATLAS, the high-dimensional
CV space is divided into basins, each of which is described by an
automatically determined, low-dimensional subset of the CVs on which
a local, well-tempered metadynamics-like bias is constructed. The
local biases are translated into an effectively high-dimensional bias
using indicator functions based on a Gaussian mixture model. Given
the high dimensionality of the CV space of the Phe-Gly-Phe tripeptide,
attempting convergence with conventional metadynamics would be futile.
Meanwhile, the ATLAS framework, which was specifically designed to
work in high dimensions, has already been tested on 6D spaces.[78] While alternative sampling techniques such as
those based on temperature acceleration could have been used for the
NN sampling,[79,80] overcoming the high energy barriers
between basins would have required temperatures impractical for the
problem at hand.In this work, space is divided into five basins,
identified by
applying the PAMM framework[81] to an initial
well-tempered metadynamics trajectory using the end-to-end distance
of the backbone as the sole CV. Each basin is described and biased
based on the two principal axes determined by performing a principal
components analysis on the associated distributions of configurations
in the six-dimensional CV space. The resultant metadynamics trajectories
were unbiased using the ITRE scheme,[82] which
makes efficient use of the entire trajectory and does not require
the distribution to be evaluated on a grid, rendering it suitable
for high-dimensional CV spaces.
Computational Details
All the baseline computations
for the Δ-ML model were performed with DFTB3/3OB[57,58] in combination with the D3(BJ)[59] dispersion
correction (DFTB), as implemented in the DFTB+ software.[83] The target potential was set at PBE[60]-dDsC[61−63] using the def2-TZVP basis set,
as implemented in GAMESS-US.[84,85] Canonical sampling
of each dipeptide was performed using T-RE simulations using the REMD@DFTB[86] protocol implemented in i-PI.[87] The simulations included 16 replicas with temperatures
ranging from 300 K to 1000 K, equally spaced on a logarithmic
scale. A time step of 0.75 fs was used in the dynamics, which
ensured the stability and energy conservation of the dynamics (see Figure S8), with a Langevin thermostat to control
the temperature. The simulations were run for two million steps, which
ensured statistical convergence of the results (see Figure S10). The final batch of structures was split in two
separate sets (70% (40,000) and 30% (15,378) of the molecules, respectively),
which were used for training and testing of the models. The resH-RE
simulations were run using the MORESIM python package.[3] They included four replicas with a potential linearly evolving
from DFTB to DFTB + LKR. This choice resulted in an exchange acceptance
probability of 40%. The resH-RE simulations were run for two million
steps, which provided converged results. A global random displacement
with a Gaussian distribution of standard deviation 0.001 Å was
chosen as the Monte Carlo step, which resulted in a 50% acceptance
rate.All metadynamics simulations were performed by coupling
the i-PI energy and force engine[88] to the
open-source, community-developed PLUMED library[89] version 2.8.0-dev (git: 79bcb8947)[90] to apply a well-tempered bias and the DFTB+[83] and LAMMPS[91] codes to evaluate the baseline
potential and Δ-learned correction, respectively. All metadynamics
simulations employed a time-step of 0.5 fs to ensure the stability
of the dynamics and the NN correction (see Figure S8) and a generalized Langevin equation (GLE) thermostat.[92,93]The local kernel regression implementation (available on github[94]) relies on a Gaussian Kernel[40] and on the aSLATM representation, as provided in the QML-toolkit.[95] The width of the Gaussian kernel, the adimensional
parameter σ, was chosen to be σ = 4.5 after a systematic
grid search. We used FPS to preselect a first pool of 39,000 local
atomic environments. The optimal number of reference environment selected
by OMP can be obtained using a grid search optimization of this parameter
(LKR-optimal), although the bigger the number the higher the cost
of the model (Figure S1b). To achieve a
converged statistical sampling (with resH-RE) at a reasonable computational
cost (see the Supporting Information for
a more detailed discussion on the computational cost), the size of
the pool of reference environment is limited to 1,000. The relevance
of this particular trade-off between accuracy and computational cost
is shown in Figure S1b. The python library
Sci-Kit Learn[96] was used to perform the
OMP regression.The NN models were trained using the N2P2 code.[97] Initial many-body symmetry functions (SF),[68] which describe the local atomic environment
of each atom
in a configuration and provide the inputs to the NNs, were generated
following the protocol of Imbalzano et al.[53] and included G2 functions with N = 12 and cutoffs r = 8, 12, and 16 Bohr and
G3 functions with N = 4, r = 8 Bohr, and ζ = 1, 2, 4 and with N = 2, r =
12 Bohr, and ζ = 1, 2. The cut-offs are long enough to describe
the environment of the central atom substantially beyond its nearest
neighbors in order to address the local differences between DFTB and
DFT (long-range discrepancies between DFTB and DFT are also accounted
for, albeit in a mean-field manner, through their effect on the local
atomic environments). The 512 most informative among them were extracted
using the semisupervised PCovCUR scheme;[98] a modification to the CUR approach, which uses a mixing parameter
(here set to 0.5) to smoothly interpolate between a feature-covariance
and a linear regression-like loss to identify features that reflect
the (structural) variance of the data set while also correlating with
the target property. Their values for a given atomic environment are
concatenated into a feature vector and fed into the “atomic”
NNs, which in the following consists of two fully connected, hidden
layers with 24 nodes each. This particular architecture has previously
proven sufficiently flexible to describe molecular crystals containing
up to four chemical species,[99,100] and multilayer perceptron
networks with similar depths and widths have seen widespread success
for a variety of molecular and condensed matter systems.[101,102]
Results and Discussion
Performance of the Trained Machine Learning
Models
The need for correcting DFTB to obtain reliable PESs
for each amino
acid dipeptide is made evident by Figure a, showing the histogram of the differences
with respect to the target PBE (after removing the multilinear regression
contribution). The inaccuracy of DFTB is also illustrated by the regression
slopes between the atomization energies at the DFTB level with and
without ML corrections (Figure b) and the PBE atomization energies. For each dipeptide, the
slope between uncorrected DFTB and PBE is consistently smaller than
unity, implying a systematic overstabilization of the most distorted
configurations and an energy understabilization for the most stable
ones (see Figure S2 for a more detailed
analysis of the individual dipeptides). The flatter characteristic
of the DFTB PESs has previously been discussed[3,103] and attributed to the limited amount of atomic overlap afforded
by its minimal valence basis, which also affects the rotational barriers.[57] As shown in Figure b,c, the LKR and NN models correct for the
systematic flattening of the PESs (slope ∼1, Figure b) and also decrease the absolute
errors for each dipeptide. As shown by the learning curves (Figure d), the NN (0.58 kcal/mol,
40,000 training dipeptides) and LKR-OMP (optimal) (0.57 kcal/mol,
40,000 training dipeptides) predictions are equally accurate. The
more computationally efficient LKR-OMP(1000) model discussed above
achieves an accuracy of 0.74 kcal/mol. The relevance of using
OMP for the selection of the reference environments instead of simpler
algorithms is illustrated by comparing the accuracy of LKR-OMP(1000)
and a ridge regression based on the same number of environments chosen
by FPS. The LKR-OMP(1000) model (referred simply as “LKR”
for the rest of the article) is significantly more accurate than the
LKR based on FPS, which additionally highlights the importance of
selecting atomic environments tailored for the specific target property.
While the performance of the NN is slightly superior to the LKR in
the training step, it must be noted that the latter model is only
trained on energy data, whereas the NN uses both energies and forces
(i.e., 3 × Natoms times more training
scalar quantities). However, the mean absolute error for each individual
dipeptide is consistently below 1 kcal/mol for both models.
The learning rates of both approaches, defined as the error as a function
of the number of training structures, are also both very similar and
characterized by a decay exponent of −0.2 on a logarithmic
scale.
Figure 2
(a) Histogram of errors in test samples of the dipeptide data set.
(b) Regression slopes between “bonding energies” of
DFTB and PBE for each of the training dipeptides and for the dimers.
(c) MAE achieved by the models in the test data for each dipeptite
and for the peptide dimers. (d) Learning curves, i.e., achieved MAE
vs number of structures used for the training. The different learning
curves are LKR using OMP with the optimized number of atomic environments
(blue), LKR exploiting OMP to select the best 1,000 environment (orange),
the Behler-Parrinello-based NN (green), LKR using FPS to select the
most distinct atomic environments, using 200 atoms per atom type (FPS
1000) (red), and LKR using FPS to select the most distinct atomic
environments but with the same distribution as OMP (FPS+ 1000) (purple).
(a) Histogram of errors in test samples of the dipeptide data set.
(b) Regression slopes between “bonding energies” of
DFTB and PBE for each of the training dipeptides and for the dimers.
(c) MAE achieved by the models in the test data for each dipeptite
and for the peptide dimers. (d) Learning curves, i.e., achieved MAE
vs number of structures used for the training. The different learning
curves are LKR using OMP with the optimized number of atomic environments
(blue), LKR exploiting OMP to select the best 1,000 environment (orange),
the Behler-Parrinello-based NN (green), LKR using FPS to select the
most distinct atomic environments, using 200 atoms per atom type (FPS
1000) (red), and LKR using FPS to select the most distinct atomic
environments but with the same distribution as OMP (FPS+ 1000) (purple).The OMP algorithm provides insightful complementary
information,
allowing one to identify which atomic environment is associated with
the largest difficulties in the learning procedure. This feature is
unique to OMP and not available for standard kernel or NN-based approaches
that do not rely on supervised sparsity methods. In particular, OMP
identifies that only a few of the 39,000 atomic environments (as low
as 300) are sufficient to reach the accuracy threshold (1 kcal/mol)
for the predictions of the dipeptide atomization energies. The OMP
selection within LKR is 45.1% C, 2.9% H, 18.6% O, 28.9% N, and 4.5%
S atoms. For the sake of comparison, the atomic composition of the
pool of dipeptide training structures is 29.5% C, 53% H, 8.5% O, 9%
N, and 0.3% S atoms. Evidently, the optimal reference atomic environments
selected by OMP do not follow the same atomic distribution as in the
overall pool of structures. OMP does not only find an adequate percentage
of atom types but also picks the most tailored atomic environments
for the target property. In contrast, the FPS selection with the same
enforced atomic distribution as OMP (FPS+ 1000) is not sufficient
to achieve a MAE as low as OMP (Figure d). In fact, on average three times more atomic environments
are needed for FPS+ to match OMP (see Figure S1b). This is further demonstrated by the 2D t-SNE (t-Stochastic Neighbor
Embedding[104]) projection (Figure ) of the training atomic environments
(constructed using the aSLATM representation as input data for the
t-SNE).
Figure 3
t-SNE maps constructed with the aSLATM representation as input
for each atom type. Each point represents an atomic environment in
the training data. The color code in the first two rows shows how
well represented the training environments are by the reference environments
chosen by OMP and FPS+. As representation score, we use the average
“atomic Kernel Representation Score” (aKRS), the average
value of the kernel similarity between each of the training atomic
environments, and the selected reference environments of the same
atom type. The color code in the last row shows the LKR correction
on each of the training atomic environments.
t-SNE maps constructed with the aSLATM representation as input
for each atom type. Each point represents an atomic environment in
the training data. The color code in the first two rows shows how
well represented the training environments are by the reference environments
chosen by OMP and FPS+. As representation score, we use the average
“atomic Kernel Representation Score” (aKRS), the average
value of the kernel similarity between each of the training atomic
environments, and the selected reference environments of the same
atom type. The color code in the last row shows the LKR correction
on each of the training atomic environments.The first two rows of t-SNE maps are color-coded based on the average
“atomic Kernel Representation Score” (aKRS), i.e., the
average value of the kernel similarity between the training atomic
environments and the selected reference (, where j represents
the
index of an environment in the training data and i is runs over the N selected reference environments of each atom type). The score is
computed for the reference environments selected by OMP (first row
of Figure ) and FPS+
(second row of Figure ). This score, bound between zero and one, shows how well an atomic
environment is represented by the selected reference environments.
The most striking differences between OMP and FPS+ is in the selection
of the oxygen and hydrogen atomic environments, whereas carbon, nitrogen,
and sulfur are treated very similarly. In other words, the assumption
behind the usage of FPS (the larger the variability in the reference
environment, the higher the accuracy) is correct for carbon, nitrogen,
and sulfur but not for hydrogen and oxygen. The oxygen maps are formed
by one large smooth cluster, which represents the amide-bond oxygen
atoms [O(a)], and two smaller regions regrouping the carboxylate [O(b)]
and hydroxyl [O(c)] oxygen atoms, respectively. In comparison to FPS,
OMP is placing more emphasis on the amide oxygens and the carboxylate
groups but much less on the oxygen in the hydroxyl groups. For the
hydrogen atoms, the large number of isolated clusters in the t-SNE
is indicative of a large variability in the hydrogen environments,
which could intuitively suggest that a high number of hydrogen reference
atoms are necessary to get an accurate model. Yet, OMP only selects
2.9% of them. This result reinforces that the choice of tailored environments
is the key to achieving a more robust regression model. Interestingly,
OMP favors carbon-bonded hydrogen atoms lying in the central cluster
rather than polar hydrogens (e.g., in a O–H bond). Since the
model is constructed to capture the variations of the potential energy
as a function of the molecule structural changes, the selection of
more carbon-bonded hydrogens than any other type has to be attributed
to the higher conformational variability of the environments surrounding
a C–H bond.Another useful analysis of how the model
behaves involves comparing
the choice of atomic environments by OMP with the magnitude of the
ML correction in terms of atomic-contributions (last row of Figure ). While one might
expect a direct relationship between the atomic selection and the
magnitude of the ML atomic error, this intuition is actually incorrect.
In fact, a large DFTB error for a given atom type does not necessarily
imply that the learning process would be improved by including more
atom-environments of the same type. This is especially true if the
electronic nature of the DFTB error is uniform across all the conformation
available in the training set. This lack of correlation is evident
while looking at the bottom panels of Figure . The DFTB errors are the largest for the
hydroxyl functional groups [H(c) and O(c) in the figure], while only
a small portion of carbonyl or amide oxygen atoms are characterized
by similar errors of opposite sign. This trend is not reflected in
the optimal OMP selection of reference atomic environments. Similarly,
the most problematic carbon atoms (in terms of ML errors) are the
oxygen-bonded carbons, which include the amide functions (the center
cluster), as well as the carbons of the terminal guanidino group of
arginine (HNC(NH2)2). However, OMP does not
place special attention to these environments when selecting the best
reference carbons. Nitrogen behaves similarly to carbon. The central
cluster is the most well described, which is representative of C–NH–C
nitrogens (mainly present in the amide bonds), while the outer clusters,
including terminal amines (−NH2), the proline rings,
and guanidino groups, are less sampled. In contrast to other atoms,
the ML correction for nitrogen has similar magnitude in all the clusters.
An interactive application to visualize and explore this data is available
at https://atomic-environments-dipeptides.herokuapp.com, built
with the Molecular Explorer Software.[105]
Extrapolation
The local nature of the two ML potentials
can in principle be used to make predictions for any system containing
no chemical species other than C, H, O, N, and S, although high accuracy
is expected only for local environments similar to those present in
the training set, i.e., in peptide chains or oligopeptides. Here we
demonstrate the transferability of the two models by exploring the
potential and free energy landscapes of the Phe-Gly-Phe tripeptide.
The Phe-Gly-Phe tripeptide (in neutral form) is an appealing target
to test the transferability of the ML models as it is one of the most
suitable chemical systems to model noncovalent interactions in proteins.[106] Additionally, this tripeptide is not an adequate
target for existing force fields, which are typically parametrized
either for capped peptides or for charged forms. The Phe-Gly-Phe tripeptide
is in the gas phase and uncapped, and contains a combination of neutral
NH2 and COOH groups that are not stable in solution. As a result,
many force fields (AMOEBA, AMBER) do not accept it as input or, alternatively,
generate unstable dynamics (GAFF).To assess the quality of
the extrapolated energies, we compile two data sets of 1000 Phe-Gly-Phe
structures subdivided into 900/100 subsets illustrative of the conformational
landscape explored at 300 and 0 K, respectively (for further details,
see Supporting Information), at both the
baseline and target levels. The first set corresponds to 1,000 structures
selected at random from the 300 K replica from the DFTB-based
T-RE simulation. Out of these 1,000 structures, 100 are optimized
at the same DFTB level (i.e., 0 K static optimization). The
second set is a random selection of 1,000 structures taken from the
300 K sampling at the DFTB + LKR level (see the next section)
out of which 100 are optimized with PBE.The most striking difference
between the error distributions of
DFTB and the ML corrected versions (respectively the blue and orange/green
histograms, Figure a) is the transition from a bimodal Gaussian distribution to the
expected normal distribution centered at zero. The two peaks correspond
to the DFTB energies of conformers generated using DFTB as underlying
potential [DFTB//DFTB, overstabilized] and to the DFTB energies of
conformers generated using a different potential [DFTB//PBE, understabilized].
The transition from a bimodal to a single Gaussian distribution upon
application of the ML-corrections reveals that the DFTB-sampled conformational
space (i.e., the set of visited structures) would be energetically
disjoint from the reference-sampled space at PBE if we had to drive
the dynamics using a DFTB potential. The ML-corrections allow concluding
that this separation is spurious and that the DFTB structures from
the PBE perspective [PBE//DFTB] are not peculiar. Interestingly, the
slope between DFTB and PBE energies for all the tripeptide test structures
combined is 0.96 (see Figures S4 and S6), which would suggest that systematic flattening of the PES by DFTB
is not observed in this case. However, the correlation between DFT
and DFTB breaks down when considering the 300 and 0 K conformations
separately (in a clear example of the Simpson’s paradox[107]), where the typical behavior of DFTB is recovered
(slopes: 0.78 at 300 K and 0.73 at 0 K, see again Figure S6). Finite temperature effects offset
the energies of the 300 K ensemble with respect to the 0 K,
so that the joint distribution seems to correlate better with the
DFT values.
Figure 4
(a) Histogram of prediction errors made on the tripeptide test
set. (b) Bar plots with the mean shifts of the error distributions
and their MAE after being centered.
(a) Histogram of prediction errors made on the tripeptide test
set. (b) Bar plots with the mean shifts of the error distributions
and their MAE after being centered.The ML corrections (Figure orange and green data) overcome all the issues present in
the uncorrected DFTB potential. First, the mean bonding energy shift
is reduced from 10.6 to 1.3 kcal/mol by the LKR and to 0.8 kcal/mol
by the NN model (see Figure b). This error does not influence the conformational sampling
of the molecule, as a constant shift in energy does not alter the
relative probability of the conformers. Nevertheless, a decreased
error is beneficial when comparing the electronic energies of different
molecules. Most importantly, the average absolute deviation from the
mean is reduced from 4.2 to 1.6 kcal/mol by the LKR model and to 1.9 kcal/mol
by the NN (see Figure b). All the errors of the LKR model are below 8 kcal/mol, while the
NN predictions on the tripeptide present two outliers of −15
and +26 kcal/mol. Additionally, the regression slope between the predictions
and the target energies is also corrected to 0.99 for all the sets
(see Figure S4b). These results are crucial
since the standard deviation and the regression slope are the most
important quantities for conformational sampling. Even a slight deviation
from 1 in the regression slope causes significant changes in the resulting
free energy surfaces. In particular, the observed regression slope
between DFTB and PBE at 300 K (0.75) is roughly equivalent to perform
sampling with a temperature 1.33 times higher (e.g., 400 K
instead of 300 K). At the same time, outliers can lead to unstable
dynamics and alter the results of sampling simulations.Overall,
while the NN model performs better on the dipeptide test
structures, the LKR provides a more robust extrapolation (lower MAE,
less outliers) for the Phe-Gly-Phe tripeptide. It must be noted that
the superior stability of LKR is not a consequence of using only energy
data for the training. In fact, the NN model trained only with energies
shows much poorer transferability and scalability capabilities (see Figure S5).As shown in the previous section,
the atomic decomposition of the
ML correction naturally provides a measure of the error localization
in the molecule. To visualize the error for the tripeptide, we constructed
a scalar field using atomic-centered Gaussian functions scaled such
as to match the LKR atomic predictions (see Figure ). With the use of this procedure, it is
possible to construct a real-space map highlighting the regions of
the tripeptide where the DFTB potential deviates from the PBE reference.
An example of these critical regions is identifiable between the oxygen
and hydrogen atom forming an intramolecular hydrogen bond (e.g., between
atoms 4 and 22 in Figure ). In Figure , we have shown that the hydrogens bound to an oxygen or a nitrogen
are the most difficult to describe at the DFTB level. Figure shows the understabilization
of the hydrogen bond between the NH2 and the CO by DFTB,
which is corrected by our models. However, this particular example
does not imply that all hydrogen bonds are poorly described and in
a systematic manner. For example, equivalent figures show that the
OCO–H bond in the dipeptide of aspartate is actually overstabilized
by DFTB, while the CO–HN in the protonated histidine is understabilized
(see Figure S3). These inconsistencies
have been shown to arise at the DFTB level due to a poor description
of short-range electrostatic and polarization interactions arising
from the use of a minimal valence basis.[108] While several empirical corrections to DFTB and more generally to
semiempirical methods have been proposed,[108−112] the use of the D3H5 correction (the last of such corrections DFTB-D3H5[113]) does not change the performance of DFTB on
the dipeptide set significantly (see Figure S7).
Figure 5
Histogram with the mean absolute atomic contribution to the LKR
corrections for the tripeptide for the 2,000 test structures. The
figure includes a particular conformation of the tripeptide with isosurfaces
of a scalar field representing the localization of the ML correction.
The scalar field was generated with the LKR atomic corrections to
the energy for that structure, convoluted with the atomic positions
and a Gaussian filter of width 1 Å. The isosurfaces correspond
to the isovalues −5, −2, +2, and +5.
Histogram with the mean absolute atomic contribution to the LKR
corrections for the tripeptide for the 2,000 test structures. The
figure includes a particular conformation of the tripeptide with isosurfaces
of a scalar field representing the localization of the ML correction.
The scalar field was generated with the LKR atomic corrections to
the energy for that structure, convoluted with the atomic positions
and a Gaussian filter of width 1 Å. The isosurfaces correspond
to the isovalues −5, −2, +2, and +5.Furthermore, the analysis reported in Figure shows that the description of the hydrogen-bond
interactions are not the only limitation of DFTB. More generally,
the highest absolute ML corrections appears whenever the bond between
two atoms is polarized, such as in the region of the terminal carboxylic
acid (atoms 2, 3, and 4 in Figure ) and the amide moiety of the peptide bond (atoms 7,
8, 9, and 10 in Figure ). In contrast to existing corrections, which are not meant to improve
the description of these polarized bonds, the ML models guarantee
by construction an equally accurate description for all the regions.
Free Energy Surface of Tripeptides
Having assessed
the robustness of the ML models by evaluating the accuracy of the
energy predictions on Phe-Gly-Phe tripeptide conformations and by
providing comparisons with uncorrected DFTB, this section goes further
and applies the ML corrections to sample the free-energy landscape
of the tripeptide in the gas phase. As described in the Computational Details section, for the LKR model
we use the resH-RE approach for a 300 K canonical sampling
of the tripeptide generated with DFTB as a reservoir to accelerate
the DFTB+LKR sampling without the need for high temperatures or bias
potentials. The reservoir was generated using T-RE because in cases
such as this where the barriers between conformers are not expected
to be too high, and where relatively long trajectories are affordable,
it allows one to obtain an explicit canonical distribution of conformers
(that is needed for the reservoir of resH-RE simulations) without
having to determine CVs and/or perform reweighting steps. The use
of other accelerated sampling schemes (resH-RE or ATLAS) becomes necessary
in the presence of a rougher PES. Figure a shows the set of characteristic collective
variables (CVs) chosen to analyze the free-energy landscape. The set
of CVs includes all the Ramachandran dihedral angles as well as the
distance between the benzene rings at each end of the chain. To visually
represent the resultant seven-dimensional FES, the 2D FESs for all
pairs of CVs are obtained by marginalizing the seven-dimensional distribution.
The DFTB-based 2D FES DFTB are shown in the lower triangle of Figure b,c, while their
DFTB+LKR-based counterparts are shown in the upper triangle of Figure b. The C4–N2–C5–C6
and C2–N1–C3–C4 dihedral angles were excluded
from the plot because their values remain constant throughout the
sampling (see Figure S9).
Figure 6
(a) Tripeptide Phe-Gly-Phe
with highlighted atoms used for the
collective variables in the analysis of the sampling simulations.
(b and c) Grids with 2D free energy landscapes for each pair of the
selected collective variables. The lower diagonals contain results
from T-RE simulations using DFTB-D3(BJ). The upper diagonals contains
the results of the resH-RE simulations using DFTB-D3(BJ) + LKR (b)
and DFTB-D3(BJ) + NN (c). In the diagonal are the probability distributions
of each collective variable for DFTB-D3(BJ)(blue), DFTB-D3(BJ) + LKR(orange),
and DFTB-D3(BJ) + NN(green).
(a) Tripeptide Phe-Gly-Phe
with highlighted atoms used for the
collective variables in the analysis of the sampling simulations.
(b and c) Grids with 2D free energy landscapes for each pair of the
selected collective variables. The lower diagonals contain results
from T-RE simulations using DFTB-D3(BJ). The upper diagonals contains
the results of the resH-RE simulations using DFTB-D3(BJ) + LKR (b)
and DFTB-D3(BJ) + NN (c). In the diagonal are the probability distributions
of each collective variable for DFTB-D3(BJ)(blue), DFTB-D3(BJ) + LKR(orange),
and DFTB-D3(BJ) + NN(green).To provide a complementary view, we further sample the same tripeptide
FESs using the committee of NN models (upper diagonal of Figure c). We exploit the
availability of forces to perform well-tempered metadynamics simulations
and make use of the ability to assess the uncertainties in the predicted
corrections to smoothly fall back onto the DFTB baseline when the
NN predictions become uncertain. This suppresses instabilities in
the dynamics due to unphysical NN corrections in areas of the PES,
which are underrepresented in the training data. The metadynamics
are biased in the six-dimensional CV space spanned by the six Ramachandran
dihedral angles (for further information see the Computational Details section).The comparison between
the DFTB T-RE results (lower triangular
portions of Figure ) and the results of the ML potentials (upper triangular portions
of Figure ) shows
the effects of correcting the flat PES on the final free energy landscape.
In addition to increasing the free energy barriers, translated in
very low populations in basin transition areas, the ML corrections
dramatically affect the relative stability of the different basins,
altering the qualitative dynamic behavior of the tripeptide at 300 K.
These effects can be equally observed in both the sampling based on
LKR and NN. The results obtained show good agreement. The single CV
populations are nearly identical, and the lowest free energy minima
are unequivocally determined. However, some disagreement in the free
energy surfaces obtained by sampling using the two ML frameworks can
be observed for the higher-energy portions of the FESs. Given the
highly nontrivial nature of this exercise, it is not easy to pinpoint
the source of the discrepancy. The entanglement between uncertainties
arising from (i) finite statistics and (ii) possible discrepancies
of the ML models complicates the analysis of their relative weight.
As a benchmark, sampling results following the same methodologies
applied to the alanine dipeptide show a very good agreement between
the DFTB+LKR and DFTB+NN approaches (see Figure S11). Overall, it is clear that both ML-corrected frameworks
predict a much sharper variation of the free energy compared with
DFTB that instead predicts a very smooth landscape as a function of
the dihedral angles. This qualitative difference is also clearly visible
in a 2D Sketchmap[114,115] projection (Figure ), which indicates that the
more diffuse structural distribution at DFTB is a direct consequence
of the flatness of the associated PES.
Figure 7
Sketchmap computed with
DFTB-D3(BJ) (left) DFTB-D3(BJ) + LKR (middle)
and DFTB-D3(BJ) + NN (right) sampling at 300 K using the selected
CVs from Figure .
Sketchmap computed with
DFTB-D3(BJ) (left) DFTB-D3(BJ) + LKR (middle)
and DFTB-D3(BJ) + NN (right) sampling at 300 K using the selected
CVs from Figure .As a final note, it is important to stress that
the generation
of converged statistics using the target potential (PBE) would have
been computationally unfeasible. While appealing, an alternative comparison
with experimental results would require incorporating solvent effects,
which is outside the scope of this work.
Conclusion
We
introduced LKR-OMP, a local kernel regression model which exploits
the supervised sparsity algorithm OMP and compared its performance
along with that of a Behler-Parrinello neural network. LKR-OMP benefits
from the straightforward training of kernel methods, combining it
with the scalability and transferability of models based on neural
networks.We juxtapose the two approaches by applying them to
the challenging
task of learning the PES of oligopeptides at the PBE-dDsC level, using
the semiempirical DFTB-D3(BJ) potential as a baseline and training
on a combination of dipeptide structures and dimers of small organic
fragments. To achieve comparable computational cost between sparse
kernel regression and NNs, it is essential to select carefully the
most representative environments. We show, both by comparing the final
model accuracy and by combining the representation score with a 2D
projection of the local atomic environments, that selection methods
relying exclusively on structural information, such as FPS or CUR,
are not always optimal and that substantial improvements can be achieved
with the supervised strategy adopted in the LKR-OMP scheme.Using only energies for training, the LRK-OMP model achieves an
accuracy and transferability compared to that of the NN-based model,
that also uses forces to optimize its parameters. Thanks to the atom-centered
construction of the ML correction, we can reveal the origin of the
DFTB-D3(BJ) error relative to DFT, interpret in terms of chemical
and atomic patterns, and demonstrate the relevance of relying upon
a correction based on nonlinear regression techniques. As a final
demonstration of the possibilities brought about by the use of ML
corrections of the PES, we use them in combination with enhanced sampling
approaches to explore the conformational energy landscape of the tripeptide
Phe-Gly-Phe at an effective PBE-dDsC level. We use two different sampling
strategies: resH-RE for LKR-OMP, which at present does not provide
easy access to energy derivatives, and ATLAS metadynamics for the
NN potential, that instead does. The free energy landscapes obtained
with the two frameworks are consistent with each other, and show striking
differences compared to the uncorrected baseline potential. This provides
another example of the exaggerated smoothness of the DFTB potentials
and highlights the dire need to make the accuracy of higher electronic
structure levels accessible to the size and time scale that are necessary
for free energy computations. In this respect, the fact that ML corrections
have now become a mature, trustworthy approach to achieve this goal,
with entirely different frameworks achieving comparable accuracy and
efficiency, is very encouraging. The LKR-OMP model, in particular,
offers a good compromise in terms of data-intensiveness, computational
cost, generality and accuracy, in addition to providing unique analytical
insight into the model performance.
Authors: Lori A Burns; John C Faver; Zheng Zheng; Michael S Marshall; Daniel G A Smith; Kenno Vanommeslaeghe; Alexander D MacKerell; Kenneth M Merz; C David Sherrill Journal: J Chem Phys Date: 2017-10-28 Impact factor: 3.488