Adrian Jinich1,2, Benjamin Sanchez-Lengeling1, Haniu Ren1, Rebecca Harman1, Alán Aspuru-Guzik3,4,5. 1. Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States. 2. Division of Infectious Diseases, Weill Department of Medicine, Weill-Cornell Medical College, New York, New York 10065, United States. 3. Department of Chemistry and Department of Computer Science, University of Toronto, 80 St. George Street, Toronto, Ontario M5S 3H6, Canada. 4. Vector Institute, Toronto, Ontario M5G 1M1, Canada. 5. Biologically-Inspired Solar Energy Program, Canadian Institute for Advanced Research (CIFAR), Toronto, Ontario M5S 1M1, Canada.
Abstract
A quantitative understanding of the thermodynamics of biochemical reactions is essential for accurately modeling metabolism. The group contribution method (GCM) is one of the most widely used approaches to estimate standard Gibbs energies and redox potentials of reactions for which no experimental measurements exist. Previous work has shown that quantum chemical predictions of biochemical thermodynamics are a promising approach to overcome the limitations of GCM. However, the quantum chemistry approach is significantly more expensive. Here, we use a combination of quantum chemistry and machine learning to obtain a fast and accurate method for predicting the thermodynamics of biochemical redox reactions. We focus on predicting the redox potentials of carbonyl functional group reductions to alcohols and amines, two of the most ubiquitous carbon redox transformations in biology. Our method relies on semiempirical quantum chemistry calculations calibrated with Gaussian process (GP) regression against available experimental data and results in higher predictive power than the GCM at low computational cost. Direct calibration of GCM and fingerprint-based predictions (without quantum chemistry) with GP regression also results in significant improvements in prediction accuracy, demonstrating the versatility of the approach. We design and implement a network expansion algorithm that iteratively reduces and oxidizes a set of natural seed metabolites and demonstrate the high-throughput applicability of our method by predicting the standard potentials of more than 315 000 redox reactions involving approximately 70 000 compounds. Additionally, we developed a novel fingerprint-based framework for detecting molecular environment motifs that are enriched or depleted across different regions of the redox potential landscape. We provide open access to all source code and data generated.
A quantitative understanding of the thermodynamics of biochemical reactions is essential for accurately modeling metabolism. The group contribution method (GCM) is one of the most widely used approaches to estimate standard Gibbs energies and redox potentials of reactions for which no experimental measurements exist. Previous work has shown that quantum chemical predictions of biochemical thermodynamics are a promising approach to overcome the limitations of GCM. However, the quantum chemistry approach is significantly more expensive. Here, we use a combination of quantum chemistry and machine learning to obtain a fast and accurate method for predicting the thermodynamics of biochemical redox reactions. We focus on predicting the redox potentials of carbonyl functional group reductions to alcohols and amines, two of the most ubiquitous carbon redox transformations in biology. Our method relies on semiempirical quantum chemistry calculations calibrated with Gaussian process (GP) regression against available experimental data and results in higher predictive power than the GCM at low computational cost. Direct calibration of GCM and fingerprint-based predictions (without quantum chemistry) with GP regression also results in significant improvements in prediction accuracy, demonstrating the versatility of the approach. We design and implement a network expansion algorithm that iteratively reduces and oxidizes a set of natural seed metabolites and demonstrate the high-throughput applicability of our method by predicting the standard potentials of more than 315 000 redox reactions involving approximately 70 000 compounds. Additionally, we developed a novel fingerprint-based framework for detecting molecular environment motifs that are enriched or depleted across different regions of the redox potential landscape. We provide open access to all source code and data generated.
All living systems
are sustained by complex networks of biochemical
reactions that extract energy from organic compounds and generate
the building blocks that make up cells.[1] Recent work[2−5] has revived decades-old efforts[6−8] to obtain a quantitative
understanding of the thermodynamics of such metabolic networks. Accurately
predicting the thermodynamic parameters, such as Gibbs reaction energies
and redox potentials, of biochemical reactions informs both metabolic
engineering applications[9,10] and the discovery of
evolutionary design principles of natural pathways.[11] This applies both to the prediction of the thermodynamics
of known metabolic reactions and to non-natural reactions, which can
be used to expand nature’s metabolic toolkit.[12,13]Experimentally available reaction Gibbs energies and redox
potentials
provide coverage for only about ∼10% of known natural metabolic
reactions. The metabolic modeling community relies mainly on group
contribution method (GCM) approaches[14−18] to estimate missing thermodynamic values. GCM decomposes
metabolites into functional groups and assigns group energies by calibration
against experimental data. The most widely used implementation of
the GCM for biochemistry is the eQuilibrator,[19,20] an online thermodynamics calculator. In addition to using GCM to
predict the formation energies of compounds, the eQuilibrator makes
use of experimental reactant formation energies when available and
combines them with group energies in a consistent manner in what is
known as the component contribution method. This results in significant
increases in accuracy.[18] However, whenever
experimental reactant formation energies are not available, estimates
are based solely on group contribution energies. Such GCM-based estimates,
which do not capture intramolecular functional group interactions,
have limited prediction accuracy.[21,22]In this
work, we focus on predicting the thermodynamics of redox
biochemistry. Redox reactions, which are fundamental to living systems
and are ubiquitous throughout metabolism, consist of electron transfers
between two or more redox pairs or half-reactions.[23] Quantum chemistry has recently emerged as an important
alternative modeling tool for the accurate prediction of biochemical
thermodynamics.[21,22,24,25] However, quantum chemical methods tend to
have very high computational cost in comparison with the GCM or other
cheminformatic-based alternatives.Recent work in the intersection
of quantum chemistry and machine
learning has resulted in hybrid approaches that significantly lower
computational cost without sacrificing prediction accuracy.[26−29] One such hybrid quantum chemistry/machine learning approach, previously
applied to organic photovoltaics material design,[30,31] relies on Gaussian process (GP) regression[32] to calibrate quantum chemical predictions against experimental data.
GP regression is an established probabilistic framework in machine
learning to build flexible models, which also furnishes uncertainty
bounds on predicted data points. Gaussian process regression uses
the “kernel trick”[33] to make
probabilistic predictions, leveraging the distance between a data
point of interest and a training set.Here, we present a mixed
quantum chemistry/machine learning approach
for the accurate and high-throughput prediction of biochemical redox
potentials. We focus on predicting the thermodynamics of carbonyl
(C=O) functional group reductions to alcohols (C—O)
or amines (C—N) (Figure A), two of the most abundant redox reaction categories in
metabolism, for which a significant amount of experimental data is
available. The approach relies on predicting electronic energies of
compounds using the semiempirical PM7 method (Figure B)[34] and then
correcting for systematic errors in the predictions with GP regression
against experimental data (Figures C and 2). We innovate on existing
GP regression techniques by making use of novel reaction fingerprints[35] to compute the similarity between redox reactions.
We demonstrate that the method results in higher predictive power
than the GCM approach. Importantly, this comes at significantly lower
computational cost than previously developed quantum chemical methodologies
for redox biochemical thermochemistry.[22] In addition, we demonstrate the wide applicability of the GP regression
method by directly calibrating GCM and fingerprint-based predictions
(without quantum chemistry), also obtaining significant improvements
in prediction accuracy. We demonstrate the high-throughput nature
of the approach by predicting the redox potentials of what is to our
knowledge the largest existing database of non-natural biochemical
redox reactions, consisting of more than 315 000 iterative
oxidations and reductions of “seed” metabolites from
the KEGG database (Figure A).[36−39] Using this vast data resource, we apply a novel molecular fingerprint-based
analysis—termed “reaction fingerprint heatmap”—to
decipher the molecular environment motifs associated with high and
low redox potentials, thus providing insights into important structure–energy
relationships. We provide open access to the full source code and
data, including metabolite geometries, electronic energies, pKa’s, and redox potentials at https://github.com/aspuru-guzik-group/gp_redox_rxn.
Figure 1
Schematic representation of the workflow. (A) We generate a network
of redox reactions obtained from iterative reduction and oxidation
of natural seed metabolites. The redox reactions considered are reductions
of carbonyl functional groups to both alcohol and amine groups. (B)
Starting from simplified molecular-input line-entry system (SMILES)[40] string representations of all the compounds
in the network at pH = 0, we preoptimize molecular geometries using
the Universal Force Field, and run semiempirical quantum chemistry
calculations using PM7. (C) We estimate the standard (millimolar state)
redox potentials Em for the major species
at pH = 0 from the difference in electronic energies. The transformed
standard redox potentials E′m(pH
= 7) is obtained from cheminformatic pKa estimates and the Alberty–Legendre transform. We calibrated
a training set of compounds against available experimental E′m(pH = 7) values using Gaussian process
(GP) regression.
Figure 2
Schematic diagram
of GP regression with reaction fingerprints.
A redox reaction of interest x* is represented using
a reaction fingerprint, which captures the molecular structure of
both substrate (oxidized) and product (reduced). This reaction fingerprint
is compared against the set of all redox reactions in the experimental
data set using the covariance function of the Gaussian process k(x*, x) (see the Methods section). The GP
correction (E′mGP) to
the quantum chemical prediction (E′mPM7) of the standard redox potential is then obtained
from a similarity-based average of the correction terms for the experimental
data set.
Schematic representation of the workflow. (A) We generate a network
of redox reactions obtained from iterative reduction and oxidation
of natural seed metabolites. The redox reactions considered are reductions
of carbonyl functional groups to both alcohol and amine groups. (B)
Starting from simplified molecular-input line-entry system (SMILES)[40] string representations of all the compounds
in the network at pH = 0, we preoptimize molecular geometries using
the Universal Force Field, and run semiempirical quantum chemistry
calculations using PM7. (C) We estimate the standard (millimolar state)
redox potentials Em for the major species
at pH = 0 from the difference in electronic energies. The transformed
standard redox potentials E′m(pH
= 7) is obtained from cheminformatic pKa estimates and the Alberty–Legendre transform. We calibrated
a training set of compounds against available experimental E′m(pH = 7) values using Gaussian process
(GP) regression.
Results
Optimization
of Redox Potential Prediction Strategy That Combines
Semiempirical Quantum Chemistry with GP Regression
Our goal
is to develop a calibrated quantum chemistry modeling framework that
can accurately predict the reduction potentials of biochemical redox
reactions in a high-throughput manner. Several different types of
redox reactions that change the oxidation state of carbon atoms exist
in biochemistry. These include the reduction of carboxylic acids to
aldehydes, the reduction of carbonyls to alcohols or amines, and the
reduction of alcohols to hydrocarbons.[4] In this work, we focus our efforts on predicting the standard redox
potentials (as a function of pH and ionic strength) for the reduction
of carbonyls to alcohols or amines. This reaction category represents
the most common type of carbon redox transformation in biochemistry.[39] In addition, it is the category for which the
largest amount of experimental data is available in the NIST database
for the thermodynamics of enzyme-catalyzed reactions (TECRDB).[40]Similar to previous implementations, our
approach consists of running quantum chemistry calculations to obtain
electronic structure energies of substrates and products for the redox
reactions of interest.[22] We take the difference
in electronic energies as an estimate of the standard redox potential
of the most abundant species (protonation states) of metabolites at
pH = 0. We use cheminformatic pKa estimates
and the Alberty–Legendre transform[41] to convert the redox potentials to transformed standard redox potentials E′° (pH = 7) (Methods). We then use GP regression to correct for systematic errors in
the ab initio estimates (Figure ) as well as pKa estimates that go into the model (Methods). To test and optimize the accuracy of our modeling framework, we
compared predicted potentials against a data set of 81 experimental
redox potentials obtained from the NIST database of thermodynamics
of enzyme-catalyzed reactions.[40]Schematic diagram
of GP regression with reaction fingerprints.
A redox reaction of interest x* is represented using
a reaction fingerprint, which captures the molecular structure of
both substrate (oxidized) and product (reduced). This reaction fingerprint
is compared against the set of all redox reactions in the experimental
data set using the covariance function of the Gaussian process k(x*, x) (see the Methods section). The GP
correction (E′mGP) to
the quantum chemical prediction (E′mPM7) of the standard redox potential is then obtained
from a similarity-based average of the correction terms for the experimental
data set.We tested several different model
chemistries to maximize the accuracy
and minimize the computational cost of our methodology (Methods section, Table , Table S1). In this context, a
“model chemistry” consists of the combination of multiple
modeling choices that go into running a quantum chemical simulation.
These include the conformer generation strategy and geometry optimization
procedure, the electronic structure method (e.g., density functional
theory (DFT), wave function approaches, or semiempirical methods)
as well as the water model used.
Table 1
Summary Statistics
of Prediction Accuracy
for Calibration with Gaussian Process Regression of Different Modeling
Approachesa
calibration/regression
linear fit
approach
geometry source
average
compute time (s)
MAE (σ) (mV)
r
R2
R2
GCM
N/A
0.01
27.15 (24.04)
0.58
0.33
0.33
Rxn-FP + GP
N/A
0.01
24.35 (23.44)
0.65
0.42
N/A
GCM + GP
N/A
0.01
21.20 (20.50)
0.75
0.56
0.33
PM7 + GP
opt.
0.36
22.47 (19.68)
0.74
0.54
0.22
HF-3c + GP
opt.
588
21.07 (19.76)
0.76
0.57
0.28
DSD-PBEP86 + GP
PM7 opt.
1223
21.15 (17.97)
0.79
0.61
0.24
DLPNO–CCSD(T)//cc-pVDZ + GP
B3LYP conf.
6111
20.45 (19.97)
0.76
0.57
0.47
See the Methods section for detailed descriptions of each modeling approach (quantum
model chemistry). Rxn-FP = reaction fingerprints; GCM = group contribution
method; GP = calibration with Gaussian process regression; opt. =
geometry optimization of a single geometric conformer with the same
level of theory; conf. = geometry optimization of multiple geometric
conformers at a given theory level. Regression statistics correspond
to GP calibration using Morgan fingerprints of radius 6. Additionally
we report the R2 (coefficient of determination)
obtained using calibration with linear regression to the experimental
data. The statistics are MAE = mean absolute error; σ = standard
deviation of the absolute error; r = Pearson r; R2 = coefficient of determination;
average compute time = estimated average computational time per electron,
in seconds. Linear fit is without GP regression.
See the Methods section for detailed descriptions of each modeling approach (quantum
model chemistry). Rxn-FP = reaction fingerprints; GCM = group contribution
method; GP = calibration with Gaussian process regression; opt. =
geometry optimization of a single geometric conformer with the same
level of theory; conf. = geometry optimization of multiple geometric
conformers at a given theory level. Regression statistics correspond
to GP calibration using Morgan fingerprints of radius 6. Additionally
we report the R2 (coefficient of determination)
obtained using calibration with linear regression to the experimental
data. The statistics are MAE = mean absolute error; σ = standard
deviation of the absolute error; r = Pearson r; R2 = coefficient of determination;
average compute time = estimated average computational time per electron,
in seconds. Linear fit is without GP regression.The GP regression-based calibration
step depends on a particular
choice of a covariance function, which in turn involves a distance
or similarity measure between redox reactions or compounds. We found
that the following combination of quantum model chemistry and kernel/distance
function between reactions resulted in the most efficient (in terms
of a trade-off between accuracy and computational cost) prediction
strategy: the PM7[34] semiempirical method
for both geometry optimizations and single-point electronic energies,
with the COSMO implicit solvation model,[42] and a reaction fingerprint obtained by taking the difference between
counted Morgan fingerprint vectors of products and substrates,[35] with a kernel function that is a mixture of
a squared exponential kernel, and a noise kernel (Methods). We note that the model chemistry using the double-hybrid
functional DSD-PBEP86 (based on a double-hybrid DFT functional) gives
the best accuracy but at a higher computational cost.
GP-Calibrated
Quantum Chemistry Yields High Prediction Accuracy
at a Low Computational Cost
Importantly, our GP-calibrated
semiempirical quantum chemistry method results in higher predictive
power than the most commonly used approach, the GCM (Table , Figure ). Using cross-validation to test prediction
accuracy, the GP-calibrated PM7 approach results in a higher Pearson
correlation coefficient (r), a higher coefficient
of determination (R2), and a lower mean
absolute error (MAE) than GCM predictions when tested against the
set of experimental potentials (p < 0.05, Methods). The GP calibration also returns a GP uncertainty
for each predicted value, and we find that it linearly correlates
with the absolute deviation from uncertainty (Figure S1).
Figure 3
Comparison of prediction accuracy for the group contribution
method
and quantum chemistry with GP regression. Scatter plots of experimental
standard transformed redox potentials against predictions from several
sources: (a) PM7, (b) linearly calibrated PM7, (c) group contribution
method (GCM), and (d) GP-calibrated PM7. The experimental data consist
of available standard transformed reduction potentials for carbonyl-to-alcohol
and carbonyl-to-amine reductions. Data points in the scatter plot
were obtained from a leave-one-out cross-validation (LOOCV) procedure,
where the model is trained on all data points except the one point
to be predicted. Note that we use the millimolar standard state, E′m, where reactant concentrations are
taken as 1 mM, since this is significantly closer to relevant physiological
concentrations of metabolites in cells than 1 M. The legend in each
plot contains summary statistics for the fit; MAE (mean absolute error)
also has the associated standard deviation in parentheses.
Comparison of prediction accuracy for the group contribution
method
and quantum chemistry with GP regression. Scatter plots of experimental
standard transformed redox potentials against predictions from several
sources: (a) PM7, (b) linearly calibrated PM7, (c) group contribution
method (GCM), and (d) GP-calibrated PM7. The experimental data consist
of available standard transformed reduction potentials for carbonyl-to-alcohol
and carbonyl-to-amine reductions. Data points in the scatter plot
were obtained from a leave-one-out cross-validation (LOOCV) procedure,
where the model is trained on all data points except the one point
to be predicted. Note that we use the millimolar standard state, E′m, where reactant concentrations are
taken as 1 mM, since this is significantly closer to relevant physiological
concentrations of metabolites in cells than 1 M. The legend in each
plot contains summary statistics for the fit; MAE (mean absolute error)
also has the associated standard deviation in parentheses.We note that the GCM predictions, as well as fingerprint-based
predictions (without relying on any quantum chemistry calculations),
can also be calibrated using GP regression, resulting in significant
improvements in their prediction accuracy. Thus, regardless of the
underlying modeling choice, the GP calibration approach is an accurate,
computationally efficient, and widely applicable modeling framework
for the high-throughput prediction of biochemical redox potentials.One of the advantages of our approach is its low computational
cost (Table S1) in comparison to other
model chemistries. For instance, the PM7-GP-calibrated calculations
are approximately 1000-fold faster than the DLPNO–CCSD(T) calibrated
model chemistry, despite both approaches yielding comparable accuracies.
We note that the pipeline involves estimating pKa’s and major protonation states at pH = 0 for all metabolites
at a slight additional computational cost. However, this cost is also
incurred by the GCM approach.
Generation of a Large Network
of Non-Natural Redox Reactions
Using a Novel Network Expansion Algorithm
To demonstrate
the high-throughput nature of our method, we implemented a redox reaction
network expansion algorithm to generate what is, to our knowledge,
the largest database of non-natural biochemical redox reactions (Methods). The network expansion algorithm makes
use of the RDKit cheminformatics software to iteratively apply simplified
molecular-input line-entry system (SMILES)[35,43] reaction strings to natural metabolites in the KEGG database. We
generate two types of redox reactions, implemented in both the reductive
and oxidative directions. The first type is the reduction of carbonyls
to alcohols (and the reverse oxidations); the second is the reduction
of carbonyls to amines (and the reverse oxidations). Applying the
cheminformatic transformation to metabolites in the KEGG database
results in a first set of nearest-neighbor reduction or oxidation
products. We then iteratively apply the reactions to the resulting
set of products, until there are no more carbonyl functional groups
to reduce (or alcohol or amine functional groups to oxidize).Although the KEGG database of natural metabolic compounds and reactions
contains molecules with up to 135 carbon atoms (a peptidoglycan cell
wall component is the molecule with the highest number of carbon atoms),[44] we limit our network expansion algorithm to
molecules with 21 carbon atoms or less. This threshold is imposed
in consideration of the size of the compounds in the training set;
if experimental data were available for larger compounds, this methodology
could easily be adapted.Table shows the
number and type of reactions that make up the network. It consists
of ∼75 000 compounds connected by ∼315 000
reactions (reductions and oxidations). Of these reactions, 83% convert
carbonyls to alcohols (or alcohols to carbonyls), while 17% convert
carbonyls to amines (or amines to carbonyls). The large majority of
the reactions (more than 80%) are oxidations of alcohols to carbonyls.
Thus, the alcohol functional group is significantly more ubiquitous
in the set of natural metabolites than either carbonyls or amine functional
groups. The large majority of the reactions in the network—97%—stem
from the recursive oxidations of seed natural metabolites,
while only ∼3% come from recursive reductions. This is consistent
with the notion that carbonyl functional groups can cause damage to
macromolecules[45,46] and are thus kept at check in
the cell.We analyzed the structure of the resulting network
by looking at
the distribution of compound sizes. Figure A shows the number of compounds in the networks
as a function of their number of carbon atoms. The distribution shows
three peaks corresponding to molecules with 6n carbon
atoms (n = 1, 2, 3). These peaks reflect a large
number of products that result from combinatorially oxidizing all
alcohol groups in mono-, di-, and trisaccharides (Figure B).
Figure 4
The network of redox
reactions obtained from iterative reductions
and oxidations of natural metabolites. (A) Carbon number distribution
for the full set of ∼70 000 compounds in the network.
The peaks at multiples of 6 carbon atoms reflect the combinatorial
oxidation of mono-, di-, and trisaccharides and their multiple alcohol
functional groups. (B) Schematic diagram of a natural trisaccharide
being iteratively oxidized at all possible alcohol functional groups.
Blue circles depict sites of initial alcohol-to-carbonyl oxidation,
and green circles depict a second oxidation site.
The network of redox
reactions obtained from iterative reductions
and oxidations of natural metabolites. (A) Carbon number distribution
for the full set of ∼70 000 compounds in the network.
The peaks at multiples of 6 carbon atoms reflect the combinatorial
oxidation of mono-, di-, and trisaccharides and their multiple alcohol
functional groups. (B) Schematic diagram of a natural trisaccharide
being iteratively oxidized at all possible alcohol functional groups.
Blue circles depict sites of initial alcohol-to-carbonyl oxidation,
and green circles depict a second oxidation site.
High-Throughput Prediction of Redox Potentials and Elucidation
of Structure–Energy Relationships
Using our GP-calibrated
semiempirical quantum chemistry method, we predicted the standard
transformed redox potentials for the set of ∼315 000
reactions generated with the iterative network expansion algorithm.
The resulting distribution of standard transformed redox potentials
is shown in Figure . We note the use of the millimolar standard state, E′m, instead of the more commonly used E′° (where reactant concentrations are taken as 1 M).
This is a standard state that is commonly used in the field of biochemical
thermodynamics since 1 mM is closer to the relevant physiological
concentrations of metabolites in cells.[4] The resulting distributions for the two reaction categories peak
at very close but slightly different values, with ⟨E′m⟩ ∼ −220 mV for
carbonyl-to-alcohol reductions and ⟨E′m⟩ ∼ −235 mV for carbonyl-to-amine reductions.
Figure 5
Distribution
of predicted standard redox potentials (pH = 7) for
a set of ∼315 000 non-natural redox reactions. Predicted
potentials were obtained by running semiempirical quantum chemical
calculations (Methods) and correcting for
systematic errors using Gaussian process regression. The green distribution
shows the E′m(pH = 7) values for
53 095 reductions of carbonyls to amines. The blue distribution
shows the E′m(pH = 7) values for
261 903 reductions of carbonyls to alcohols.
Distribution
of predicted standard redox potentials (pH = 7) for
a set of ∼315 000 non-natural redox reactions. Predicted
potentials were obtained by running semiempirical quantum chemical
calculations (Methods) and correcting for
systematic errors using Gaussian process regression. The green distribution
shows the E′m(pH = 7) values for
53 095 reductions of carbonyls to amines. The blue distribution
shows the E′m(pH = 7) values for
261 903 reductions of carbonyls to alcohols.Importantly, our vast network of non-natural redox
reactions and
their associated redox potentials are amenable to structure–energy
analyses to elucidate molecular structures that are enriched or depleted
in reactions with high or low potentials. Toward this end, we developed
a novel molecular fingerprint-based structure–energy analysis,
which we term “reaction fingerprint heatmaps”, to detect
such reaction motifs (Figure ). The main idea behind reaction fingerprint heatmaps is that,
by comparing the average reaction fingerprints of
reactions with different redox potential values, one can visually
detect the structural patterns that correlate with energetics. More
specifically, we first divide the distribution of predicted standard
redox potentials into arbitrary discrete bins, in our case 100 bins.
We then find, for each standard redox potential bin, the subset of
reactions with potentials falling within that range and average their
corresponding fingerprint vectors. We then extract structure–energy
trends by looking at the changes in average fingerprints values across
consecutive bins. When these changes are large for a particular bit,
it signifies that a meaningful molecular environment (motif) is enriched
or depleted at that particular bin (range) of redox potential values.
For any particular reaction, we can reverse engineer the reaction
fingerprint to obtain substructures that activate the particular bit.
For example (Figure ), we find that Bit-4 is enriched in reactions with standard potentials
of −160 mV < E′m <
−100 mV. By mapping Bit-4 back to specific reactions, we find
a molecular environment that corresponds to a dicarbonyl within a
heterocyclic ring (see example reaction in Figure ) that is enriched in substrate molecules
within that range of standard potentials. While, across all reactions,
the difference between the substrate and product molecules is the
reduction of a single carbonyl into a hydroxy carbon group, our reaction
fingerprint heatmaps detect the molecular environments dictating the
energetics associated with that transformation.
Figure 6
Elucidating structure–energy
relationships using a reaction
fingerprint heatmap. The top right panel shows the distribution of
predicted standard redox potentials (in the millimolar standard state)
using our GP-calibrated quantum chemical approach. The bottom right
panel shows a reaction fingerprint heatmap, where the y-axis corresponds to the bits in the reaction fingerprint vector
averaged across all reactions falling within a redox potential bin
(range of values). The heatmap highlights bits that have significantly
different values (on average) as a function of redox potential. The
example reactions on the left correspond to each of the significant
reaction fingerprint bits. For example, Bit-4 is a structure enriched
in substrate molecules of reactions with high values of standard redox
potentials.
Elucidating structure–energy
relationships using a reaction
fingerprint heatmap. The top right panel shows the distribution of
predicted standard redox potentials (in the millimolar standard state)
using our GP-calibrated quantum chemical approach. The bottom right
panel shows a reaction fingerprint heatmap, where the y-axis corresponds to the bits in the reaction fingerprint vector
averaged across all reactions falling within a redox potential bin
(range of values). The heatmap highlights bits that have significantly
different values (on average) as a function of redox potential. The
example reactions on the left correspond to each of the significant
reaction fingerprint bits. For example, Bit-4 is a structure enriched
in substrate molecules of reactions with high values of standard redox
potentials.
Discussion
In
this work, we developed a mixed quantum chemistry/machine learning
modeling approach for the prediction of biochemical redox potentials
of carbonyl functional groups. The method is based on calculating
the electronic energy difference between substrates and products in
redox reactions using the semiempirical quantum chemistry method PM7.[34] The raw quantum chemistry estimates are then
calibrated against experiment using GP regression.We demonstrated
that the method has better prediction power than
the commonly used GCM approach. Furthermore, the computational resources
required are significantly lower than previous quantum chemistry strategies
for biochemical thermodynamics. The GCM decomposes compounds into
discrete functional groups and assigns energies to each group by calibrating
via linear regression against experimentally available Gibbs energies.
It then estimates reaction energies as the group energy difference
between products and substrates. As Figure A shows, GCM collapses a large fraction of
the redox potentials of carbonyl compounds into a few sets of values.
We also note that, in a recent variant of the GCM, group energies
can be combined with experimental reactant Gibbs formation energies
to significantly increase the accuracy of predictions, resulting in
what is known as the CCM.[18] Since we seek
to apply our modeling approach to predicting redox potentials of non-natural
reactions, the CCM is not applicable to the vast majority of the reactions
in our data set. However, we also found that the GP regression technique
can be used to further calibrate both GCM and fingerprint-based predictions
(without relying on quantum chemical calculations), resulting in significant
improvements to the prediction accuracies of these two approaches.
Thus, the decision on what modeling choice to use can be based on
a balancing act between accuracy, computational cost, and the preferred
underlying modeling baseline. For example, if one desires an accurate,
computationally cheap approach that relies on an underlying electronic
structure method, PM7+GP is an optimal choice. Alternatively, for
more accurate results that rely on more expensive electronic structure
theory, DSDPEP89 (based on a double-hybrid DFT functional) gives the
best accuracy but at higher computational cost. Likewise, if a good
baseline for your chemical application is GCM, at a very low cost
and almost comparable accuracy GCM+GP will be a good modeling choice.
GP calibration proves to be a versatile approach to improve the accuracy
of a wide variety of prediction methods.We considered model
chemistries that rely on a single conformation
per molecule and also ensembles of conformers. In our case we did
not observe a huge difference in predicted accuracy between both approaches.
Other chemical applications and data sets should be tested to see
if performance is impacted by conformer diversity. A single missed
hydrogen bond in one reacting partner, but not the other, could easily
introduce large errors in the predicted redox potential.Current
work on GP calibration in computational chemistry relies
on molecular fingerprints to compute the similarity
between molecules of interest and compounds with experimental data.
Our implementation is to our knowledge the first application of GP
to reaction energy predictions using reaction fingerprints. We chose
to use GP—as opposed to other machine learning techniques such
as neural networks—on the basis of the amount of experimental
data available. GPs are suited for this task since they grow in complexity
according to the data; a GP is robust to overfitting since it penalizes
complex models and is able to provide estimates of uncertainty on
predictions.[32,47,48] In contrast, techniques such as deep neural networks—which
have gained traction in recent machine learning applications to quantum
chemistry[49−51]—are only applicable when much larger data
sets are available.We focused our approach to the prediction
of the standard redox
potentials of carbonyl functional groups (i.e., reduction to alcohols
or amines). The application of GP regression to other redox transformations—including
the reduction of carboxylic acids to aldehydes and the reduction of
alcohols to hydrocarbons—is limited by the availability of
experimental data. This highlights the importance of a concerted effort
to generate more experimental data for biochemical redox potentials
in controlled conditions (such as pH, ionic strength, temperature,
and buffer) to increase the predictive power and scope of calibrated
quantum chemical approaches.We demonstrated the high-throughput
nature of the methodology by
generating a network of more than 315 000 non-natural biochemical
reactions involving ∼70 000 compounds. Our algorithm
is different from other network expansion algorithms previously applied
to metabolic networks. By solely focusing on redox biochemistry, iterative
application of the reactions converges to molecules that cannot be
further reduced or oxidized. This contrasts with other network expansion
algorithms[52,53] used to investigate scenarios
related to the origins of life, where several types of chemical transformations
are considered, including carbon-bond formation and cleavage. Other
network expansion algorithms also start from natural seed compounds
but only consider natural metabolites.[12] In contrast, our algorithm uses natural metabolites as seeds but
results in the expansion to a vast network of non-natural compounds.Despite the improvements obtained over GCM with our method, the
variance in the prediction accuracies across individual reactions
is still significant. In practice, this spread places a limit on the
confidence with which one can predict the redox potential of an individual
reaction of interest. This is true, and to a larger degree, for other
methods currently widely used, such as GCM. However, the main value
of developing a method with lower average prediction error lies in
our ability to analyze large sets of compounds, where average trends
in the energies of different molecular substructures can be captured.
We thus developed a novel structure–energy relationship analysis
framework which we term reaction fingerprint heatmaps. These allow
us to detect molecular environments that are enriched or depleted
in reactions that fall within specific regions (bins) of the redox
potential distribution. This could potentially be useful for future
metabolic engineering applications: being able to pinpoint the exact
molecular environments that correlate with redox potentials would
allow the fine-tuning of the energetics of synthetic metabolites.
We note that this approach captures large-scale structure–energy
trends. An alternative approach, which is able to associate weights
to individual atoms or subfragments, are the similarity maps as introduced
by Riniker et al.[54]We envision several
applications of our biochemical redox potential
prediction methodology. One such application is studying the thermodynamic
landscape of specific families of natural metabolic compounds that
undergo combinatorial reductions (or oxidations) of carbonyl (or alcohol
or amine) functional groups. One such family of compounds are the
brassinosteroids and the oxylipins,[55−57] structurally diverse
plant metabolites that play important roles in many physiological
processes. In addition, in the context of drug metabolism, our methodology
could be applied to obtain quantitative insights into the thermodynamics
of redox transformations such as those mediated by the P450 (CYP)
superfamily of enzymes.[58]To our
knowledge, our database contains the largest set of geometric
structures, electronic energies, pKa estimates,
and redox potentials of natural metabolites and compounds related
to these through oxidoreductive transformations. We make all the code
and data sets generated in this work—including metabolite 3D
geometries, electronic energies, pKa’s,
and redox potential estimates—available as an open source repository
at https://github.com/aspuru-guzik-group/gp_redox_rxn.
Methods
Generation
of Most Abundant (Major) Protonation States at pH
= 0
Our pipeline is based on performing electronic structure
simulations of the most protonated species of each biochemical compound
involved in every redox reaction of interest. We run the quantum chemistry
simulations on the major protonation states at pH = 0 to avoid errors
associated with the prediction of anionic compound energies.[21,59,60] We use the ChemAxon calculator
plugin (Marvin 17.7.0, 2017, ChemAxon) cxcalc majormicrospecies to generate, for each substrate and product involved in a biochemical
redox reaction of interest, the major microspecies at pH = 0.
Geometry
Optimization
Using the SMILES string representation[43] for each compound in the reaction network, we
obtain the string corresponding to the most abundant (major) microspecies
at pH = 0 using ChemAxon. This smiles string is then converted to
an RDKit “mol” object. From here we obtain 3D structures
based on two decisions. (1) Whenever we want a single 3D geometry,
we generate an initial conformer with RDKit (using the EmbedMolecule
command, which uses distance geometry to obtain initial coordinates
for a molecule), and we optimize this conformer’s geometry
Universal Force Field (UFF) as implemented in RDKit.[61] (2) If we want several conformers (to Boltzmann average
their energies), we generate around 1000 random “smart”
conformers via the ETKDG force field,[62] calculate their energies, and pick between 5 and 20 conformers (based
on size) and on clustering of energies (using the Butina clustering
algorithm). Depending on the baseline model chemistry, either we further
refine the geometry of the PM7[34] semiempirical
method with the COSMO implicit solvation model[42] and then a quantum chemistry approach (PBE-3H or B3LYP),
or we simply use the base PM7 or UFF geometry and perform single-point
energy (SPE) calculations.
Single-Point Energy Estimates
We
then compute the electronic
energy Eelectronic of every compound involved
in a given redox reaction using the PM7 semiempirical method[34] with the COSMO implicit solvation model[42] (see below for other single-point energy SPE
model chemistries tested). For both reductions of carbonyl groups
to alcohols and amines, we add a hydrogen molecule to the substrate
side of the redox reaction (in the direction of reduction) and compute
its electronic energy. In addition, to balance reductions of carbonyls
to amine functional groups, we add an ammonia molecule to the substrate
side of the reaction and a water molecule to the product side of the
reaction (in the direction of reduction) and compute their electronic
energies.We use the difference in the electronic energies of
products and substrates, ΔEelectronic, as an estimate of the standard redox potential E°(major species, MS, at pH = 0):where n is the number of
electrons (2 for all reactions considered here), and F is Faraday’s constant.Our approach ignores the contribution
of ro–vibrational
enthalpies and entropies to Gibbs reaction energies. This significantly
reduces the computational cost associated with quantum chemical simulations,
and we correct for the systematic errors introduced by this approximation
through the Gaussian process regression (see below). We note that
previous work has shown that the value of ΔEelectronic obtained for these types of redox reactions
strongly correlates linearly with the full ΔGr° prediction obtained from including ro–vibrational
enthalpic and entropic contribution (data not shown).
Transforming
Chemical Potentials E°(Major
Species at pH = 0) to Biochemical Potentials, E′°(pH
= 7)
Having estimated the standard redox potential E°(major species at pH = 0), we use cheminformatic
pKa estimates of reactants and the Alberty–Legendre
transform[41] (p 67, eqs 4.4–10) to
convert E°(major species at pH = 0) to the transformed standard redox potential, E′°(pH = 7), which is a function of pH.[63] To estimate the pKa’s
of all reactants in every redox reaction of interest, we use the ChemAxon
calculator plugin (Marvin 17.7.0, 2017, ChemAxon) cxcalc pka. Internally, the pKa calculator plugin
is based on the calculation of the partial charge of atoms in the
molecule.[64,65]
Gaussian Process Calibration Using Reaction
Fingerprints
We calibrate biochemical redox potentials E′°(pH
= 7) obtained from quantum chemical simulations against available
experimental data to correct for systematic errors in our simulations
and the cheminformatic pKa estimates.
One simple strategy of calibration of energy values would be to use
a two-parameter linear regression. We note that our group has successfully
employed this in the context of compounds for redox flow battery applications.[66−71] Here, we make use of the information provided by the difference
between the structural similarity of products and substrates and utilize
a GP regression approach with reaction fingerprints to calibrate redox
potential energies. For our calibrations with Gaussian process regression
we used GPMol, a Python library based on GPflow,[72] which in turn is a package for building Gaussian process
models using TensorFlow.GP regression relies on a similarity
or distance metric between data points. To construct the notion of
similarity between reactions, we utilize reaction fingerprints.[35] Several possible choices of kernel function
and distance measure between molecules and reactions exist. Generally,
the distance measures between molecules and reactions make use of
fingerprint representations of compounds, which encode the structure
of a molecule in a binary vector form.[73] Although several varieties of reaction fingerprints exist, we tested
our approach on the Morgan2 and AP3 (atom pair) reaction FPs of Schneider
et al.[35] and obtain them from the difference
of fingerprint vectors of products and substrates. We find that in
the context of Gaussian processes for biochemical redox potential
predictions Morgan fingerprints outperform AP3 fingerprints by approximately
5–4 mV. The Gaussian processes’ predictive power comes
from learning relationships between relative distances in feature
space, so it might be the case that for distance-based ML algorithms
Morgan fingerprints are better. However, we note that we did not incorporate
the usage of any agent to the weighting scheme, so doing so might
change the results. Additional performance metrics for comparison
can be seen in Table S1.One key
ability of Morgan fingerprints is that we can invert the
bits back to a corresponding molecular substructure to perform our
structure–energy analysis. For each molecule, we generate a
counted Morgan fingerprint,[73] a fixed-length
counted vector indicating the absence (zeros) or counted presence
(positive) of a particular graph–connectivity–environment.
Each environment captures the local topological information on a molecule
by mapping the local vicinity of connected atoms, along with their
formal charges, type of chemical bond, and its position relative to
a cyclic structure. Our Morgan fingerprints are constructed from 2048
length-vectors using a radius of 6 and are reduced to a lower dimension
(∼200) using a gradient boosted tree. We find that the prediction
accuracy did not improve significantly upon using Morgan fingerprints
with a larger radius. When we consider the difference of two Morgan
fingerprints, we are looking at the subtraction and addition of molecular
environments between two molecules. We expect that similar molecular
transformations will have similar changes in molecular environments.
When these vectors are normalized, the inner product between two reaction
fingerprints x1 and x2 is the cosine distance,[74] which is a measure of similarity between two vectors. This notion
is adapted in the field of Gaussian processes to quantify the similarity
between reactions in GP regression.Predictions obtained from
GPs can be interpreted as weighted averages
of the training data, where the weights are probabilistic in nature.
GPs are distinct because of their associated covariance functions
(i.e., the kernel).[33] The equation below
shows that the kernel we employ is a mixture of a squared exponential
kernel and a noise kernel, which increases the robustness of the model.Here, K is the kernel function, σrxn is the variance of the reaction fingerprint
kernel, l is the length scale parameter, and σnoise is the white noise variance parameter.
Database of Experimental Redox Potentials for Calibration with
Gaussian Process Regression
To perform the calibration, a
data set of available experimental transformed standard redox potentials
is used. This data set of redox half-reactions and their associated
transformed standard reduction potentials E′°(pH
= 7) was generated by Bar-Even et al.,[4] by compiling experimental equilibrium constants (i.e., standard
Gibbs reaction energies) from the NIST database of Thermodynamics
of Enzyme-Catalyzed Reactions (TECRDB)[40] and the Robert Alberty database of biochemical compound Gibbs formation
energies.[41] The data set consists of 57
redox reactions that reduce a carbonyl functional group into an alcohol
functional group and 24 redox reactions that reduce a carbonyl functional
group into an amine group.To ensure we are not overfitting
in the process of calibrating with GP regression, we used leave-one-out
cross-validation (LOOCV), where we train our model on all data points
except one point and predict its value. This is done for all 81 data
points, and reported predictions, accuracies, and resulting scatter
plots always come from untrained data. It is also important to note
that GPs are inherently robust to overfitting since the training procedure
penalizes more complex models (higher-rank kernels) via an objective
function.[32]
Selecting a Model Chemistry
with Low Computational Cost and
High Prediction Accuracy
To obtain fast and accurate predictions
of redox potentials using the GP regression calibrated quantum chemistry
strategy described above, we tested several different quantum model
chemistries. A model chemistry consists of a combination of a geometry
optimization (GO) procedure, a method to calculate the single-point
electronic energies (SPEs) of optimized molecular geometries, as well
as other modeling considerations, such as the number of geometric
conformations per compound, the water solvation model, and the pKa estimation strategy.We explored the
following set of methods to obtain optimized geometric conformations:
the Universal Force Field (UFF),[61] PM7,[34] DFTB,[75] HF3-c,[76] and PBEh-3c.[77] Additionally,
to compute the electronic energies of the optimized structures through
single-point energy (SPE) calculations, we considered the following
approaches: DFTB, PM7, HF3-c, PBEh-3c, DSD-PBEP86/SVP,[78] and DLPNO–CCD(T)/SVP.[79] Given computational resource constraints, only a subset
of all possible combinations of geometry optimization and SPE procedures
were explored.To select a model chemistry from the exploration
set, we compared
their prediction accuracies—after the GP regression calibration
step described above—when tested on the set of experimental
redox potentials. We quantified prediction accuracy using three different
metrics: Pearson correlation coefficient (r), the
coefficient of determination (R2), and
mean absolute error (MAE, in mV). Before applying the GP regression
calibration step, the double-hybrid functional and linear-scaling
couple-clustered methods DSD-PBEP86 and DLPNO–CCD(T) resulted
in the highest accuracy. Due to the relatively small variation in
prediction accuracy, we picked the cheapest method that still gave
reasonable accuracy (MAE < 30 meV), which turned out to be the
PM7 semiempirical method for both geometry optimizations and single-point
electronic energies. It should be noted that our geometry is a single,
UFF optimized conformation per molecule. For PBEh-3c and HF-3c we
also considered utilizing multiple conformers obtained with the ETKDG
force field[62] with significant improvement
in accuracy over a single conformer (see Table S1). For other chemical applications (e.g., redox flow batteries)
and molecular data sets, adequately accounting for multiple conformers
might have an important influence on the accuracy of the calculations.
A single missed hydrogen bond in one reacting partner, but not the
other, could easily introduce large errors in the predicted redox
potential. As such the adequacy of using just a single geometry should
be verified.To test for statistically significant differences
of the prediction
accuracy summary statistics of different model chemistries (e.g.,
GCM versus GP-calibrated PM7), we performed a nonparametric bootstrap
hypothesis test: we subsampled data points from the table of redox
reactions with experimental data and computed the prediction accuracy
summary statistics [Pearson r, R2, MAE, root-mean-square error (RMSE)] for each subsample.
We then count the fraction of subsampled GCM data sets that result
in summary statistics that are greater than or equal in value to the
summary statistics of the model chemistry we are comparing against
(e.g., GP-calibrated PM7).
Group Contribution Method (GCM) Estimates
of Redox Potentials
We compared the accuracies obtained using
the GP regression calibrated
quantum chemistry approach described above against those obtained
with the commonly used GCM.[14,18,80] We use the GCM implementation of Noor et al. which was adapted to
the thermodynamics of biochemical reactions.[18] Briefly, group energies of all compounds in the KEGG database are
stored in a group matrix, with rows corresponding to compounds and
columns corresponding to groups. Associated to each group is a group
energy corresponding to the energy for the group’s major protonation
state (major species) at pH = 7. To obtain a GCM-based estimate for
a standard transformed reduction potential for the major species at
pH = 7, we take the difference in group energy vectors for all products
and substrate in the reaction. We then transform the resulting standard
redox potentials E°(major species at pH = 7)
to the standard transformed redox potential at pH
= 7, E′°(pH = 7), using the same pKa estimates and Alberty transform approach described
above.
Redox Network Expansion Algorithm with Cheminformatic Reaction
Strings
We implemented a network expansion algorithm to iteratively
reduce carbonyl functional groups and oxidized alcohol/amine functional
groups for a subset of the natural metabolites of the KEGG database.
Due to constraints in available computational resources to run electronic
structure calculations, we used the set of all metabolic compounds
in the KEGG database with 21 carbon atoms or fewer as seed metabolites.
Using the RDKit open source cheminformatics software, we identify
carbonyl functional groups (alcohol/amine functional groups) in each
of the seed KEGG metabolites and transform them through a redox reaction
to the corresponding reduced (oxidized) alcohol/amine (carbonyl).
In practice, this is done using SMARTS reaction strings as implemented
in RDKit and iteratively applying them to SMILES representations[43] of the KEGG metabolites. The SMARTS reaction
strings for the reduction and oxidation reactions considered here
are shown in Tables S2 and S3.We
iteratively apply the reduction/oxidation reaction transformations
to the compounds generated from KEGG seed metabolites. We terminate
the iterations when the resulting product has, in the case of reductions,
no more carbonyl functional groups that can be reduced (or in the
case of oxidative transformations, no more alcohol/amine functional
groups that can be oxidized to carbonyls). All software was written
using the Python programming language.
Authors: Rajendra P Joshi; Andrew McNaughton; Dennis G Thomas; Christopher S Henry; Shane R Canon; Lee Ann McCue; Neeraj Kumar Journal: ACS Omega Date: 2021-03-25