Kazu Osaki1, Toru Ekimoto1, Tsutomu Yamane2, Mitsunori Ikeguchi1,2. 1. Graduate School of Medical Life Science, Yokohama City University, 1-7-29 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan. 2. Center for Computational Science, RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan.
Abstract
Hydration free energy (HFE) is a key factor in improving protein-ligand binding free energy (BFE) prediction accuracy. The HFE itself can be calculated using the three-dimensional reference interaction model (3D-RISM); however, the BFE predictions solely evaluated using 3D-RISM are not correlated to the experimental BFE for abundant protein-ligand pairs. In this study, to predict the BFE for multiple sets of protein-ligand pairs, we propose a machine learning approach incorporating the HFEs obtained using 3D-RISM, termed 3D-RISM-AI. In the learning process, structural metrics, intra-/intermolecular energies, and HFEs obtained via 3D-RISM of ∼4000 complexes in the PDBbind database (ver. 2018) were used. The BFEs predicted using 3D-RISM-AI were well correlated to the experimental data (Pearson's correlation coefficient of 0.80 and root-mean-square error of 1.91 kcal/mol). As important factors for the prediction, the difference in the solvent accessible surface area between the bound and unbound structures and the hydration properties of the ligands were detected during the learning process.
Hydration free energy (HFE) is a key factor in improving protein-ligand binding free energy (BFE) prediction accuracy. The HFE itself can be calculated using the three-dimensional reference interaction model (3D-RISM); however, the BFE predictions solely evaluated using 3D-RISM are not correlated to the experimental BFE for abundant protein-ligand pairs. In this study, to predict the BFE for multiple sets of protein-ligand pairs, we propose a machine learning approach incorporating the HFEs obtained using 3D-RISM, termed 3D-RISM-AI. In the learning process, structural metrics, intra-/intermolecular energies, and HFEs obtained via 3D-RISM of ∼4000 complexes in the PDBbind database (ver. 2018) were used. The BFEs predicted using 3D-RISM-AI were well correlated to the experimental data (Pearson's correlation coefficient of 0.80 and root-mean-square error of 1.91 kcal/mol). As important factors for the prediction, the difference in the solvent accessible surface area between the bound and unbound structures and the hydration properties of the ligands were detected during the learning process.
Developing accurate and efficient methods
for predicting the binding
affinity of ligands to target proteins is required in computer-aided
drug discovery.[1] The binding affinity experimentally
evaluated using the half-maximal inhibitory concentration (IC50) or
the dissociation constant (Kd) can be
converted to binding free energy (BFE).[2,3] Currently,
BFE calculations based on atomic structures are widely performed in
pharmaceutical processes daily. For example, in in silico screening processes, to rank many ligands in terms of affinity,
BFEs of the ligands are calculated quickly. Thus, empirical scores
of docking simulations (e.g., Glide[4]) are
frequently used. After the screening process, ligand optimization
processes are conducted to increase the ligand activity. Therefore,
the BFEs of a few tens of ligands must be calculated accurately to
examine the structure–activity relationships. Thus, free energy
calculations based on molecular dynamics (MD) simulations are useful.[1] To improve the accuracy of BFE calculations,
the accurate treatment of the hydration effects is a key factor because
many water molecules are involved in the ligand-binding process. Upon
ligand binding, water molecules in the binding site are replaced by
the ligand and are forced to rearrange in the bound state.[5] The hydration water molecules around the ligand
in the isolated state are dispelled upon ligand binding. In MD simulations
using an explicit solvent, the hydration effects can be considered
explicitly. However, MD simulations have high computational demands,
leading to a loss in computational efficiency. Therefore, precisely
handling the hydration effects is a trade-off between accuracy and
efficiency.Various computational methods for BFE calculation
based on physicochemical
approaches[1,4,6−22] using the thermodynamic cycle (Figure ) or data-driven approaches using machine
learning[23−31] have been proposed to date. The required computational burdens of
physicochemical approaches differ depending on the incorporation methods
of the hydration effects. In most physicochemical approaches, the
interaction energy between a protein and a ligand can be calculated
using force fields, whereas the hydration effects are handled differently.
For example, in docking simulations with abundant protein–ligand
pairs, the receptor structure is often fixed, and the hydration effect
is implicitly and approximately evaluated using empirical score functions
(e.g., Glide score[4]) to quickly evaluate
BFEs. In molecular mechanics (MM) and Poisson–Boltzmann surface
area (PB/SA) or MM and generalized Born surface area (GB/SA) methods,
the interaction energy between a protein and a ligand and the configurational
entropies are calculated using MM, where the hydration free energies
(HFEs) of the protein, ligand, and complex are approximately calculated
using a continuous dielectric model (i.e., implicit solvent model).[9,10] In continuous dielectric models, the detailed molecular features
of water molecules such as hydrogen bonding, the hydrophobic effect,
and the rearrangement of the water molecules upon solute insertion
are missing. Recently, instead of continuum models, a solution statistical-mechanics
theory-based method, called the three-dimensional reference interaction
model (3D-RISM), was combined with MM.[11−16] The 3D-RISM method can evaluate HFEs, preserving the molecular features.[34−37] The 3D-RISM method provides a thermally averaged distribution of
solvent molecules around solute molecules, and HFE can be calculated
through an integral equation using the distribution function. The
MM/3D-RISM method can successfully predict the BFE of a group of ligands
with different activities for the same protein. The 3D-RISM method
has also been applied to the statistical analysis of hydration water
based on a large number of 3D-RISM calculations,[38] development of an efficient method to calculate HFEs,[39] or binding-site searches by extending the theory
to the distribution of atoms constituting the ligand.[40] The 3D-RISM method, combined with MM or extended protocols
described above, improves the prediction of HFEs.[11−16,38−40]
Figure 1
Thermodynamic cycle of
the binding free energy. The protein (gray)
and ligand (green) are shown under the upper and lower conditions
representing vacuum and solution, respectively. The binding free energy
(ΔGbind) is obtained from the binding
free energy in vacuum (ΔGbindvacuum) and the hydration free
energies for the protein (Δμprotein), ligand
(Δμligand), and complex (Δμcomplex).
Thermodynamic cycle of
the binding free energy. The protein (gray)
and ligand (green) are shown under the upper and lower conditions
representing vacuum and solution, respectively. The binding free energy
(ΔGbind) is obtained from the binding
free energy in vacuum (ΔGbindvacuum) and the hydration free
energies for the protein (Δμprotein), ligand
(Δμligand), and complex (Δμcomplex).As a more rigorous approach, free energy calculations
using all-atom
MD simulations have been used to obtain the BFE between the bound
and unbound states, such as the alchemical approach,[17] the potential of mean force (PMF)-based approach,[18] free energy perturbation (FEP)+,[19] generalized replica exchange with solute tempering
(gREST)+FEP,[20] pmx,[21] and massively parallel computation of absolute BFE with
well-equilibrated states (MP-CAFEE).[22] The
free energy calculations have been applied to the relative or absolute
BFE calculations, and they exhibit a good correlation with experimental
BFEs. However, because of the high computational demands of MD simulations,
free energy calculations for abundant protein–ligand pairs
are practically difficult.In contrast to the physicochemical
approach described above, machine
learning approaches predict BFEs by learning the correlation between
experimental BFE data and input features, such as structural metrics
and scoring functions.[23−31] Recently, a large amount of experimental BFE data and crystal structures
of the protein–ligand complex have been stored in databases
(e.g., the PDBbind database[32,33]), and machine learning
models using the data for BFE prediction, for example, KDEEP,[24] XGB-Score,[25] and SFCscoreRF[26], have been
proposed. As the input features, the descriptors given by structural
metrics, such as atomic coordinates, distances between atoms, and
amino acid sequences, or energetic metrics, such as scoring functions,
were employed,[23] in which the relationships
between the descriptors and the experimental BFE data were learned
and a regression model was built. The selection of the appropriate
features describing the experimental BFE data is one of the key points
for accuracy.[23] Therefore, as descriptors,
incorporating well-defined physicochemical quantities related to BFE,
as well as conventional structural metrics, has the potential for
accurate predictions of abundant protein–ligand pairs. Thus,
incorporating hydration effects as an input feature may improve the
accuracy of BFE predictions.Herein, we propose a machine learning
approach combined with the
3D-RISM method for BFE predictions. First, we calculated the BFEs
for 3993 protein–ligand pairs in the PDBbind database using
MM/3D-RISM (Figure ). However, the BFEs evaluated using the 3D-RISM method exhibited
a poor correlation with the experimental BFEs. Although the MM/3D-RISM
strategy predicted the BFEs of a few ligands with a similar scaffold
for the same protein, the strategy failed with multiple types of protein–ligand
pairs. We also attempted to apply the improved version of 3D-RISM
to BFE calculations. According to Palmer et al., the HFEs for 185
neutral small molecules calculated using the 3D-RISM method deviated
from the experimental HFEs, and the difference was proportional to
the partial molar volume.[41] Therefore,
they proposed a universal correction, in which the contribution of
the partial molar volume was simply subtracted from the HFE obtained
using the 3D-RISM method. However, in our calculation of the BFEs,
their correction did not improve the correlation between the calculated
and experimental BFEs. Therefore, we developed a machine learning
approach using thermodynamic quantities obtained from the 3D-RISM
method as principal descriptors, termed 3D-RISM-AI. By introducing
a machine learning algorithm for regression, we aimed to predict BFEs
for abundant protein–ligand pairs, which cannot be expressed
simply by energy addition and subtraction operations in the thermodynamic
cycle. Regression models were constructed from the structural features
and thermodynamic quantities calculated using 3D-RISM for the 3993
protein–ligand pairs in the PDBbind database. Four machine
learning algorithms were used for the regression, and their performance
was verified. The best-performing learning model exhibited a high
correlation between the predicted and experimental BFEs: Pearson’s
correlation coefficient (R) of 0.80, Spearman’s
rank correlation coefficient (ρ) of 0.77, and root-mean- square
error (RMSE) of 1.91 kcal/mol. Although the performance of 3D-RISM-AI
is comparable to that of other machine learning models (R = 0.753–0.806, ρ = 0.647–0.796, and RMSE = 1.80–2.22
kcal/mol), the advantage of 3D-RISM-AI is that the feature importance
analysis allows us to determine thermodynamic quantities that are
effective in the BFE predictions.
Theoretical Background
The BFE (ΔGbind) between a protein
and a ligand is defined aswhere Gcomplex, Gprotein, and Gligand are the free energies of the complex, protein, and ligand,
respectively. The free energy is decomposed into the following three
terms: the internal energy of the solute molecule (E), configurational entropy of the solute molecule (S), and HFE (Δμ), as follows:where T is the temperature
and X is one of the complexes, proteins, or ligands.
According to the thermodynamic cycle of BFE (Figure ), ΔGbind can be obtained by the sum of the BFE in vacuum (ΔGbindvacuum) and the difference in the HFE (ΔΔμ) as follows:Here, we focus on one conformation of the
solute complex (e.g., the crystal structure) and suppose that the
conformations of both protein and ligand do not change from those
in the complex, and the contributions of the entropy term can be neglected.
Under this assumption, the terms on the right-hand side of eq areThe internal energies in eq can be calculated using a force field.The HFE (eq ) is
obtained using the 3D-RISM method.[34−37] On the basis of the statistical
solution theory, thermodynamic quantities are obtained through the
pair distribution function, g(r),
which can be obtained by solving the Ornstein–Zernike (OZ)
integral equation as a function of the total correlation function, h(r) = g(r) – 1, together with incorporating closure approximations.
In the case of molecular liquids, the OZ equation includes degrees
of freedom for both configuration and rotation, and it is difficult
to solve such high-dimensional equations for complicated molecules.
In contrast, using the RISM approximation, the molecules are described
as a set of atom sites corresponding to atom types, and the molecular
OZ equation can be approximately rewritten as the equation of site–site
distance, which is called 1D-RISM.[42,43] Because the
1D-RISM approach uses the spherically symmetric site–site correlation
function, the three-dimensional distribution of the solvent molecules
around the solute molecule cannot be described. Therefore, an extension
of RISM to three dimensions, called 3D-RISM, was introduced. In 3D-RISM,
the total correlation functions of the solute and solvent sites are
obtained using the 3D-RISM equation:where hα(r) is the total correlation function of the solute
site and the solvent site of the atom type for α, cξ(r′) is the direct correlation
function of the solvent atom type for ξ, and χ (|-’ |) is the susceptibility function
of the solvent sites in the bulk solvent given by eq . In eq , ωξα(r) is the intramolecular correlation function of the solvent
molecule, ρ is the site-number
density for atom type α in the bulk solvent, and hξα(r) is the total correlation
function of the intramolecular sites calculated from the 1D-RISM.
To solve eq , the Kovalenko–Hirata
(KH) closure, which is an approximate relationship between the total
and direct correlation functions, is introduced.[37] The KH closure is given bywhere β is 1/kBT, kB is the
Boltzmann constant, and uα(r) is the solute–solvent site potential calculated
from the force field. Using eqs –9, HFE is calculated as follows:where Θ is the Heaviside step function.In the supervised machine learning, the relationship between objective
and explanatory variables is expressed as a regression model.[23] Using the training data set comprising the experimental
BFE values for the n protein–ligand pairs
as the objective variable and the thermodynamic quantities and the
structural indices calculated from each pair as the explanatory valuables,
a regression model is given bywhere ΔGbind_exp( is the experimental BFE for the nth protein–ligand
pair, {x(, y(, ..., z(} represents the input feature vector comprising
multiple descriptors calculated from the nth pair,
θ is the hyperparameter, and the function f is determined by optimizing the objective function L asUsing the training regression model f, the BFE for the new (n + 1)th pair,
which is not included in the training data set, can be predicted asThe settings of function f, the hyperparameters, and the objective function to be optimized
depend on the regression algorithm. In this study, four types of regression
algorithms were employed: ridge regression (RR),[44] support vector regression (SVR),[45] random forest regression (RFR),[46] and
extreme gradient boosting regression (XGBR).[47] RR and SVR use a linear function with regularization term or the
support vector and kernel trick, and RFR and XGBR use the decision
tree for the regression process. The machine learning approach is
expected to provide functions of BFEs that cannot be expressed by
a simple linear operation of energies (eqs –5 and 10), as defined by the thermodynamic cycle, because
the function given by the regression algorithm is represented by a
nonlinear optimized function depending on the data set and explanatory
variables used.
Methods
Data Sets
The experimental BFE values and the crystal
structures for the protein–ligand complex in the PDBbind database
(ver. 2018) were used.[32,33] In the PDBbind database, from
16151, the refined set (ver. 2018, the number of complexes: n = 4463) was defined according to the quality of the crystal
structures.[48] In addition, the core set
(ver. 2016, n = 285) was selected from the refined
set, 57 clusters were determined using amino acid sequence similarities,
and five representative proteins were selected from each cluster.
Crystal structures in the refined set containing unnatural amino acids
or atomic collisions that could not be treated in the computational
preprocess were excluded from the calculations. Finally, we used 3933
complexes in the refined set, including 217 complexes as the core
set. In the machine learning process, 217 complexes for the core set
were used as test data and 3716 complexes (the refined set except
for the core set) were used as training data. The complexes used as
training data did not contain complexes in the test data.
3D-RISM Calculations
Before the 3D-RISM calculations,
structural modeling was performed on 3993 crystal structures. All
crystal water molecules were removed. With multiple protein complexes,
a protein located within 10 Å of the ligand was used. For missing
residues, the acetyl cap and N-methyl cap were added
to the missing N- and C-termini, respectively. The protonation states
of histidine were determined using the ProtAssign module implemented
in Maestro.[49] Energy minimizations for
the modeling structures were performed using the GB solvent model
implemented in Amber18 and AmberTools19.[50] The force fields for proteins and ligands were ff14SB and the generalized
amber force field (GAFF), respectively.[51,52] The partial
charge of the ligand was determined using the AM1-BCC method.[53,54] Two-step minimizations were performed: a 250-step steepest descent
method and a 250-step conjugate gradient method. In the energy minimizations,
position constraints with a force constant of 5 kcal/mol were added
to the heavy atoms. The cutoff distance of the long-range interactions
was set to 12 Å.3D-RISM calculations were performed using
the rism3d.snglpnt command in AmberTools19.[50] The complex structure after the energy minimizations was used as
the input structure. For closure, the KH closure was used.[37] For 1D-RISM, the dielectrically consistent reference
interaction site model (DRISM) theory was used.[36] The water model was the SPC/E model,[55] the water density was set to 0.999 g/cm3, and
the temperature was 298 K. The buffer distance between the solute
molecule and the boundary in the calculation box was set to 20 Å,
and the grid spacing on the three-dimensional grid was set to 0.5
× 0.5 × 0.5 Å3. The other parameters were
set to default values.
Descriptors Used in 3D-RISM-AI
Thirteen thermodynamic
quantities described below were calculated from complex structures.
Using the Amber force field and 3D-RISM method, the internal energy
(E, eq ) and HFE (Δμ, eqs and 10) were calculated for the complex,
protein, and ligand, as well as BFE (eq ). The protein and ligand structures were extracted
from the complex structures. From the 3D-RISM calculations, thermodynamic
quantities related to the HFE, such as the partial molar volume (V), enthalpy term (ε), and entropy term (−TS), were calculated. In addition, the Δμ, V, ε, and TS terms were decomposed into polar and
apolar contributions. Consequently, 13 thermodynamic quantities (E, Δμ, Δμapolar, Δμpolar, V, Vapolar, Vpolar, −TS, −TSapolar, −TSpolar, ε, εapolar, and εpolar) were obtained.The structural indices were also
calculated. Because the BFEs are correlated to the solvent accessible
surface area (SASA),[1] the SASA values for
the complex, protein, and ligand were calculated using CppTraj in
AmberTools19.[50] To incorporate the conformational
entropy of the ligand, the number of rotatable bonds was calculated
using Rdkit,[56] because the entropy is
correlated to the rotatable bonds.[57] In
addition, the difference in each quantity between the complex and
the isolated protein or ligand was calculated as XBind = Xcomplex – Xprotein – Xligand. Hereinafter, the difference was denoted as “Bind”
and the Bind descriptors were calculated for the 13 thermodynamic
quantities and SASA.In summary, 58 descriptors were used as
input features in the 3D-RISM-AI.
The descriptors comprised four types (i.e., complex, protein, ligand,
and Bind) of the 13 thermodynamic quantities and SASA, the number
of rotatable bonds of the ligand, and BFE. The overall procedure and
summary of the descriptors are shown in Figure . Each descriptor value was standardized
to a mean of zero and a variance of one. All training data are available
on GitHub (https://github.com/IkeguchiLab/3D-RISM-AI).
Figure 2
Overall procedure of 3D-RISM-AI. Quantities of the complex, protein,
ligand, and binding are denoted as Com_A, Prt_A, Lig_A, and Bind_A,
respectively. The descriptors denoted as A are the
internal energy (E), HFE (Δμ), partial
molar volume (V), enthalpy term of the HFE (ε),
entropy term of the HFE (−TS), and solvent
accessible surface area (SASA). The index i represents
the total value and the polar and apolar decomposed values. The number
of rotatable bonds of the ligand is Lig_Nrotatable. The
BFE is Bind_Gtheory. The regression algorithms used are
the extreme gradient boosting regression (XGBR), random forest regression
(RFR), support vector regression (SVR), and ridge regression (RR).
Overall procedure of 3D-RISM-AI. Quantities of the complex, protein,
ligand, and binding are denoted as Com_A, Prt_A, Lig_A, and Bind_A,
respectively. The descriptors denoted as A are the
internal energy (E), HFE (Δμ), partial
molar volume (V), enthalpy term of the HFE (ε),
entropy term of the HFE (−TS), and solvent
accessible surface area (SASA). The index i represents
the total value and the polar and apolar decomposed values. The number
of rotatable bonds of the ligand is Lig_Nrotatable. The
BFE is Bind_Gtheory. The regression algorithms used are
the extreme gradient boosting regression (XGBR), random forest regression
(RFR), support vector regression (SVR), and ridge regression (RR).
Machine Learning
Four regression algorithms were used
in supervised machine learning: RR, SVR, RFR, and XGBR. The models
for RR, SVR, and RFR were built using scikit-learn,[58] and the model for XGBR was built using xgboost.[47] A Gaussian kernel is used in SVR. The hyperparameters
and feature selection were optimized by 5-fold cross-validation using
the training data set. A grid search is used in the hyperparameter
search. The objective variables—the experimental BFE data—were
calculated from Kd as ΔG = kBT ln Kd. The optimization was evaluated using the averaged root-mean-squared
error (RMSE) from the 5-fold cross-validation. After the optimization,
a regression model with all training data (n = 3716)
was built using a combination of the hyperparameters and the set of
descriptors that exhibited the lowest mean RMSE.Using the training
regression model, BFEs were predicted for the test data set (n = 217). Compared to the experimental and predicted BFEs,
the accuracy of the learning model was evaluated using Pearson’s
correlation coefficient (R), Spearman’s rank
correlation coefficient (ρ), and the RMSE. In the evaluation
using RMSE, data with an absolute value of the difference between
the experimental and predicted values within 2 kcal/mol were classified
as a low absolute error (LAE), and the data with an error larger than
2 kcal/mol were classified as a high absolute error (HAE). In the
decision tree-based algorithms (RFR and XGBR), the feature importance
was evaluated using the information gain, a measure of improvement
in the objective function when creating a branch in the decision tree,
and the default measure was used in scikit-lean (RFR) and xgboost
(XGBR).[47,58] All test data together with learning and
prediction codes are available on GitHub (https://github.com/IkeguchiLab/3D-RISM-AI).
Results and Discussion
Binding Free Energies Calculated Using 3D-RISM and the Thermodynamic
Cycle
First, we calculated the BFEs for 3993 protein–ligand
pairs using the 3D-RISM method and the thermodynamic cycle (Figure ). The calculated
BFEs exhibited a poor correlation with the experimental BFEs (Figure ), with a Pearson’s
correlation coefficient (R) of 0.047, Spearman’s
rank correlation coefficient (ρ) of 0.069, and RMSE of 49.25
kcal/mol. The calculated BFEs were one digit larger than the experimental
BFEs. We also attempted an improved version of 3D-RISM in which the
contribution of the partial molar volume was subtracted from the HFE
obtained using the original 3D-RISM method.[41] However, the correlation between the calculated and experimental
BFEs did not improve (Figure S1), with R = 0.049, ρ = 0.124, and RMSE = 187.68 kcal/mol.
Figure 3
Correlation
between the experimental and calculated BFEs using
the 3D-RISM method and thermodynamic cycle for 3993 protein–ligand
pairs. The Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and
root-mean-squared error (RMSE) are 0.047, 0.069, and 49.25 kcal/mol,
respectively.
Correlation
between the experimental and calculated BFEs using
the 3D-RISM method and thermodynamic cycle for 3993 protein–ligand
pairs. The Pearson’s correlation coefficient (R), Spearman’s rank correlation coefficient (ρ), and
root-mean-squared error (RMSE) are 0.047, 0.069, and 49.25 kcal/mol,
respectively.Although the MM/3D-RISM strategy predicted the
BFEs of a few ligands
with a similar scaffold for the same protein, the strategy failed
with multiple types of protein–ligand pairs. Therefore, we
developed a machine learning method, termed 3D-RISM-AI, using thermodynamic
quantities obtained by the 3D-RISM method as principal descriptors.
Because the machine learning method can handle nonlinear relationships
between an objective variable (BFEs) and explanatory variables (thermodynamic
and structural features), it is possible to improve the BFE prediction
using 3D-RISM.
Training Process
To build an optimal regression model,
feature selection was performed using the training data set (n = 3716). The best combination of descriptor types was
searched for in each regression algorithm (Figure ). In this feature selection, eight combinations
of the descriptor types involved in Bind and the three remaining types
(i.e., complex, protein, and ligand) were examined, and the mean RMSE
was evaluated with 5-fold cross-validation (Table S1). For all algorithms, combining the descriptor types of
the ligand, protein, and complex, besides the Bind, resulted in a
lower mean RMSE, suggesting that the Bind-type descriptors alone were
not effective for the BFE predictions.
Figure 4
Performance depends on
features included in the regression algorithms.
The feature combination patterns denoted as A to H are represented
in the table.
Performance depends on
features included in the regression algorithms.
The feature combination patterns denoted as A to H are represented
in the table.Consequently, we selected the best combination
exhibiting the lowest
mean RMSE and employed all descriptor types (denoted as H in Figure ) for XGBR, RFR,
and SVR and the Bind, ligand, and complex descriptors (denoted as
F in Figure ) for
the RR. Before examining the performance of the test data set, as
shown in the next section, using all training data sets, a learning
model was constructed for each algorithm with the optimized hyperparameters
and the best combination of the feature types.
Performance of the 3D-RISM-AI
Using the test data set
(n = 217), BFEs were predicted using each learning
model (Figure ). In
contrast to the BFE prediction based solely on the 3D-RISM method
(Figure ), the BFEs
predicted from all regression models in 3D-RISM-AI were well correlated
to the experimental BFEs. In a comparison of the four algorithms,
XGBR exhibited the best performance for all indicators: R = 0.80, ρ = 0.77, RMSE = 1.91 kcal/mol, and 154 data in LAE
(Figure a). Here,
the data in LAE represents the number of predicted BFEs within a 2
kcal/mol deviation from the experimental BFE. RFR, which is a decision
tree-based method similar to XGBR, also showed a better performance
(R = 0.78, ρ = 0.75, RMSE = 2.02 kcal/mol,
and 141 data in LAE) (Figure b). In contrast, the performance of RR (R = 0.65, ρ = 0.66, RMSE = 2.36 kcal/mol, and 132 data points
in LAE) and SVR (R = 0.68, ρ = 0.64, RMSE =
2.21 kcal/mol, and 150 data points in LAE) were relatively poor in
this examination (Figure c,d).
Figure 5
Comparison of the predicted BFEs using four learning models.
Blue
and orange dots represent the low absolute error (LAE) data and the
high absolute error (HAE) data, respectively. LAE means that the difference
between the predicted and experimental BFE is smaller than 2 kcal/mol.
The difference for HAE is larger than 2 kcal/mol.
Comparison of the predicted BFEs using four learning models.
Blue
and orange dots represent the low absolute error (LAE) data and the
high absolute error (HAE) data, respectively. LAE means that the difference
between the predicted and experimental BFE is smaller than 2 kcal/mol.
The difference for HAE is larger than 2 kcal/mol.Decision tree-type regressions, such as XGBR and
RFR, worked well
for predicting BFE. Because RR uses a linear function for the regression
model, its capability to express the relationship between the experimental
BFE and the input features may be poor. As for SVR, the mean RMSE
in the training process for only Bind descriptors (denoted as A in Figure ) showed similar
values of XGBR or RFR; however, the mean RMSE of SVR did not decrease
when other descriptors were included, which was a different behavior
from the decision tree-type algorithms.The performance of XGBR
and RFR in 3D-RISM-AI was compared with
that of other machine learning approaches. For example, the performance
of the KDEEP based on the three-dimensional
convolutional neural network was R = 0.82, ρ
= 0.82, and RMSE = 1.73 kcal/mol.[24] As
for the decision tree type learning model with structure and interaction-energy
features, such as RF-Score, ID-Score, SFCscoreRF, XGB-Score,
and ΔvinaXGB, their performances were R = 0.753–0.806, ρ = 0.647–0.796, and RMSE = 1.80–2.22
kcal/mol for the best model for each approach.[25−29] The performance of the 3D-RISM-AI was comparable
to that of previous machine learning models using different descriptors
and algorithms. However, because 3D-RISM-AI was based on well-defined
thermodynamic quantities, the factor analysis of the decision tree
method allowed us to analyze descriptors that were important for predicting
the BFE, as shown in the next section.
Feature Importance in the BFE Prediction
To understand
descriptors that were effective for BFE prediction, the feature importance
was analyzed using information gain in the XGBR learning model. The
contribution to reducing the loss function was evaluated by determining
the information gain of each descriptor. According to the ratio of
the information gain of each descriptor to the total gain, the difference
in SASA upon ligand binding (Bind_SASA) and the descriptors related
to the HFE components of ligands (Lig_–TStotal, Lig_Vtotal, Lig_μtotal, Lig_Vpol, and Lig_εapol) were important in the learning process (Figure ). Among the top seven descriptors
with ratios above 0.02, besides Bind_SASA, only Bind_εapol was the Bind-type descriptor. In RFR, the most important descriptors
were almost the same as the top important descriptors in XGBR. In
particular, Bind_SASA and the HFE components of the ligands were assigned
as important descriptors (Figure S2). These
results suggest that, except for Bind_SASA, the descriptors related
to the HFE components of the ligands calculated from 3D-RISM notably
contributed to the BFE prediction.
Figure 6
Rate of information gain in the training
process of XGBR. The top
descriptor is the difference in SASA upon ligand binding (Bind_SASA).
The second to fifth descriptors are the HFE components of ligands:
the entropy term (Lig_–TStotal),
partial molar volume (Lig_Vtotal), HFE
(Lig_μtotal), and polar component of the partial
molar volume (Lig_Vpol). The sixth and
seventh descriptors are the apolar component of the enthalpy (Lig_εapol) and the apolar component of the enthalpy difference in
HFE upon ligand binding (Bind_εapol), respectively.
Rate of information gain in the training
process of XGBR. The top
descriptor is the difference in SASA upon ligand binding (Bind_SASA).
The second to fifth descriptors are the HFE components of ligands:
the entropy term (Lig_–TStotal),
partial molar volume (Lig_Vtotal), HFE
(Lig_μtotal), and polar component of the partial
molar volume (Lig_Vpol). The sixth and
seventh descriptors are the apolar component of the enthalpy (Lig_εapol) and the apolar component of the enthalpy difference in
HFE upon ligand binding (Bind_εapol), respectively.To further understand the important physicochemical
descriptors
for prediction, the correlation among the descriptors was analyzed.
The top seven descriptors of XGBR described above were classified
into two groups exhibiting high correlations among the descriptors,
which were hydrophobic and hydrophilic features (Figure ). Although the descriptors
related to hydrophobicity (Bind_SASA, Lig_–TStotal, and Lig_Vtotal) were
the top three (Figure ), the remaining important descriptors were hydrophilic features.
Because both hydrophobic and hydrophilic features are important for
BFE prediction, the physicochemical features of hydrophobicity and
hydrophilicity should be learned in a well-balanced manner.
Figure 7
Correlation
matrix of the top seven descriptors. The polar and
apolar components, that is, hydrophilic and hydrophobic features,
are denoted by Pol and Apol, respectively.
Correlation
matrix of the top seven descriptors. The polar and
apolar components, that is, hydrophilic and hydrophobic features,
are denoted by Pol and Apol, respectively.The correlations between the descriptors and experimental
BFEs
were compared with the machine learning results (Figure ). The largest correlation
coefficient of the individual descriptors with the experimental BFEs
was less than 0.5. Considering that the correlation coefficients of
the machine learning results were 0.6–0.8 (Figure ), the machine learning algorithms
could develop regression models showing higher correlations with the
experimental BFEs. The top descriptors correlated to the experimental
BFEs were similar to those that were important in the learning process
of the decision tree (Figure ). The top descriptors were Bind_SASA, Bind_εapol, and the ligand HFE-related descriptors. The top seven descriptors
were hydrophobic features that correlated well with each other in
the correlation matrix (Figure S3). Interestingly,
the correlations between the hydrophilic features and the experimental
BFEs were relatively small (no hydrophilic descriptor in the top seven),
although their feature importance was not small (three in the top
seven). These results indicate that in the machine learning algorithms
the weights of the hydrophobic and hydrophilic descriptors were learned
in a well-balanced manner.
Figure 8
Correlations between the descriptors and experimental
BFEs. The
correlation is evaluated using the absolute value of Pearson’s
correlation coefficient (R).
Correlations between the descriptors and experimental
BFEs. The
correlation is evaluated using the absolute value of Pearson’s
correlation coefficient (R).
Trends in Predicted BFEs by XGBR and Data Bias
To investigate
data bias, the distributions of the predicted BFEs using XGBR and
the experimental BFEs were compared (Figure ). In a comparison of the distributions across
all test data (Figure a), the predicted BFEs were biased toward the mean of BFEs about
−9 kcal/mol. In the LAE data, the distributions were similar
to each other (Figure b). By contrast, in the HAE data, the experimental BFEs were broadly
distributed and the predicted BFEs were biased toward the mean (Figure c). This inaccuracy
could be attributed to the experimental BFE data used in the training
process were not uniformly distributed and biased toward the mean
(Figure S4). Therefore, the amount of data
largely deviating from the mean was small, and the construction of
the regression model failed to predict the data deviating from the
mean.
Figure 9
Histograms of BFEs. The distributions are (a) all test data sets,
(b) data for the low absolute error (LAE) with |ΔGXGBR – ΔGexp)|
≤ 2 kcal/mol, and (c) data for the high absolute error (HAE)
with |ΔGXGBR – ΔGexp)| > 2 kcal/mol. The solid and dashed
lines
represent the predicted BFEs by XGBR and the experimental BFEs, respectively.
Histograms of BFEs. The distributions are (a) all test data sets,
(b) data for the low absolute error (LAE) with |ΔGXGBR – ΔGexp)|
≤ 2 kcal/mol, and (c) data for the high absolute error (HAE)
with |ΔGXGBR – ΔGexp)| > 2 kcal/mol. The solid and dashed
lines
represent the predicted BFEs by XGBR and the experimental BFEs, respectively.Finally, we should discuss future approaches to
address the limitations
in the above analyses and to improve accuracy by incorporating new
features not considered in the 3D-RISM-AI. First, because the training
data set in the PDBbind database was biased around the mean, more
training data, including uniformly distributed BFEs over a wide energy
range, are necessary. The BFE data for multiple types of ligands frequently
used in the structure–activity relationship analyses would
be effective. Second, the performance of 3D-RISM-AI depended on the
performance of the 3D-RISM method and the force field used. Because
the important features were the ligand HFEs and their components are
given by the 3D-RISM method, solving the challenges inherent to the
theoretical framework and numerical calculations, such as the closure
and force field, would directly improve 3D-RISM-AI. In addition, the
interactions of ligand molecules include weak interactions, such as
cation−π, CH−π, and interactions related
to halogen atoms, such as halogen bonding and sigma holes, which cannot
be represented by conventional force fields but captured by quantum
chemical (QM) calculations. Therefore, it would be a good strategy
to improve the accuracy of the ligand force field or to add the interactions
given by QM calculations as new descriptors. Third, the 3D-RISM-AI
does not incorporate dynamic features. Both structural fluctuations
in the solution and the structural diversity of the unbound states
leading to configurational entropy changes could be considered combined
with MD simulations. Fourth, 3D-RISM-AI is complementary to other
machine learning approaches, such as KDEEP, which employs
structural features as descriptors. The 3D-RISM-AI uses thermodynamic
quantities based on HFE as descriptors and can incorporate structural
information with different properties in a complementary manner.Here, we discuss several issues to be resolved in the future before
applying 3D-RISM-AI to in silico pharmaceutical processes,
such as high-throughput virtual screening (HTVS) and structure–activity
relationship analyses for abundant protein–ligand pairs. First,
in these processes, complex models are obtained using docking simulations;
however, the models usually deviate from experimental ones. As only
experimental structures were used in the current 3D-RISM-AI, it is
not clear how sensitive the machine learning model is to the quality
of the structures of the protein–ligand complexes. Next, the
computational time required to calculate the descriptors is an issue
in HTVS. In this study, it required approximately 6 h using one central
processing unit (CPU) core to conduct the 3D-RISM calculation for
a protein or a complex, depending on the size of the protein: approximately
2 and 42 h were the minimum and maximum, respectively, and 6 h was
the median. To perform the 3D-RISM calculations for abundant pairs
in HTVS, which typically involves at least 100000s of molecules and
often millions, a large number of CPU cores such as supercomputers
are required. Alternatively, a graphics processing unit (GPU) version
of the 3D-RISM method could be used for speeding up the 3D-RISM calculation.[59] Furthermore, the recently developed deep learning
model for the 3D-RISM results,[60] which
is capable of quickly predicting water distributions around proteins,
could potentially work with 3D-RISM-AI. Finally, it is required to
examine whether 3D-RISM-AI is a better approach to find hit compounds
because it is not clear whether the machine learning model is significantly
more accurate than the docking scores. Solving these issues and applying
3D-RISM-AI to pharmaceutical processes are future challenges.
Conclusions
Using the HFE based on the 3D-RISM method
as the principal input
feature, we proposed a machine learning approach to predict the BFE
for abundant protein–ligand pairs, termed 3D-RISM-AI. Whereas
the BFEs solely evaluated using the 3D-RISM method through the thermodynamic
cycle were not correlated to the experimental data, the BFEs predicted
using 3D-RISM-AI showed a good correlation with the experimental BFEs: R = 0.80, ρ = 0.77, and RMSE = 1.91 kcal/mol. Although
the performance was comparable to that of other machine learning approaches
using other input features, the important factor analysis allowed
us to understand the important features for predicting BFEs, such
as the difference in the SASA between the unbound and bound structures
and the ligand HFE-related descriptors. The physicochemical features
described by the most important descriptors were hydrophobic, but
the rest were hydrophilic, indicating that the balance between them
is important for the predictions. Although the importance of both
hydrophobic and hydrophilic interactions is well-known, the fine balance
between them could be automatically detected using the machine learning
approach of the physicochemical-based 3D-RISM-AI. In addition, the
machine learning framework of 3D-RISM-AI can incorporate both structural
diversity sampled from MD simulations and other structural or energetic
input features.
Authors: Richard A Friesner; Robert B Murphy; Matthew P Repasky; Leah L Frye; Jeremy R Greenwood; Thomas A Halgren; Paul C Sanschagrin; Daniel T Mainz Journal: J Med Chem Date: 2006-10-19 Impact factor: 7.446