Kazuhiro J Fujimoto1,2, Shota Minami2, Takeshi Yanai1,2. 1. Institute of Transformative Bio-Molecules (WPI-ITbM), Nagoya University, Furocho, Chikusa, Nagoya 464-8601, Japan. 2. Department of Chemistry, Graduate School of Science, Nagoya University, Furocho, Chikusa, Nagoya 464-8601, Japan.
Abstract
We propose a novel machine-learning-based scoring function for drug discovery that incorporates ligand and protein structural information into a knowledge-based PMF score. Molecular docking, a simulation method for structure-based drug design (SBDD), is expected to reduce the enormous costs associated with conventional experimental methods in terms of rational drug discovery. Molecular docking has two main purposes: to predict ligand-binding structures for target proteins and to predict protein-ligand binding affinity. Currently available programs of molecular docking offer an accurate prediction of ligand binding structures for many systems. However, the accurate prediction of binding affinity remains challenging. In this study, we developed a new scoring function that incorporates fingerprints representing ligand and protein structures as descriptors in the PMF score. Here, regression analysis of the scoring function was performed using the following machine learning techniques: least absolute shrinkage and selection operator (LASSO) and light gradient boosting machine (LightGBM). The results on a test data set showed that the binding affinity delivered by the newly developed scoring function has a Pearson correlation coefficient of 0.79 with the experimental value, which surpasses that of the conventional scoring functions. Further analysis provided a chemical understanding of the descriptors that contributed significantly to the improvement in prediction accuracy. Our approach and findings are useful for rational drug discovery.
We propose a novel machine-learning-based scoring function for drug discovery that incorporates ligand and protein structural information into a knowledge-based PMF score. Molecular docking, a simulation method for structure-based drug design (SBDD), is expected to reduce the enormous costs associated with conventional experimental methods in terms of rational drug discovery. Molecular docking has two main purposes: to predict ligand-binding structures for target proteins and to predict protein-ligand binding affinity. Currently available programs of molecular docking offer an accurate prediction of ligand binding structures for many systems. However, the accurate prediction of binding affinity remains challenging. In this study, we developed a new scoring function that incorporates fingerprints representing ligand and protein structures as descriptors in the PMF score. Here, regression analysis of the scoring function was performed using the following machine learning techniques: least absolute shrinkage and selection operator (LASSO) and light gradient boosting machine (LightGBM). The results on a test data set showed that the binding affinity delivered by the newly developed scoring function has a Pearson correlation coefficient of 0.79 with the experimental value, which surpasses that of the conventional scoring functions. Further analysis provided a chemical understanding of the descriptors that contributed significantly to the improvement in prediction accuracy. Our approach and findings are useful for rational drug discovery.
The
development of a new drug requires a long period of time and
a large amount of money before approval and marketing.[1] To reduce these costs, there is a strong demand for rational
drug discovery.[2] In the early stages of
drug discovery, drug candidates are experimentally searched for from
a vast array of compounds in terms of their binding affinity to the
target protein (lead identification), and the compounds are repeatedly
improved to further increase their binding affinity (lead optimization).[3] These processes are based on the idea of hit
or miss and therefore bring about the enormous costs of drug discovery.Computer-aided drug discovery (in silico drug discovery) has been
attracting significant attention as a solution to such problems in
conventional drug discovery.[4−6] Structure-based drug design (SBDD),[7,8] which uses the three-dimensional structure of target proteins, plays
a central role in computer-aided drug discovery. If the protein pocket
is regarded as a keyhole, SBDD is a method to design a matching key
based on the shape of the keyhole. Therefore, computer-aided drug
discovery by SBDD is expected to be more rational than the conventional
approach in designing drug candidates with high activity and selectivity.
The anti-influenza drug oseltamivir[9] and
the chronic myelogenous leukemia drug imatinib[10] are representative examples of successful SBDD.The
background for the realization of SBDD is the accumulation
of a large amount of data on three-dimensional protein structures
due to the development of structural analysis infrastructure and the
dramatic improvement in computational power. In addition, the development
of a computational science method called molecular docking constitutes
another major factor. It is a type of computer simulation and has
become an indispensable tool for SBDD.[5,7]The purpose
of molecular docking is twofold: first, to predict
the binding structure of the ligand in the binding pocket of the target
protein and second, to predict the protein–ligand binding affinity.
These accurate predictions are key to rational drug discovery. To
this end, a number of molecular docking programs have been developed,
ranging from commercial to academic use.[11−20] Simulations using these programs reproduce actual ligand-binding
structures well in many systems. Thus, one of the goals of molecular
docking, the prediction of ligand-binding structures, is now possible
with considerable accuracy. In contrast, the other goal of molecular
docking, the prediction of binding affinity (binding energy), has
not yet been achieved to a satisfactory degree, and there is still
a large discrepancy between the predicted and experimental values.
Due to the inability to predict absolute binding affinities, the relative
binding affinities of various ligands to a given target protein are
often evaluated in drug discovery, but the effectiveness of this approach
is highly system-dependent. For this reason, the accurate prediction
of binding affinity is one of the major challenges in computer-aided
drug discovery.The poor prediction of binding affinities by
current molecular
docking is mainly attributed to inaccuracies of the scoring function.
It has been reported that incorporating solvation and entropy effects
into the scoring function improves the prediction accuracy,[21−26] but these incorporations are usually computationally expensive.
On the other hand, one of the knowledge-based scoring functions, the
PMF score,[27−30] can simply take into account these contributions. The PMF score
expresses the interatomic potential in a protein–ligand complex
as a frequency of occurrence of two atoms at a certain distance in
the crystal structure data and hence gives large negative values at
binding distances that often appear in the crystal structures. Thus,
the PMF score has the advantage of implicitly accounting for many
kinds of contributions to the binding, such as solvation, entropy,
and enthalpy effects, at low computational cost.This study
aims to develop a new scoring function based on the
PMF score that provides an accurate prediction of binding affinity.
In recent years, the application of machine-learning-based regression
algorithms has been proposed for the development of new scoring functions,[31−33] where random forests and support vector machines have been used
successfully.[34−36] Some of the scoring functions developed have shown
significant improvements over conventional scoring functions,[37−41] with Pearson correlation coefficients between predicted and experimental
values exceeding 0.7 to 0.8. Based on these successes, machine learning
is employed in this study to develop scoring functions. It should
be noted, however, that we employ a different strategy from the one
reported previously in the sense that the model function is constructed
by adding fingerprints representing the ligand and protein structure
into the PMF score. Incorporating ligand fingerprints has already
been reported,[42] but there are no reports
of scoring functions that take protein fingerprints into account.
Regression analysis is then performed using machine learning algorithms
such as LASSO[43] and LightGBM.[44] The scoring function developed in this study
exhibits a Pearson correlation coefficient of 0.79, indicating a higher
reproducibility to infer experimental binding affinities than the
conventional methods. Further analysis reveals that ligand and protein
fingerprints compensate for the poor description of the PMF score.
Our approach and findings would be useful for the further development
of computer-aided drug discovery.
Methods
In this section, we briefly explain how to develop a new machine-learning-based
scoring function in which ligand and protein structural information
is incorporated into the knowledge-based PMF score.[27−30] First, descriptors constituting
model functions are introduced, and then the model is optimized by
machine learning.
PMF Score
The
PMF score[27−30]V(r) represents the
knowledge-based protein–ligand binding energy and is defined
as the sum of the interaction energies A(r) of the protein–ligand atom pairs at distance r.Here, the indices i and j are protein and ligand atoms, respectively,
and rijc is the cutoff distance defined for each pair of protein and
ligand atoms. There are 17 types of protein atoms and 34 types of
ligand atoms.[30] The interactions beyond rijc are not considered in the PMF score. The interatomic energy A(r) is calculated using the
Boltzmann distribution as follows:where kB is Boltzmann’s
constant, T is temperature,
ρ(r) is the number
density of ligand–protein pair ij at distance r, and ρ0 is the number density in the reference
state where the atomic interaction of pair ij is
zero. Therefore, corresponds to the probability
density
distribution of pair ij in the structural data set
of the protein–ligand complex. f(r) is called the ligand volume correction factor
and is used to remove the volume of ligand atoms from the number density.[29] In this study, the PMF score is used as the
basis for developing the scoring function.
Ligand
Fingerprints
To incorporate
the structural information of the ligand in the PMF score, we employ
the molecular fingerprint method. Molecular fingerprints are the representation
of molecular structures based on certain rules using binary vectors
representing presence or absence as 1 or 0, or count vectors representing
the frequency of occurrence. It is one of the most common methods
to convert a molecular structure into a computer-readable form and
is widely used in chemoinformatics[45−47] and in silico drug discovery.[48] Hereafter, molecular fingerprints are called
ligand fingerprints to distinguish them from fingerprints on proteins,
which will be introduced later.There are two major types of
ligand fingerprints. The first one assigns a substructure in the molecule
to each bit of the array. Here, substructures refer to fragments such
as functional groups or carbon skeletons. This method can appropriately
represent molecular structures composed of the substructures registered
as keys in a bit array. However, if the substructures are not registered,
an accurate representation of the molecular structure is not possible.
The second fingerprint method constructs molecular structures using
the information on the relative positions of atoms in the molecule.
This method can generate ligand fingerprints that reflect any chemical
structure. However, it is difficult to extract only the substructure
information from the generated fingerprint. This is a drawback of
this method in the analysis of machine-learning-based results. In
this study, we have employed MACCSKeys[49] and extended connectivity fingerprint (ECFP)[50,51] as representative examples of the first and second fingerprint methods,
respectively.The MACCSKeys fingerprint is a bit array consisting
of 166 keys
representing different substructures. The concept of MACCSKeys is
illustrated in Figure a. If a substructure in the molecule is among the types associated
with the 166 keys, the corresponding bit is set to 1; otherwise, it
is set to 0. It should be noted that even if a molecule contains several
substructures of the same type, the number of substructures is not
reflected in MACCSKeys.
Figure 1
Representation of molecular structures using
(a) MACCSKeys and
(b) ECFP. The MACCSKeys bit is set to 1 if the molecule contains a
registered substructure; otherwise, it is set to 0. ECFP searches
for the relative position of each atom in the molecule using a bond
layer l and records the structural information in
a bit array.
Representation of molecular structures using
(a) MACCSKeys and
(b) ECFP. The MACCSKeys bit is set to 1 if the molecule contains a
registered substructure; otherwise, it is set to 0. ECFP searches
for the relative position of each atom in the molecule using a bond
layer l and records the structural information in
a bit array.The ECFP fingerprint is one of
the circular fingerprints,[52] which represents
molecular structures by considering
the position of atoms within a bond layer l centered
on each atom of the molecule. The concept of ECFP is illustrated in Figure b. First, the relative
positions of atoms within a given bond layer are iteratively explored,
and then the resulting atom environments are transformed to a bit
array through a hash function. The size of the bond layer and bit
array can be determined arbitrarily, but typically, the maximum bond
layer of 2 (ECFP4) and 1024 bits are used, respectively. ECFP has
the advantage of not requiring a predefined substructure data set,
in contrast to MACCSKeys. In addition, ECFP can consider several functional
groups of the same type separately, which makes it a more accurate
representation than MACCSKeys.
Protein
Fingerprints
In addition
to the structural information of the ligand, we have attempted to
include the structural information of the protein in the scoring function
using our original fingerprint method shown below. This method is
named protein fingerprint (PF), and its concept is shown in Figure . We assign three
types of moieties to each amino acid constituting the given protein:
N-linked main chain, O-linked main chain, and side chain. To account
for direct protein–ligand interactions, we only incorporate
atoms constituting amino acids located within 5 Å of the ligand
atoms in the binding pocket. Hydrogen atoms are not considered here.
Our descriptors distinguish 20 types of amino acids including the
three types of the associated moieties except for glycine, resulting
in a total of 59 different descriptors, which were included in the
scoring function. The PF method employs count vectors to record structural
information, whereas MACCSKeys and ECFP use binary vectors. The following
procedure is used to represent PF. First, all the atoms in the protein
that are within 5 Å of the ligand are searched for, then the
corresponding atoms are classified into the 59 different amino acid
sites, and the number of their occurrences is stored in the corresponding
array of descriptors. It should be noted that the PF method takes
into account the numbers of amino acids of the same type, which are
stored in the count vectors.
Figure 2
Protein fingerprints (PF) developed in this
study. (a) Each amino
acid is described by three descriptors: N-linked main chain, O-linked
main chain, and side chain. (b) Example to illustrate PF. Here we
consider a situation where only the main chain of alanine is within
5 Å of the ligand. The counter associated with the N-linked main
chain descriptor of alanine (blue box) is incremented by 2 due to
the presence of a nitrogen atom and an alpha carbon atom in the alanine
main chain. Similarly, the counter associated with the O-linked main
chain descriptor of alanine (red box) is incremented by 2 due to the
carbon and oxygen atoms forming the peptide bond. It should be noted
that the number of hydrogen atoms is not considered in PF.
Protein fingerprints (PF) developed in this
study. (a) Each amino
acid is described by three descriptors: N-linked main chain, O-linked
main chain, and side chain. (b) Example to illustrate PF. Here we
consider a situation where only the main chain of alanine is within
5 Å of the ligand. The counter associated with the N-linked main
chain descriptor of alanine (blue box) is incremented by 2 due to
the presence of a nitrogen atom and an alpha carbon atom in the alanine
main chain. Similarly, the counter associated with the O-linked main
chain descriptor of alanine (red box) is incremented by 2 due to the
carbon and oxygen atoms forming the peptide bond. It should be noted
that the number of hydrogen atoms is not considered in PF.
Machine Learning with LASSO and LightGBM
To develop a new scoring function, the PMF score modified with
ligand and protein fingerprints is trained with the machine learning
method. In this study, we employ two machine learning techniques:
the least absolute shrinkage and selection operator (LASSO)[43] and the light gradient boosting machine (LightGBM).[44] Here, we attempt to construct the following
relation for the objective variable y and m descriptors (explanatory variables) x1, x2, ...x:where f(x1, x2, ...x) is the objective function. It should be noted
that we assume the protein–ligand binding energy for the objective
variable and the components of PMF score, ligand fingerprints, and
protein fingerprints for the descriptors.LASSO is a regression
analysis technique that imposes regularization on the least squares
process.[43] First, we consider a linear
model with the following form:where c (for l = 0 to m) is the
partial regression coefficient. The values of these coefficients are
determined by the least squares method. To this end, the following
loss function L is considered:where y and xlk (for l = 1 to m) denote the objective variable and descriptors for training data k, respectively, and the summation runs over all n data. If the absolute values of the partial regression
coefficients become large, the model fits too well to the training
data and poorly to the test data. This phenomenon is called overfitting
(overtraining). To circumvent this problem, LASSO adds a regularization
(penalty) term to the loss function, which is the sum of the absolute
values of the partial regression coefficients multiplied by a hyperparameter
α.In this form, the loss function L increases due
to the regularization term as the partial
regression coefficient increases. In this way, LASSO prevents an increase
in the absolute value of the partial regression coefficient in the
least squares treatment, making overfitting less likely to occur.
The suppression of overfitting strongly depends on the value of the
hyperparameter α. Therefore, this study adopts the value of
α that gives the highest prediction accuracy for the pretest
data set.LightGBM is a supervised machine learning method based
on the gradient-boosting
decision tree (GBDT) algorithm,[44] which
has recently been applied in various fields such as life sciences,[53−55] engineering,[56,57] and economics.[58] An attempt to apply XGBoost, a type of GBDT, to protein–ligand
binding affinity has also been reported.[59] Since LightGBM is a combination of the decision tree and gradient
boosting methods, each is described here. First, the decision tree[60] is a tree-based data analysis method that can
be applied to both classification and regression problems. In this
study, the regression model of eq is considered. As illustrated in Figure a, the properties of the data
were evaluated at the root and internal nodes using conditional expressions
on the descriptors, and finally, the value of the objective variable y is predicted based on the result at the leaf node. Another
feature of LightGBM is gradient boosting,[61,62] which is a technique that attempts to iteratively improve the prediction
accuracy of a model, f(x1, x2, ...x), by adding a new estimator, δ(x1, x2, ...x).where μ
represents the
iteration step and β is a hyperparameter called the learning
rate. In LightGBM, δμ(x1, x2, ...x) corresponds to a decision tree that represents the error
between the actual (experimental) value y and the
predicted value by the μ – 1 step decision tree fμ – 1(x1, x2, ...x). A schematic illustration of LightGBM is shown in Figure b. Repetition of
the operation according to eq can gradually reduce the error. However, too much iteration
leads to overfitting. To avoid this problem, we employ a method called
early stopping, in which the number of iterations M (the number of trees) with the highest prediction accuracy is determined
based on the results on the pretest data set conducted with different
iteration counts.
Figure 3
(a) Example of a decision tree for a regression problem.
The characteristics
of the descriptors of the data are diagnosed by the conditional expressions
at the root and internal nodes, and the predicted value of the objective
variable is returned at the leaf node. (b) Schematic illustration
of LightGBM. For each iteration, a new decision tree is added and
the error decreases accordingly.
(a) Example of a decision tree for a regression problem.
The characteristics
of the descriptors of the data are diagnosed by the conditional expressions
at the root and internal nodes, and the predicted value of the objective
variable is returned at the leaf node. (b) Schematic illustration
of LightGBM. For each iteration, a new decision tree is added and
the error decreases accordingly.
Computational Details
PDBbind[63] v2019 was used as the data set for developing
new scoring functions. PDBbind is known as a high-quality protein–ligand
structure data set that can be used to develop and validate scoring
functions. In this study, we used 6271 data structures from PDBbind
v2019 in which dissociation constant Kd values exist. Of the 6271 protein–ligand complex structures,
4933 were used as the training data set, 1234 as the pretest data
set, and 104 as the test data set. The training data set was used
in machine learning with LASSO and LightGBM, while the pretest data
set was used to determine the value of the hyperparameter α
for LASSO and the number of iterations M for LightGBM.
The test data set corresponds to the core set in PDBbind (CASF-2016[64]), which is widely used as the benchmark data
for evaluating the accuracy of scoring functions.[65−67] This study
also used it only as an external test data set, not as a training
data set, to evaluate the performance of the scoring function. All
atomic coordinates of the protein and ligand, including hydrogens,
were used without any optimization for the PDBbind structures.In this study, we aimed to improve the prediction accuracy of the
PMF score by incorporating the structural information of ligands and
proteins using the fingerprint methods. In addition, we examined the
effect of incorporating van der Waals (vdW) interactions into the
PMF score using the 6–12 Lennard–Jones potentials[68] with the amber99 force field.[69] The pairwise interatomic potentials, PMF scores and vdW
interactions, were calculated using the atomic coordinates of the
PDBbind structures and included in the scoring function as descriptors.
It should be noted that the PMF score in this study does not consider
the interatomic potentials associated with hydrogen, whereas the vdW
interaction does. MACCSKeys and ECFP were applied to the ligand structure,
and each bit comprising the bit array was used for a single descriptor.
Protein fingerprints were applied to protein atoms within 5 Å
of the ligand atoms, which were used as descriptors. Machine learning
using LASSO or LightGBM was performed on the objective functions consisting
of the PMF score, MACCSKeys, ECFP, PF, and vdW. The hyperparameter
α for the LASSO regression was determined through machine learning
on the pretest data set with α varying from 0.001 to 0.01 in
0.001 increments. The number of iterations M for
LightGBM early-stopping was determined based on the results of machine
learning on the pretest data set with M varying up
to 10,000. It should be noted that the optimal values for the hyperparameter
α and the number of iterations M are different
for each model. The learning rate β of 0.1 was used for LightGBM.The number of descriptors included in the scoring function is 448
for PMF scores, 166 for MACCSKeys, 1024 for ECFP, 59 for PF, and 493
for vdW. We validated a combination of these descriptors to improve
the prediction accuracy of the scoring function.The RDKit program
package[70] is used
for generating ligand fingerprints with MACCSKeys and ECFP. For machine
learning of the models, the scikit-learn[71] and LightGBM[72] libraries were used for
LASSO and LightGBM, respectively.Two methods were employed
in this study to assess the validity
of the model. The first is the Pearson correlation coefficient R, given bywhere y and f are the experimental
and predicted binding affinities for the kth complex,
respectively, and y̅ and f̅ are their average values. The second evaluation method is the root
mean squared error (RMSE), which is given byBoth R and
RMSE were calculated using the test data set containing the 104 complexes
(n = 104).The training results vary depending
on how the 6271 complex structures
are divided into 4933 training and 1234 pretest data sets. To avoid
such a dependence, the following steps were taken. First, we created
10 different combinations of training and pretest data sets with different
seed values involved in the generation of random numbers and then
performed machine learning on these data sets using LightGBM. Finally,
we employed the data set combination that produced the correlation
coefficient closest to the average of the resulting 10 correlation
coefficients.
Results and Discussion
Comparison of Binding Affinities Calculated
by New Scoring Functions with Experimental Values
We applied
our scoring functions to a test data set of 104 structures, which
is the core data set of PDBbind, and examined the Pearson correlation
coefficients between the experimental and predicted binding energies.
The results of the scatter plots are shown in Figure , and the correlation coefficient R and RMSE values are summarized in Table . For comparison, the same analysis was also
performed using various conventional scoring functions. The predicted
values of binding affinity by each scoring function were taken from
the CASF-201664 data set. These results are listed in Table . The results of the
Pearson correlation coefficients show that including ligand and protein
fingerprints in the scoring function improves the prediction accuracy
of the binding energy step by step from the PMF model. The most accurate
result was obtained from the PMF + ECFP + PF model trained with LightGBM,
with a Pearson correlation coefficient of 0.79. Pearson correlation
coefficients calculated with the conventional scoring functions range
from 0.24 to 0.74, with the best result being obtained for deltaVinaRF20.[35] These results indicate that our newly developed
scoring function has a higher prediction accuracy than the conventional
ones. The RMSE of the PMF + ECFP + PF model is 1.64 kcal/mol, which
also shows better prediction accuracy of binding energies than the
results of AutodockVina (2.42 kcal/mol), ASE@MOE (9.63 kcal/mol),
and deltaVinaRF20 (10.88 kcal/mol). These results clearly show that
the newly developed scoring functions are applicable not only to the
relative comparison of protein–ligand binding energies but
also to the evaluation of their absolute values.
Figure 4
Scatter plots of experimental
values versus predicted values by
scoring functions trained with (a) LightGBM and (b) LASSO.
Table 1
Pearson Correlation Coefficient (R) and RMSE between the Experimental Values and the Predicted
Values by the Newly Developed Scoring Functions
model
method
R
RMSE
(kcal/mol)
PMF
LASSO
0.67
2.04
LightGBM
0.73
1.85
PMF + MACCSKeys
LASSO
0.76
1.84
LightGBM
0.69
1.93
PMF + ECFP
LASSO
0.77
1.76
LightGBM
0.75
1.76
PMF + PF
LASSO
0.69
2.00
LightGBM
0.75
1.77
PMF + vdW
LASSO
0.67
2.02
LightGBM
0.74
1.82
PMF + MACCSKeys
+ PF
LASSO
0.75
1.83
LightGBM
0.73
1.83
PMF + ECFP
+ PF
LASSO
0.77
1.77
LightGBM
0.79
1.64
Table 2
Pearson Correlation Coefficient (R) between the
Experimental Values and the Predicted Values
by the Conventional Scoring Functions
model
R
model
R
ASE@MOE
0.54
LigScore1@DS
0.24
Affinity-dG@MOE
0.55
LigScore2@DS
0.41
Alpha-HB@MOE
0.55
London-dG@MOE
0.33
AutodockVina
0.54
PLP1@DS
0.41
ChemScore@SYBYL
0.45
PLP2@DS
0.43
D-Score@SYBYL
0.40
PMF04@DS
0.43
DrugScore2018
0.43
PMF@DS
0.50
DrugScoreCSD
0.49
PMF@SYBYL
0.37
G-Score@SYBYL
0.40
X-Score
0.55
GBVI_WSA-dG@MOE
0.24
X-ScoreHM
0.50
LUDI1@DS
0.41
deltaSAS
0.53
LUDI2@DS
0.45
deltaVinaRF20
0.74
LUDI3@DS
0.41
Scatter plots of experimental
values versus predicted values by
scoring functions trained with (a) LightGBM and (b) LASSO.Overall, the accuracy in
terms of the correlation coefficient and
RMSE was found to be improved by adding ligand fingerprints as descriptors,
although there were some cases where the prediction accuracy worsened,
such as the PMF + MACCSKeys models trained by LightGBM. In addition,
the inclusion of protein fingerprints in the scoring function further
improved the correlation coefficient and RMSE results. In the case
of LightGBM, the change from the PMF to the PMF + ECFP + PF models
improved the correlation coefficient by 0.06 and rectified the RMSE
by 0.21 kcal/mol. These results clearly show the importance of adding
ligand and protein structural information as descriptors to the scoring
function.The best scoring function in our results was obtained
from training
with LightGBM, but in some models, LASSO gave better results. This
indicates that the best machine learning method depends on the type
of descriptor included in the scoring function. Interestingly, in
the present results, the PMF + ECFP + PF model provided the best values
of the correlation coefficient and RMSE for LightGBM as well as LASSO.
This strongly suggests that the ECFP and PF fingerprints are suitable
for descriptors to be included in the scoring function.In the
present calculations, the ECFP fingerprints gave better
results than the MACCSKeys fingerprints. MACCSKeys contains 166 bits,
each bit corresponding to a specific substructure such as carbon skeletons
and functional groups. Therefore, if a test compound consists of substructures
registered in MACCSKeys, the molecular structure information will
be reflected well in the scoring function; otherwise, it cannot be
accurately represented. On the other hand, the ECFP fingerprint can
represent any molecular structure by considering the relative positions
of atoms within the bond layer l centered on each
atom of the compound. For this reason, ECFP probably gave better results
than MACCSKeys in the data set used here.The worst score in
the present results was obtained from the PMF
+ vdW model trained with LASSO, with a correlation coefficient of
0.67 and a RMSE of 2.02 kcal/mol. This result is almost the same as
that of the PMF model trained by LASSO, indicating that the addition
of the vdW interaction does not improve the prediction accuracy. The
PMF model implicitly deals with many types of interactions, such as
desolvation effects and enthalpy, and hence the effect of the vdW
interaction is already included. This means that the high dependency
of the vdW interaction on the PMF model hinders the improvement of
the prediction accuracy of the scoring function. Such a strong correlation
between descriptors is called multicollinearity.[73]
Analysis of Descriptors
Affecting the Scoring
Function
To gain a chemical understanding of the improvement
in prediction accuracy, we analyzed the contribution of each descriptor
in the scoring function. Here, we focus on the partial regression
coefficients (eq ) in
the PMF + MACCSKeys model trained with LASSO regression. The reason
for choosing this model is that MACCSKeys provides a clear chemical
interpretation due to the one-to-one correspondence of its components
to specific substructures and that LASSO allows us to easily estimate
the contribution of descriptors based on the magnitude of the regression
coefficient.Table summarizes the descriptors with the large contributions to
the binding energies. It should be noted that the positive and negative
values of regression coefficients correspond to repulsive and attractive
interactions, respectively. As for the negative regression coefficients,
we can see that there are many large contributions from heteroatoms
such as key numbers 23, 71, 131, and 159, which are related to hydrogen
bonding. In addition, the large contributions from halogen atoms such
as key numbers 27 and 87 are shown, suggesting that the consideration
of halogen effects, which could not be incorporated by PMF alone,
leads to a higher prediction accuracy. As for the positive coefficients,
the large contributions came from ring structures and aromatics such
as key numbers 125, 162, and 163 and hydrocarbons such as 28, 82,
and 160. These substructures cause destabilization to protein–ligand
binding due to their positive coefficients. Therefore, it was suggested
that these substructures are not suitable for drug design.
Table 3
MACCSKeys Descriptors with Large Contribution
descriptor
descriptor
key no.
key description
contribution
(kcal/mol)
key no.
key description
contribution
(kcal/mol)
23
NC(O)O
–0.74
92
OC(N)C
0.44
159
O > 1
–0.60
28
QCH2Q
0.4
27
I
–0.56
57
O HETEROCYCLE
0.38
36
S HETEROCYCRE
–0.45
47
SAN
0.37
131
QH > 1
–0.40
162
AROMATIC
0.33
71
NO
–0.36
125
AROMATIC RING >1
0.31
87
X!A$A
–0.36
163
6M RING
0.25
70
QNQ
–0.35
49
CHARGE
0.24
127
A$A!O > 1
–0.34
160
CH3
0.23
116
CH3AACH2A
–0.32
82
ACH2QH
0.21
A similar analysis
was performed for the PF fingerprints. Table summarizes the descriptors
of PF that have made significant contributions. Although the PF fingerprints
based on count vectors are not simply comparable to MACCSKeys with
binary vectors, we can see that the overall contribution of PF to
the scoring function was smaller than that of MACCSKeys. In this study,
we have constructed the PF fingerprints using three descriptors for
each amino acid except glycine: the N-linked and C-linked main chain, and its side chain. The analysis
showed that the contributions of the main chains were larger than
those of the side chains. This was contrary to our chemical intuition
that the side chain properties would give a larger contribution than
the main chain. The result also showed a significant contribution
from the cysteine side chain other than the main chain. This descriptor
may compensate for interactions involving sulfur atoms that are not
adequately represented by the PMF score. Another possibility is the
incorporation of the disulfide bond effect. Since disulfide bonds
are not considered in the descriptors other than the PF fingerprints,
the contribution of the cysteine side chain may have been greater.
Table 4
PF Descriptors with Large Contribution
descriptora
contribution
(kcal/mol)
descriptora
contribution
(kcal/mol)
Asn-N
–0.13
Asn-O
0.16
His-O
–0.12
Gln-O
0.12
Cys-S
–0.12
Ala-O
0.10
Phe-N
–0.11
Lys-O
0.10
Met-N
0.10
Pro-O
0.08
Ile-O
–0.09
His-N
0.08
Asp-N
–0.09
Trp-N
0.07
Ala-N
–0.08
Cys-O
0.06
Gly-N
–0.07
Asp-O
0.06
Tyr-O
–0.07
Asn-O
0.06
N-linked and O-linked main chain,
and side chain are abbreviated as N, O, and S, respectively.
N-linked and O-linked main chain,
and side chain are abbreviated as N, O, and S, respectively.
Systems with Improved and
Unimproved Scores
The above analysis succeeded in finding
descriptors that contribute
significantly to the new scoring function. We then turned our attention
to specific systems in which the incorporation of ligand fingerprints
greatly improved the computational results. For ease of interpretation,
we again analyzed the PMF + MACCSKeys model trained with LASSO. Table summarizes the calculated
binding energies for the PMF and PMF + MACCSKeys models, and their
chemical structures are illustrated in Figure . The results clearly show that the addition
of MACCSKeys improves the binding energy by 1.84 and 0.50 kcal/mol
for the systems with PDB ID of 4J3L and 3ARP, respectively. We further analyzed the
MACCSKeys descriptors and found significant contributions of the halogen
atom (key number 87: −0.36 kcal/mol) and functional groups
involved in hydrogen bonding (key numbers 159 and 131: −0.60
and −0.40 kcal/mol). This result indicates that the PMF score
alone cannot sufficiently describe these effects for compound 1. The analysis also showed that MACCSKeys for compound 2 provides a large destabilization of the charged functional
group (key number 49: 0.24 kcal/mol) due to its positive regression
coefficient. This is a counterintuitive result because electrostatic
interactions generally contribute significantly to protein–ligand
interactions. The reason for this could be that the MACCSKeys descriptor
of the charged functional group ameliorates the overestimation of
the implicit electrostatic effect in the PMF score.
Table 5
Binding Energies for the PMF and PMF
+ MACCSKeys Models Along with the Experimental Values, Exemplifying
Change in Predicted Value by Adding MACCSKeys (kcal/mol)
model
PDB ID
PMF
PMF + MACCSKeys
exptl.
4J3L
–8.58
–10.42
–10.67
3ARP
–10.41
–9.91
–9.82
3LKA
–8.88
–7.50
–3.87
Figure 5
Ligand structures included
in the test data set. For 1 and 2, our scoring
function showed improved prediction
accuracy with the incorporation of ligand fingerprints. For 3, the improvement over PMF score was not observed. The corresponding
PDB IDs are shown in parentheses.
Ligand structures included
in the test data set. For 1 and 2, our scoring
function showed improved prediction
accuracy with the incorporation of ligand fingerprints. For 3, the improvement over PMF score was not observed. The corresponding
PDB IDs are shown in parentheses.We also investigated systems with large deviations
from the experimental
values. As also shown in Table , the PMF + MACCSKeys model for a system with PDB ID of 3LKA gave a large error
(3.63 kcal/mol) from the experimental value. Such a large error was
also observed in the PMF + ECFP model trained with LightGBM. The error
calculated by the PMF model alone was 5.01 kcal/mol, which means that
the interaction is not correctly described at the stage by the PMF
model. From these results, we can see that the inclusion of ligand
fingerprints in the PMF score did not sufficiently improve the prediction
accuracy for this system. In the PMF score, the nitrogen atom in compound 3 is classified as the atom type of ND. However, the environment
of the nitrogen atom in sulfonamides would be different from that
of most nitrogen atoms with the ND type because it is bound to a sulfonyl
group. For this reason, its classification into ND is considered to
be inappropriate for the nitrogen atom in compound 3,
and the error of the PMF model is likely to be large. As shown above,
a model including ligand fingerprints can greatly improve the prediction
accuracy for many systems, but no such effect was observed for 3LKA.
The reason for this may lie in the training data. In this study, the
PDBbind structures were used as training data without any structural
optimization. In addition, no evaluation of the protonation state
was performed for both the ligands and the proteins. Such lack of
structural refinement may have caused the poor prediction accuracy
for 3LKA. If we can prepare a large amount of training data with structural
optimization and protonation state evaluation in the future, the prediction
accuracy may be further improved.
Conclusions
In this study, we developed a new scoring function for the protein–ligand
binding energy prediction by introducing the ligand fingerprints and
protein fingerprints into the knowledge-based PMF function. Here,
the machine learning techniques LASSO and LightGBM were used to train
the scoring functions. The PMF + ECFP + PF model trained with LightGBM
showed the best results, with a Pearson correlation coefficient of
0.79 and a RMSE of 1.64 kcal/mol, indicating a higher prediction accuracy
for binding energy than the conventional scoring functions. Further
analysis revealed that the ECFP and PF fingerprints are suitable for
combination with PMF regardless of the two machine learning methods
used in this study. To the best of our knowledge, this is the first
example of introducing protein fingerprints into a scoring function.
Although there are scoring functions that show better correlation
coefficients than our results,[37−41] the results of this study successfully demonstrate the effectiveness
of protein fingerprints as well as ligand fingerprints.We also
analyzed the components of the binding energy obtained
from our newly developed scoring function and found that the inclusion
of the ligand fingerprints into the PMF score improves prediction
accuracy, especially for halogen atoms and functional groups involved
in hydrogen bonding. On the other hand, we found a specific system
in which the use of ligand fingerprints was not effective. This analysis
strongly suggests the need for improving the quality of the training
data with structural optimization and protonation state evaluation.Incorporating ligand and protein structural information using the
fingerprint methods into the scoring function is a promising approach
for predicting binding energies. The PF fingerprints developed in
this study are based on a representation using only three descriptors
for each amino acid. Even such a simple handling of the protein structure
significantly contributed to the improvement in prediction accuracy.
A more appropriate representation of the protein structure would possibly
lead to better results of the binding energies. The development of
advanced protein fingerprints is a future challenge. The proposed
methods and findings in this study will be useful for the development
of computational drug discovery.
Authors: Helen M Berman; Tammy Battistuz; T N Bhat; Wolfgang F Bluhm; Philip E Bourne; Kyle Burkhardt; Zukang Feng; Gary L Gilliland; Lisa Iype; Shri Jain; Phoebe Fagan; Jessica Marvin; David Padilla; Veerasamy Ravichandran; Bohdan Schneider; Narmada Thanki; Helge Weissig; John D Westbrook; Christine Zardecki Journal: Acta Crystallogr D Biol Crystallogr Date: 2002-05-29
Authors: Richard A Friesner; Jay L Banks; Robert B Murphy; Thomas A Halgren; Jasna J Klicic; Daniel T Mainz; Matthew P Repasky; Eric H Knoll; Mee Shelley; Jason K Perry; David E Shaw; Perry Francis; Peter S Shenkin Journal: J Med Chem Date: 2004-03-25 Impact factor: 7.446