Huisi Liu1,2, Minyi Su2, Hai-Xia Lin1, Renxiao Wang3, Yan Li3. 1. Department of Chemistry, College of Sciences, Shanghai University, 99 Shangda Road, Shanghai 200444, People's Republic of China. 2. State Key Laboratory of Bioorganic and Natural Products Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200032, People's Republic of China. 3. Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China.
Abstract
Protein-ligand binding affinity reflects the equilibrium thermodynamics of the protein-ligand binding process. Binding/unbinding kinetics is the other side of the coin. Computational models for interpreting the quantitative structure-kinetics relationship (QSKR) aim at predicting protein-ligand binding/unbinding kinetics based on protein structure, ligand structure, or their complex structure, which in principle can provide a more rational basis for structure-based drug design. Thus far, most of the public data sets used for deriving such QSKR models are rather limited in sample size and structural diversity. To tackle this problem, we have compiled a set of 680 protein-ligand complexes with experimental dissociation rate constants (k off), which were mainly curated from the references accumulated for updating our PDBbind database. Three-dimensional structure of each protein-ligand complex in this data set was either retrieved from the Protein Data Bank or carefully modeled based on a proper template. The entire data set covers 155 types of protein, with their dissociation kinetic constants (k off) spanning nearly 10 orders of magnitude. To the best of our knowledge, this data set is the largest of its kind reported publicly. Utilizing this data set, we derived a random forest (RF) model based on protein-ligand atom pair descriptors for predicting k off values. We also demonstrated that utilizing modeled structures as additional training samples will benefit the model performance. The RF model with mixed structures can serve as a baseline for testifying other more sophisticated QSKR models. The whole data set, namely, PDBbind-koff-2020, is available for free download at our PDBbind-CN web site (http://www.pdbbind.org.cn/download.php).
Protein-ligand binding affinity reflects the equilibrium thermodynamics of the protein-ligand binding process. Binding/unbinding kinetics is the other side of the coin. Computational models for interpreting the quantitative structure-kinetics relationship (QSKR) aim at predicting protein-ligand binding/unbinding kinetics based on protein structure, ligand structure, or their complex structure, which in principle can provide a more rational basis for structure-based drug design. Thus far, most of the public data sets used for deriving such QSKR models are rather limited in sample size and structural diversity. To tackle this problem, we have compiled a set of 680 protein-ligand complexes with experimental dissociation rate constants (k off), which were mainly curated from the references accumulated for updating our PDBbind database. Three-dimensional structure of each protein-ligand complex in this data set was either retrieved from the Protein Data Bank or carefully modeled based on a proper template. The entire data set covers 155 types of protein, with their dissociation kinetic constants (k off) spanning nearly 10 orders of magnitude. To the best of our knowledge, this data set is the largest of its kind reported publicly. Utilizing this data set, we derived a random forest (RF) model based on protein-ligand atom pair descriptors for predicting k off values. We also demonstrated that utilizing modeled structures as additional training samples will benefit the model performance. The RF model with mixed structures can serve as a baseline for testifying other more sophisticated QSKR models. The whole data set, namely, PDBbind-koff-2020, is available for free download at our PDBbind-CN web site (http://www.pdbbind.org.cn/download.php).
Optimizing the affinity and selectivity of the candidate drug to
the target is still the mainstream strategy in current drug discovery
and development programs. However, the success rate of final approval
of the drug is still disappointingly low despite the tremendous progress
made in both experimental and computational techniques. Some studies
suggest that in vivo activity of a drug is strongly correlated with
the binding/unbinding kinetic properties rather than the equilibrium
thermodynamic properties in the drug-target binding process.[1−4] For example, Guo et al. studied the relationships between the intracellular
efficacy and the binding affinity, the residence time of 10 adenosine
A2A receptor agonists.[2] It was found that
the intracellular efficacy of these agonists correlated much better
with the residence time (R2 = 0.90) than
the inhibition constant Ki did (R2 = 0.13). Doornbos et al. reported a series
of 7-aryl-1,2,4-triazolo[4,3-a]pyridines with mGlu2
positive allosteric modulator activity and affinity.[4] Evaluations of the compounds with short, medium, and long
residence time in a label-free assay demonstrated a correlation between
the residence time and their pharmaceutical effect. They also pointed
out that affinity-only driven selection would lead to compounds with
high kon values but not necessarily optimized
residence time, which is the key indicator for predicting optimal
efficacy in vivo. All these results indicate that improving the protein–ligand
binding/unbinding kinetics has become a new trend in lead optimization.[5,6]Kinetic properties of protein–ligand binding (i.e.,
residence
time and association and dissociation rate constants) can be measured
by some experimental methods, such as surface plasmon resonance (SPR),
radioligand binding, time-resolved fluorescence resonance energy transfer,
and so on. However, all these experimental techniques are not only
expensive and time-consuming but also insufficient to characterize
the kinetics of protein–ligand interactions in detail. Therefore,
it is necessary to develop computational methods for elucidating the
protein–ligand binding/unbinding process. To answer this demand,
molecular dynamics (MD) simulations have been widely applied to explore
the dissociation pathway of the ligand from the protein in full atom
detail.[7−10] For example, Platter and Noe combined an extensive set of 150 μs
MD simulation and Markov model-based analysis to investigate the interplay
of conformational change and ligand-binding kinetics for the serine
protease trypsin and its competitive inhibitor benzamidine.[9] Seven metastable conformations with different
binding pocket structures were found to interconvert at timescales
of tens of microseconds, which all had corresponding crystal structures
released in PDB. Miao et al. developed a new computational method,
ligand Gaussian accelerated MD, which provides a powerful enhanced
sampling approach for characterizing ligand binding thermodynamics
and kinetics simultaneously.[10] It was applied
to the protein–ligand binding model system, that is, trypsin
complexed with a benzamidine inhibitor. The calculated ligand binding
free energy and kinetic rate constants were well consistent with the
experimental data. Obviously, MD simulation is gradually playing a
pivotal role in investigating the protein–ligand binding kinetics.
However, MD-based methods usually require a lot of computational resources
and thus are not suitable for high-throughput applications.In recent years, with the increasing availability of protein binding
kinetics data, data-driven modeling provides an alternative and efficient
solution for studying the protein–ligand interaction kinetics.
For example, the typical quantitative structure–kinetics relationship
(QSKR) strategy has been adopted to build models for predicting protein–ligand
dissociation kinetic constants and give guidance for structure-based
designs.[11−13] Mei H. et al. adopted partial least squares and support
vector machine to build a QSKR model on a training set composed of
37 HIV-1 protease inhibitors.[12] The predictive
model showed that several descriptors derived from water and hydrophobic
probes are dominant factors for the kinetic and thermodynamic properties.
Wade et al. applied a similar protocol to the data set of 66 HSP and
33 HIV-1 protease inhibitors by using the comparative binding energy
analysis.[13] Target-specific scoring functions
were derived by correlating koff with
a subset of weighted interaction energy components, which were promising
for binding kinetics-guided lead optimization.[13]Obviously, a data set of protein–ligand complexes
with reliable
kinetics data and three-dimensional (3D) structures is a prerequisite
for the development of QSKR models. However, current public collections
of protein–ligand kinetic data are rather rare. Some well-known
public databases (such as ChEMBL,[14,15] Binding MOAD,[16] etc.) that collect the protein–ligand
binding constants do not include kinetic-related data. This is why
the published studies have focused on data sets with very few targets.
Even though there are some databases that contain kinetic data (such
as KDBI,[17,18] KOFFI,[19] BindingDB,[20] etc.), they are either small in size or lack
the 3D structures of the protein–ligand complexes. Recently,
Fedorov et al. have described a set of 501 protein–ligand complexes
with dissociation rate constants collected from the public literature.[21] This data set was the largest data set of this
kind to the best of our knowledge. In addition, Fedorov et al. utilized
their data set to develop a random forest (RF) model for predicting
dissociation rate constants. Their work probably represents the state-of-the-art
of QSKR models.To overcome the limitations in the development
of QSKR models,
we have attempted to compile an even larger data set of protein–ligand
binding kinetic data and the corresponding protein–ligand complex
structures. Our data set covers more diverse target proteins and ligand
molecules, which can serve as a benchmark for developing or validating
computational QSKR models. For this purpose, we utilized the 38000+
references accumulated during our work of updating the PDBbind database[22,23] to collect protein–ligand dissociation rate constants (i.e., koff) and relevant information (e.g., protein–ligand
complex structure, experimental methods, etc.). In addition, we integrated
two data sets described in the public literature[12,13] into our data set. Finally, our data set contained experimental koff values for a total of 680 protein–ligand
complexes. These complexes were formed by 155 types of protein and
a wide range of ligand molecules. The 3D structure of each protein–ligand
complex in our data set was either retrieved from the Protein Data
Bank (PDB)[24] or carefully modeled based
on a proper template, making our data set more suitable for QSKR studies.
Based on our data set, we also provided an RF model for predicting koff values, which is in concept similar to Fedorov’s
model.[21] Our model may serve as a baseline
model for evaluating other QSKR models. The whole data set, namely, PDBbind-koff-2020, is freely available from our PDBbind-CN
web site (http://www.pdbbind.org.cn/download.php).
Methods
Collection of Protein–Ligand
Complexes
with Kinetic Constants
The protein–ligand binding
kinetic data in our data set were mainly collected from the references
that we have processed for the regular update of the PDBbind database.[22,23] As on January 2020, we had accumulated a total of over 38,000 references
for this purpose. All those references were first filtered by a few
keywords (e.g., kon, koff, off-rate, and dissociation rate). This step was conducted
with a computer program and resulted in ∼3400 references. Then,
those references were manually read to curate koff values of protein–ligand complex, the applied experimental
methods and settings, and other relevant information. During this
process, kinetic data of covalently bound protein–ligand complexes
were ignored. If the ligand molecule contained unusual elements (such
as B, Be, etc.), the data of this complex was not considered, either.
For the collected complexes, their corresponding target information
(i.e., PDB code and UniProt ID) was also recorded. After preliminary
collection, we found that the kinetic data collected were redundant
in several situations. The first case was that for the same protein–ligand
complex, multiple sets of kinetic data and complex structures were
obtained from different experimental methods or conditions. In this
case, we gave priority to retaining the kinetic data and its corresponding
structure measured by the SPR method under normal conditions (such
as room temperature). The second case was that multiple sets of kinetic
data for a protein–ligand complex were also determined from
different research groups, but not each one corresponded to a complex
structure. Thus, the complex with a crystal structure and its kinetic
data should be retained first. If there were multiple crystal structures
to choose from, the priority was given to retaining the one with better
resolution. The last case was that the proteins in the complex structures
had mutations or were from different species. In this case, we checked
whether the residues around the binding site were significantly different.
If so, they were considered as different complex structures and their
kinetic data were retained as unique. Otherwise, they were considered
as the same complexes, which would be handled as previous two cases.In addition, we also solicited two data sets of protein–ligand
dissociation kinetic constants from the literature, one containing
39 HIV-1 protease complexes[12] and the other
containing 89 HSP90 complexes.[13] Finally,
a total of 680 records of koff with their
target and ligand information were included in our data set. The complete
information of the data set is provided in the Supporting Information (Table S1). To the best of our knowledge,
this data set is the largest collection of protein–ligand dissociation
kinetic constants supplied with complex structures, which is publicly
reported in the literature.
Process of Protein–Ligand
Complex Structures
We required each protein–ligand
complex included in our
data set to have a 3D structure. This is obviously an important feature
for our data set to become a popular benchmark. In our data set, some
samples had known crystal complex structures, while others did not.
For those samples with known crystal complex structures, we downloaded
the coordinates from the PDB (http://www.rcsb.org/pdb).[24] The
original structural files from the PDB were processed so that they
could be readily applied to the software for computational tasks.
The biological unit of each complex was split into a protein molecule
and a ligand molecule. They were prepared using the Protein Preparation
Wizard in Schrödinger software (Schrödinger LLC, version
2019). Water molecules and metal ions were removed from the protein
structure. In some cases, the coordinates of some residues or heavy
atoms were missing from the protein structure. Thus, the “Advanced
Homology Modeling” module in Schrödinger was used to
deal with this problem. The ligand molecule was also visually examined
and corrected. The ProToss tool (https://protein.plus/) developed by Rarey’s group[25] was used to add hydrogen atoms to the protein–ligand
complex and predict the protonation states. The processed protein
molecule and ligand molecule were saved in a PDB file format and a
MOL2 file format, respectively.For those samples without known
crystal complex structures, we first searched the PDB to find out
whether there were crystal complexes with the same protein and similar
ligands. We then constructed the 3D structure of the unknown ligand
based on the maximum common substructure (MCS) of the matched ligand.
Take compound 1 as an example (Figure ). Its congeneric compound 2 was found to bind with the target protein choline kinase (PDB code: 5EQY). The structure
of compound 2 was extracted directly from the crystal
structure and defined as the template. The 3D structure of compound 1 could be obtained by simply replacing the cyano and its
adjacent hydrogen atom with two fluorine atoms, keeping the coordinates
of the MCS unchanged. Assuming no apparent violations in the binding
pocket, we believe that ligands derived based on the same molecular
framework should adopt a similar binding pose. Therefore, it is reasonable
to some extent to build a 3D ligand based on a common molecular framework.
This strategy was also adopted in the previous work.[26,27]
Figure 1
Illustration
of the process of constructing a 3D model of a protein–ligand
complex without an available crystal structure. The template ligand
was selected from the crystal structure in PDB entry 5EQY as an example.
Illustration
of the process of constructing a 3D model of a protein–ligand
complex without an available crystal structure. The template ligand
was selected from the crystal structure in PDB entry 5EQY as an example.
Modeling Protein–Ligand
Complex Structures
In our work, the GLIDE module with the
GlideScore-SP scoring function[28,29] implemented in the
Schrödinger software was adopted to predict
the binding poses of the complexes with unknown crystal structures.
Based on the assumption that ligands with similar chemical structures
may adopt similar binding modes to the target protein, we visually
inspected the docking output results and selected the reasonable binding
pose. Here, we defined the binding pose that was most similar to the
template ligand as the “best-fit” binding pose. If the
“best-fit” binding pose was found, it was selected as
the possible reasonable binding pose. However, for a few ligands that
could not get a reference to determine the “best-fit”
binding pose, we selected 1–3 representatives from the “high-score”
binding poses. Both the “best-fit” and “high-score”
binding poses would be applied to MD simulations with subsequent binding
free energy analysis to determine the final reasonable protein–ligand
binding pose.Short-length MD simulation was applied to further
refine the protein–ligand complex structure. Here, we utilized
the AMBER software (version 2014)[30] to
carry out the MD simulation tasks. The GAFF force field parameters
were applied to the ligand molecule. Its atomic partial charges were
derived with the RESP method[31] based on
the HF/6-31G* computation results given by the Gaussian software (version
09, released by Gaussian Inc).[32] The protein
atoms were assigned according to the FF14SB template charges, and
all ionizable residues were set at the default protonation states
at pH = 7. The complex structure was then soaked in a cubic box of
TIP3P water molecules with a margin of 10 Å along each dimension.
Counterions (i.e., Na+ or Cl–) were added
to neutralize the whole system.A regular MD protocol was applied
to the prepared protein–ligand–water
complex structure, which includes the following: (1) Minimizing the
complex structure using the steepest descent and the conjugate gradient
algorithms. (2) Heating from 0 to 300 K in 200 ps followed by a 300
ps equilibration under the NPT ensemble. In this process, the heavy
atoms of the protein–ligand complex were restrained with a
constant of 10.0 kcal mol–1 Å–2. (3) Restraining the protein–ligand complex from the crystal
structure for a 2 ns production (restraint_wt = 10.0 kcal mol–1 Å–2). For the protein–ligand
complex from molecular docking, a production simulation of 2 ns long
was performed without restraints.We first analyzed the conformational
fluctuations of the protein–ligand
complex during the 2 ns MD simulation. In most systems, the complex
structure reached convergence in terms of RMSD. A total of 45 complexes
required another 2 ns production to reach convergence. The MM-GB/SA
method was used to calculate the binding free energy of the final
snapshot on the MD trajectory.[33] As mentioned
above, “best-fit” and “high-score” binding
poses may exist for the protein–ligand complex from molecular
docking. According to the MM-GB/SA results, we selected the one with
the more favorable (i.e., more negative) binding energy as the final
structural model of the given protein–ligand complex.
Simple Model for Predicting Protein–Ligand
Dissociation Constants
Fedorov’s work demonstrated
what an RF model could achieve on their data set.[21] Because our data set is the same in nature as theirs, we
also trained a similar RF model[34] for computing
the koff value of a given protein–ligand
complex on our data set. To achieve this goal, the atom pair descriptors
implemented in the RF-Score scoring function[35] were adopted here to construct the RF model. Four types of protein
atoms (i.e., C, N, O, and S) and nine types of ligand atoms (i.e.,
C, N, O, F, P, S, Cl, Br, and I) were defined, resulting in a total
of 36 protein–ligand atom pairs. Those atom pair descriptors
count the occurrence of each possible protein–ligand atom pair
within a distance range of 12 Å. This relatively wide range was
intended to cover the protein structure relevant to the ligand dissociation
pathway. This range was equally spaced into six bins, that is, [0–2),
[2–4), [4–6), [6–8), [8–10), and [10-12].
As result, the final descriptor for encoding a protein–ligand
complex was a one-dimensional (1D) vector consisting of 36 ×
6 = 216 elements (Figure ).
Figure 2
Definition of the protein–ligand atom pair descriptors in
our RF model. Each protein–ligand complex was encoded as an
1D vector consisting of 216 elements, which included the occurrences
of 36 protein–ligand atom pairs in six even bins (0–12
Å).
Definition of the protein–ligand atom pair descriptors in
our RF model. Each protein–ligand complex was encoded as an
1D vector consisting of 216 elements, which included the occurrences
of 36 protein–ligand atom pairs in six even bins (0–12
Å).In order to evaluate the predictive
power of this model, we selected
the complexes formed by the three most populated proteins in our data
set, that is, HSP90 (89 complexes), HIV-1 protease (39 complexes),
and FAK (33 complexes), as the external test sets. Besides, 10 complexes
including a large stapled peptide as the ligand were defined as the
fourth test set, because those peptides had very different physiochemical
properties from common small-molecule ligands. Based on the distance
between two complexes calculated by the atom pair descriptors, the
Kennard–Stone algorithm[36] was applied
to split the remaining 509 complexes into a training set (407 complexes)
and a validation set (102 complexes) in a ratio of 8:2. To examine
whether the modeled structures can be used as an extended data set
in the development of machine learning prediction models, we also
trained and validated the RF models by using two sets of 312 crystal
structures and 197 modeled structures separated from the 509 complexes.
The two data sets were also split into training and validation sets
in a ratio of 8:2 by using the Kennard–Stone algorithm. The
RF models trained and tested on the crystal structures, modeled structures,
and mixed structures were named “RF_crystal”, “RF_model”,
and “RF_mixed”, respectively.We used the Scikit-learn
(version 0.21.3) software[37] to construct
the RF regression model. Note that three hyperparameters
significantly affected the performance of an RF model. They were the
number of data used in the model (n_estimators),
the maximal depth of the decision trees (max_depth), and the criterion of feature selection (criterion). The “grid search” method was applied to sample those
hyperparameters. It first determined the range of each hyperparameter
and then attempted all combinations of those three hyperparameters
to select the optimal one. During the sampling process, the value
of n_estimators gradually increased
from 10 to 600 at a step size of 10, and the value of max_depth increased from 2 to 30 at a step size of 2. With a certain set of
hyperparameters, the RF model was first generated by considering all
samples in the training set and then evaluated on the validation set.
The optimal set of hyperparameters was recorded when the Pearson correlation
coefficient (Rp) between the predicted
values and the experimental data in the validation set reached the
maximum.
Results and Discussion
Basic Features of the Data Set
Our
data set mainly comes from two resources. Most of the data are kinetic
constants manually curated from the literature included in the PDBbind
database. The rest part of the data was taken from public literature.
Basic information of the protein–ligand complexes in this data
set, including PDB codes, experimental kinetic data, detection method,
temperature, protein names, UniProt IDs, ligand SMILEs, and the related
citations, are summarized in Table S1 in the Supporting Information. In all 680 records of the protein–ligand
dissociation kinetic constants, over half of them (54.7%, 372) have
the corresponding complex structures resolved by X-ray crystal diffraction
method. For the remaining ones, their complex structures were derived
through molecular modeling (i.e., molecular docking followed by short-length
MD simulation). This data set, including the experimental koff values and the 3D structures of all complexes,
is free available from our PDBbind-CN web site (http://www.pdbbind.org.cn/download.php). This data set is named PDBbind-koff-2020 because
it is basically an extension of PDBbind version 2020.The curated
binding kinetic data koff stem from a
variety of detection methods (Figure ). Overall, SPR took up the largest share, with 69.6%
of all detection methods. Fluorescence-based methods ranked second,
accounting for approximately 15.5%. Other methods were used less frequently,
such as biolayer interferometry, isothermal titration calorimetry,
radioligand binding analysis, and so on. The dissociation rate constants
(koff) of the protein–ligand complexes
in this data set range from −3 to 7 (in log units), spanning
over nearly 10 magnitudes (Figure ). The Shapiro–Wilk normal distribution test[38] was performed on the data set, and the result
showed that the distribution conformed to the normal distribution
at the 95% confidence level (p-value = 0.144).
Figure 3
Distribution
of the detection methods used for measuring the protein–ligand
dissociation rate constants (koff) included
in our data set.
Figure 4
Distribution of the dissociation
rate constants (koff in s–1) of the 680 ligands in our
data set.
Distribution
of the detection methods used for measuring the protein–ligand
dissociation rate constants (koff) included
in our data set.Distribution of the dissociation
rate constants (koff in s–1) of the 680 ligands in our
data set.The entire data set was further
clustered by the sequence of the
protein molecule in each complex using the CD-hit program (v 4.5.4),[39] with the similarity threshold setting to 90%.
The clustering results indicated that all samples in the data set
could be divided into 155 groups, covering different protein families.
The top 10 protein families are HSP90 (89 complexes), HIV-1 protease
(39 complexes), focal adhesion kinase (33 complexes), p38α (26
complexes), soluble epoxide hydrolase (sEH) (19 complexes), thermolysin
(17 complexes), lipid transfer protein (11 complexes), thrombin (11
complexes), FimH protein (11 complexes), and eukaryotic translation
initiation factor (10 complexes). Distributions of six key “drug-likeness”
properties of the ligand molecules in this data set, including molecular
weight (MW), the number of rotatable bonds (RBs), the number of hydrogen
bond donor atoms (HBDs), the number of hydrogen bond acceptor atoms
(HBAs), the polar surface area (PSA),[40] and the computed log P value,[41] are also given in Figure . A total of 314 ligands satisfied the Lipinski’s
rule of 5,[42] accounting for 46.2% of the
entire set. However, there are also a number of ligand molecules with
MW or other properties far beyond Lipinski’s rule of 5. In
general, our data set shows significant diversity, whether it is from
a variety of protein types or the wide distribution of the physiochemical
properties of ligands.
Figure 5
Distribution of six “drug-likeness” properties
of
the 680 ligands in our data set. (A) MW; (B) number of RBs; (C) number
of HBAs; (D) number of hydrogen bond donors; (E) PSA; and (F) octanol–water
partition coefficient (log P).
Distribution of six “drug-likeness” properties
of
the 680 ligands in our data set. (A) MW; (B) number of RBs; (C) number
of HBAs; (D) number of hydrogen bond donors; (E) PSA; and (F) octanol–water
partition coefficient (log P).
Baseline Model for Predicting Dissociation
Rate Constants
Once publicly released, we expect that our
data set will be employed by other researchers to develop computational
models for predicting ligand dissociation rate constants. Using the
ligand “drug-likeness” properties as the simplest molecular
descriptors, we found that they were basically uncorrelated with experimental
unbinding kinetics measured by −log koff (Figure S1). It indicates that
it is inappropriate to predict koff by
the linear regression of simple descriptors. Fedorov’s work
proposed that the RF model could be used as a baseline to well correlate
the protein–ligand interactions with their dissociation kinetics.[21] Therefore, we have derived a simple RF model
based on our data set as described in the Methods section.The purpose of this RF model is to set a baseline
to testify the true power of other predictive models derived on the
same data set. This RF model was constructed by using protein–ligand
atom pairs as the input descriptors. Those descriptors are used to
encode the protein environment relevant to protein–ligand interaction
as well as ligand dissociation kinetics. The performance of all tested
RF models, as evaluated on the validation set of 102 diverse protein–ligand
complexes, is illustrated as a function of the hyperparameters including “n_estimators”, “max_depth”, and “criterion” in Figure . The final optimal
RF model was selected when n_estimators = 50, max_depth = 14, and criterion = “mae”. Under this setting, the RF model produced a
Pearson correlation coefficient (Rp) of
0.968 and an RMSE of 0.474 on the training set (Figure A) and a Pearson correlation coefficient
(Rp) of 0.706 and an RMSE of 0.986 (in
−log koff units) on the validation
set (Figure B). These
results were generally consistent with Fedorov’s RF models.
We also tested the ability of an empirical scoring function X-Score[43] to predict the koff values. The Pearson correlation coefficients on the training set
and test set were 0.227 and 0.101, respectively (Figure S2). It again demonstrates that these scoring models
developed based on protein–ligand equilibrium thermodynamics
cannot be used to predict ligand kinetics at all.
Figure 6
Pearson correlation coefficients
between experimentally measured
dissociation rate constants (−log koff) in the validation set and the corresponding predicted values produced
by the RF model. The correlation coefficients are plotted as the function
of “n_estimators” and “max_depth” when the criterion was “mae” (A) or “mse”
(B).
Pearson correlation coefficients
between experimentally measured
dissociation rate constants (−log koff) in the validation set and the corresponding predicted values produced
by the RF model. The correlation coefficients are plotted as the function
of “n_estimators” and “max_depth” when the criterion was “mae” (A) or “mse”
(B).The RF model was tested on four
external sets. On three of them,
that is, HSP90, HIV-1 protease, and FAK, the Pearson correlation coefficients
produced by this RF model were 0.501, 0.306, and 0.241, respectively,
with corresponding RMSE values of 0.891, 1.602, and 1.044 (Figure C–E). Thus,
this model exhibited moderate predictive power on the HSP 90 test
set but obviously failed on the HIV-1 protease and FAK test sets.
We have applied the t-SNE software[44] to
visualize the protein–ligand complexes in our data set after
dimension reduction conducted on atom-pair descriptors. The outcome
reveals that the HSP90 complexes are relatively dispersed in the descriptor
space compared to the HIV-1 protease complexes and FAK complexes in
our data set (Figure ). In other words, it indicates that the ligand molecules in the
HIV-1 protease complexes and FAK complexes in our data set represent
certain limited types of structure, and those types of structure are
not well understood by our RF model. This RF model was also evaluated
on the test set of large stapled peptides. Here, one can see a good
correlation coefficient (Rp = 0.834),
but the model does not reproduce the absolute values of experimental
data well (Figure F).
Figure 7
Correlation between experimentally measured dissociation rate constants
(−log koff) and the corresponding
predicted values for: (A) training set, where N =
407, Rp = 0.968, and RMSE = 0.474; (B)
validation set, where N = 102, Rp = 0.706, and RMSE = 0.986; (C) test set of HSP90 complexes,
where N = 89, Rp = 0.501,
and RMSE = 0.891; (D) test set of HIV-1 protease complexes, where N = 39, Rp = 0.306, and RMSE
= 1.602; (E) test set of FAK complexes, where N =
33, Rp = 0.241, and RMSE = 1.044; and
(F) test set of complexes containing stapled peptide ligands, where N = 10, Rp = 0.834, and RMSE
= 1.546.
Figure 8
t-SNE visualization of the protein–ligand
complexes in our
data set based on protein–ligand atom pair descriptors. Complexes
of the 10 most populated proteins are rendered as colored circles,
while other complexes are rendered as gray circles.
Correlation between experimentally measured dissociation rate constants
(−log koff) and the corresponding
predicted values for: (A) training set, where N =
407, Rp = 0.968, and RMSE = 0.474; (B)
validation set, where N = 102, Rp = 0.706, and RMSE = 0.986; (C) test set of HSP90 complexes,
where N = 89, Rp = 0.501,
and RMSE = 0.891; (D) test set of HIV-1 protease complexes, where N = 39, Rp = 0.306, and RMSE
= 1.602; (E) test set of FAK complexes, where N =
33, Rp = 0.241, and RMSE = 1.044; and
(F) test set of complexes containing stapled peptide ligands, where N = 10, Rp = 0.834, and RMSE
= 1.546.t-SNE visualization of the protein–ligand
complexes in our
data set based on protein–ligand atom pair descriptors. Complexes
of the 10 most populated proteins are rendered as colored circles,
while other complexes are rendered as gray circles.Our results described above indicate that it might be too
naïve
to predict the protein–ligand dissociation rate constant based
merely on a static protein–ligand complex structure. Again,
our purpose is to showcase how well a baseline model can perform in
those cases, and hopefully one will have a more realistic interpretation
of other computational models, either machine-learned or not, derived
for dissociation rate prediction.To verify whether utilizing
modeled structures as additional training
samples will benefit the model performance, we also trained and validated
the RF models by using two separate data sets containing only crystal
structures and only modeled structures. The splitting of training
and validation sets followed the same workflow as the previous one.
Crystal structures and modeled structures in the HSP set and HIV-1
set were used as external test sets for these new models. FAK set
and stapled peptide set were not considered for validations due to
the limited sample size. The performances of three models (i.e., “RF_crystal”,
“RF_model”, and “RF_mixed”) are summarized
in Table . Evaluation
results showed that the model trained on modeled structures had comparable
performance to the model trained on crystal structures. “RF_model”
and “RF_crystal” produced Pearson correlation coefficients
of 0.971 and 0.975 on the training sets and 0.677 and 0.635 on the
validation sets, respectively. Compared to “RF_crystal”,
“RF_model” had a slightly higher RMSE value on the validation
set. These results were a bit worse compared to the model with mixed
structures, that is, “RF_mixed”. It is actually understandable,
because fewer samples were used for training these two separate models,
leading to some key features probably missing in model learning. Thinking
in turn, it also confirmed that using the modeled structures as an
extended set is beneficial for the development of machine learning
models.
Table 1
Performances of Models Trained on
Crystal Structures, Modeled Structures, and Mixed Structuresa
data sets
training set
validation set
HSP set
HIV-1
set
N
Rp
RMSE
N
Rp
RMSE
N
Rp
RMSE
N
Rp
RMSE
RF_crystal
250
0.975
0.457
62
0.635
0.943
39
0.463
0.968
12
0.337
1.782
RF_model
158
0.971
0.458
39
0.677
1.003
50
–0.348
1.012
27
–0.023
1.374
RF_mixed
407
0.968
0.474
102
0.706
0.986
89
0.501
0.891
39
0.306
1.602
The validation results for the FAK
set and stapled peptide set are not presented here due to the small
sample size.
The validation results for the FAK
set and stapled peptide set are not presented here due to the small
sample size.Furthermore,
we found that “RF_model” performed poorly
on the external HSP and HIV-1 sets. It was partially due to the smaller
size of the training set compared to the crystal structures. On the
other hand, all modeled structures were constructed based on the MCS
of the crystal ligands, which could be considered as a subset of the
chemical space covered by the crystal structures. Consequently, a
model trained on the modeled structures would miss some types that
were present only in crystal structures and thus performed worse when
it was extrapolated to an external test set. When the crystal structures
and modeled structures were combined for training, the model achieved
improved performance on both the validation set and external test
sets. This further supports the use of modeled structures to augment
training samples for developing machine learning models.
Comparison to Other Data Sets of Kinetic Data
A few
resources provide extensive collections of the thermodynamic
binding affinity data of protein–ligand complexes, such as
PDBbind,[22,23] ChEMBL,[14,15] Binding MOAD,[16] and BindingDB.[20] However,
public collections of kinetic data are rather rare. KDBI[17,18] provides experimentally measured kinetic data for protein–protein,
protein–nucleic acid, and protein–small molecule interactions.
Its latest release provides a total of 19,263 entries of kinetic data.
However, those kinetic data are not associated with corresponding
3D complex structures, which limited their applications to molecular
modeling. KOFFI is a similar database.[19] It added more information, such as the assay method, device, chip,
and so on. It collects a total of 1705 individual entries, most of
which are kinetic data of protein–protein and protein–nucleic
acid complexes, with very few records of protein–small molecules.
Note that KOFFI does not provide the linkage between kinetic data
and 3D complex structures either.BindingDB is a public, web-accessible
database of binding affinity data,[20] focusing
chiefly on the interactions of protein considered to be drug targets
with small, drug-like molecules. It also collects the binding kinetic
data of some targets and small molecules. The data set established
in this work was compared with the kinetic data included in BindingDB
in terms of sample size, protein type, and so on. As of May 10, 2021,
BindingDB contained a total of 301 non-redundant koff records of 12 protein types. Each protein has an average
of nearly 25 records of its ligands. In fact, the distribution of
the kinetic data in different protein types is extremely unequal.
The protein human leukocyte elastase had 194 records of kinetic data,
accounting for about two-thirds of all data. In contrast, our data
set includes kinetic data for 680 protein–ligand complexes,
which cover 155 different proteins. The size of our data set is about
twice the collection available in BindingDB. Besides, our data set
obviously presents a wider of structural diversity.The most
attractive feature of our data set is that a protein–ligand
complex structure is provided for each record of kinetic data. Here,
crystal structures are available for 372 complexes, which account
for 54.7% of the entire data set. Other similar database rarely collects
related crystal structures for these protein–ligand complexes,
such as KDBI and KOFFI. BindingDB also contains only 14 crystal structures
for their koff data. For the protein–ligand
complex with no available crystal structure, we constructed the initial
binding model based on a similar ligand template through molecular
docking. The binding models derived from crystal structures and molecular
docking were further refined by a short-time MD simulation. With these
3D structures and the corresponding kinetic data, we can use machine
learning methods to explore the QSKR between the targets and their
ligands, which cannot be achieved by other databases such as KDBI,
BindingDB, and so on.When we were preparing this article, Fedorov
et al. reported a
similar collection of kinetic data.[21] Their
data set consisted of 501 protein–ligand complexes with experimentally
measured dissociation rate constants. A comprehensive comparison of
the two data sets was carried out from the aspects of the koff data distribution, protein types, ligand
structural diversity, and complex structures. Although the −log koff data of the two sets are mainly located
in the range of −2 to 6, they have quite different distributions
(Figure S3A,B). The peak value is biased
toward the interval of 3–4 in Fedorov’s set, which does
not pass the Shapiro–Wilk normal distribution test (p-value < 0.05). The koff data collected in our set generally obey the Shapiro–Wilk
normal distribution (p-value = 0.144), with the peak
falling between 2 and 3. Setting the protein sequence identity above
90% as the threshold, Fedorov’s set contains 53 different protein
families, with an average of 9.5 protein–ligand complexes per
family. Our set covers 155 protein families, with an average of 4.4
complexes. We further compared the top 10 protein types that appear
frequently in the two data sets. In Fedorov’s set, the top
10 proteins accounted for more than two-thirds of the total sample
size, while in our data set, this number is only 39%, reflecting more
abundant protein types (Figure S4A,B).
It is worth mentioning that there are four common protein types that
frequently appear in the two data sets, including HSP90, HIV-1 protease,
p38α, and sEH. It indicates that they are the targets of most
interest in kinetics-guided drug discovery research. The physiochemical
properties of the ligands were also compared (Figure S5). Broader distributions of six physiochemical properties
were indicated for the ligands in our set. In summary, our data set
is more diverse in terms of kinetic data, target types, and ligand
properties. In addition, more than 50% of the protein–ligand
complexes in our set have crystal structures resolved by X-ray crystallography,
while this percentage in Fedorov’s set is only 33%. It demonstrates
the advantages of our data set in developing structure-based prediction
models.
Conclusions
A reliable
computational model for predicting protein–ligand
binding/unbinding kinetic properties can hopefully provide a more
rational basis for structure-based drug design. Apparently, a high-quality
data set is the basis for developing such QSKR models, especially
for machine-learning models. In this work, we have compiled a data
set of experimental dissociation rate constants and the corresponding
protein–ligand complex structures. This data set was mostly
curated from over 38,000 references accumulated during our work of
updating the PDBbind database. It includes 680 records of protein–ligand
dissociation rate constants (koff), ranging
from −3 to 7 (in log units). The target proteins included in
this data set could be clustered into 155 groups at a sequence similarity
cutoff of 90%, covering a range of enzymes, receptors, transporters,
and so on. The physiochemical properties of the ligand molecules included
in this data set also had a wide distribution. This data set has exceeded
those described in the literature in terms of sample size and structural
diversity. In addition, we have derived a simple RF model based on
this data set for predicting koff values.
It is of course naïve to expect that binding kinetics can be
depicted from a static protein–ligand complex structure. Thus,
this model should be considered as a baseline for testifying the true
power of other more sophisticated QSKR models. The whole data set,
namely, PDBbind-koff-2020, is freely available from
the PDBbind-CN web site (http://www.pdbbind.org.cn/download.php).
Authors: Thomas A Halgren; Robert B Murphy; Richard A Friesner; Hege S Beard; Leah L Frye; W Thomas Pollard; Jay L Banks Journal: J Med Chem Date: 2004-03-25 Impact factor: 7.446
Authors: Maarten L J Doornbos; José María Cid; Jordi Haubrich; Alexandro Nunes; Jasper W van de Sande; Sophie C Vermond; Thea Mulder-Krieger; Andrés A Trabanco; Abdellah Ahnaou; Wilhelmus H Drinkenburg; Hilde Lavreysen; Laura H Heitman; Adriaan P IJzerman; Gary Tresadern Journal: J Med Chem Date: 2017-08-01 Impact factor: 7.446
Authors: Doris A Schuetz; Wilhelmus Egbertus Arnout de Witte; Yin Cheong Wong; Bernhard Knasmueller; Lars Richter; Daria B Kokh; S Kashif Sadiq; Reggie Bosma; Indira Nederpelt; Laura H Heitman; Elena Segala; Marta Amaral; Dong Guo; Dorothee Andres; Victoria Georgi; Leigh A Stoddart; Steve Hill; Robert M Cooke; Chris De Graaf; Rob Leurs; Matthias Frech; Rebecca C Wade; Elizabeth Cunera Maria de Lange; Adriaan P IJzerman; Anke Müller-Fahrnow; Gerhard F Ecker Journal: Drug Discov Today Date: 2017-04-13 Impact factor: 7.851