Lena Kalinowsky1, Julia Weber1, Shantheya Balasupramaniam2, Knut Baumann2, Ewgenij Proschak1. 1. Institute of Pharmaceutical Chemistry, Goethe University Frankfurt, Max-von-Laue Str. 9, Frankfurt am Main D-60438, Germany. 2. Institute of Medicinal and Pharmaceutical Chemistry, University of Technology of Braunschweig, Beethovenstr. 55, Braunschweig D-38106, Germany.
Abstract
The prediction of protein-ligand interactions and their corresponding binding free energy is a challenging task in structure-based drug design and related applications. Docking and scoring is broadly used to propose the binding mode and underlying interactions as well as to provide a measure for ligand affinity or differentiate between active and inactive ligands. Various studies have revealed that most docking software packages reliably predict the binding mode, although scoring remains a challenge. Here, a diverse benchmark data set of 99 matched molecular pairs (3D-MMPs) with experimentally determined X-ray structures and corresponding binding affinities is introduced. This data set was used to study the predictive power of 13 commonly used scoring functions to demonstrate the applicability of the 3D-MMP data set as a valuable tool for benchmarking scoring functions.
The prediction of protein-ligand interactions and their corresponding binding free energy is a challenging task in structure-based drug design and related applications. Docking and scoring is broadly used to propose the binding mode and underlying interactions as well as to provide a measure for ligand affinity or differentiate between active and inactive ligands. Various studies have revealed that most docking software packages reliably predict the binding mode, although scoring remains a challenge. Here, a diverse benchmark data set of 99 matched molecular pairs (3D-MMPs) with experimentally determined X-ray structures and corresponding binding affinities is introduced. This data set was used to study the predictive power of 13 commonly used scoring functions to demonstrate the applicability of the 3D-MMP data set as a valuable tool for benchmarking scoring functions.
Since
the 1980s, a variety of docking and scoring methods have
been developed, which are used for three main purposes: the prediction
of the bioactive conformation of a known active ligand, virtual screening
to identify new ligands for a specific target, and the prediction
of binding affinities for a series of related compounds.[1] In a recently published comparative assessment
of scoring functions, 20 commercially and freely available scoring
functions were evaluated in terms of “docking power”,
“ranking power”, and “scoring power” using
a diverse test set of 195 protein–ligand complexes.[2,3] The docking power evaluates the ability to identify the active binding
mode among a decoy set of ligand binding poses. The ranking power
evaluates the ability to rank known ligands according to their binding
affinities. The scoring power evaluates the ability to generate scores
that are (preferably) linearly correlated with the experimental binding
data. Li et al. showed that the evaluated functions performed better
in the docking power test than in the scoring/ranking power test.[2,3] These results support the common assumption that the “docking”
problem has been solved for the case of rigid receptors, whereas the
“scoring” problem still remains a major challenge.[4] Unfortunately, current scoring functions are
still far from being able to accurately predict the binding free energy
of a protein–ligand complex. Additionally, the inclusion of
solvation and rotational entropy contributions as well as protein
reorganization energy in the calculation of the binding free energy
remains critical.[5−8] Furthermore, most of the scoring functions assume the binding affinity
to consist of the sum of several independent terms, which often leads
to scores that correlate with the molecular size rather than with
binding affinity.[4,9]To demonstrate the predictive
power and to investigate the strengths
and weaknesses of scoring functions, several benchmark test sets have
been developed.[10−12] These data sets are characterized by their high diversity
in terms of protein families, ligand chemotypes, and binding affinities.
The high diversity is well suited for the evaluation and comparison
of the global performance of docking and scoring software. However,
understanding the local behavior of a scoring function, for example,
how well it can differentiate between similar molecules, is almost
impossible with these data sets. Here, a novel benchmark data set
based on matched molecular pairs (MMPs) was developed to study the
local behavior of scoring functions. MMPs are defined as molecules
that differ in one well-defined transformation associated with a change
in an arbitrary molecular property (transformation effect).[13] The PDBbind core set[14,15] forms the basis of the diverse data set containing 99 co-crystallized
MMPs (3D-MMPs) stored together with the transformation effect on the
binding affinity of the corresponding ligands. The assembled 3D-MMP
data set was used to investigate whether the scoring functions can
correctly differentiate between chemically related compounds (i.e.,
the pairwise ranking power was assessed). Therefore, the 3D-MMPs were
scored in the respective crystal structures without any posing (i.e.,
the position of the small molecule was not changed) to focus on scoring
and to exclude the influence of posing (i.e., the placement algorithm).
Thirteen well-established scoring functions were included in the study
covering a broad range of different scoring technologies. Not included
were the recent machine-learning–based scoring functions. It
has been shown that the machine-learning part may greatly improve
the scoring and ranking power. Setting up the machine-learning part
of the scoring functions needs a training data set whose source also
commonly is the PDBbind database.[16−21] Hence, the complexes of the data set proposed here may already be
known to the respective machine-learning–based scoring function,
which would bias their results in the benchmark. Although it cannot
be ruled out that some or all of the complexes were used to parametrize
one or several of the studied scoring functions, the influence of
being included in the training set of a machine-learning–based
scoring function on the resulting scoring power is expected to be
far greater than in cases of classically parametrized scoring functions.
In the former case, the machine-learning–based scoring function
simply needs to recall the result of the respective complex. As a
result, this initial analysis of the scoring power was restricted
to classically parametrized scoring functions.
Results
A diverse
benchmark data set of 99 3D-MMPs associated with 33 diverse
target clusters is assembled. The detailed composition of the data
set is described in the Supporting Information (Table S1). For each target cluster, three 3D-MMPs are selected.
The transformation effect on the binding affinity of the corresponding
ligands is calculated as follows: first, the logarithm (base 10) of
the affinity data is taken. Second, the difference in the logarithmically
transformed data is computed, where the identity of the minuend and
subtrahend is stored. The thus-obtained difference is referred to
as Δaffinity.Various scoring functions are used to score
the data set of 99
3D-MMPs, and most of them are available within the commercially available
software MOE 2014.09[22] (London dG, ASE,
Affinity dG, Alpha HB, GBVI/WSA dG) and GOLD Suite 5.2.2[23−30] (ASP, ChemPLP, ChemScore and GoldScore). To ensure that the results
are not artifacts from the preparation and processing of the 3D-MMPs,
the computations are carried out independently in two different laboratories
at the Goethe University Frankfurt (UF), Germany, and the University
of Technology Braunschweig (UB), Germany. In addition to the nine
aforementioned scoring functions, UF also evaluated the freely available
docking tools AutoDock 4.2.6[31,32] and AutoDock Vina 1.1.2,[33] as well as the independent scoring functions
X-Score[34] and DSX.[35] For some scoring functions, small values are optimal, whereas for
others, large values indicate a high binding affinity. The latter
scoring functions are multiplied by a negative one so that all of
the scoring functions are commensurate. The difference in scores for
a particular pair is referred to as Δscores. To compute this
value, the minuend and subtrahend are the same as those used for computing
Δaffinity. Because of this preprocessing, a positive Δscore
represents an increase in the predicted binding affinity caused by
the transformation, whereas a negative Δscore represents a decrease
in the predicted binding affinity. The consensus for the predictions
(i.e., the pairwise ranking power) was determined as the majority
result of each single ranking over all scoring functions.The
protein–ligand complexes are scored by the different
scoring functions with and without considering the water molecules
in the active site (except for the DSX and X-Score, where water is
not parameterized). The analysis is carried out on different subsets.
First, a subset containing only those transformations with an absolute
transformation effect (i.e., |Δaffinity|) of at least 0.5 log
units (n = 58) is selected to eliminate the impact
of experimental uncertainties.[36] This subset
is referred to as subset 1. Out of subset 1, subsets with absolute
transformation effects of at least 1.0 (n = 39) and
2.0 log units (n = 15) are used to study whether
increasing the activity difference improves the results. These subsets
are referred to as subsets 2 and 3, in which subset 3 is a subset
of subset 2.The ability of the scoring functions to predict
the direction of
a transformation effect (positive or negative) is examined first (Table ). This pairwise test
between the two members of the respective 3D-MMPs basically amounts
to checking the sign of Δscore and Δaffinity. If the affinity
is improved by the molecular transformation, the score of the transformed
molecule should also improve. Because there are only two possible
outcomes (correct vs incorrect prediction), the sign test can be used
to assess the statistical significance. If the number of correct predictions
is larger than σcrit, the scoring function yields
a prediction accuracy that is significantly different from chance
(i.e., 50%). It should be noted that this does not tell anything about
the mechanism that led to the prediction accuracy better than chance.[37] Overall, X-Score reaches the highest prediction
accuracy of 69.0% in subset 1. No scoring function reaches a prediction
accuracy significantly different from chance in subset 3. Significant
results are only obtained when water is not considered. The prediction
accuracy is also computed for the 9 scoring functions and the various
subsets studied at UB (Table , second value). Only Affinity dG, London dG, Alpha HB, and
consensus scoring achieve statistically significant results, with
the best prediction accuracy in subset 1 of 74.1% (Affinity dG). The
prediction accuracy for 3D-MMPs with activity constants smaller than
1 μM and 3D-MMPs after geometry optimization can be found
in the Supporting Information (Tables S2–S4).
Table 1
Prediction Accuracy (in %) of the
13 Scoring Functions Calculated at UF (First Value) and UB (Second
Value)a
3D-MMPs with |Δaffinity| ≥ 0.5
3D-MMPs with |Δaffinity| ≥ 1.0
3D-MMPs with |Δaffinity| ≥ 2.0
scoring functionb
w/ waterc
w/o waterd
w/ waterc
w/o waterd
w/ waterc
w/o waterd
Affinity dG
55.2/56.9
63.8/74.1
56.4/59.0
61.5/71.8
73.3/73.3
66.7/73.3
London dG
56.9/56.9
53.4/60.3
53.8/56.4
56.4/61.5
60.0/73.3
66.7/73.3
Alpha HB
55.2/58.6
51.7/60.3
59.0/64.1
61.5/66.7
73.3/73.3
60.0/73.3
ASE
60.3/60.3
55.2/58.6
56.4/56.4
53.8/56.4
66.7/66.7
46.7/53.3
GBVI/WSA dG
46.6/58.6
48.3/55.2
43.6/61.5
48.7/56.4
53.3/66.7
46.7/60.0
ChemScore
48.3/46.6
56.9/51.7
43.6/53.8
61.5/53.8
60.0/60.0
53.3/53.3
GoldScore
55.2/50.0
62.1/48.3
53.8/56.4
61.5/59.0
53.3/60.0
53.3/60.0
ChemPLP
53.4/50.0
53.4/50.0
53.8/53.8
56.4/53.8
66.7/60.0
60.0/60.0
ASP
55.2/50.0
55.2/53.4
51.3/48.7
53.8/51.3
60.0/60.0
60.0/73.3
AutoDock
48.3/–
50.0/–
53.8/–
51.3/–
66.7/–
60.0/–
AutoDock Vina
37.9/–
46.6/–
38.5/–
43.6/–
46.7/–
53.3/–
DSX
–/–
58.6/–
–/–
61.5/–
–/–
66.7/–
X-Score
–/–
69.0/–
–/–
66.7/–
–/–
73.3/–
Consensus
58.6/53.4
63.8/62.1
59.0/53.9
59.0/59.0
66.7/60.0
60.0/60.0
The ability of
the scoring functions
to predict the direction of a transformation effect (positive or negative)
with and without the consideration of water is shown. Results significantly
different from chance (i.e., 50%) are in bold.
Scoring functions tested for the
prediction accuracy of the transformation effect. The prediction accuracy
of the majority vote over all scoring functions for each 3D-MMP is
listed as “Consensus”.
Scoring under consideration of water.
Scoring without consideration of
water. σcrit: [60.3% 64.1% 73.3%] for n: [58 39 15]; Note that σcrit is the critical value,
whereas n represents the respective subset size;
The calculations of the critical value are provided in the Supporting Information; values are given for
subsets 1–3.
The ability of
the scoring functions
to predict the direction of a transformation effect (positive or negative)
with and without the consideration of water is shown. Results significantly
different from chance (i.e., 50%) are in bold.Scoring functions tested for the
prediction accuracy of the transformation effect. The prediction accuracy
of the majority vote over all scoring functions for each 3D-MMP is
listed as “Consensus”.Scoring under consideration of water.Scoring without consideration of
water. σcrit: [60.3% 64.1% 73.3%] for n: [58 39 15]; Note that σcrit is the critical value,
whereas n represents the respective subset size;
The calculations of the critical value are provided in the Supporting Information; values are given for
subsets 1–3.Furthermore,
the relationship between the prediction accuracy and
the size of the molecules is analyzed (Table ). The number of heavy atoms is used to characterize
the molecular size. 3D-MMPs with (approximately) identical binding
affinity are excluded from the analysis (n = 6).
Previous studies have shown that the additive character of many scoring
functions leads to higher scores as the size of the molecules increase[9] and that simply using the molecular size leads
to better results than using a scoring function.[4] In fact, Δaffinity increases in nearly 72% of the
3D-MMPs when the molecular size increases (not counting molecules
of equal size). Owing to the latter fact, the relationship between
score improvements and size increases could not directly be studied
as it has been done previously because affinity acts as a confounder
in this case. Put differently, size increases and concordant score
increases may actually be caused by an increase in the affinity and
may not be governed just by a molecular size bias of the scoring function.
To control for the confounder, three subgroups were built: (1) 3D-MMPs
where the larger molecule is a more potent binder (n = 51), (2) 3D-MMPs where the smaller molecule binds more potently
(n = 20), and (3) 3D-MMPs of equal molecular size
(n = 22). The prediction accuracy of the direction
of the transformation effect is then analyzed in these three subgroups.
The analysis is restricted to complexes without water, as this yields
better results on an average, and to the entire set and subset 1 (|Δaffinity|
≥ 0.5) as even in subset 1 the subgroups are already rather
small. The results based on the scorings at UB are shown in Table . It can be seen that
most scoring functions perform better in the case where the larger
molecule is a more potent binder, whereas the opposite case is more
difficult to predict correctly. In the case of equally sized molecules,
there is no clear trend, but many scoring functions do not perform
better than chance. For some scoring functions such as London dG and
Affinity dG, there is clear preference for assigning better score
to a larger molecule. They perform very well on the subset where the
larger molecule is more affine, whereas they perform badly in the
opposite case. As opposed to this, GBVI/WSA dG and ChemScore perform
equally good (poor) irrespective of the molecular size.
Table 2
Prediction Accuracy (in %) of the
Scoring Functions for All 3D-MMPs (First Value) and Subset 1 (Second
Value) Calculated at UBa
scoring function
larger mol. is more
affine (n = 51/30)
smaller mol. is more affine (n = 20/14)
equal size (n = 22/14)
Affinity dG
88.2/86.7
35.0/35.7
59.1/85.7
London dG
78.4/80.0
25.0/28.6
63.6/50.0
Alpha HB
76.5/73.3
45.0/50.0
50.0/42.9
ASE
58.8/56.7
35.0/48.9
77.3/78.6
GBVI/WSA dG
58.8/56.7
55.0/50.0
63.6/57.1
ChemScore
54.9/56.7
55.0/57.1
40.9/35.7
GoldScore
49.0/53.3
40.0/35.7
50.0/50.0
ChemPLP
60.8/63.3
40.0/35.7
45.5/35.7
ASP
64.7/63.3
25.0/21.4
59.1/64.3
Consensus
78.4/76.7
35.0/35.7
59.1/57.1
The ability of the scoring functions
to correctly predict the more affine molecule of the pair in different
subgroups is shown. The subgroups were built on the heavy atom count
of the ligands.
The ability of the scoring functions
to correctly predict the more affine molecule of the pair in different
subgroups is shown. The subgroups were built on the heavy atom count
of the ligands.To examine
the validity of the underlying setup, UB processed the
CSAR-NRC benchmark data set with the same protocol as that used for
the 3D-MMP data set and compared the results against the previously
published results.[38,39] The first analysis attempts to
reproduce the results of Smith et al.[38] In the aforementioned study, the correlation between the experimentally
measured binding affinity data for 332 energy-optimized complexes
of the CSAR-NRC data set, excluding the crystal structures of Factor
Xa, and the corresponding scores is determined. ASE and Affinity dG
are studied by Smith et al., as well as in this work. For ASE, the
Pearson correlation is 0.61 (Figure S1A), and for Affinity dG, a value of 0.51 is obtained (Figure S1B). A one-to-one comparison is not possible
because the authors of the previous study used pseudonyms for the
scoring functions. However, the two values obtained in our analysis
can be compared to the distribution of values previously published.
As a minimum requirement, the determined correlation coefficients
for ASE and Affinity dG are larger in magnitude than that of the weakest
method of Smith et al., with a Pearson correlation coefficient of
0.35. The resulting correlation coefficient for Affinity dG is included
in the confidence interval of the scoring functions ranked 12–16
(out of 17; Table 1 of ref (38)), and that for ASE is included in the confidence interval
for scoring functions ranked 3–16. Hence, the results obtained
here are within the distribution of the previously published values,
with the result for ASE being centrally located in that distribution,
whereas that for Affinity dG is in the last third.The second
validation study attempted to reproduce the study of
Corbeil et al.[39] The correlation between
the experimentally determined binding affinity data for the energy-optimized
complexes of the entire CSAR-NRC data set and the corresponding scores
of GBVI/WSA dG is determined. The coefficient of determination R2 of GBVI/WSA dG published by Corbeil et al.
is 0.30, whereas that determined according to the protocol used at
UB is 0.29 (Figure S2). Hence, the employed
protocol used here yields results that are in close agreement with
those of Corbeil et al.The UF laboratory analyzed the transformations
in which many scoring
functions failed to predict the correct effect. Figure shows that the substitution of a heterocycle
led to a significant increase in the binding affinity of the corresponding
ligand. All of the scoring functions predicted the opposite effect.
Apparently, the hydrophobic interaction between fluorine and the side
chains of methionine Met126 and leucineLeu123 led to a higher contribution
to the binding affinity from the scoring functions than the electrostatic
interaction between the electronegative heterocycle and the lysineLys72 side chain. Furthermore, the Lys72 side chain undergoes considerable
conformational change, which might lead to a significant change in
the conformational strain of the protein, which is not considered
by the most scoring functions, as well as the change in the conformational
strain of the ligand caused by the exchange of the 6-carbon to nitrogen.
Figure 1
Heterocycle
substitution. A 3D-MMP representing two orotidine 5′-monophosphate
decarboxylase ligands is shown. The binding affinity increased by
1.7 orders of magnitude upon substitution of the heterocycle. 5-Fluoro-uridine-5′-monophosphate
is shown in orange (PDBcode: 3G1V), and 6-aza uridine-5′-monophosphate is shown
in blue (PDBcode: 1KM3). The surface of the binding pocket is colored by its lipophilicity
(white: neutral, green: lipophilic, magenta: hydrophilic).
Heterocycle
substitution. A 3D-MMP representing two orotidine 5′-monophosphate
decarboxylase ligands is shown. The binding affinity increased by
1.7 orders of magnitude upon substitution of the heterocycle. 5-Fluoro-uridine-5′-monophosphate
is shown in orange (PDBcode: 3G1V), and 6-aza uridine-5′-monophosphate is shown
in blue (PDBcode: 1KM3). The surface of the binding pocket is colored by its lipophilicity
(white: neutral, green: lipophilic, magenta: hydrophilic).In Figure , the
methylation of an aromatic moiety is shown to lead to a loss in binding
affinity of more than 2 orders of magnitude. The side chain of Tyr106
undergoes a shift, which might weaken the aromatic interactions with
the heterocycle. All of the scoring functions predicted a gain in
binding affinity when the water molecules are not considered. Only
two scoring functions (Affinity dG and Alpha HB) correctly predict
the transformation direction when including water in the active site.
Figure 2
Methylation
of an aromatic ring. A 3D-MMP representing two tRNA-guanine
transglycosylase inhibitors is shown. The binding affinity decreased
by 2.5 orders of magnitude upon methylation of the aromatic ring.
The unsubstituted analogue is shown in blue (PDBcode: 1S39), and the methylated
analogue is shown in orange (PDBcode: 1S38). The surface of the binding pocket is
colored by its lipophilicity (white: neutral, green: lipophilic, magenta:
hydrophilic).
Methylation
of an aromatic ring. A 3D-MMP representing two tRNA-guanine
transglycosylase inhibitors is shown. The binding affinity decreased
by 2.5 orders of magnitude upon methylation of the aromatic ring.
The unsubstituted analogue is shown in blue (PDBcode: 1S39), and the methylated
analogue is shown in orange (PDBcode: 1S38). The surface of the binding pocket is
colored by its lipophilicity (white: neutral, green: lipophilic, magenta:
hydrophilic).Figure shows the
substitution of pyrazole on a benzene ring. This substitution leads
to a decrease in the binding affinity of more than 1 order of magnitude.
The smaller heterocycle allows the ligand to move deeper into the
binding site and changes the distance between the pyrazolenitrogen
and the amino group of Lys38, leading to a more favorable hydrogen
bond distance.[40] Only one scoring function
(ASE) correctly predicts the transformation direction when considering
water.
Figure 3
A 3D-MMP representing two checkpoint kinase 1 ligands is shown.
The binding affinity decreased by 1.3 orders of magnitude upon substitution
of a heterocycle to the benzene ring. The heterocycle analogue is
shown in orange (PDBcode: 2XEZ), and the benzene analogue is shown in blue (PDBcode: 2XF0). The surface of
the binding pocket is colored by its lipophilicity (white: neutral,
green: lipophilic, magenta: hydrophilic).
A 3D-MMP representing two checkpoint kinase 1 ligands is shown.
The binding affinity decreased by 1.3 orders of magnitude upon substitution
of a heterocycle to the benzene ring. The heterocycle analogue is
shown in orange (PDBcode: 2XEZ), and the benzene analogue is shown in blue (PDBcode: 2XF0). The surface of
the binding pocket is colored by its lipophilicity (white: neutral,
green: lipophilic, magenta: hydrophilic).
Discussion
A diverse benchmark data set of 99 3D-MMPs was
assembled (Figure ). In this study,
13 commonly used scoring functions were evaluated, and it was shown
that the concordance between the change in affinity and the change
in score (i.e., use of the sign of the score differences as a predictor
for the change in affinity) is rather mediocre. Even in subsets in
which the pairs have large affinity differences, the results were
not generally improved. This is rather unexpected because it should
be easier for a scoring function to differentiate two molecules with
very different affinities. Focusing on the more potent compounds (i.e.,
affinity < 1 μM) again eases the differentiation of the two
molecules and consequently improves the results in many cases (see
the Supporting Information, Table S2).
However, there is no clear pattern with respect to the performance
of the scoring functions. It should be noted that owing to a decreasing
size of the subsets with larger affinity differences, larger values
of the prediction accuracies are more likely just by chance. Hence,
the critical values also increase, which renders the head-to-head
comparison of differently sized subsets difficult.
Figure 4
The profile of the data
set. The affinity of the ligands was widely
distributed from the sub-nanomolar to the two-digit-micromolar level,
with a slight emphasis on the rather potent sub-micromolar ligands.
In 42.4% of all of the cases, Δaffinity ranged from 0.0 to 0.5
log units. In this case, the ligands can be considered as almost equipotent.
However, more than half of the transformations within a 3D-MMP are
larger than 0.5 log units and can be regarded as markedly different.
The profile of the data
set. The affinity of the ligands was widely
distributed from the sub-nanomolar to the two-digit-micromolar level,
with a slight emphasis on the rather potent sub-micromolar ligands.
In 42.4% of all of the cases, Δaffinity ranged from 0.0 to 0.5
log units. In this case, the ligands can be considered as almost equipotent.
However, more than half of the transformations within a 3D-MMP are
larger than 0.5 log units and can be regarded as markedly different.In addition to the prediction
accuracy of Δaffinity based
on Δscore (Table ), the relationship between the molecular size (expressed as the
number of heavy atoms) and the prediction accuracy was studied (Table ). This was done in
subgroups because there is a strong relationship between molecular
size and affinity. In 71.8% of the cases (51 out of 71), the larger
molecule was more active, a trend well known from lead optimization.
The remaining molecules showed either no difference in affinity (n = 6) or were of equal size (n = 22).
It was shown that it is easier to correctly predict the more affine
molecule in the 3D-MMP if it is the larger one and vice versa. Hence,
there is a general preference to assign the larger molecule a better
score irrespective of the actual affinity. However, the strength of
this preference varies largely across different scoring functions
and is absent for some scoring functions. The preference for larger
molecules is strong for London dG, Affinity dG, and the consensus
prediction and slightly less pronounced for Alpha HB (Table ). Exactly these scoring functions
yielded statistically significant results for the prediction accuracy
at UB (Table ). Hence,
it is likely that these prediction accuracies turned significant owing
to the strong preference for larger molecules given that the data
set is composed of mainly 3D-MMPs where the larger molecule is more
affine. Put another way, it is likely that the data set composition
acted as confounder in these cases. This shows that a careful subgroup
analysis is important to identify potential confounders. Unfortunately,
the interesting subgroups where the more affine molecule is smaller
and the subgroup of equally sized molecules is rather small in this
data set. It would be beneficial to enlarge these groups in future
releases of the benchmark set to eliminate this potential confounder.The computations were carried out independently in two different
laboratories to remove artifacts from preprocessing of the complexes
and to strengthen the results. For preprocessing and processing the
complexes, slightly different protocols were used deliberately. A
detailed description of these steps can be found in the Materials and Methods section. Consequently, the obtained
results vary, which is a phenomenon well known in the literature.[39] The prediction accuracy varies up to 15% in
both directions. The median difference is about 5%. Taking a closer
look at the data without considering water (i.e., columns 2, 4, 6,
and 8 of Table ),
some trends can be seen. In the entire set and subset 1, the UFs results
for GoldScore are better (and statistically significant). This effect
could be due to a slight difference in the binding site definition
at UF and UB; in particular, at UB, a slightly extended binding site
was defined. Another striking difference is the better performance
of all of the MOE scoring functions at UB, which can be traced back
to the differently employed force fields (MMFF94x@UF vs Amber10:EHT@UB).The differences in the preparation of these data sets in terms
of protonation and minimization of the protein–ligand complexes
are difficult to control apart from the obvious changes in the settings
(e.g., the force field). Yet, controlling the preprocessing steps
is important, as these processes can lead to completely different
scoring results.[41] Some examples were given
above. In light of the sources of variability, it should be recalled
that the comparison between UB and previous studies by Smith et al.[38] and Corbeil et al.[39] revealed no striking differences, supporting the validity of preprocessing
as it was done here.The main advantage of the 3D-MMP data set
is that it is suited
for the global benchmarking of scoring functions. The relatively small
size and high diversity of the 3D-MMP data set make it a valuable
tool for the evaluation of the strengths and weaknesses of individual
and consensus functions. Cases in which the scoring functions were
not able to predict the correct sign change in the affinity of the
transformation are shown in Figures –3, and these cases exemplify
the current challenges of the available software tools.Figure shows the
substitutions, which seem to be well represented by the terms of most
scoring functions. The introduction of an additional lipophilic group
leading to an increased activity is recognized by all of the scoring
functions. Furthermore, the introduction of additional functional
groups leading to the formation of novel, clearly defined interactions,
such as metal chelation of ionic interactions, is well represented
by all of the scoring functions. In addition, all of the scoring functions
clearly recognize the exchange of a functional group by another one
with more favorable directed interactions. However, the success in
these cases might also arise from the inclusion of the shown examples
into the training set of the scoring functions. These examples show
that some interactions are very well represented by the scoring functions,
whereas others are not. The 3D-MMP data set can provide a valuable
basis for the identification of the strengths and weaknesses of the
scoring functions, which will ultimately lead to improvements in docking
and scoring software.
Figure 5
Four 3D-MMPs where at least 12 out of 13 functions were
able to
predict the correct transformation direction are shown. The binding
affinity increased or decreased at least by 2.0 orders of magnitude
after the shown substitutions. Ligands from PDBcodes 2WB5, 3LKA, 1F8D, and 2ZDA are shown in blue.
Ligands from PDBcodes 2J62, 3LK8, 1F8B, and 3SI4 are shown in orange.
The surfaces of the binding pockets are colored by their lipophilicity
(white: neutral, green: lipophilic, magenta: hydrophilic).
Four 3D-MMPs where at least 12 out of 13 functions were
able to
predict the correct transformation direction are shown. The binding
affinity increased or decreased at least by 2.0 orders of magnitude
after the shown substitutions. Ligands from PDBcodes 2WB5, 3LKA, 1F8D, and 2ZDA are shown in blue.
Ligands from PDBcodes 2J62, 3LK8, 1F8B, and 3SI4 are shown in orange.
The surfaces of the binding pockets are colored by their lipophilicity
(white: neutral, green: lipophilic, magenta: hydrophilic).
Conclusions
In this study, we compiled
a novel data set of 3D-MMPs, which is
suitable for benchmarking and fine-tuning of scoring functions. The
data set comprises 99 3D-MMPs, which are highly diverse in terms of
target proteins, chemical chemotypes of the co-crystallized ligands,
binding affinities, and differences in binding affinities. Although
the size of the data set is rather small, it offers the possibility
to examine cases of erroneous and correct predictions manually. By
this, it provides detailed insights into the capabilities of scoring
functions and other methods for structure-based affinity comparison.
In this study, we demonstrated the applicability of the 3D-MMP data
set to performance evaluation of scoring functions and identified
systematic and individual weaknesses of these. The 3D-MMP data set
can be used for further optimization and evaluation of algorithms
for structure-based computer-aided drug design in the future.
Materials
and Methods
Basis
The PDBbind v2014[14,15] forms the
basis of the diverse benchmark data set of 3D-MMPSs. The PDBbind provides
a broad collection of binding affinity data for protein–ligand
complexes embedded in the Protein Data Bank (PDB).[42] PDBbind v2014 comprises 44 569 complexes formed
between proteins, ligands, and nucleic acids. Furthermore, three subdivided
sets of different size and quality are available (general set, refined
set, and core set).[14,15]
Clustering
All
of the data were processed using the
workflow management tool KNIME (Konstanz Information Miner, KNIME
Analytics Platform 2.10.1, KNIME.com AG, 2014).[43] In the first step, PDB codes from the PDBbind general set
were assigned to new cluster IDs, where one cluster corresponds to
one target protein. This assignment was carried out using UniProtKB
(UniProt Knowledgebase) numbers[44] via mapping
on the UniProt homepage (www.uniprot.org). Due to ambiguity in the assignment of the PDB codes to UniProtKB
numbers (i.e., one PDB code with two different UniProtKB numbers),
the assignment had to be extended. By employing a KNIME workflow with
the SubGraph Extractor node (available in the KNIME
Labs), a connection network between the PDB codes and UniProtKB numbers
was established. The Object Inserter node is used
to define nodes (PDBcodes) and edges (UniProtKB numbers). Row to Network and Node Table nodes are
then used to aggregate the PDBcodes corresponding to the same UniProtKB
number forming a new cluster corresponding to a new cluster ID. To
ensure correct clustering of the new cluster IDs, sequence alignment
(sequence identity > 90%) was carried out. The new cluster IDs
based
on the PDBbind general set were used to assign the PDB codes from
the PDBbind core set with the same newly created cluster IDs. Because
the PDBbind core set is already a diverse set of target proteins,
we used the target proteins from the core set to build up our new
benchmark data set. Therefore, the 68 cluster IDs from the core set
were used to collect all of the corresponding PDB codes from the general
set. This results in 3958 PDB codes belonging to the 68 core set cluster
IDs.
3D-MMP Generation
MMPs were generated using the available Matched Pairs Detector node (provided by Erl Wood Cheminformatics)
in KNIME.[45−47] Sequence alignment was applied (sequence identity
> 90%) to ensure affiliation to the same target protein (cluster
ID)
within one MMP. By calculating the root-mean-square deviation (RMSD)
from the maximum common substructure (MCS), the orientation and location
of the ligands within one MMP were analyzed. An RMSD < 1 Å
of the MCS was necessary to provide comparability of the ligands within
one MMP. In cases where the RMSD was larger than 1 Å, structural
alignment (superposition) was conducted. Therefore, a superposition
of the proteins was carried out, and a rotation matrix for each protein
was obtained. These rotation matrices were used to rotate the ligands
in the same manner. Afterward, the RMSD was calculated again, and
the MMPs with an RMSD < 1 Å were collected. This alignment
of complex coordinates led to 3D-MMPs. Restriction rules for the collected
3D-MMPs were implemented in the next step. The maximum size of a cyclic
substituent was limited to nine nonhydrogen atoms, and a noncyclic
substituent was limited to five nonhydrogen atoms. The common core
in the molecules was restricted to be at least 50% of the size of
the entire molecule. Furthermore, the measured binding affinities
within a 3D-MMP were of the same experimental type (Ki, Kd, or IC50).
3D-MMPs with IC50 values were only accepted if they were
obtained from the same publication. This rule was applied to avoid
inaccuracies because of the high dependence of the assay on the conditions.
To achieve diversity with respect to the target and the binding affinity
effect (Δaffinity), three 3D-MMPs were selected for each target
cluster with the smallest, largest, and mean Δaffinity. After
applying all of the substituent restriction and diversity rules, a
diverse data set of 99 3D-MMPs, corresponding to 33 target clusters,
was obtained.
Quality Assessment of the Data Set
To ensure the high
quality of the data set, all 99 3D-MMPs X-ray structures (178 protein–ligand
complexes) were reviewed. The resolutions of the protein–ligand
X-ray structures were used for an initial global assessment of quality.
A mean resolution of 1.90 Å was achieved in our data set. However,
this value only describes the theoretical limit on the precision of
the model. It does not provide any quality information about the specific
parts of the structure, such as the ligand or binding pocket. The
additional global quality measures are the refinement R-factors (Rwork, Rfree). Rwork is a measure of the
difference between the measured data and the model-predicted data.
Therefore, Rwork values can be used to
estimate the model quality. Before refinement, a random subset of
the data is collected (Rfree reflections),
which is later used for cross-validation to avoid overfitting the
data.[48] These global quality measures (resolution, Rwork and Rfree)
can be found in the Supporting Information (Table S5). For local quality assessment specific to the ligand and
the binding pocket, the electron density of all of the ligands was
examined using the electron density score for multiple atoms (EDIAm). The EIDA values quantify the electron density fit of an
atom by calculating a weighted sum over an oversampled electron density
grid in the proximity of the atom.[49−51] According to Friedrich et al.,[49−51] an EDIAm value above 0.8 indicates a satisfying
fit of a molecular structure on the observed electron density. The
EDIAm values for the data set can be found in the Supporting
Information (Table S6). In cases in which
an EDIAm value of 0.8 was not achieved, the electron density
maps were manually examined (http://www.ebi.ac.uk/pdbe/eds), and polder maps were generated.
A polder map is an OMIT map that excludes the bulk solvent around
an omitted region. In our case, the omitted region is the ligand.
Polder maps are a helpful tool to visualize weak electron densities
around the ligand or other regions of interest. The polder maps were
generated using the software Phenix.[52] A
detailed table including the electron density and polder maps associated
with structures with an EDIAm value smaller than 0.8 can
be found in the Supporting Information (Table S7). Using the global and local quality measures mentioned
above, it can be concluded that our 3D-MMP data set fulfills the conventional
quality criteria for a high-quality benchmark data set.
Data Set
Preparation at Goethe University Frankfurt (UF)
Complex
Preparation
All of the protein–ligand
complexes were downloaded from the PDB. In every case, chain A was
used for all further applications. The protonation of proteins and
ligands was carried out in a KNIME workflow using the available Protonate 3D node (provided by MOE; default settings).[53] Water was taken into consideration to analyze
the influence during the predictive power investigation. Therefore,
all of the predictions were performed twice, with and without the
consideration of water.
Scoring Procedure
Thirteen commonly
used scoring functions
were evaluated in the study conducted at UF. These included five scoring
functions provided in MOE 2014.09[22] (London
dG, ASE, Affinity dG, Alpha HB, and GBVI/WSA dG), four scoring functions
provided in the software package GOLD Suite 5.2.2[23−30] (ASP, ChemPLP, ChemScore, and GoldScore), the freely available docking
tools, AutoDock 4.2.6[31,32] and AutoDock Vina 1.1.2,[33] and the independent scoring functions, X-Score[34] and DSX.[35] Seven
of these scoring functions can be considered as empirical scoring
functions, three as force-field-based, and another three as knowledge-based.
A brief description of these scoring functions can be found in the Supporting Information.The MOE scoring
was realized through the MOE extensions in KNIME, specifically the complex scoring node. Receptors were read in as .mol2, and
ligands as .sdf. The complex scoring node was executed
to yield scores for each complex using the five scoring functions
implemented in MOE.The GOLD scoring was realized through a
KNIME workflow. GOLD docking
or, in our case, re-scoring is based on a configuration file (gold.conf)
that contains all of the necessary information. For each scoring function
and each complex, a gold.conf file was generated. The gold.conf file
is built up of information pointing to the corresponding ligand file
(.mol) and protein file (.pdb), the binding site (protein atoms within
5 Å of the ligand), and the scoring function. It also contains
information for only the re-scoring of a ligand with no advanced ligand
minimization. These gold.conf files were used to run the GOLD re-scoring
for all four scoring functions. The scores were extracted from the
output files (solutions.rescore.log) for each complex and each scoring
function.Re-scoring with AutoDock 4.2.6 was carried out in
a KNIME workflow.
Python scripts for all of the preparation steps were available in
the AutoDock Tools (ADT) provided in the MGLTools package.[54] AutoDock 4.2.6 required the receptor and ligand
file to be written in PDBQT format. The PDBQT format had additional
partial charges and AutoDock atom types to the normal PDB format.
Receptor and ligand files were prepared using the prepare_receptor4.py and prepare_ligand4.py Python scripts. Next, a
grid parameter file was prepared for AutoGrid 4, which precalculates
the grid maps of the interaction energies later used by AutoDock 4.2.6
to determine the total interaction energy for a protein–ligand
complex. The grid parameter files were prepared using the prepare_gpf4.py Python script. The prepared grid parameter
files and receptor PDBQT files were then used to run AutoGrid 4 and
generate grid maps (.glg). To run AutoDock 4.2.6, a docking parameter
file (.dpf) was needed. This file was generated using the prepare_dpf4.py Python script. The obtained docking parameter
files had to be modified to only perform re-scoring with AutoDock
4.2.6. Therefore, we used a Python script to remove all of the lines
responsible for docking in the .dpf file and append the parameter epdb. After adding this parameter, AutoDock 4.2.6 was used
to calculate the energy of the ligand provided in the PDBQT ligand
file. In the final step, AutoDock 4.2.6 was run using the modified
docking parameter file to yield the .dlg result file. As a score,
the estimated free energy of binding was extracted from the generated
.dlg file using a Python script.AutoDock Vina 1.1.2, as the
successor of AutoDock 4.2, used the
same receptor and ligand format (PDBQT). Additionally, only a configuration
text file (config.txt) was needed to run AutoDock Vina 1.1.2. This
configuration file contains the receptor and ligand PDBQT file, coordinates
for the center grid points (coordinates taken from the corresponding
.gpf file generated in the AutoDock 4.2.6 workflow) and the number
of grid points in each direction (x = y = z = 40). After generating the configuration files
for all of the complexes, re-scoring with AutoDock Vina 1.1.2 was
achieved by using the flag “--score_only”. The resulting
scores were extracted from the generated .log files.The independent
scoring function X-Score required a receptor PDB
file and a ligand .mol2 file. Using a Python script, re-scoring with
X-Score was run in KNIME using the flag “-score”. The
resulting .log files were used to extract three single scores (HPScore,
HMScore, and HSScore). The average of these three scores gave the
final X-Score score.The independent scoring function DSX required
a receptor PDB file
and a ligand .mol2 file. DSX was run using a simple Java script in
KNIME. The resulting text files were used to extract the final DSX
scores.
Data Set Preparation at TU Braunschweig (UB)
The complexes with and without
considering water and their energy-optimized forms were processed
as follows. The complexes were prepared with MOE 2013.08.[55] Amber10:EHT was chosen as the force field, using
the reaction field electrostatics as the solvent model. Each protein
was loaded into MOE, and the structural discrepancies detected by
the structure preparation tool of MOE were fixed. The default settings
for protonation were adjusted to physiological conditions (i.e., T [K] = 310.15, pH = 7.4, salt-conc. [mol/L] = 0.9). Subsequently,
partial charges based on the selected force field and hydrogens were
added. These prepared proteins were saved as .pdb. The ligands were
prepared in the same manner and saved as .mol2 and into a Molecular
Database file (.mdb). Energy-optimized complexes were obtained by
loading the prepared proteins and their corresponding ligands into
MOE, where an active site limited energy minimization was conducted.
Upon convergence, proteins and ligands were saved separately, as described
previously.At UB, all of
the scoring functions
of MOE (London dG, ASE, Affinity dG, Alpha HB, and GBVI/WSA dG) and
GOLD Suite 5.2.2[23−30] (ASP, ChemPLP, ChemScore, and GoldScore) were evaluated. The MOE
scoring was carried out using KNIME version 2.11.2[56] with the MOE extensions for KNIME (knimoe 2.2.0). Proteins were read in as .pdb, and ligands as .mdb. Once
the ligands were assigned to their corresponding proteins, the complex
scoring node was executed to yield the scores of the five MOE scoring
functions for each complex. The force field of the complex scoring
node in knimoe was changed to Amber10:EHT before scoring.[57]GOLD scoring was realized via command
line. Therefore, four gold.conf files (one per GOLD scoring function)
for each prepared protein–ligand pair were generated. In addition
to containing the respective scoring function, each gold.conf file
pointed to the corresponding protein and ligand. Furthermore, they
contained default settings for re-scoring, with the following changes
from the default settings. The binding site was defined as all atoms
within 10 Å of the ligand, and the re-scoring options “perform
local optimization”, “retrieve rotated protein atom
positions (if available)”, and “replace score tags in
file” were disabled.Moreover, a subset of 3D-MMPs was
composed of only more potent
compounds in which their complexes had binding affinities of less
than 1 μM (n = 54 3D-MMPs, subset 4). This
subset was divided again into 3D-MMPs with log10 differences
in the affinity equal to or greater than 0.5 (n =
25 3D-MMPs, subset 5) and 1.0 (n = 13 3D-MMPs, subset
6) to result in six subsets.
CSAR-NRC Data Set
The CSAR-NRC data set[38] was downloaded
(CSAR-NRC HiQ from www.csardock.org) and processed
in the same manner as the complexes in the 3D-MMP data set. The CSAR-NRC
data set contains “343 high-quality, protein–ligand
crystal structures” and was used by Smith et al.[38] for a benchmark exercise in 2010. This data
set was also made available in an energy-minimized form. Participants
scored both sets of crystal structures of the CSAR-NRC data set with
varying parameters. The results were published using pseudonyms so
that only the distribution of the figures of merit could be analyzed.Each crystal structure in the CSAR-NRC data set was loaded into
MOE separately. The ligands were extracted from the complexes and
saved into a Molecular Database file (.mdb). Proteins and ligands
were prepared and saved in the same way as the proteins and ligands
in the 3D-MMP data set. Subsequently, the prepared proteins and their
corresponding ligands underwent energy minimization based on the Amber10:EHT
force field and were saved as .pdb (protein) and .mdb (ligands). Scoring
of the CSAR-NRC data set was conducted in an identical manner to the
3D-MMP data set, as described in the previous section. Because the
correlation measures between the scores and affinities published by
Smith et al. were restricted to the 332 complexes in the entire CSAR-NRC
data set by excluding the crystal structures of Factor Xa, only that
subset was used here. From the scoring functions tested at UB, ASE,
Affinity dG, and ChemScore were part of the core methods of the benchmark
exercise in 2010, which included a total of 17 scoring functions.
Within this benchmark exercise, these three scoring functions were
tested on the minimized complexes because they better correlated with
the experimental data. The use of pseudonyms made a one-to-one comparison
impossible. Nevertheless, it was checked whether the correlations
of ASE and Affinity dG for the 332 minimized complexes at least outperformed
the weakest method of Smith et al. by considering the Pearson correlation
coefficients to ensure that the observed results are of sufficient
technical quality (i.e., are not artifacts of the employed protocol
for preparing and processing the data). Furthermore, Corbeil et al.[39] used the CSAR-NRC data set as a test set to
validate the GBVI/WSA dG scoring function. Before applying the GBVI/WSA
dG scoring function on the CSAR-NRC data set, Corbeil et al. minimized
its complexes based on the MMFF94x force field with reaction field
electrostatics. Corbeil et al. used the entire CSAR-NRC data set,
including the Factor Xa protein–ligand complexes. For comparison,
we also considered the entire energy-minimized CSAR-NRC data set and
determined the R2 accordingly.
Scoring Function Analysis at UF and UB
For a detailed
analysis, the 3D-MMPs were divided into different subsets. The subsets
comprised 3D-MMPs with log10 differences in the affinity
equal to or greater than 0.5 (n = 58 3D-MMPs, subset
1), 1.0 (n = 39 3D-MMPs, subset 2), and 2.0 (n = 15 3D-MMPs, subset 3). The scoring results were extracted
and analyzed in two different ways. The predicted transformation effect
was calculated by subtracting the single scores for each 3D-MMP (Δscore).
The ability of the scoring functions to predict the direction of a
transformation effect (positive or negative) with and without the
consideration of water (prediction accuracy) was analyzed first (Table ). This was done for
all of the subsets. Furthermore, it was investigated as to what extent
the number of heavy atoms (as a surrogate of molecular size) affected
the prediction accuracy (Table ). 3D-MMPs with the same affinity were removed from this analysis
(n = 6), which led to a reduced number of 3D-MMPs
in the entire data set (n = 93) and in subset 1 (n = 58). Finally, the consensus was determined as the majority
vote of all of the scoring functions.