Literature DB >> 23548044

CSAR benchmark exercise 2011-2012: evaluation of results from docking and relative ranking of blinded congeneric series.

Kelly L Damm-Ganamet¹, Richard D Smith, James B Dunbar, Jeanne A Stuckey, Heather A Carlson.

Abstract

The Community Structure-Activity Resource (CSAR) recently held its first blinded exercise based on data provided by Abbott, Vertex, and colleagues at the University of Michigan, Ann Arbor. A total of 20 research groups submitted results for the benchmark exercise where the goal was to compare different improvements for pose prediction, enrichment, and relative ranking of congeneric series of compounds. The exercise was built around blinded high-quality experimental data from four protein targets: LpxC, Urokinase, Chk1, and Erk2. Pose prediction proved to be the most straightforward task, and most methods were able to successfully reproduce binding poses when the crystal structure employed was co-crystallized with a ligand from the same chemical series. Multiple evaluation metrics were examined, and we found that RMSD and native contact metrics together provide a robust evaluation of the predicted poses. It was notable that most scoring functions underpredicted contacts between the hetero atoms (i.e., N, O, S, etc.) of the protein and ligand. Relative ranking was found to be the most difficult area for the methods, but many of the scoring functions were able to properly identify Urokinase actives from the inactives in the series. Lastly, we found that minimizing the protein and correcting histidine tautomeric states positively trended with low RMSD for pose prediction but minimizing the ligand negatively trended. Pregenerated ligand conformations performed better than those that were generated on the fly. Optimizing docking parameters and pretraining with the native ligand had a positive effect on the docking performance as did using restraints, substructure fitting, and shape fitting. Lastly, for both sampling and ranking scoring functions, the use of the empirical scoring function appeared to trend positively with the RMSD. Here, by combining the results of many methods, we hope to provide a statistically relevant evaluation and elucidate specific shortcomings of docking methodology for the community.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Ligands
Proteins

Year: 2013 PMID： 23548044 PMCID： PMC3753884 DOI： 10.1021/ci400025f

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Structure-based drug design (SBDD) is a valuable technology that is seeing increased utilization to advance the process of drug discovery research.[1−6] A typical docking protocol is comprised of two components: the search algorithm and scoring function. An exhaustive search algorithm would account for all possible binding poses by allowing both the protein and ligand to be fully flexible; however, although ligand flexibility can be accurately reproduced, replicating the innumerable degrees of freedom of a protein is impractical due to the enormity of the conformational space that must be searched. Developing methods that incorporate protein flexibility in a computationally tractable manner has been recognized as a means to improve SBDD techniques.[2,7−13] The scoring function is used to evaluate and rank each pose by predicting the binding affinity between the ligand and protein. Many simplifications and assumptions are made to the scoring function to increase its speed, such as neglecting entropy and solvation, but these result in a loss of accuracy. Scoring function development is also an active area of SBDD research.[14−18]

Benchmark Docking Exercises

In order to facilitate the development of docking software, the Community Structure–Activity Resource (CSAR) center was funded by the National Institutes of Health (NIH) in 2008 to increase the amount of high quality experimental data publicly available for development, validation, and benchmarking of docking methodologies. CSAR conducted its first benchmark exercise in 2010 with the goal of (1) evaluating the current ability of the field to predict the free energy of binding for protein–ligand complexes and (2) investigating the properties of the complexes and methods that appear to hinder scoring.[19,20] This exercise illuminated that scoring functions are still not able to successfully predict binding affinity and, hence, are not capable of correctly rank ordering ligands.[20] Additionally, the size of the ligand did not appear to affect the scoring quality, but hydrogen bonding and torsional strain were found to be significantly different between well-scored and poorly scored complexes. Detailed results from most participants can be found in a special issue of the Journal of Chemical Information and Modeling [J. Chem. Inf. Model.2011, 51, (9), 2025]. Previously in 2007, Nicholls and Jain organized a Docking and Scoring Challenge with an emphasis on developing standards for evaluation of methods, data set preparation, and data set sharing.[21] A second Challenge was conducted in 2011, where it was found that GLIDE[22,23] and Surflex-Dock[24,25] outperformed other methods tested in both pose prediction and virtual screening (enrichment).[26] A prevalent theme that emerged from the various participants was that optimizing the protein structures prior to docking improved performance.[27−29] Special issues of the Journal of Computer-Aided Molecular Design were dedicated to both the 2007 competiton [J. Comput.-Aided Mol. Des.2008, 22, (3–4), 131] and 2011 competition [J. Comput.-Aided Mol. Des.2012, 26, (6), 675]; detailed evaluations from participating groups can be found there. Moreover, OpenEye periodically runs the SAMPL Experiment to assess additional aspects of computational modeling relevant to SBDD such as prediction of vacuum–water transfer energies, binding affinities of aqueous host–guest systems, prediction of solvation energies, tautomer ratios, etc.[30−32] Various groups have conducted independent evaluations of docking programs and have found that many search routines are capable of predicting the native binding pose of the ligand within a RMSD of 2 Å for a range of protein targets.[6,33] Furthermore, while not able to predict binding affinity well, current methods have proven to be successful at enriching hit rates (i.e., identifying active molecules from decoys).[6,33,34] However, consistently ranking inhibitors with nM-level affinity over those with μM-level affinity has proven to be a challenge, as is identifying “activity cliffs” where small changes result in significant increases or decreases in affinity. Most methods appear to do well at either pose prediction or enrichment; only a few are capable of successfully performing both.[33] A further caveat is that expert knowledge is necessary as small modifications to the software’s parameters can have large effects on the docking results.[33,35] A review by Cole et al. discusses how assessing and comparing the performance of docking software is a difficult task and can be misleading as one is not always comparing apples to apples.[35] An additional study found that the quality of the crystal structures in publicly available data sets can affect docking results; poor resolution structures led to unsuccessful docking and vice versa.[36] To that end, it has become increasingly clear that one major limitation to the field was a large, standardized, high quality data set of experimentally determined protein–ligand complexes.[27,37−39] Furthermore, computational chemists need reliable experimental data with the complexes such as binding affinity, solubility, pKa, and logP/logD of the ligand. The CSAR center was created to fulfill this need, and details of our high quality data sets can be found in our data set paper in the same CSAR special issue of the Journal of Chemical Information and Modeling, along with a review of other publically available data sets.[40]

Assessment

In addition to high quality data, proper evaluation protocols are imperative to assess the performance of the docking methodology.[20,21,41] Pose prediction of the native or cognate ligand is a common approach for evaluating the search algorithm. Recently, this practice has been called into question; however, we must take a step back and recognize that while this task may not be performed regularly for drug discovery purposes, it is an essential positive control in the research lab. Cross-docking exercises are more relevant as they are the actual application of the docking software and should be conducted once it is confirmed that the method is capable of reproducing native binding poses. A variety of measures exist for evaluating pose predictions: RMSD (root-mean-square deviation), DPI (data precision index)/RDE (relative displacement error),[41−43] number of native contacts predicted,[44−46] RSR (real space R) and RSCC (real space correlation coefficient),[47] and coordinate error.[39] RMSD is the standard for evaluating poses, but it can be misleading. Crystal structures are simply a static snapshot of the protein–ligand complex, but more importantly, the coordinates are only a model of the true experimental data. Furthermore, RMSD can be biased by the frame of reference; for example, binding-site residues and a ligand can shift just 1 Å in flexible docking and result in an artificially large RMSD despite maintaining all relevant contacts. Lastly, a random placement of a small ligand can have low RMSDs while symmetric molecules that are not handled properly can produce artificially high RMSD values.[45] Native contacts appear to be a robust measurement that can capture the complex interactions and, used in combination with the RMSD, provide a thorough evaluation of the predicted poses. Although noted in the literature that a drawback of native contacts is its inability to be automated,[45] we have created an automated tool in python to calculate both the percentage of native contacts correct between the protein and predicted ligand pose and a raw count of contacts (all contacts made between the protein and predicted ligand pose). To assess the performance of the scoring function in a virtual screening-type application, enrichment and relative-ranking studies are commonly employed. The area under the curve (AUC) of receiver operator characteristic (ROC) curves[48,49] based on the rank score are typically reported for enrichment; this metric is able to assess how well a ranking function identifies known inhibitors as high ranking and discriminates them from inactive ligands. Standard correlation measures are used to evaluate the ability of scoring functions to rank-order active compounds, and sound statistical methods are necessary to identify true trends in the data.[20,35,39,41] Pearson’s correlation (r) is typically employed to provide a linear relationship, whereas Spearman’s rho (ρ) and Kendall’s tau (τ) measures the strength of the nonparametric relationship between the ranks of the data. Hence, r is a better measure for assessing absolute predictions, while ρ and τ are more appropriate metrics for relative ranking. Here, we present an evaluation of the results from the CSAR center’s first blinded benchmark exercise. In order to avoid a winner-vs-loser mentality, participants were asked to submit two sets of results and focus on testing a hypothesis of their choice, rather than comparing their results to others. The exercise concluded with a symposium at the Fall 2012 National Meeting of the American Chemical Society (ACS) in Philadelphia, with eight speakers and an open discussion session.

Contributors

Most of the pose predictions and ranking values evaluated were calculated by authors featured in the same CSAR special issue of the Journal of Chemical Information and Modeling, some by the CSAR team, and a few by participants who spoke at the ACS symposium but were unable to submit papers to the special issue due to various time constraints. A variety of methods/codes were utilized in the exercise and include Gold, MOE, AutoDock, AutoDock Vina, MedusaDock, RosettaLigand, Schrödinger Induced Fit Dock, Q-Mol, OEDocking, CDOCKER, ICM-VLS, BlurDock, Glide, MDock, Sol, and WILMA. Most are custom versions or in-house software and were expert-guided docking protocols. To provide anonymity to each group, they are denoted below as A–U, and each method submitted as 1–6. If a group submitted results using more than one docking program, they were separated into multiple groups (i.e., A, B, and C). We have done this to avoid a win–lose mentality as this benchmark exercise is not meant to be a contest but rather a means to elucidate important and common deficiencies across the methods in predicting and scoring binding poses. Additionally, a breakdown of the various sampling and ranking scoring functions and docking program used by each group is provided in Table 1. The docking programs are denoted a–r to once again provide anonymity, but this allows the reader to determine which groups used the same programs. Our hope is that this study will help direct the computational community to where the most significant effort is needed for future methodology development.

Table 1

Breakdown of Employed Sampling and Ranking Scoring Function and Docking Software for Each Group

group	sampling scoring function	ranking scoring function	docking softwarea
A	force field-based	force field-based	a
B	force field-based	force field-based	b
C	knowledge- and force field-based combined	knowledg- and force field-based combined	c
D	knowledge-based	knowledge-based	d
E	knowledge-based	knowledge-based	e
F	empirical-based	empirical-based	f
G	empirical-based	empirical-based	g
H	force field-based	force field-based	h
I	empirical-based	empirical-based	f
J	empirical-based	empirical-based	i
K	empirical-based	force field-based	j
L	empirical-based	empirical-based	k
M	force field-based	knowledge-based	l
N	knowledge- and empirical-based combined	knowledge- and empirical-based combined	m
O	force field-based	empirical-based	n
P	crude shape complementarity	knowledge-based	o
Q	knowledge-based	knowledge- and empirical-based combined	p
R	force field- and empirical-based combined	force field- and empirical-based combined	l
S	empirical-based	empirical-based	n
T	force field-based and shape/functionality-based complementarity	force field-based and shape/functionality-based complementarity	q
U	force field-based	force field-based	r

Docking codes used by more than one group are shown in bold.

Methods

Data Set and Participation

The goal of the 2011–2012 blinded exercise was to compare different improvements for docking and relative ranking of congeneric series of compounds, testing three areas: (1) pose prediction, (2) enrichment/discriminating active from inactives, and (3) relative ranking. The exercise was built around blinded, high-quality, experimental data from four protein targets: LpxC (University of Michigan data), Urokinase (Abbott data), Chk1 (Abbott data), and Erk2 (Vertex data). Participants were provided with a set of SMILES strings of the active and inactive ligands, the pH of the assay used to determine the binding data, and a PDB code to use for docking. Cross-docking studies, in addition to an analysis of the active site, were conducted in-house to determine the most appropriate PDB structure for use with each target (3P3E[50] for LpxC, 1OWE(51) for Urokinase, 2E9N(52) for Chk1, and 3I5Z(53) for Erk2, as shown in Table 2). Participants were asked to submit two sets of results and test a hypothesis of their choice. Twenty groups worldwide participated in the exercise where 17 sent pose predictions (the majority sent in the top three poses using multiple methods, resulting in 3250 total poses) and 18 sent in rankings (the majority sent in one set of rankings using multiple methods, resulting in 174 total rankings).

Table 2

Summary of Data Sets Employed in Benchmark Exercise and Predictions Received

protein target	data source	PDB structure employed	# of structures	pose predictions received	# of active ligands	# of inactive ligands	ranking predictions received
LpxC	University of Michigan	3P3E	4	458	3	8	39
Urokinase	Abbott	1OWE	4	390	16	4	45
Chk1	Abbott	2E9N	14	1279	38	9	47
Erk2	Vertex	3I5Z	12	1123	39	–	43

Obviously, participants were not told which ligands were active or inactive (inactive ligands will not have corresponding structures). As such, participating groups submitted poses of all ligands, but only those with corresponding high-quality CSAR structures were used in our pose prediction analysis. Active molecules that do not have corresponding crystal structures were also included in the enrichment/relative ranking portion of the exercise. A summary of the number of CSAR protein–ligand complexes and active/inactive molecules employed for each target is provided in Table 2 along with the number of predictions received for pose prediction and enrichment/relative ranking broken down by protein. A complete description of how the four data sets were curated and prepared for this exercise can be found in ref (40). An online questionnaire was sent to all participating groups to gather additional data on the details of their methodology. We had a 100% response rate and gathered the following information: details on ligand setup, protein setup, and the molecular docking protocol, in addition to thoughts on the analysis conducted by the CSAR team. Having the methodology details used by each group allowed us to identify how particular aspects of docking programs across multiple groups’ results affected the pose/ranking predictions.

Pose Prediction

RMSD and native contacts were used to evaluate the predicted poses. All poses, the best pose, and the top-scoring pose were evaluated for the predictions. We superimposed the submitted protein–ligand complexes using the wRMSD method[54] to the unpublished CSAR crystal structure to provide the same frame of reference. Groups were asked to dock into a protein conformation found in the PDB, but the poses were compared back to the crystal structure solved by the CSAR group. If only the ligand pose was sent, we first placed it in the context of the suggested PDB structure for the exercise (i.e., 3P3E for LpxC) and then performed the superposition. Once the complex was superimposed, we calculated the RMSD between the predicted ligand pose and the crystallographic coordinates. Only heavy atoms were used, and symmetry was accounted for in the calculation. The script used was graciously donated by the Chemical Computing Group and run in MOE 2010.11.[55] A “correct” docking pose was defined as having a RMSD of less than 2 Å.[56,57] The RMSD script was utilized to pull out the atom correspondence for each ligand pose. Additionally, a python script was written in-house to analyze for hetero–hetero, carbon–carbon, and packing contacts between the protein and predicted ligand pose. For the packing contact, atom types are ignored and all contacts are counted. Waters were not included in the analysis, but in the LpxC test case, the catalytic zinc was included as part of the protein. If a zinc atom was not present in any submission, the location of the 3P3E catalytic zinc was used. The cutoff values used for the interactions are as follows: O–O, O–N, O–F, N–N, and N–F ≤ 3.5 Å (approximate hydrogen bonding and electrostatic interactions) S-x, Br-x, Cl-x ≤ 3.8 Å, where x is O or N (approximate longer and weaker electrostatic interactions) Zn-x ≤ 2.8 Å, where x is O or N (ion coordination) C–C ≤ 4.0 Å (capture tight and loose van der Waals interactions) Packing ≤ 4.0 Å (capture all interactions) Two types of analysis were conducted with the contacts: (1) the number of native contacts correctly predicted (i.e., same contacts as those in the co-crystal structure) and (2) a raw count of contacts (i.e., all contacts made between the protein and ligand). The number of native contacts correctly predicted can be thought of as an assessment metric: Is the predicted pose making the same contacts to the protein as in the native co-crystal structure? Is this pose right or wrong? The raw count is not an assessment of the pose per se but provides information on the actual contacts being made between the predicted pose and protein. Here, we are probing the reason why the pose is different to elucidate the cause of the problem; for example, which types of contacts are being sacrificed or overpredicted by the various scoring functions. For the number of native contacts correctly predicted, only those contacts made between the predicted ligand pose and protein structure that are also present in the CSAR co-crystal structure (i.e., the native contacts) are summed and then broken down by hetero–hetero contacts (%Het–Het) and carbon–carbon contacts (%C–C) or as a total count (%Total). The contact script accounts for the symmetry of the ligand and histidine tautomers and residue equivalence. For example, a contact to either aspartic acid acidic oxygen would count equally. On the same note, we count all carbons in a residue equally, and hence, a contact to any carbon would count. For the raw count of pose contacts, all of the hetero–hetero contacts made between the predicted ligand pose and protein structure are summed up (Het–Het), then carbon–carbon contacts (C–C), and finally packing contacts. We then determined whether the pose overpredicted, underpredicted, or had the same number of contacts in the co-crystal structure within 10%. To obtain a 95% confidence interval, 10,000 random bootstrap samples of the raw count contact data were taken with replacement. From each sample, the 95% confidence interval was determined by the 2.5 percentile and 97.5 percentile of the distribution of overpredicted, underpredicted, and same contacts from the bootstrap samples. Furthermore, we have attempted to use RSRs and RSCCs for the assessment of the predicted ligand poses. RSR and RSCC provide a fit of the predicted ligand pose to the electron density and, as such, are an evaluation based on the raw experimental data. A crystal structure is a model. Hence, comparing back to the experiment data will remove a layer of bias from the evaluation—the error of the model is not propagated into the analysis. Unfortunately, neither the RSR or RSCC values were reproducible between the different versions of CCP4[58] used (4.1.2 and 4.2.0). We were also unable to reproduce the values reported by the Uppsala Electron Density Server (EDS)[59] for the original crystal structure. Because of this inconsistency, we feel it is difficult to trust these values, and as such, they are not used in our pose prediction analysis.

Ranking Evaluation and Statistical Analysis

We have evaluated the ability of scoring functions to properly identify the inactives in the series and their ability to rank-order the active compounds. ROC plots were generated to determine the AUC;[49] the greater the AUC, the better the ability of the method to identify active over inactive compounds. To obtain 95% confidence intervals around the AUC, bootstrap sampling was performed by randomly selecting samples with replacement 10,000 times. The size of each sample was the same as the size of the set used to generate the ROC plot. The AUC was calculated for each sample and the 2.5 percentile and 97.5 percentile of the resulting distribution of 10,000 AUC values were computed to give the 95% confidence interval.[25,60] Software was kindly provided by Ajay Jain to compute the confidence intervals. Pearson’s (r) parametric correlation coefficient and Spearman (ρ) and Kendall (τ) nonparametric correlation coefficients were calculated to determine the correlation between the predicted and known affinities. The software JMP[61] was used to calculate all statistics, unless otherwise noted. Fisher transformations combined with standard deviations were used to determine 95% confidence intervals around the Pearson correlation and Spearman correlation.[62] For the Kendall statistic, the Fisher transformation cannot be used; therefore, the approximation of 1.96 × (1 – τ2)((2(2n + 5))/(9n(n – 1)))1/2 was used to determine the 95% confidence interval.[63] In our previous evaluation paper, we discussed the use of heavy atoms and SlogP as a “yardstick” to determine a baseline or null correlation.[20] Here, we have calculated the molecular weight of the ligand and SlogP using MOE 2010.11[55] as null control cases and identified groups that are statistically significant from these values. In order to compare the R2 values across individual groups to the yardsticks, the variance in the residuals from the linear regression were compared using Levene’s F-test using R.[64] A probability of a F-statistic less than 0.05 indicates that the error between the two fits is statistically different.

Results and Discussion

How well did methods perform overall on predicting poses?

Figure 1 shows a RMSD box plot of the best pose for each protein–ligand complex broken up by group–method, each method from each group. RMSD box plots provide the distribution of the RMSD values and the associated statistics (mean, median, 95% confidence interval around the mean, interquartile range, outliers, etc.). For all protein targets combined, the median RMSD across all group–methods was 3.0 Å. Additionally, 37% of the group–methods have a median RMSD of less than 2.0 Å. The results were also broken down by protein target; the RMSD box plots are provided in the Supporting Information. The median RMSD was 1.14 Å for LpxC, 1.25 Å for Urokinase, 3.50 Å for Chk1, and 5.03 Å for Erk2. Table 3 provides a breakdown by protein on the percent of predictions less than 2.0 Å for each group–method. Only three groups (<10%) were able to predict well across all protein systems (of these groups, two used empirical-based scoring functions for sampling and ranking and one used force field-based). This is not surprising as it is known that most scoring functions are not robust enough to perform well across binding sites of various sizes, accessibility, and chemical properties.

Figure 1

Table 3

% Predictions <2 Å for Each Group–Method by Protein Target (Best Pose)a

	LpxC % predictions <2 Å	Urokinase % predictions <2 Å	Chk1 % predictions <2 Å	Erk2 % predictions <2 Å
group A-1	0.0	0.0	0.0	0.0
group B-1	100.0	100.0	71.4	50.0
group D-1	100.0	100.0	71.4	25.0
group D-2	100.0	100.0	71.4	25.0
group E-1	N/A	N/A	0.0	100.0
group E-2	N/A	100.0	33.3	100.0
group G-1	100.0	50.0	42.9	41.7
group G-2	75.0	50.0	28.6	41.7
group H-1	75.0	50.0	28.6	50.0
group H-2	50.0	50.0	35.7	50.0
group I-1	100.0	0.0	28.6	0.0
group I-2	100.0	100.0	35.7	8.3
group I-3	100.0	75.0	28.6	0.0
group J-1	100.0	75.0	42.9	25.0
group J-2	100.0	75.0	42.9	25.0
group K-1	100.0	100.0	57.1	25.0
group K-2	N/A	100.0	78.6	41.7
group L-1	100.0	100.0	50.0	33.3
group L-2	100.0	100.0	28.6	41.7
group L-3	100.0	100.0	50.0	50.0
group M-1	100.0	75.0	28.6	8.3
group M-2	100.0	50.0	21.4	25.0
group M-3	100.0	100.0	14.3	0.0
group M-4	100.0	50.0	28.6	16.7
group N-1	0.0	100.0	23.1	0.0
group O-1	100.0	N/A	50.0	33.3
group P-1	100.0	100.0	28.6	0.0
group P-2	100.0	75.0	30.8	16.7
group P-3	75.0	100.0	21.4	9.1
group P-4	75.0	75.0	21.4	9.1
group Q-1	100.0	100.0	14.3	25.0
group R-1	0.0	25.0	21.4	0.0
group R-2	25.0	50.0	21.4	8.3
group S-1	100.0	75.0	7.1	16.7
group S-2	100.0	100.0	57.1	50.0
group T-1	25.0	25.0	21.4	0.0
group T-2	75.0	25.0	0.0	0.0
group U-1	50.0	66.7	100.0	0.0

Group–methods able to predict greater than or equal to 50% are bolded.

RMSD box plot of the best pose for each protein–ligand complex broken down by group–method. The rectangular box indicates the interquartile range (25–75%), and the bars the 1.5× interquartile range. The median is shown by the line in the box, and the diamond denotes the mean and 95% confidence interval around the mean. The red bracket signifies the shortest interval that contains 50% of the data, and outliers are indicated by squares above the bars. Group–method, which submitted scores for all ligands of LpxC, Urokinase, Chk1, and Erk2, are bolded. Group–methods able to predict greater than or equal to 50% are bolded. The best pose is presented because it attempts to remove the bias of the scoring function on the results and asks if the method was able to find the correct pose in the top three predicted poses submitted by each group. We found that the best pose was also the top scoring pose 50.6% of the time for all protein targets combined, 58.6% for just LpxC, 46.7% for just Urokinase, 50.3% for just Chk1, and 48.9% for just Erk2. Essentially, scoring functions are predicting the best pose as the top pose better than random but still not at the rate that is necessary for drug discovery purposes. However, it is important to note that in most cases for this analysis the trends are essentially the same whether all poses, best poses, or top poses are utilized.

Which test sets were most challenging for predicting poses?

Figure 2 shows a RMSD box plot of the best pose for each protein–ligand complex broken down by protein. LpxC and Urokinase had the smallest median RMSD (1.14 Å and 1.25 Å, respectfully) with Chk1 a bit higher (3.50 Å) and then Erk2 (5.03 Å). As demonstrated in Table 3, the majority of groups were able to predict 75–100% of LpxC and Urokinase poses with a RMSD of less than 2.0 Å. Table 4 provides the distribution of RMSD for all, best, and top poses. Various benchmark studies have been conducted using the same test cases as discussed above.[6,25,26] However, a direct comparison cannot be made between our analysis and the published studies as different data sets of ligands were used. This also illustrates the need for standardized data sets such as those developed by CSAR; if groups were consistent with the benchmark data sets employed when evaluating their methodology developments, then the field would be able to assess whether positive improvements have actually been made.

Figure 2

Table 4

Distribution of Pose RMSD Values by Proteina

		<1 Å (%)	1–2 Å (%)	2–3 Å (%)	3–4 Å (%)	4–5 Å (%)	>5 Å (%)	median RMSD
LpxC
	all poses (n = 458)	22.05	41.27	15.50	4.15	1.31	15.72
	best poses (n = 174)	34.48	47.70	9.20	0.57	0.57	7.47	1.14
	top poses (n = 174)	24.14	49.43	14.37	2.30	1.72	8.05
Urokinase
	all poses (n = 390)	22.31	29.74	22.56	5.38	4.87	15.13
	best poses (n = 137)	35.04	38.69	13.87	3.65	2.92	5.84	1.25
	top poses (n = 137)	24.09	33.58	22.63	5.84	3.65	10.22
Chk1: all 3 ligand series combined
	all poses (n = 1279)	9.38	12.90	5.63	12.90	9.54	49.65
	best poses (n = 477)	16.35	18.24	7.76	17.19	9.01	31.45	3.50
	top poses (n = 477)	12.58	15.09	5.24	13.00	9.43	44.65
Chk1: series 1
	all poses (n = 364)	27.47	22.80	9.89	5.77	3.30	30.77
	best poses (n = 141)	44.68	26.95	8.51	4.96	0.71	14.18	1.08
	top poses (n = 141)	36.88	24.82	8.51	4.26	2.84	22.70
Chk1: series 2
	all poses (n = 442)	1.13	5.43	4.75	18.78	14.93	54.98
	best poses (n = 166)	3.01	10.84	8.43	24.10	15.06	38.55	4.20
	top poses (n = 166)	1.20	4.22	3.61	18.07	15.06	57.83
Chk1: series 3
	all poses (n = 473)	3.17	12.26	3.17	12.90	9.30	59.20
	best poses (n = 170)	5.88	18.24	6.47	20.59	10.00	38.82	3.94
	top poses (n = 170)	3.53	17.65	4.12	15.29	9.41	50.00
Erk2: all 3 ligand series combined
	all poses (n = 1123)	4.54%	9.35	5.43	5.43	8.90	66.34
	best poses (n = 411)	8.76	12.90	7.79	8.27	11.44	50.85	5.03
	top poses (n = 411)	6.33	9.98	5.84	7.06	8.27	62.53
Erk2: series 1
	all poses (n = 186)	8.60	30.65	6.45	9.68	11.83	32.80
	best poses (n = 71)	14.08	40.85	7.04	12.68	9.86	15.49	1.67
	top poses (n = 71)	8.45	43.66	2.82	11.27	14.08	19.72
Erk2: series 2
	all poses (n = 745)	1.74	3.89	4.56	5.50	9.93	74.36
	best poses (n = 276)	4.71	5.80	6.88	9.06	13.77	59.78	5.49
	top poses (n = 276)	3.99	1.45	6.16	7.25	7.97	73.19
Erk2: series 3
	all poses (n = 192)	11.46	9.90	7.81	1.04	2.08	67.71
	best poses (n = 64)	20.31	12.50	12.50	0.00	3.13	51.56	5.06
	top poses (n = 64)	14.06	9.38	7.81	1.56	3.13	64.06
All proteins
	all poses (n = 3250)	11.05	17.69	8.98	8.18	7.60	46.49
	best poses (n = 1199)	18.52	23.02	8.67	10.18	7.92	31.69	3.00
	top poses (n = 1199)	13.43	20.43	8.76	8.59	7.26	41.53

RMSD values greater than 30% are bolded.

RMSD box plot of the best pose for each protein–ligand complex broken down by protein target. The rectangular box indicates the interquartile range (25–75%) and the bars the 1.5× interquartile range. The median is shown by the line in the box, and the diamond denotes the mean and 95% confidence interval around the mean. The red bracket signifies the shortest interval that contains 50% of the data, and outliers are indicated by squares above the bars. RMSD values greater than 30% are bolded. LpxC and Urokinase only have one chemical series, while both Chk1 and Erk2 have three series within the ligand set provided to the participants. It appears that the most prominent reason that groups did not perform as well on Chk1 and Erk2 is because of the multiple chemical series. If the chemical series are broken out, performance across the different protein targets was very comparable when comparing the series that contained a chemically similar ligand to the co-crystal structure utilized for docking. Methods performed much better on series 1 than series 2 or 3 for both Chk1 and Erk2. The crystal structure suggested by the CSAR team for both Chk1 and Erk2 are co-crystallized with a ligand from their respective series 1. Accounting for the conformational changes that can occur within the binding pocket of the protein is a very difficult task.[2,8−13] In co-crystal structures, the prearrangement of the ligand binding site can lead to the cross-docking problem where the protein structure has adapted to bind a particular ligand or class of ligands but is unable to accommodate structurally diverse inhibitors as we found here. Incorporating protein flexibility is recognized as a means to overcome the cross-docking problem; however, not enough groups used protein flexibility to allow us to perform a statistically significant analysis on whether or not it affected the docking results.

How did RMSD correlate with native contacts?

We first asked if the native contact metric agrees with RMSD and if it provides any additional useful information. Figure 3 shows native contact box plots of the best pose for each protein–ligand complex broken up by group–method for %Total, %Het–Het, and %C–C contacts correct. When comparing all series together, native contacts show the same trend as RMSD. Groups performed the best on LpxC and Urokinase; the median %Total was equal to 51% and 52%, respectively. Chk1 and Erk2 appeared to be more difficult; median %Total was equal to 25% and 13%, respectively. However, unlike a RMSD, native contacts can provide additional information on the specific type of contacts that are being made, as demonstrated in Figure 3B and C. When comparing the %Het–Het contacts correct versus %C–C contacts correct, it appears that groups were more successful at predicting the Het–Het contacts in Urokinase (%Het–Het = 77%). However, the C–C contacts groups were more successful at LpxC (%C–C = 54%). Another interesting trend that emerged is that methods had a more difficult time predicting the C–C contacts than the Het–Het contacts between the protein and ligand. In fact, 7.3% of the group–methods were able to predict 100% of the Het–Het contacts, while no group–methods were able to predict 100% of the C–C contacts. Additionally, 13.2% of the group–methods could predict greater than 80% of the Het–Het contacts, but only 1.6% of the group–methods could do the same for C–C.

Figure 3

Native contacts box plot of the best pose for each protein–ligand complex broken down by protein target. The rectangular box indicates the interquartile range (25–75%) and the bars the 1.5× interquartile range. The median is shown by the line in the box, and the diamond denotes the mean and 95% confidence interval around the mean. The red bracket signifies the shortest interval that contains 50% of the data, and outliers are indicated by squares above the bars. (A) %Total contacts correct, (B) %Het–Het contacts correct, and (C) %C–C contacts correct. Figure 4 shows the correlation between the calculated native contact and RMSD values. The data suggests that the values are exponentially correlated (r2 = 0.75); as the ligand moves further away from the protein, contacts are lost at an exponential rate. Kroemer et al. also found that overall RMSD correlates with their interactions-based accuracy classification (with the exception of a few test cases).[45] As demonstrated in Figure 4A, for this data set, it is not possible to have a RMSD better than 3.45 Å when no contacts are being made. Furthermore, at 50% total contacts correct, the RMSD is 1.55 Å (on average) with a range up to ∼3 Å.

Figure 4

(A) %Total contacts correct, (B) %Het–Het contacts correct, and (C) %C–C contacts correct plotted again RMSD. The exponential fit is shown on each graph.

To date, the field uses 2 Å as the cutoff for a successful docking pose. This value was not determined quantitatively but rather through qualitative inspection over many years of evaluating docking programs and the desire to use a round number. Utilizing both native contacts and RMSD provides the researcher with a more complete picture of their docking performance and allows for a more quantitative analysis of the results. Here, we can use the %Total contacts correct at various RMSD cutoff values to examine if 2 Å is an appropriate metric, and if not, what is. At a RMSD cutoff of 2 Å, the %Total contacts correct ranges from 14% to 86% (for 499 data points). The same analysis was conducted at a RMSD cutoff of 1.5, 2.5, 3, and 3.5 Å, and the ranges along with the percentage of %Total contacts that fall within various cut-offs are provided in the Supporting Information. An examination of all data suggests that lowering the value does not gain significant contacts; however, 2.5 Å is just as reasonable of a cutoff as 2 Å for defining a correct pose. (A) %Total contacts correct, (B) %Het–Het contacts correct, and (C) %C–C contacts correct plotted again RMSD. The exponential fit is shown on each graph. The data for %C–C contacts correct (Figure 4C) essentially follows the same trend shown in %Total (Figure 4A). On the %Het–Het contacts correct graph (Figure 4B), there are interesting data points where 0% of the correct contacts are being made at a range of RMSD values (even less than 2 Å). Careful examination of the predicted poses elucidated that these points are all from Chk1. As shown in Figure 5, the ligand is just slightly shifted to the right. Although, the RMSD is equal to 0.702, both of the hinge region hydrogen bonds have been lost. One must be careful in the interpretation of native contacts data because the number of hydrogen bonds is typically much smaller than the number of carbon–carbon interactions, and the number of hydrogen bonds varies significantly from target to target. This data will also be influenced by the size and composition of the ligand.

Figure 5

Predicted docking pose (submission; yellow) overlaid with the experimental co-crystal structure of Chk1–ligand 1 (blue). Dotted lines illustrate two important hydrogen bonds formed between the ligand and the hinge region of the protein backbone. The RMSD between the coordinates of the predicted pose and coordinates of the experimental structure is equal to 0.702, %Het–Het contacts correct is equal to 0%, and %C–C contacts correct is equal to 37%.

Did the scoring functions overpredict or underpredict raw contacts?

Additional information on whether scoring functions overpredict or underpredict contacts can be gathered by analyzing all raw contacts made between the protein and ligand (i.e., not just the percent native contacts correct as previously presented). This is an important question because it highlights the cause of the differences in the poses rather than just assessing whether or not the correct ligand pose was found. Consequently, it emphasizes weaknesses that could be addressed in scoring function development. Table 5 presents the percentage of raw Het–Het, C–C, and packing contacts that were overpredicted, underpredicted, or the same number of contacts (within 10%) for all protein targets combined and broken down by each protein. An important note is that “same-predicted” does not mean the same contacts are being made but rather that the same number of contacts has been predicted. For all proteins combined, the trend emerges that scoring functions have a bias to underpredict Het–Het and packing contacts but both underpredict and overpredict C–C contacts at the same rate. About half of all methods underpredict Het–Het contacts, while only 32% overpredict them. For C–C contacts, the same exact number of methods overpredict and underpredict (∼40%), and there is no general bias seen. We were very surprised to find that Het–Het and packing contacts were biased toward underpredicting contacts as the overall consensus in the field is that scoring functions tend to focus on optimizing both types of contacts and overpredicting the interactions.

Table 5

Percentage of Raw Het–Het, C–C, and Packing Contacts Where the Number Was Overpredicted, Underpredicted, or Had the Same Contacts within 10% for Best Posea

	Hetero–Hetero	Carbon–Carbon	packing
All (n = 1199)
same predicted	17.70 ± 2.17	21.10 ± 2.25	32.53 ± 2.63
underpredicted	50.37 ± 2.79	40.38 ± 2.84	40.71 ± 2.80
overpredicted	31.93 ± 2.63	38.53 ± 2.75	26.60 ± 2.46

LpxC (n = 174)
same predicted	21.26 ± 6.32	29.90 ± 6.90	49.41 ± 7.47
underpredicted	73.00 ± 6.61	19.00 ± 5.75	23.56 ± 6.32
overpredicted	5.75 ± 3.74	51.20 ± 7.47	27.03 ± 6.61

Urokinase (n = 137)
same predicted	27.77 ± 7.30	19.75 ± 6.57	29.90 ± 7.66
underpredicted	37.21 ± 8.03	59.09 ± 8.03	50.37 ± 8.39
overpredicted	35.02 ± 8.03	21.16 ± 6.93	19.75 ± 6.93

Chk1 (n = 477)
same predicted	15.33 ± 3.25	20.13 ± 3.46	31.23 ± 4.19
underpredicted	49.48 ± 4.57	43.61 ± 4.51	46.95 ± 4.51
overpredicted	35.19 ± 4.19	36.26 ± 4.40	21.82 ± 3.67

Erk2 (n = 411)
same predicted	15.53 ± 3.53	19.00 ± 3.77	27.75 ± 4.28
underpredicted	46.19 ± 4.87	39.41 ± 4.74	37.46 ± 4.77
overpredicted	38.27 ± 4.74	41.60 ± 4.87	34.31 ± 4.52

Values greater than 35% are bolded.

Values greater than 35% are bolded. In Figure 6A and B, the RMSD < 1 Å bin and RMSD = 1–2 Å bin for Het–Het contacts are provided, respectively; the population of each value is given by the size of the point. As one would expect, when the RMSD is quite small, the majority of the points are close to the identity line. It is also obvious from these graphs that when the RMSD is small, the trend that scoring functions are underpredicting contacts holds true. As the RMSD becomes larger, the data becomes more spread and moves away from the identity line (data for bins 2–4, 4–10, and >10 Å is provided in the Supporting Information). Furthermore, once the RMSD is greater than 10 Å, almost all of the contacts are off the identity line and being underpredicted. Figure 7A and B show the RMSD < 1 Å bin and RMSD = 1–2 Å bin for C–C contacts and Figure 8A and B for Packing contacts (again data for bins 2–4, 4–10, and >10 Å is provided in the Supporting Information). For C–C contacts, the points are spread almost evenly between underprediction and overprediction. The packing contacts agree with what was shown for Het–Het contacts, and at small RMSD values, the trend that the scoring function is underpredicting contacts remains. Again, this is a very interesting finding as most scoring functions use an additive term for van der Waals packing, and hence, more contacts should result in a better score.

Figure 6

Figure 7

Number of raw C–C contacts in co-crystal versus number of raw C–C contacts in prediction. The solid line illustrates a perfect match, while the dotted lines show a ±10% range. (A) RMSD < 1 Å bin. (B) RMSD = 1–2 Å bin.

Figure 8

Number of raw packing contacts in co-crystal versus number of raw packing contacts in prediction. The solid line illustrates a perfect match, while the dotted lines show a ±10% range. (A) RMSD < 1 Å bin. (B) RMSD = 1–2 Å bin.

Number of raw Het–Het contacts in co-crystal versus number of raw Het–Het contacts in prediction. The solid line illustrates a perfect match, while the dotted lines show a ±10% range. (A) RMSD < 1 Å bin. (B) RMSD = 1–2 Å bin. Number of raw C–C contacts in co-crystal versus number of raw C–C contacts in prediction. The solid line illustrates a perfect match, while the dotted lines show a ±10% range. (A) RMSD < 1 Å bin. (B) RMSD = 1–2 Å bin. Number of raw packing contacts in co-crystal versus number of raw packing contacts in prediction. The solid line illustrates a perfect match, while the dotted lines show a ±10% range. (A) RMSD < 1 Å bin. (B) RMSD = 1–2 Å bin. Table 5 also shows the data broken down by protein target. For LpxC, the Het–Het contacts were significantly underpredicted, while the C–C contacts were overpredicted. The packing contacts were overpredicted and underpredicted at essentially the same rate. LpxC contains a catalytic zinc atom, which was included as part of the active site. Methods had a difficult time predicting these hydrogen bonds. However, the scoring function was still able to rank these predictions correct because it was overcompensating by overpredicting C–C contacts.

Did the docking metrics correlate with ligand descriptors?

When utilizing new metrics for pose prediction, it is always prudent to determine if they correlate with size and chemical properties of the ligand. To assess this, MOE 2010 was utilized to calculate the Pearson (r), Spearman (ρ), and Kendall (τ) correlations between the ligand properties and RMSD, %Het–Het, and %C–C. The results are provided in the Supporting Information for molecular weight, # of atoms, # of heavy atoms, # of hetero atoms, # of hydrophobic atoms, # of acceptors, # of donors, # of acceptors and donors, # of carbons, and # of nitrogens. We found that there was no correlation between the ligand size and the metrics. Furthermore, we found that the metrics were not correlated with chemical properties of the ligand such as the number of hydrogen bond acceptors and donors, among others.

How did the methodology “features” correlate with RMSD?

Each of the 20 groups employs their own protocols for protein and ligand setup and docking methodology. In order to understand how such methodology “features” across multiple groups’ results affected the pose/ranking predictions, we asked each participant to fill out an online questionnaire to gather additional data on the details of their methodology. The pose prediction results were binned by RMSD, and the percentage of time that a particular feature resulted in a pose within the RMSD bin is presented. It is important to note that although we received a 100% response rate, some participants did not answer every question. The data presented is for “all poses” as “best pose” when binned by RMSD did not always have enough data points to be statistically significant. Figure 9 shows how the various components of protein and ligand setup trend with the RMSD. Here, minimizing the protein and correcting the histidine tautomeric state had a positive effect on the docking results, while minimizing the ligand appeared to have a less positive effect. As previously mentioned, many groups that participated in the 2011 Docking and Scoring Challenge also found that optimizing the protein structure prior to docking improved their performance.[27−29] Most likely, this creates an internally consistent environment as the protein is now on the same energy landscape as the scoring function used in docking. Furthermore, scoring functions are typically parametrized for proteins but not for ligands, which may result in unrealistic ligand conformations such as bent aromatic rings. Lastly, pregenerated ligand conformations had better results than those that were generated on the fly. This may suggest an issue with ligand sampling.

Figure 9

Outcome of the online questionnaire on protein and ligand setup for all poses. The pose prediction results were binned by RMSD and plotted as the percentage of time that a particular feature resulted in a pose within the RMSD bin. Distinct trends that are related to docking RMSD are noted with arrows. Figure 10 shows the same analysis for the docking methodology that was employed by each group and how it trends with the RMSD. The data revealed that first training with the native ligand to determine optimal docking parameters significantly improved the docking performance as did using restraints, substructure fitting, and shape fitting. This data implies that results can be enriched when prior information about the system is known. Furthermore, marrying ligand- and structure-based approaches has been an area of active research in the field recently, and these hybrid techniques have been shown to outperform the use of structure-based methods on their own.[65−67] Interestingly, the use of special parameters for the catalytic zinc in LpxC did not improve docking performance (data not shown).

Figure 10

Outcome of the online questionnaire on docking methodology for all poses. The pose prediction results were binned by RMSD and plotted as the percentage of time that a particular feature resulted in a pose within the RMSD bin. Distinct trends that are related to docking RMSD are noted with arrows. The type of scoring function utilized for both sampling and ranking was also analyzed, as shown in Figure 11. Many groups utilize their own scoring function or a combination of the different types, and these would fall into our “other” category. When empirical scoring functions are used as either the sampling or ranking scoring function, they appear to have a positive effect on the docking results as demonstrated by Figure 11A and B. However, the “other” category negatively affected pose prediction when utilized as the sampling or ranking scoring function. In general, the “other” category was primarily hybrid-type scoring functions where two or more types were combined. Knowledge-based scoring functions performed fairly consistent across the RMSD bins for both sampling and ranking scoring functions. Here, consistently means that this particular scoring function did not seem to have a negative or positive effect on the pose prediction performance. The force field-based scoring function was also fairly consistent across the RMSD bins when utilized as the sampling scoring function but appeared to have a positive effect when used as the ranking scoring function. The top three performing groups (best pose) all utilized empirical-based scoring functions for sampling, and two of the three also used empirical-based for ranking (the third was force field-based). One of the bottom three performing groups employed a force field-based scoring function for both sampling and scoring. The second used a shape and functionality-based complementarity, and the third used a crude shape-based complementarity for sampling and knowledge-based for scoring.

Figure 11

Outcome of the online questionnaire on scoring functions for all poses. The percentage of time that a scoring function was utilized is shown by RMSD bin. Distinct trends that are related to docking RMSD are noted with arrows.

How well did methods perform overall on identifying actives from inactives and relative ranking?

Tables 6 and 7 show the Pearson r and Spearman ρ, respectively, assessments of the correlation between the predicted scores and the experimental binding affinity for each group–method, broken down by protein target. Only the 28 group–methods that submitted scores for all ligands of LpxC, Urokinase, Chk1, and Erk2 were included in the analysis to ensure a fair evaluation and are shown in Tables 6 and 7. However, for completeness, all groups that submitted rankings are provided in the Supporting Information as well as the Kendall τ correlation. Pearson is parametric and a measure of the linear relationship between scores and binding affinities, while Spearman ρ and Kendal τ are nonparametric and reflect the correlation of the rank ordering of the data. As r compares the absolute values of each prediction, it is a much more difficult assessment metric than ρ and τ. While all correlations are worthwhile to compare, ρ and τ are more appropriate metrics for relative ranking and r for absolute binding affinities.

Table 6

Pearson r Parametric Correlation between the Predicted Scores and Experimental Binding Affinities by Protein for Group–Methods That Submitted Scores for All Ligands of LpxC, Urokinase, Chk1, and Erk2a

group–method	LpxC (3 lig)	Urokinase (15 lig)	Chk1 (29 lig)	Erk2 (38 lig)	sum r (max = 4.00)
median	0.78	0.50	–0.01	0.37	1.64
molecular wt	0.37	0.42	–0.14	0.40	1.05
SlogP	–0.51	0.06	–0.33	–0.15	–0.93
group A-1	–0.84	–0.28	0.01	0.11	–1.00
group C-1	0.76	0.71	0	0.30	1.77
group C-2	0.73	0.71	–0.12	0.34	1.66
group E-1	0.87	0.45	0.09	0.66	2.07
group E-2	0.87	0.45	0.01	0.57	1.90
group G-1	1.00	0.09	0.16	0.41	1.66
group H-1	0.99	0.32	0.02	0.46	1.79
group H-2	0.92	0.23	0.02	0.47	1.64
group I-1	0.94	0.48	–0.12	0.26	1.56
group I-2	0.92	0.49	–0.09	0.31	1.63
group J-1	1.00	0.35	0.30	–0.02	1.63
group L-1	0.61	0.77	–0.02	–0.02	1.34
group L-2	0.41	0.60	0.02	0.09	1.12
group M-1	0.67	0.29	–0.15	0.23	1.04
group M-2	0.31	0.33	–0.26	0.41	0.79
group M-3	0.40	0.47	–0.13	0.16	0.90
group M-4	0.20	0.43	–0.31	0.33	0.65
group N-1	–0.56	0.22	–0.12	0.03	–0.43
group P-1	0.99	0.36	0	0.41	1.76
group P-2	0.59	0.50	–0.22	0.63	1.50
group P-5	0.99	0.72	–0.03	0.48	2.16
group P-6	0.79	0.58	–0.14	0.55	1.78
group R-1	0.89	0.54	–0.12	0.22	1.53
group R-2	0.84	0.61	–0.11	0.15	1.49
group S-1	0.92	0.57	0.38	0.12	1.99
group S-2	0.77	0.60	0.16	0.44	1.97
group T-1	–0.56	0.17	0.32	–0.07	–0.14
group T-2	0.85	0.37	–0.17	0.40	1.45

Values of r greater than 0.50 and sum r values greater than 2.00 are bolded.

Table 7

Spearman ρ Nonparametric Correlation between the Predicted Scores and Experimental Binding Affinities by Protein for Group–Methods That Submitted Scores for All Ligands of LpxC, Urokinase, Chk1, and Erk2a

group–method	LpxC (3 lig)	Urokinase (15 lig)	Chk1 (29 lig)	Erk2 (38 lig)	sum ρ (max = 4.00)
median	0.50	0.52	–0.03	0.31	1.30
molecular wt	0.50	0.41	–0.14	0.40	1.17
SlogP	–0.50	0.14	–0.29	–0.15	–0.80
group A-1	–0.87	–0.28	0.02	0.06	–1.07
group C-1	0.50	0.64	–0.02	0.32	1.44
group C-2	0.50	0.63	–0.09	0.34	1.38
group E-1	0.50	0.50	0.03	0.67	1.70
group E-2	0.50	0.50	0.05	0.58	1.63
group G-1	1.00	0.29	0.18	0.42	1.89
group H-1	1.00	0.20	0.06	0.45	1.71
group H-2	1.00	0.24	0.01	0.50	1.75
group I-1	1.00	0.56	–0.09	0.22	1.69
group I-2	1.00	0.51	–0.14	0.30	1.67
group J-1	1.00	0.44	0.10	–0.08	1.46
group L-1	0.50	0.76	0.01	–0.04	1.23
group L-2	0.50	0.72	0.08	0.05	1.35
group M-1	0.50	0.31	–0.16	0.21	0.86
group M-2	0.50	0.37	–0.24	0.39	1.02
group M-3	0.50	0.50	–0.05	0.14	1.09
group M-4	0.50	0.50	–0.30	0.28	0.98
group N-1	–0.50	0.22	–0.10	0.17	–0.21
group P-1	1.00	0.38	–0.04	0.41	1.75
group P-2	0.50	0.52	–0.22	0.64	1.44
group P-5	1.00	0.55	–0.02	0.45	1.98
group P-6	0.50	0.49	–0.11	0.56	1.44
group R-1	0.50	0.60	–0.17	0.24	1.17
group R-2	0.50	0.62	–0.15	0.14	1.11
group S-1	1.00	0.63	0.30	0.12	2.05
group S-2	0.50	0.57	0.13	0.40	1.60
group T-1	–0.50	0.32	0.28	–0.02	0.08
group T-2	0.50	0.40	–0.14	0.40	1.16

Values of ρ greater than or equal to 0.50 and sum ρ values greater than 2.00 are bolded.

Overall, most groups did not perform well on relative ranking. This is not surprising as it is well documented that ranking ligands is very difficult.[6,20,33,34] The sum of r and ρ across all proteins is provided as a metric to assess each groups overall performance; a perfect ranking would result in a sum of r or ρ of 4.0 while random would be 0. We interpreted methods with sums of ≥2.0 as having good performance. Molecular weight and SlogP were calculated and used as “yardsticks” or null cases to determine a baseline value.[20] The sum of r was 1.05 for molecular weight and −0.93 of SlogP. We also calculated the F-statistic to determine if the fits found for the group–methods were statistically different from the fits found for molecular weight and also for SlogP. For the majority of the group–methods, the linear fits are statistically significant in their difference from the yardsticks both for methods with good correlation and methods with poor correlation. Values of r greater than 0.50 and sum r values greater than 2.00 are bolded. Values of ρ greater than or equal to 0.50 and sum ρ values greater than 2.00 are bolded. Here, the maximum sum of ρ was 2.05 (group S-1, which used an empirical-based scoring function), and the minimum sum of ρ was −1.07 (group A-1, which used a force field-based scoring function). Only one of the group–methods resulted in a sum of ρ greater than 2.0, and two were anticorrelated. In summary, across all protein systems, most groups were not able to relatively rank the ligands. The maximum sum of r was 2.16 (group P-5, which used a knowledge-based ranking scoring function), and the minimum sum of r was −0.14 (group T-1, which used a force field-based scoring function). Only 2 of the 28 group–methods were able to attain a score of a sum of r greater than 2.0, and 3 of the 28 were anticorrelated. Using the rankings from ρ, two of the top three performing groups used empirical-based scoring functions for sampling and scoring, and the third used a crude shape-based complementarity for sampling and knowledge-based for scoring. One of the bottom three performing groups employed a force field-based scoring function for both sampling and scoring. The second used a shape and functionality-based complementarity, and the third used a combination of knowledge and empirical-based. This finding illuminates another potential issue: Is the experimental data that the computational field considers to be “gold standard” actually correct? As discussed in our data set paper,[40] we have found that different experimental methods (e.g., Thermofluor, ITC, Octet Red, etc.) can give different values for the same protein–ligand complex, and furthermore, even the relative ranking between a chemical series can be dependent on the various experimental method employed. This is a very troublesome finding as it suggests that the best we may be able to do as a computational field is predict whether a compound is active or inactive but not absolute values or ranking between libraries of compounds. Our techniques are only as good as the data being used for development, and if the experiment data does not agree between methods, then we have no gold standard to judge our predictions. Researchers will need to first check the variance of their data and use only targets with the lowest to parametrize and validate their methods. We calculated correlations between the rankings for each group–method in this exercise and found no correlation, suggesting that inaccurate reference data is not the issue here. Most groups truly found ranking to be a difficult task. In our next benchmark exercise, we will try to build a data set to address this issue in more detail. Table 8 shows the results for the enrichment or discriminating actives from inactives portion of the exercise, again only for the 28 group–methods that submitted ranks for all ligands (all group–methods are provided in the Supporting Information). A high AUC indicates that actives were clearly identified over inactives. The sum of the AUC is provided as a metric to assess each groups overall performance; a perfect ranking would result in a sum of 3.00, while random ranking would be 1.50 (there were no inactives for Erk2). Here, the evaluation was conducted again using only group–methods that sent in scores for all ligands of LpxC, Urokinase, and Chk1. The maximum sum AUC was 2.44 (group L-2, which used an empirical-based ranking scoring function), and the minimum sum AUC was 1.14 (group I-1, which used an empirical-based ranking scoring function). Thirteen of the 28 group–methods performance was less than random (sum AUC less than 1.50). All of the top three performing groups utilized empirical-based scoring functions for both sampling and scoring. Two of the bottom three performing groups employed a combination of force field- and empirical-based scoring functions for both sampling and scoring, and the third used empirical-based.

Table 8

AUC Values Derived from ROC Curves by Protein for Group–Methods that submitted Scores for All Ligands of LpxC, Urokinase, and Chk1a

group–method	LpxC (3 active, 8 nonactive)	Urokinase (15 active, 4 nonactive)	Chk1 (30 active, 9 nonactive)	sum AUCs (max = 3.00)	Urokinase + Chk1 sum AUCs (max = 2.00)
median	0.21	0.83	0.56	1.60	1.39
molecular wt	0.83	0.55	0.69	2.08	1.24
SlogP	0.13	0.72	0.49	1.33	1.21
group A-1	0.71	0.25	0.69	1.64	0.94
group C-1	0.04	0.97	0.59	1.60	1.56
group C-2	0.04	0.78	0.60	1.42	1.38
group E-1	0.08	0.83	0.64	1.55	1.47
group E-2	0.08	0.83	0.54	1.46	1.37
group G-1	0.33	0.63	0.41	1.38	1.04
group H-1	0.42	0.83	0.55	1.80	1.38
group H-2	0.25	0.83	0.56	1.64	1.39
group I-1	0	0.68	0.46	1.14	1.14
group I-2	0.33	0.73	0.38	1.44	1.11
group J-1	0.92	0.97	0.44	2.32	1.41
group L-1	0.75	0.98	0.49	2.22	1.47
group L-2	1	0.97	0.47	2.44	1.44
group M-1	0.13	0.80	0.55	1.48	1.35
group M-2	0.42	0.78	0.43	1.63	1.21
group M-3	0.13	0.82	0.47	1.41	1.29
group M-4	0.42	0.78	0.42	1.62	1.21
group N-1	0.79	0.82	0.50	2.11	1.32
group P-1	0.38	0.72	0.62	1.71	1.34
group P-2	0.21	0.75	0.72	1.68	1.47
group P-5	0.33	0.97	0.75	2.05	1.71
group P-6	0.21	0.77	0.74	1.72	1.51
group R-1	0.17	0.53	0.53	1.23	1.06
group R-2	0.13	0.57	0.52	1.21	1.08
group S-1	0.79	0.83	0.47	2.10	1.31
group S-2	0.67	0.90	0.43	1.99	1.33
group T-1	0.92	0.33	0.53	1.78	0.87
group T-2	0	0.72	0.66	1.38	1.38

AUC values greater than 0.50 and sum AUC values greater than 1.50 and 1.00 (for Urokinase and Chk1 alone) are bolded.

AUC values greater than 0.50 and sum AUC values greater than 1.50 and 1.00 (for Urokinase and Chk1 alone) are bolded. We also conducted this analysis without LpxC as Urokinase and Chk1 have more ligands and suffer less from small-number statistical issues. For just Urokinase and Chk1, the sum of the median AUC was 1.39 (here perfect would be 2.00 and random would be 1.00). Only 3 of the 28 group–methods had sum AUCs greater than random, and 3 of the 28 were less than random. The majority of them were close to random. Similar to relative ranking, most methods were not able to do enrichment well across the board. Furthermore, no group–method performed the best in all three categories of pose prediction, discriminating actives from inactives, and relative ranking. The scoring functions utilized in each method have their own strengths and weaknesses that do not appear to be robust across the evaluation exercises.

Which test sets were most challenging for identifying actives from inactives and relative ranking?

When using a Pearson correlation as the evaluation metric, the group–methods performed the best on LpxC (median r was 0.78), as provided in Table 6. However, when applying a nonparametric correlation coefficient (Table 7), the maximum median ρ was 0.52 for Urokinase, while LpxC was very close with a median ρ of 0.50. For this analysis, LpxC stands out as the protein system that group–methods performed the best at, while Chk1 appears to be the hardest test case. For the most part, the correlations found were not any better than random as compared to the null test cases. The three chemical series within the Chk1 and Erk2 ligands were also broken up (data is provided in the Supporting Information). For both Chk1 and Erk2, the correlations do improve within a few of the series but are still essentially not any better than random. Relative ranking is a difficult task and one where much work is needed. When evaluating the group–methods on each protein system independently, the performance by the various group–methods was much better. Here, the median AUC for Urokinase was 0.83, indicating that groups were able to discriminate Urokinase actives from inactives quite well, as shown in Table 8. Chk1 followed with a median AUC of 0.56 (essentially random), and the median AUC of LpxC was 0.21 (worse than random). The three chemical series within the Chk1 ligands were also broken up, and the median AUCs were found to be 0.61 for series 1, 0.69 for series 2, and 0.56 for series 3 (tables are provided in the Supporting Information). Unlike with pose prediction, group–methods were still not able to discriminate actives from inactives within the three chemical series. However, there were cases where actives and inactives within a series could be identified, but overall performance suffered because inactives in one series outranked actives of another series.

Are scoring functions able to identify actives from inactives better than relative ranking or vice versa?

Four protein systems were employed in the exercise. However, LpxC was too small of a data set to draw a statistically significant conclusion, and Erk2 did not have any inactives. As such, Chk1 and Urokinase were examined to determine if scoring functions were better at enrichment or relative ranking in this benchmark exercise. Unfortunately, most methods were not able to do either very well for Chk1 (data not shown). Figure 12 shows the ranking results (Pearson r or Spearman ρ correlations) plotted against enrichment results (AUC) for Urokinase. An AUC of less than 0.50 is considered random, while negative values of r or ρ signify that the rankings were anticorrelated. Here, we see that the trend is the same whether a parametric or nonparametric correlation is applied; most groups are able to predict the active Urokinase ligands from the inactives better than rank the active molecules. At a value of AUC equal to 1.0 (perfect enrichment), there is a large spread of r and ρ. However, we see that at high values of r and ρ, the AUC is close to 1.0. This demonstrates that if a group is able to rank well, then they are usually able to discriminate actives from inactives well but not vice versa. Ranking is typically thought of as the harder task of the two, so if the scoring function is fined tuned enough to rank, then it should also be able to do enrichment.

Figure 12

For the Urokinase test set, the ability to rank active molecules versus enriching hit lists is plotted. An AUC of less than 0.50 is considered random. Negative values of r or ρ signify that the data was anticorrelated. (A) Pearson r parametric correlation versus AUC and (B) Spearman ρ nonparametric correlation versus AUC.

How do predicted poses correlate with ranking? Do the programs typically get the pose correct when the ranking is correct or vice versa?

Docking programs are faced with two major tasks: (1) predicting the pose of the ligand in the presence of the protein and (2) scoring the predicted the pose. Although two separate tasks, they should be correlated, which brings about the question “Are scoring functions getting the rankings correct for the right reasons?”. One would assume that if the pose is ranked the highest, it should be the pose seen in the crystal structure. In Figure 13, the RMSD of the top-ranking pose for molecule X is shown versus the percentage of inactive molecules ranked higher than molecule X for both Urokinase and Chk1 targets (Was the scoring function able to rank the active molecule higher than the inactive molecules, and if not, how many inactive molecules were ranked higher?). Again, only Urokinase and Chk1 were used because of the reasons stated above. While there is a spread in the data suggesting that the scoring function is not always ranking for the correct reason, there also is a large group of molecules (13.6%) near the 0,0 point on the graph. There are multiple reasons that scoring functions may be ranking an incorrect pose higher than the correct pose. First, it may be due to incorrect capturing of the contacts being made between the protein and ligand. Additionally, it may be because of the terms that are not explicitly accounted for in the scoring function (e.g., entropy and/or solvation are typically not included).

Figure 13

RMSD is plotted against the percentage of inactive molecules ranked higher than an active molecule for both Urokinase and Chk1 targets. The insert shows the percentage of ligands that fall with each RMSD bin for two groups: (1) active molecules that have no inactives ranked higher (0%) and (2) active molecules that have one or more inactives ranked higher (all other). The data was also binned by RMSD for two groups: (1) active molecules that have no inactives ranked higher (0%) and (2) active molecules that have one or more inactives ranked higher (all other). Of the “0%” group, 54.2% of the poses had a RMSD of less than 2 Å, the cutoff for a successful docking prediction. For the “all other” group, only 17.4% fall within this cutoff. It appears from this data that the correct pose is not a necessity for ranking correctly, but there is a better chance to be scored correctly if the pose is correct as well.

Conclusion

In this benchmark exercise, participants were asked to compare different improvements for pose prediction, enrichment, and relative ranking of congeneric series of compounds across four protein targets. Here, we have provided a thorough analysis across all groups’ results to determine common limitations to many docking programs to help the field prioritize where effort should be made. Additionally, much emphasis was placed on the pose prediction evaluation metrics to help set standards in the field. When developing computational methods, proper evaluation of the results is just as important as the high quality experimental data used in the data set. Using best pose, the median RMSD across all group–methods was 3.0 Å, and 37% of group–methods had a median RMSD < 2 Å. LpxC and Urokinase had the smallest median RMSD with Chk1 following, and Erk2 was the most challenging. Native contacts are exponentially correlated to RMSD. Additionally, they provide a breakdown of contact types (Het–Het vs C–C) and information on atom packing. For all proteins combined, raw Het–Het and packing contacts were underpredicted and raw C–C contacts were both overpredicted and underpredicted at the same rate. No correlations were found between the pose prediction metrics and chemical properties or size of ligand. For protein and ligand setup, minimizing the protein and correcting the histidine tautomeric state had a positive effect on the docking results, while minimizing the ligand appeared to have a less positive effect. Pregenerated ligand conformations had better results than those that were generated on the fly. Additionally, first training with the native ligand to determine optimal docking parameters significantly improved the docking performance as did using restraints, substructure fitting, and shape fitting. Lastly, for both sampling and ranking scoring functions, the use of the empirical scoring function appeared to have a positive effect on the docking results, while the “other” category negatively affected pose prediction. For the most part, methods were not very successful at relative ranking or enrichment. The sum of the median ρ was 1.28/4.00, r was 1.65/4.00, and the sum of the median AUCs was 1.60/3.00. For relative ranking, group–methods performed the best on LpxC, and Chk1 was found to be the most challenging. In the enrichment study, Urokinase proved to be the most straightforward, while LpxC was the most difficult. Compared to relative ranking, group–methods were able to identify actives from inactives better for Urokinase. However, LpxC was too small of a data set to draw conclusions, and for Chk1, group–methods were not able to do either very well. Lastly, the correct pose is not a necessity for ranking correctly, but there is a better chance to be scored correctly if the pose is correct as well. Future benchmark exercises from CSAR will involve ranking pregenerated poses to separate the two major components of the docking algorithm and, hence, focus only on the ability of the scoring function to correctly rank the “real” binding pose. As always, we will strive to conduct a blinded exercise with as many systems and ligands as possible to not only avoid system dependent insights but also to provide statistical significance to the results. We greatly appreciate the efforts of all of our colleagues in the pharmaceutical industry for the donation of this data and for future benchmark exercises.

60 in total

1. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening.

Authors: Thomas A Halgren; Robert B Murphy; Richard A Friesner; Hege S Beard; Leah L Frye; W Thomas Pollard; Jay L Banks
Journal: J Med Chem Date: 2004-03-25 Impact factor: 7.446

2. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy.

Authors: Richard A Friesner; Jay L Banks; Robert B Murphy; Thomas A Halgren; Jasna J Klicic; Daniel T Mainz; Matthew P Repasky; Eric H Knoll; Mee Shelley; Jason K Perry; David E Shaw; Perry Francis; Peter S Shenkin
Journal: J Med Chem Date: 2004-03-25 Impact factor: 7.446

3. The Uppsala Electron-Density Server.

Authors: Gerard J Kleywegt; Mark R Harris; Jin Yu Zou; Thomas C Taylor; Anders Wählby; T Alwyn Jones
Journal: Acta Crystallogr D Biol Crystallogr Date: 2004-11-26

Review 4. Comparing protein-ligand docking programs is difficult.

Authors: Jason C Cole; Christopher W Murray; J Willem M Nissink; Richard D Taylor; Robin Taylor
Journal: Proteins Date: 2005-08-15

5. Comparative assessment of scoring functions on a diverse test set.

Authors: Tiejun Cheng; Xun Li; Yan Li; Zhihai Liu; Renxiao Wang
Journal: J Chem Inf Model Date: 2009-04 Impact factor: 4.956

6. A blind challenge for computational solvation free energies: introduction and overview.

Authors: J Peter Guthrie
Journal: J Phys Chem B Date: 2009-04-09 Impact factor: 2.991

7. Docking performance of the glide program as evaluated on the Astex and DUD datasets: a complete set of glide SP results and selected results for a new scoring function integrating WaterMap and glide.

Authors: Matthew P Repasky; Robert B Murphy; Jay L Banks; Jeremy R Greenwood; Ivan Tubert-Brohman; Sathesh Bhat; Richard A Friesner
Journal: J Comput Aided Mol Des Date: 2012-05-11 Impact factor: 3.686

8. Species-specific and inhibitor-dependent conformations of LpxC: implications for antibiotic design.

Authors: Chul-Jin Lee; Xiaofei Liang; Xin Chen; Daina Zeng; Sang Hoon Joo; Hak Suk Chung; Adam W Barb; Shauna M Swanson; Robert A Nicholas; Yaoxian Li; Eric J Toone; Christian R H Raetz; Pei Zhou
Journal: Chem Biol Date: 2010-12-16

9. Molecular shape and medicinal chemistry: a perspective.

Authors: Anthony Nicholls; Georgia B McGaughey; Robert P Sheridan; Andrew C Good; Gregory Warren; Magali Mathieu; Steven W Muchmore; Scott P Brown; J Andrew Grant; James A Haigh; Neysa Nevins; Ajay N Jain; Brian Kelley
Journal: J Med Chem Date: 2010-05-27 Impact factor: 7.446

10. CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions.

Authors: Richard D Smith; James B Dunbar; Peter Man-Un Ung; Emilio X Esposito; Chao-Yie Yang; Shaomeng Wang; Heather A Carlson
Journal: J Chem Inf Model Date: 2011-08-29 Impact factor: 4.956

51 in total

Review 1. Focusing on shared subpockets - new developments in fragment-based drug discovery.

Authors: Eman M M Abdelraheem; Carlos J Camacho; Alexander Dömling
Journal: Expert Opin Drug Discov Date: 2015-08-21 Impact factor: 6.098

2. Incorporating specificity into optimization: evaluation of SPA using CSAR 2014 and CASF 2013 benchmarks.

Authors: Zhiqiang Yan; Jin Wang
Journal: J Comput Aided Mol Des Date: 2016-02-15 Impact factor: 3.686

3. Homology models of mouse and rat estrogen receptor-α ligand-binding domain created by in silico mutagenesis of a human template: molecular docking with 17ß-estradiol, diethylstilbestrol, and paraben analogs.

Authors: Thomas L Gonzalez; James M Rae; Justin A Colacino; Rudy J Richardson
Journal: Comput Toxicol Date: 2018-11-28

4. Predicting binding poses and affinities for protein - ligand complexes in the 2015 D3R Grand Challenge using a physical model with a statistical parameter estimation.

Authors: Sergei Grudinin; Maria Kadukova; Andreas Eisenbarth; Simon Marillet; Frédéric Cazals
Journal: J Comput Aided Mol Des Date: 2016-10-07 Impact factor: 3.686

5. Biased Docking for Protein-Ligand Pose Prediction.

Authors: Juan Pablo Arcon; Adrián G Turjanski; Marcelo A Martí; Stefano Forli
Journal: Methods Mol Biol Date: 2021

Review 6. Receptor-ligand molecular docking.

Authors: Isabella A Guedes; Camila S de Magalhães; Laurent E Dardenne
Journal: Biophys Rev Date: 2013-12-21

7. Iterative Knowledge-Based Scoring Functions Derived from Rigid and Flexible Decoy Structures: Evaluation with the 2013 and 2014 CSAR Benchmarks.

Authors: Chengfei Yan; Sam Z Grinter; Benjamin Ryan Merideth; Zhiwei Ma; Xiaoqin Zou
Journal: J Chem Inf Model Date: 2015-10-01 Impact factor: 4.956

8. Absolute Binding Free Energies between T4 Lysozyme and 141 Small Molecules: Calculations Based on Multiple Rigid Receptor Configurations.

Authors: Bing Xie; Trung Hai Nguyen; David D L Minh
Journal: J Chem Theory Comput Date: 2017-05-01 Impact factor: 6.006

9. D3R grand challenge 2015: Evaluation of protein-ligand pose and affinity predictions.

Authors: Symon Gathiaka; Shuai Liu; Michael Chiu; Huanwang Yang; Jeanne A Stuckey; You Na Kang; Jim Delproposto; Ginger Kubish; James B Dunbar; Heather A Carlson; Stephen K Burley; W Patrick Walters; Rommie E Amaro; Victoria A Feher; Michael K Gilson
Journal: J Comput Aided Mol Des Date: 2016-09-30 Impact factor: 3.686

10. DOCK 6: Impact of new features and current docking performance.

Authors: William J Allen; Trent E Balius; Sudipto Mukherjee; Scott R Brozell; Demetri T Moustakas; P Therese Lang; David A Case; Irwin D Kuntz; Robert C Rizzo
Journal: J Comput Chem Date: 2015-06-05 Impact factor: 3.376