Literature DB >> 22716043

Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Michael M Mysinger¹, Michael Carchia, John J Irwin, Brian K Shoichet.

Abstract

A key metric to assess molecular docking remains ligand enrichment against challenging decoys. Whereas the directory of useful decoys (DUD) has been widely used, clear areas for optimization have emerged. Here we describe an improved benchmarking set that includes more diverse targets such as GPCRs and ion channels, totaling 102 proteins with 22886 clustered ligands drawn from ChEMBL, each with 50 property-matched decoys drawn from ZINC. To ensure chemotype diversity, we cluster each target's ligands by their Bemis-Murcko atomic frameworks. We add net charge to the matched physicochemical properties and include only the most dissimilar decoys, by topology, from the ligands. An online automated tool (http://decoys.docking.org) generates these improved matched decoys for user-supplied ligands. We test this data set by docking all 102 targets, using the results to improve the balance between ligand desolvation and electrostatics in DOCK 3.6. The complete DUD-E benchmarking set is freely available at http://dude.docking.org.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Ligands
Proteins

Year: 2012 PMID： 22716043 PMCID： PMC3405771 DOI： 10.1021/jm300687e

Source DB: PubMed Journal: J Med Chem ISSN： 0022-2623 Impact factor: 7.446

Introduction

While molecular docking screens routinely leverage protein structure to discover new ligands,[1−4] quantitative assessment of their performance remains problematic.[5] Although prospective assessment of docking performance is irreplaceable,[6,7] it is both time-consuming and expensive. Because a general correlation between docking scores and affinities is beyond current methods,[8,9] the field relies on ligand enrichment in docking hit lists to evaluate retrospective performance.[10−14] “Enrichment” measures how known ligands rank versus a background of decoy molecules and so depends not only on the nature of the ligands but also on the background decoys. Thus to compare docking enrichments, a benchmarking set of ligands and decoys is needed. The original Directory of Useful Decoys (DUD) was designed to meet this benchmarking need while controlling for decoy bias on enrichment.[15,16] Given a random drug-like set of decoys, Verdonk et al. showed that targets which bind high molecular weight ligands naturally get higher enrichments due to correlation between larger molecules and better docking scores.[17] In contrast, actual ligand binding affinities correlate with molecular size only for very small molecules.[18] Unable to separate the true correlations of simple molecular properties that aid prospective ligand discovery from the artifical correlations that arise from biases, it is informative to ask what value molecular docking adds beyond these properties. To this end, DUD decoys are matched to the physical chemistry of ligands on a target-by-target basis: by the properties of molecular weight, calculated logP, number of rotatable bonds, and hydrogen bond donors and acceptors. To fulfill their role as negative controls, decoys should not actually bind, so DUD used 2-D similarity fingerprints to minimize the topological similarity between decoys and ligands. In short, DUD decoys were chosen to resemble ligands physically and so be challenging for docking but at the same time be topologically dissimilar to minimize the likelihood of actual binding. Through intense use,[19−26] weaknesses in the original DUD set have appeared in both the ligands and decoys. Good and Oprea noted that a handful of chemotypes dominate many ligand sets, allowing high ranks for one scaffold to cause good overall enrichment.[27] One way to circumvent this problem is using chemotype retrieval metrics,[28] but another is to remove the “analogue bias” from the database by clustering on ligand scaffolds. After clustering the 40 targets, Good’s subset of DUD contains only 13 targets with over 15 ligands, indicating a need for more targets with more ligands. Another important goal is to increase target diversity, for example, by adding membrane domain proteins, none of which are represented in DUD. As there were weaknesses in the DUD ligands, this was also true of the decoys. Several investigators[29−31] observed that despite property matching on logP, net formal charge is still imbalanced in DUD; 42% of all ligands are charged versus only 15% of decoys. Property matching of decoys to ligands could also be tightened by choosing decoys more embedded in ligand property space.[32,33] Despite a 2-D chemical dissimilarity filter to prevent decoys from being active, some original DUD decoys still appear to bind, and these false decoys artificially reduce docking enrichment.[32] Addressing both false decoys and decoy property embedding, Vogel et al. released DEKOIS for the original 40 DUD targets. Gatica and Cavasotto generated ligand and decoy sets for 147 G protein-coupled receptors (GPCRs) while adding net charge to property matching.[34] Very recently, a python GUI application was announced to generate property-matched decoys.[35] By ignoring synthetic feasibility, Wallach and Lilien generate virtual decoy sets for the original DUD targets with tighter property-matching.[33] Instead of generating computational decoys, the MUV set selects decoys for 17 targets that were negative in public high-throughput screens.[36] Instead of generating decoys at all, REPROVIS-DB assembles ligand and database data from earlier successful virtual screens which are deemed reproducible.[37] Here we describe a new version of DUD that addresses these liabilities and develops new functionality. By drawing on ChEMBL09,[38] each DUD-Enhanced (DUD-E) ligand has a measured affinity supported by a literature reference. Though ligands are now typically clustered by Bemis–Murcko atomic frameworks[39] to reduce chemotype bias, there are still on average 224 ligands per target. The target list is expanded from 40 to 102, favoring targets with many ligands and multiple[40] structures. The additions include several drug relevant membrane proteins: five GPCRs, two ion channels, and two cytochrome P450s. Meanwhile, false decoys are reduced by more stringent filtering of topological dissimilarity. Where possible, measured experimental decoys are included. Finally, we consider how DUD-E performs as a benchmark versus the original DUD and explore its use as a tool for evaluating and optimizing molecular docking.

Results

The ideal target for a benchmarking set would be well studied, with many measured ligand affinities and multiple, diverse cocrystal ligand structures. To this end, the enhanced DUD database (DUD-E) is largely based on the intersection of ChEMBL,[38] for ligand annotations and affinities, and the RCSB PDB,[40] for structures. As we sought targets to enlarge the set, the 40 original DUD targets were first priority, 38 of which we included. Platelet-derived growth factor receptor β was dropped, as it was a homology model. Estrogen receptor α (ESR1) is a single target in DUD-E, whereas it was split into agonists and antagonists previously. To enlarge the benchmarking set, we used three main criteria. First, we favored new target classes with pharmacological precedence. Second, we sought targets with many ligands and crystal structures, as they likely reflect a combination of target relevance and ease of study. Third, we preferred targets that could modestly enrich known ligands using fully automated docking, as these may be both easy to prepare and amenable to docking. Conversely, targets with mostly covalent ligands were deprioritized. DUD-E targets are defined by their UniProt[41] gene prefix, with data from each species being combined into a single data set. While ChEMBL annotates ligands to a particular UniProt accession code, the ligand overlap between orthologous targets is surprisingly small. For example, among 1555 unique ligands with affinities below 1 μM for the human dopamine D3 receptor and 744 ligands for the rat orthologue, only 85 ligands are in both sets. These two orthologues share 97% trans-membrane sequence identity (79% overall), so this low overlap suggests to us that ChEMBL ligand annotations are sparse and do not typically reflect species specificity. Therefore, we pooled the data for all species, defining a DUD-E target as a UniProt gene prefix (such as DRD3), and not the full gene_species pair (such as DRD3_HUMAN or P35462). The 102 targets span diverse protein categories, including 26 kinases, 15 proteases, 11 nuclear receptors, five GPCRs, two ion channels, two cytochrome P450s, 36 other enzymes, and five miscellaneous proteins (Figure 1). Altogether 66695 raw ligands, defined as those with annotated affinities better than 1 μM to their target, molecular weights less than 600 and fewer than 20 rotatable bonds were extracted from ChEMBL09 (or the AmpC β-lactamase literature) (Table 1). That is an average of 654 ligands per target with a minimum of 40 and a maximum of 3090. Though negative binding is rarely reported, we also found 9219 experimental decoys (i.e., no measurable affinity up to 30 μM), with a maximum of 1070 for cyclo-oxygenase-1 (PGH1).

Figure 1

DUD-E target classification. Number of the 102 targets that belong to eight broad protein categories.

Table 1

Characteristics of DUD-E

	total	ChEMBL	manual
no. targets	102	101	1

DUD-E target classification. Number of the 102 targets that belong to eight broad protein categories. With targets selected, we chose a single X-ray structure to represent each target in docking studies (Table 2, Supporting Information Table S1). To find the structure most amenable to docking, we used an automated docking campaign to screen 3690 PDB structures against their clustered ligands and property-matched decoys (see below). Preference was given to higher resolution, to higher automated enrichment, and to the human orthologue. We avoided mutant structures, unresolved active site loops, extraneous bound peptides, or structures too constrained for many of that target’s ligands. Where we had domain knowledge, the most representative structure was preferred, for example a DFG-in structure for kinases or an antagonist structure for estrogen receptor α (ESR1). For 57 out of 102 targets, a DOCK Blaster[42] prepared structure was used for DUD-E, directly from the automated tool chain. Another 45 targets required manual intervention, most due to simple errors in automated preparation (e.g., incomplete metal atom preparation, missing cofactors, or nonstandard amino acids). A select few needed expert intervention to arrive at modest enrichment, such as adding crystallographic waters, changing histidine protonation, flipping ambiguous side-chains such as asparagine, or increasing a local dipole moment on a specific residue (a technique we often use prospectively to improve polar complementarity[43,44]). In five targets, we incorporated prior docking preparations used for prospective ligand discovery: adenosine A2A receptor (AA2AR),[44] β1 adrenergic receptor (ADRB1), AmpC β-lactamase (AMPC), C-X-C chemokine receptor type 4 (CXCR4),[3] and dopamine D3 receptor (DRD3).[45]

Table 2

Overview of Representative Targets

target class	gene ID	description	total ligands	clustered ligands	experimental decoys	matched decoys	PDB	LogAUC (%)	ROC EF₁	AUC (%)
cytochrome P450	CP2C9	cytochrome P450 2C9	145	120	176	7450	1R9O	7	3	60
	CP3A4	cytochrome P450 3A4	302	170	267	11800	3NXU	7	2	63

GPCR	AA2AR	adenosine A2a receptor	3057	482	192	31550	3EML	28	22	83
	ADRB1	β-1 adrenergic receptor	648	247	69	15850	2VT4	19	11	76
	CXCR4	C-X-C chemokine receptor type 4	40	40	14	3406	3ODU	36	18	90

ion channel	GRIA2	glutamate receptor ionotropic AMPA 2	476	158	201	11845	3KGC	23	23	71
	GRIK1	glutamate receptor ionotropic kainate 1	136	101	235	6550	1VSO	35	27	86

kinase	AKT1	serine/threonine-protein kinase AKT	585	293	53	16450	3CQW	27	29	72
	MK10	c-Jun N-terminal kinase 3	199	104	23	6600	2ZDT	24	11	82
	MK14	MAP kinase p38 α	2205	578	73	35850	2QD9	17	10	74

miscellaneous	KIF11	kinesin-like protein 1	272	116	29	6850	3CJO	34	35	77
	XIAP	inhibitor of apoptosis protein 3	100	100	7	5150	3HL5	52	55	88

nuclear receptor	ESR1	estrogen receptor α	1297	383	136	20685	1SJ0	18	15	67
	MCR	mineralocorticoid receptor	201	94	2	5150	2AA2	–4	2	36
	THB	thyroid hormone receptor β-1	246	103	29	7450	1Q4X	36	38	79
	PPARD	peroxisome proliferator-activated receptor δ	699	240	79	12250	2ZNP	32	20	89

other enzymes	FNTA	protein farnesyltransferase type I α	1430	592	132	51500	3E37	16	7	76
	HDAC8	histone deacetylase 8	309	170	73	10450	3F07	29	24	80
	HIVINT	HIV type 1 integrase	167	100	268	6650	3NF7	8	2	64
	KITH	thymidine kinase	57	57	68	2850	2B8T	15	0	80
	PARP1	poly (ADP-ribose)polymerase-1	1031	508	12	30050	3L3M	25	21	79
	PUR2	GAR transformylase	50	50	12	2700	1NJS	51	50	92

protease	DPP4	dipeptidyl peptidase IV	1939	533	167	40950	2I78	41	41	87
	FA10	coagulation factor X	3090	537	176	28325	3KL6	39	36	87
	LKHA4	leukotriene A4 hydrolase	343	171	21	9450	3CHP	18	4	82
	MMP13	matrix metallo-proteinase 13	1632	572	26	37200	830C	12	5	71

To increase scaffold diversity and to make smaller, more manageable ligand sets, we clustered the raw ChEMBL ligands by their Bemis–Murcko atomic frameworks.[39] These atom-type based frameworks include ring systems of the molecule and connecting linkers, minus any side fragments. For example, the seventh largest Murcko cluster in kinesin-like protein 1 (KIF11) has seven ligands, all close analogues (Figure 2A). If at least 100 frameworks were present, then we included only the highest affinity ligand from each framework. If fewer were available, we raised the number of ligands selected from each framework until we obtained more than 100 molecules, trading diversity for quantity. Returning to kinesin-like protein 1, we extracted only 70 Murcko frameworks (Figure 2B). Out of 276 raw ligands, the five largest Murcko clusters contained 146 ligands (53%). Selecting the two or three highest affinity ligands from each framework results in 98 and 118 ligands, respectively, so we stopped at three ligands per framework. In the process we still managed to remove 158 lower affinity compounds from highly redundant clusters. In a few targets, more than 600 ligands remained even after clustering, so we reduced the affinity threshold below 1 μM in the sequence (300, 100, 30, 10, and 3 nM), until fewer than 600 frameworks were found. For example, in adenosine A2A receptor, there are 3096 raw ligands resulting in 1099 frameworks at 1 μM, but we can reduce the number of frameworks to 483 using a 30 nM affinity threshold (Figure 2C).

Figure 2

Ligand clustering. (A) The seventh largest Murcko cluster of kinesin-like protein 1 (KIF11), showing both the scaffold (left) and all seven member ligands. (B) Number of ligands in each of the 70 KIF11 Bemis–Murcko atomic frameworks. We removed lower affinity compounds over-represented clusters (above the line), while retaining 100 ligands. (C) Number of adenosine A2A receptor (AA2AR) Murcko clusters is plotted against affinity threshold. Fewer than 600 clusters are present using a 30 nM affinity threshold. To examine the effect of clustering on docking enrichments, we docked the three targets with the highest and lowest fraction of clustered to raw ligands from those with enough ligands to pick one ligand per Murcko cluster. To measure docking performance we used LogAUC, an aggregate metric that gives early enrichment more weight. As described previously,[31] LogAUC is completely analogous to AUC but in the transformed space after you have zoomed in on early enrichment by taking the semilog of the x-axis. In tryptase β1 (TRYB1), the target with the highest clustered fraction, clustering substantially decreases the LogAUC by 6%, whereas in the other five targets clustering increases the LogAUC (Supporting Information Table S2). The mean absolute deviation over the six targets is 3.7% LogAUC, but in all cases the raw and clustered ROC curves have similar shapes (data not shown). Overall, we believe the clustered sets provide a better measure of docking performance with lower docking effort and will be used in the remainder of this work. A key problem with the original DUD decoys was that they sometimes closely resembled the ligands, occasionally even being confirmed as binders. Enforcing 2-D topological dissimilarity between decoys and ligands should eliminate this problem in principle, but in practice critical ligand binding “warheads” often remain in the decoy set selected from ZINC,[46] e.g., amidine groups in factor Xa (FA10). By identifying these warheads in three targets (Figure 3A), we investigated how to eliminate false decoys. In the original DUD, CACTVS fingerprints were used to select decoys with Tanimoto coefficients (Tc) to ligands below 0.9, which is roughly similar to using Daylight fingerprints with Tc below 0.7.[15] In recent work,[31] we used Daylight fingerprints with a more restrictive Tc < 0.5. Using this filter on the enhanced DUD ligand sets, we still saw 39%, 53%, and 96% of possible warhead bearing molecules passing through in factor Xa (FA10), glycinamide ribonucleotide transformylase (PUR2), and thymidine kinase (KITH), respectively (Figure 3B). Using Daylight with Tc < 0.325, we reduced FA10 warheads below 1% but still saw 14% and 34% in PUR2 and KITH. Clearly different targets and even different ligands require different absolute thresholds. To circumvent this, we removed a percentage of the most similar decoys for each ligand, sorted by maximum Tc to any ligand. This allowed the effective absolute threshold to vary. Removing 50% of the decoys with Daylight was better in KITH, while removing 50% with ECFP4 was better in FA10 and PUR2. The final procedure of using ECFP4 fingerprints and removing 75% of the decoys, resulted in 0.2%, 0%, and 5.8% of warheads remaining, substantially reducing the number of false decoys. Having refined the decoy dissimilarity procedure on three targets where we could define a warhead, we then applied it to all generated decoys. To help ensure that the resulting decoys were, in fact, substantially different, topologically, from the ligands, we compared the two by a metric partially orthogonal to topology, asking how many decoy molecules shared the same scaffold as a ligand. Of the 805136 decoy scaffolds over all of DUD-E, only 692 (0.086%) were found among the 25503 ligand scaffolds, consistent with substantial topological differences among the two sets despite their close physical property matching.

Figure 3

Decoy generation. (A) Three key “warhead” groups from factor Xa (FA10), glycinamide ribonucleotide transformylase (PUR2), and thymidine kinase (KITH). (B) Fraction of warheads remaining is plotted against the dissimilarity method. The dissimilarity methods consist of a fingerprint (Daylight or ECFP4) and either a hard cutoff or a fraction of the most dissimilar decoys to be retained. (C) Property distributions of estrogen receptor α (ESR1) for both the 383 ligands (blue) and the 20685 property-matched decoys (red). In addition to reducing false decoys, the DUD-E decoy generation procedure was extensively revised. Each decoy derived from a particular ligand, where decoy property ranges around the ligands properties adjusted to seven possible widths. This adapted to local chemical space around each ligand, allowing more closely matched decoys. Also, net charge was added to the property matching, as it is critical in electrostatics and desolvation. The improved property-matching can be seen in the property histograms for estrogen receptor α (ESR1) (Figure 3C) as well as the averages and standard deviations for all the targets (Supporting Information Table S3). Using ZINC[46] for the potential decoy pool made them purchasable, enabling experimental testing for actual binding to the target. As a result of this work, this enhanced decoy procedure has been fully automated and is available online to generate DUD-E style decoys for any user supplied list of input ligands at http://decoys.docking.org. The original DUD paper[15] showed that a property-matched decoy set is more challenging for docking than a random collection of molecules. Therefore, we compared enrichments using property-matched decoys to those using a random drug-like background, which consisted of all ChEMBL12 ligands with affinities better than 10 μM. Switching from a drug-like background to DUD-E property-matched decoys does reduce average enrichment over the 102 targets, from 26.8% to 24.4% LogAUC (Supporting Information Table S4). Yet for three targets, the property-matched sets unexpectedly led to much better enrichment, by more than 15% LogAUC. In both glutamate receptor ionotropic kainate 1 (GRIK1) and purine nucleoside phosphorylase (PNPH), the ligands have low molecular weights (Supporting Information Table S3) and thus scored poorly against the generally larger ChEMBL12 molecules, just as Verdonk[17] suggests. In urokinase-type plasminogen activator (UROK), the top of the drug-like docking hit list is dominated by decoys with amidine “warheads”. Because these are likely binders, the increased property-matched enrichment resulted from fewer false decoys in that set. Indeed, the 2.4% LogAUC reduction that occurs upon switching to property-matched decoys arises from these two competing factors: property matching the decoys reduces enrichment, and reduction of false decoys increases enrichment. Overall, enrichment as measured by average LogAUC is 1.5 fold higher in DUD-E compared to the original DUD. To understand this, we first isolated the change due to the revised decoy generation procedure. Using the original DUD ligands and target preparations, but switching from original decoys to these revised decoys substantially increased the average enrichment over the 37 directly comparable targets from 14.8% to 19.7% LogAUC (Table 3, Supporting Information Table S5). With the new adaptive property-matching procedure incorporating net charge, the revised decoys might have been expected to lower enrichment, but instead we saw an overall increase. Inspecting the docking hit lists, we observed a dramatic decrease in high scoring decoys that resemble ligands to a degree that they might actually bind. Indeed, all three targets with identifiable warheads that we used to tune the dissimilarity procedure showed large increases in enrichment: FA10 increases from 13% to 28% LogAUC, PUR2 from 40% to 62% LogAUC, and KITH from 1% to 32% LogAUC. If we now isolate the switch from original ligands and revised decoys to both DUD-E ligands and decoys, we see a moderate decrease in average enrichment from 19.7% to 16.4% LogAUC. We attribute this decrease to the larger, more diverse clustered ligand lists in DUD-E. Lastly, switching the target preparation, and the choice of the particular PDB structure used to represent a target, substantially increases enrichment from 16.4% to 22.8% LogAUC between DUD and DUD-E (Supporting Information Table S5). The overall effect with SEV ligand desolvation in DOCK 3.6 is to increase average enrichment from 14.8% LogAUC against DUD to 22.8% LogAUC against the DUD-E benchmark.

Table 3

Decomposition of Enrichment Changes between DUD and DUD-E

incremental change	all original	new style decoys	switch to new ligands	switch target preparation
decoys	DUD	DUD-E	DUD-E	DUD-E
ligands	DUD	DUD	DUD-E	DUD-E
receptor preparation	DUD	DUD	DUD	DUD-E
average LogAUCa	14.8	19.7	16.4	22.8

Over the 37 common targets (target-by-target data in Supporting Information Table S5).

Over the 37 common targets (target-by-target data in Supporting Information Table S5). A central motivation for any benchmarking set is to test, at least retrospectively, new methods. We wanted to explore how our recent context-dependent ligand desolvation method[31] behaved against the DUD-E benchmark. We therefore used it to re-examine the utility of solvent-excluded volume (SEV) ligand desolvation versus using no desolvation term (None) or using the full transfer free energy from water to hexadecane (Full). In our initial study of these terms on the 40 original DUD targets, SEV improved upon None by just 0.7% average LogAUC. Conversely, over the 102 DUD-E targets, SEV substantially outperformed None by 3.8% LogAUC on average, with average LogAUC values of 20.6, 14.3, and 24.4% for None, Full, and SEV desolvation methods, respectively (Figure 4, Supporting Information Table S4). Despite these average trends, ROC curves on individual targets can vary significantly among the various methods (Figure 5). As in the original desolvation analysis, some targets are more amenable to full desolvation, such as catechol O-methyltransferase (COMT) and purine nucleoside phosphorylase (PNPH), while others are more amenable to no desolvation, such as factor X (FA10) and glycinamide ribonucleotide transformylase (PUR2). Against the DUD-E benchmark, SEV desolvation not only outperforms the other methods, but performs well in both types of targets. This suggests that over a more comprehensive set of targets, and what we argue is a better set of ligands and decoys, the advantage of the more physically correct SEV ligand desolvation treatment becomes more pronounced.

Figure 4

Figure 5

Representative ROC plots. ROC plots using no desolvation (None), solvent-excluded volume ligand desolvation (SEV), the thin low-dielectric layer (Thin), or a drug-like background that consists of all ChEMBL12 ligands with affinities better than 10 μM (Drug-like). The black dotted line represents the results expected from docking ligands randomly. LogAUC percentages are reported in the legend text.

Retrospective enrichment comparing ligand desolvation and electrostatics methods. Docking results over DUD-E as measured by LogAUC. “None” has no ligand desolvation term, “SEV” uses solvent-excluded volume ligand desolvation, “Thin” employs a thin low-dielectric layer in the electrostatic calculations. Representative ROC plots. ROC plots using no desolvation (None), solvent-excluded volume ligand desolvation (SEV), the thin low-dielectric layer (Thin), or a drug-like background that consists of all ChEMBL12 ligands with affinities better than 10 μM (Drug-like). The black dotted line represents the results expected from docking ligands randomly. LogAUC percentages are reported in the legend text. Electrostatic interaction with the protein is a large term that opposes ligand desolvation, with their relative balance being critical for binding. Because we do not know the binding pose of putative ligands prior to docking, we need to approximate the region of low dielectric the ligand might occupy to precompute electrostatic grids. Previously, we used the negative image of the receptor (computed by SPHGEN) to construct this low dielectric region, but manual tweaking was often required. In the large open binding pocket of CXCR4, we observed that using a thin layer of low-dielectric around just the edge of the protein allowed ligands to interact with it while reducing the bulk dielectric perturbation at the center of its large binding pocket.[3] Here we explored using an automated thin dielectric layer strategy across the entire DUD-E set. Visually, these new automated thinner dielectric layers are more physically realistic, even in the rare case when they are effectively thicker than the previous layers (due to a water probe being able to penetrate that layer). With these thin low-dielectric layers (Thin), the average LogAUC over the 102 targets improved from 24.4% to 24.9% (Figure 4, Supporting Information Table S4). Six targets used manually prepared dielectric layers (AA2A2, ADRB1, AMPC, CDK2, CXCR4, and DRD3) and thus do not directly reflect the difference between automated dielectric layers. Excluding those six enlarges the average difference from 0.5% to 1.0% LogAUC. Admittedly, these are moderate differences, but they exemplify how DUD-E may be used to test new docking methods and hint that as we progress docking models, enrichment will improve. Here we present three representative targets in greater detail to display a magnified view of DUD-E.

Mineralocorticoid Receptor (MCR)

MCR has the lowest enrichment in DUD-E. Across all 11 automatically docked structures, enrichment of DUD-E ligands to its decoys was negligible. Thus we selected the same PDB structure as the original DUD, 2AA2 at 1.95 Å resolution. While enrichment using the new DUD-E sets was worse than random at −4% LogAUC and 36% AUC (Table 2), using the original DUD ligands and decoys gave 45% LogAUC and 76% AUC. Despite poor enrichment in DUD-E, building and docking the crystal ligand from scratch, ignoring crystallographic information, resulted in good pose agreement (Figure 6A). Taken together, we can rationalize the enrichment differences, as 13 of 15 original ligands shared a polycyclic scaffold with the well-docked crystal ligand, while the 94 new ligands had much more scaffold diversity. Thus the reduced enrichment in DUD-E reflects increased chemotype diversity as a result of including more ligands and clustering them by Bemis–Murcko atomic frameworks. Of the four lowest enriching targets in DUD-E, three are nuclear hormone receptors, with glucocorticoid receptor (GCR) and androgen receptor (ANDR) joining MCR. These receptors all have hydrophobic pockets with flexible binding site residues such as methionine and leucine so that a single rigid receptor may be incapable of docking all of their ligands. Thus these targets may be good tests of flexible receptor docking methods.

Figure 6

Representative docking poses. The crystallographic ligand was rebuilt and docked from scratch. (A–F) The crystal pose (magenta) is compared to the resulting docked pose (green). In (C), more ligand conformations are generated and the redocked pose is also shown (tan). Key hydrogen bonds are shown by black dotted lines, and the partially transparent protein surface is colored by atom type.

Thyroid Hormone Receptor β1 (THB)

THB produced good enrichment when a structure with an open subpocket was selected. Enrichment for the 16 automatically docked structures varies significantly, ranging from 13% (1NQ0) to 37% LogAUC (1Q4X). The lower enriching structures have larger cavities near Arg320 (right side of Figure 6B), opening to solvent in 1NQ0; the higher enriching structures have larger cavities at the other end of the binding site near Met420 (left side), opening to solvent in 1Q4X. We selected the automated preparation of 1Q4X despite its modest 2.80 Å resolution because Thr273 is pushed away by the crystal ligand, making the left subpocket larger. Using SEV desolvation then yields enrichment statistics of 36% LogAUC, 79% AUC, and a receiver operating characteristic curve based enrichment factor at 1% (EF1) of 38 (Table 2). The redocked crystal ligand has excellent pose agreement (Figure 6B).

Serine/Threonine-Protein Kinase AKT (AKT1)

AKT1 is a newly added kinase that demonstrates several considerations during PDB structure selection. Whereas 10 PDB structures were automatically docked, four got worse than random enrichment. All four correspond to structures of the Pleckstrin homology (PH) domain instead of the kinase domain. The structure with the best normal AUC, 3O96, corresponds to an allosteric site at the interface of the PH and kinase domains, not the traditional ATP binding pocket. While the best enriching structure by LogAUC, 3CQW at 2.00 Å, corresponds to the canonical site, its nonstandard phosphothreonine amino acid evades the automated protocol. Preparing that residue manually results in 27% LogAUC, 72% AUC, and 29 EF1 (Table 2). Nevertheless, the redocked ligand (green) fails to generate the crystal ligand pose (magenta) (Figure 6C). The ligand, however, is quite small, with one central rotatable bond, and requires a specific rotation about that bond to fit in the binding site. Lowering the rmsd threshold for ligand conformation generation allows that rotation to be sampled, restoring the correct ligand binding pose (tan) (Figure 6C).

Discussion

Docking continues to be judged by hit rates in prospective studies, and by enrichment in retrospective recall studies, because it cannot now hope to calculate affinities or even monotonic rank order. Like protein structure prediction, docking thus remains an empirical, although we would argue also a pragmatic field. Its reliance on enrichment has driven the development of benchmarking sets, first explored by Rognan[11] and Jain,[12] recently investigated by Boeckler[32] and Cavasotto;[34] the most widely used and cited of these remains the Directory of Useful Decoys (DUD).[15] Despite its widespread adoption, DUD retains serious liabilities, including a lack of ligand diversity, lack of property-matching to net charge, and a substantial number of false decoys. The enhanced DUD (DUD-E) described here was developed to address these shortcomings and to expand the target list to be more reprentative of pharmacologically relevant space.

Balancing Ligands and Decoys for Enrichment

An important problem with DUD arose from the ligands and decoys originally chosen. The former sometimes over-represented in a few chemotypes, and the latter were sometimes not decoys but actually ligands. Moreover, the mapping of specific ligands to their matched decoys had been lost in the released set. In the 102 ligand and decoy sets that comprise DUD-E, ligand diversity in any given set is substantially increased, reducing the bias that can come from a single chemotype ranking well. With at least 40 ligands for every target and a preference to maximize chemotype diversity, DUD-E allows for more representative tests of docking screens. Correspondingly, property-matching decoys to each ligand individually, while more stringently removing false decoys (i.e., ligands), allows investigators to directly match specific ligands to their decoy molecules and reduces what had been artifactually low enrichment for some targets in DUD. Adding net charge as a property to match between ligands and decoys resolves a discrepancy between them in DUD, where the ligands had tended to be more charged, on average, than the decoys, which had the effect of skewing our evaluation of physical forces like desolvation. The impact of these changes on docking performance is substantial and clarifying. In isolation from other effects, clustering the ligands for diversity reduces enrichment, as one might expect because high-performing, over-represented sets have been largely removed. Conversely, the new decoys increase enrichment compared to the DUD performance. At first this seemed counterintuitive, because one imagines that a better-balanced, more stringent decoy set will be a greater challenge for a docking program. However, this is more than balanced by the removal of what had been false decoys (ligands), which artifactually reduced enrichment in DUD because, as ligands, they had often ranked well but counted as decoys they diluted the annotated ligands. Finally, the new target preparations, carefully selected from a docking campaign to over 3500 structures, also increased enrichment. Overall, the increased enrichment in DUD-E should provide more sensitivity for benchmarking docking algorithms, giving it greater responsiveness to modifications that reduce enrichment as well as those that increase it.

Online Tools for Automated Generation of Further Ligand and Decoy Sets

DUD-E is built to be a better platform for refinement and extension of ligand and decoy sets. Targets are independent of one another, both in ligands and decoys, allowing target addition, deletion, or replacement. The protocol to generate decoys for DUD-E is made available online to generate decoys for any target given only a list of ligand structures, which enables extension of DUD-E to new targets of interest by individual investigators. The decoy server pulls directly from a purchasable subset of the ZINC database, inheriting its improvements and purchasing updates.[46] The final decoy selection from the applicable pool of decoys is random where possible, allowing the generation of multiple decoys sets to test overfitting to the canonical DUD-E decoys. Each decoy belongs to one and only one ligand, so if one wants to filter a ligand, then the corresponding decoys can be easily removed. For example, we provide raw ligand and decoy sets before clustering by Bemis–Murcko atomic frameworks. If a different clustering method was desired, which selected a different subset of the raw ligands, then the corresponding decoys could be retained (furthermore, we provide the python script used to generate clustered subsets from raw sets). We also include extra data that allows some design decisions to be altered, for instance, we include the marginal ligands which are active above our 1 μM cutoff.

Applications to Docking Optimization and Testing

DUD-E should provide a more robust benchmarking set for exploring new docking methods, so we were keen to test it against new methods that we had been investigating. When tested against the older DUD set, we had found that a new solvent-excluded volume (SEV) ligand desolvation method had had a disappointingly small effect on enrichment despite what was clearly a better physical model. However, when measured against the DUD-E benchmark, the differential performance between the old and new method increased substantially in the latter’s favor. Similarly, against the DUD-E benchmark, a more physically realistic dielectric layer, used to calculated the electrostatic interaction term from static Poission–Boltzmann maps, also led to improved enrichments that had been largely masked in the DUD set, owing to the problems described above. Certain cautions merit airing. Most importantly, DUD-E is a large data set synthesized from several source databases, each of which is continuously evolving and improving. Thus individual errors are expected, though usually traceable to the source database at the time DUD-E was constructed. Although we only show docking results using DOCK 3.6 with solvent-excluded volume ligand desolvation, DUD-E was designed to be a general benchmarking set. Thus some arbitrary choices and simplifying assumptions were made in the effort to provide one canonical data set useful to compare docking algorithms. For instance, we assume a single PDB code can represent the target, but some targets are highly flexible or they contain both orthosteric and allosteric binding pockets. Fundamentally, DUD and DUD-E are designed to measure value-added screening performance of 3-D methods over simple 1-D molecular properties. Decoys that might bind are removed using 2-D ligand similarity, so DUD-E is inappropriate to test 2-D methods. Through its construction, ligands light up against DUD-E decoys using these 2-D similarity methods, which create an artificially favorable enrichment bias for them. A final caution is that to filter more false decoys in DUD-E, we keep only a quarter of the most highly dissimilar decoys. However, while we show that this increased dissimilarity removes false decoys, it could also contribute to artificial increases in docking enrichment. Notwithstanding these caveats, DUD-E is substantially improved over the original DUD. It is a larger, more diverse data set with better matched decoys that resemble ligands less, correcting many flaws in its predecessor. Although we anticipate that it will be most widely used in the instantiation we describe here, it was developed with the idea that it could be flexibly extended and evolved; the tools to do so are even provided online (http://dude.docking.org). We hope that it and its descendants will provide a useful tool for docking evaluation in the community until such time as a more fundamental measurement of docking performance is possible.

Methods

ChEMBL and RCSB PDB Data Extraction

This enhanced DUD database has been constructed by combining ligand data from ChEMBL[38] and structural data from RCSB PDB[40] (Supporting Information Figure S1A). Ligands assigned to protein targets (ChEMBL confidence score ≥4) with affinities (IC50, EC50, Ki, Kd, and log variants thereof) of 1 μM or better were extracted from the ChEMBL09 database.[38] Similarly, we assigned experimental decoys as molecules with no measurable affinity at 30 μM or higher (greater than relation only). The remaining ligands with affinities above 1 μM, and decoys with no measurable affinity below 30 μM, are included for completeness and dubbed “marginal”. Via ChEMBL, ligands are associated with a particular target sequence by UniProt[41] accession code, and then mapped[47] from UniProt accession codes to protein data bank (PDB) structures (X-ray only) using http://www.uniprot.org/docs/pdbtosp.txt, obtained on February 23, 2011.

Target Selection Docking

Preliminary docking calculations were performed on each PDB structure that mapped to ChEMBL ligands and contained a single, unambiguous cocrystal ligand as prepared by DOCK Blaster.[42] Property-matched computational decoys were generated by the automated decoy generation procedure below, using Daylight fingerprints with a Tanimoto coefficient (Tc) threshold below 0.5. These decoys were docked and compared to their cognate ligands using DOCK 3.6 with solvent-excluded volume (SEV) ligand desolvation.[31] Balancing the parallel goals of diversity, drug relevance, many ligands and structures, and at least modest automated docking enrichment, we selected 119 tentative targets for the new DUD. This list was reduced to the final 102 targets by factors such as ligand and PDB duplication between targets (e.g., FNTB duplicates FNTA), low resolution structures (RAF1), sterically constrained binding sites (NR1H2, THA), or over-representation (MK08, MTOR).

Target Preparation

For each target, we assembled all UniProt accession codes (species) with any raw ChEMBL compounds (ligands, decoys, marginal ligands, or marginal decoys). For only those accession codes, structures were extracted using the ChEMBL to PDB mapping, except P07700 was manually added to ADRB1 to include six more rare structures for that GPCR. This procedure neglects those PDB structures that belong to an accession code having no ChEMBL compounds. For example, 1KIM is the PDB structure of thymidine kinase (KITH) in the original DUD. This KITH structure is from herpes virus (UniProt P03176), an accession code with no raw compounds extracted from ChEMBL, and is thus not included in the ChEMBL/PDB intersection used to construct the new DUD. Still, 5025 PDB codes were sent to an updated DOCK Blaster pipeline for automated docking preparation (Supporting Information Figure S1D). In some cases, an unambiguous ligand could not be found to indicate the binding site, but we were able to assign 565 additional ligands by manually inspecting over 1300 structures. Ultimately, 3692 structures completed input grid preparation, and all but two finished docking and enrichment analysis. Clustered ligands sets were docked to property-matched decoys (both described below) using ECFP4 fingerprints and removing the most similar 75% of queried decoys. DOCK 3.6 was run using SEV ligand desolvation (as below). For each target, enrichment, resolution, and organism were collected and sorted by enrichment in pdb_analyze.txt, available online at http://dude.docking.org. Crude notes on the selection process are recorded in pdb_selection.txt, and the picked structure is listed in pdb_blessed.txt. AA2AR and DRD3 docking preparations were provided by Jens Carlson,[44,45] CXCR4 partially by Dahlia Weiss,[3] ADRB1 by Peter Kolb (personal communication), and AMPC by Sarah Barelier, Oliv Eidam, and Inbar Fish (unpublished results).

Ligand Preparation

To prepare ligand sets for each target, ChEMBL affinities and log variants were first normalized to nM units (Supporting Information Figure S1B). Salts were removed, charges were normalized, and properties were calculated using Molinspiration’s mib package (www.molinspiration.com). Ligands with 600 Da or higher molecular weight or with 20 or more rotatable bonds were removed. Smiles were put in canonical form using OpenEye’s OEChem software.[48] Ligand sets from each species were combined, sorted by ascending normalized affinity, and then made unique based on canonical smiles. The same procedure was used to collate the experimental decoys, marginal ligands, and marginal decoys. For AmpC β-lactamase (AMPC), an original DUD target, the ChEMBL09 ligands are covalent in nature. To identify noncovalent ligands, we manually compiled ligands[6,43,49,50] with affinities below 5 mM and experimental decoys[43,51] from the literature.

Ligand Clustering

To reduce the sometimes large number of ChEMBL ligands down to a manageable size while also increasing scaffold diversity as suggested by Good and Oprea,[27] we clustered the ligands by their Bemis–Murcko atomic frameworks,[39] as generated by Molinspiration’s mib. If there were 100 or more frameworks, we chose only the highest affinity ligand from each. If there were fewer than 100 Murcko frameworks, we increased the number of highest affinity ligands taken from each until we achieved at least 100 ligands (or until all ligands were included). Conversely, if there were more than 600 Murcko frameworks, then we decreased the ligand affinity threshold in the sequence [1 μM, 300 nM, 100 nM, 30 nM, 10 nM, 3 nM] until fewer than 600 frameworks were present, where we then took the highest affinity ligand from each framework. While clustered ligand sets are the default, the full unclustered ligand sets and corresponding decoys are available. The script (subset_decoys.py) used to select the clustered subset given the ligand ids is provided with the full ligand set to enable other clustering algorithms or filtering methods to be substituted.

Automated Decoy Generation

As in the original DUD, we property-matched decoys to ligands using molecular weight, estimated water–octanol partition coefficient (miLogP), rotatable bonds, hydrogen bond acceptors, and hydrogen bond donors, plus we added net charge. We generated all ligand protonation states in pH range 6–8 using Schrödinger’s Epik with arguments “-ph 7.0 -pht 1.0 -tp 0.20” (Supporting Information Figure S1C). Molecular properties were then computed using Molinspiration’s mib. Over all the protonated forms of a given ligand, we kept only those with a unique set of the six physicochemical properties. For each of these unique property sets, we aimed to generate 50 matched decoys. For example, a single input ligand predicted to have two alternate charges would get 50 decoys property-matched to each charge. To accomplish this, a pool of decoys was selected from ZINC[46] using a dynamic protocol that adapted to local chemical space by narrowing or widening windows in seven steps around the six properties. The goal was to return 3000–9000 potential decoys that matched the decoy’s reference protonation state (predicted most prevalent form at pH 7.05). In the final decoy procedure, ECFP4 fingerprints were generated by Scitegic’s Pipeline Pilot for ligands and potential decoys. The decoys were sorted by their maximum Tc to any ligand, and the most dissimilar 25% were retained through this dissimilarity filter. We then remove duplicate decoys from the ligand set by sorting decoys from least to most duplicated and assigned each decoy to the protonated ligand which has the least number of decoys already assigned. This ensures unique decoys were spread across the ligands as evenly as possible. Finally, if available, 50 decoys were picked randomly from this deduplicated list.

Original DUD Comparison

For the original DUD comparison, we downloaded ligands and decoys from dud.docking.org and prepared docking flexibases with our modern ZINC toolchain.[46] The original DUD target preparations were copies of the original, modified to perform SEV desolvation calculations as described previously.[31] We also generated DUD-E style automated decoys and flexibases for the original DUD ligands. The analysis was performed on the 37 directly comparable targets, excluding the original targets PDGFrb, ERagonist, and ERantagonist.

Docking Calculations

Except as noted, docking calculations were performed with DOCK 3.6 and solvent-excluded volume (SEV) ligand desolvation as described previously.[31] Ligand conformations were generated by OpenEye’s Omega.[52] For sampling, the minimum number of graph matching nodes was changed to 3, and ligand overlap was changed to 0.1. Ligands were limited to between 5 and 100 heavy atoms. The timeout for an individual ligand hierarchy was 180 s. We performed 200 steps of simplex minimization, with initial translations of 0.2 Å and initial rotations of 5°. The thin dielectric layer Delphi spheres were created by walking out each DMS (http://www.cgl.ucsf.edu/Overview/ftp/dms.zip) surface normal by 1.8 Å and placing a sphere. This thin sphere layer is then used as input to makespheres1.pl in place of the usual SPHGEN spheres. The random background calculations were performed using SEV desolvation by seeding the DUD-E ligands into the entire ChEMBL12_10 subset of ZINC, which includes 273375 ligands with annotated affinities below 10 μM.

Docking Metrics

The area under the curve (AUC) of the receiver operating characteristic (ROC) is one common metric to measure docking performance. However, ROC plots often use a semilog transformation of the x-axis to zoom in on early changes. As described previously,[31] LogAUC is completely analogous to AUC in this transformed space, measuring the percentage of the unit area under the curve. Formally, we use the adjusted LogAUC0.001 here, which spans three decades of log space and subtracts the LogAUC of the random curve (14.462%) so that random enrichment is 0%. We typically refer to the adjusted LogAUC0.001 as either adjusted LogAUC or simply LogAUC. The ROC-based enrichment factor at 1% (EF1) is the percent of ligands found when 1% of the decoys have been found and is preferred over traditional enrichment factors.[53]

53 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Detailed analysis of scoring functions for virtual screening.

Authors: M Stahl; M Rarey
Journal: J Med Chem Date: 2001-03-29 Impact factor: 7.446

3. Successful virtual screening for novel inhibitors of human carbonic anhydrase: strategy and experimental confirmation.

Authors: Sven Grüneberg; Milton T Stubbs; Gerhard Klebe
Journal: J Med Chem Date: 2002-08-15 Impact factor: 7.446

4. UniProt: the Universal Protein knowledgebase.

Authors: Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

Review 5. Approaches to the description and prediction of the binding affinity of small-molecule ligands to macromolecular receptors.

Authors: Holger Gohlke; Gerhard Klebe
Journal: Angew Chem Int Ed Engl Date: 2002-08-02 Impact factor: 15.336

6. Virtual screening using protein-ligand docking: avoiding artificial enrichment.

Authors: Marcel L Verdonk; Valerio Berdini; Michael J Hartshorn; Wijnand T M Mooij; Christopher W Murray; Richard D Taylor; Paul Watson
Journal: J Chem Inf Comput Sci Date: 2004 May-Jun

7. Assessing scoring functions for protein-ligand interactions.

Authors: Philippe Ferrara; Holger Gohlke; Daniel J Price; Gerhard Klebe; Charles L Brooks
Journal: J Med Chem Date: 2004-06-03 Impact factor: 7.446

8. Comparative evaluation of eight docking tools for docking and virtual screening accuracy.

Authors: Esther Kellenberger; Jordi Rodrigo; Pascal Muller; Didier Rognan
Journal: Proteins Date: 2004-11-01

9. Rapid context-dependent ligand desolvation in molecular docking.

Authors: Michael M Mysinger; Brian K Shoichet
Journal: J Chem Inf Model Date: 2010-09-27 Impact factor: 4.956

10. Structure-based discovery of a novel, noncovalent inhibitor of AmpC beta-lactamase.

Authors: Rachel A Powers; Federica Morandi; Brian K Shoichet
Journal: Structure Date: 2002-07 Impact factor: 5.006

464 in total

1. Improving inverse docking target identification with Z-score selection.

Authors: Stephanie S Kim; Melanie L Aprahamian; Steffen Lindert
Journal: Chem Biol Drug Des Date: 2019-01-02 Impact factor: 2.817

2. The Development of Target-Specific Pose Filter Ensembles To Boost Ligand Enrichment for Structure-Based Virtual Screening.

Authors: Jie Xia; Jui-Hua Hsieh; Huabin Hu; Song Wu; Xiang Simon Wang
Journal: J Chem Inf Model Date: 2017-06-01 Impact factor: 4.956

3. Catching a Moving Target: Comparative Modeling of Flaviviral NS2B-NS3 Reveals Small Molecule Zika Protease Inhibitors.

Authors: Szymon Pach; Tim M Sarter; Rafe Yousef; David Schaller; Silke Bergemann; Christoph Arkona; Jörg Rademann; Christoph Nitsche; Gerhard Wolber
Journal: ACS Med Chem Lett Date: 2020-03-03 Impact factor: 4.345

4. Enhancing Virtual Screening Performance of Protein Kinases with Molecular Dynamics Simulations.

Authors: Tavina L Offutt; Robert V Swift; Rommie E Amaro
Journal: J Chem Inf Model Date: 2016-10-03 Impact factor: 4.956

5. FINDSITE^comb2.0: A New Approach for Virtual Ligand Screening of Proteins and Virtual Target Screening of Biomolecules.

Authors: Hongyi Zhou; Hongnan Cao; Jeffrey Skolnick
Journal: J Chem Inf Model Date: 2018-10-16 Impact factor: 4.956

6. Integrated In Silico Fragment-Based Drug Design: Case Study with Allosteric Modulators on Metabotropic Glutamate Receptor 5.

Authors: Yuemin Bian; Zhiwei Feng; Peng Yang; Xiang-Qun Xie
Journal: AAPS J Date: 2017-05-30 Impact factor: 4.009