John J Irwin1, Garrett Gaskins1,2,3,4, Teague Sterling1, Michael M Mysinger1, Michael J Keiser1,2,3,4. 1. Department of Pharmaceutical Chemistry, University of California, San Francisco , Byers Hall, 1700 4th Street, San Francisco, California 94158-2330, United States. 2. Institute for Neurodegenerative Diseases, University of California, San Francisco , 675 Nelson Rising Lane, San Francisco, California 94158, United States. 3. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco , Byers Hall, 1700 4th Street, San Francisco, California 94158, United States. 4. Institute for Computational Health Sciences, University of California, San Francisco , 550 16th Street, San Francisco, California 94158, United States.
Abstract
Whereas 400 million distinct compounds are now purchasable within the span of a few weeks, the biological activities of most are unknown. To facilitate access to new chemistry for biology, we have combined the Similarity Ensemble Approach (SEA) with the maximum Tanimoto similarity to the nearest bioactive to predict activity for every commercially available molecule in ZINC. This method, which we label SEA+TC, outperforms both SEA and a naïve-Bayesian classifier via predictive performance on a 5-fold cross-validation of ChEMBL's bioactivity data set (version 21). Using this method, predictions for over 40% of compounds (>160 million) have either high significance (pSEA ≥ 40), high similarity (ECFP4MaxTc ≥ 0.4), or both, for one or more of 1382 targets well described by ligands in the literature. Using a further 1347 less-well-described targets, we predict activities for an additional 11 million compounds. To gauge whether these predictions are sensible, we investigate 75 predictions for 50 drugs lacking a binding affinity annotation in ChEMBL. The 535 million predictions for over 171 million compounds at 2629 targets are linked to purchasing information and evidence to support each prediction and are freely available via https://zinc15.docking.org and https://files.docking.org .
Whereas 400 million distinct compounds are now purchasable within the span of a few weeks, the biological activities of most are unknown. To facilitate access to new chemistry for biology, we have combined the Similarity Ensemble Approach (SEA) with the maximum Tanimoto similarity to the nearest bioactive to predict activity for every commercially available molecule in ZINC. This method, which we label SEA+TC, outperforms both SEA and a naïve-Bayesian classifier via predictive performance on a 5-fold cross-validation of ChEMBL's bioactivity data set (version 21). Using this method, predictions for over 40% of compounds (>160 million) have either high significance (pSEA ≥ 40), high similarity (ECFP4MaxTc ≥ 0.4), or both, for one or more of 1382 targets well described by ligands in the literature. Using a further 1347 less-well-described targets, we predict activities for an additional 11 million compounds. To gauge whether these predictions are sensible, we investigate 75 predictions for 50 drugs lacking a binding affinity annotation in ChEMBL. The 535 million predictions for over 171 million compounds at 2629 targets are linked to purchasing information and evidence to support each prediction and are freely available via https://zinc15.docking.org and https://files.docking.org .
The purchasable chemical space has roughly
doubled every two and
a half years since 1990, owing to steady progress in efficient parallel
synthesis[1−8] and the synthesis of new building blocks. There are now over 400
million compounds one can easily purchase using ZINC,[9] which covers 204 commercial catalogs from 145 companies.
Each catalog is categorized by ease of purchase, and each compound
in turn inherits a purchasability level from its catalog membership.
The growth in catalog size is impressive, particularly among the make-on-demand
catalogs. Purchasable compounds in the favored lead-like[10] and fragment-like[11] areas have grown from 3 million and a half million in 2007 to 124
million and 9.2 million today, respectively. Many vendors have incorporated
the lessons of lead- and fragment-likeness in library design,[47] often filtering for PAINS.[48] About 340 million (85%) of these compounds are affordable
enough for the average academic lab to conduct a ligand discovery
project, retaining a price point around $100 per sample or less. A
further 60 million compounds are available at higher building-block
prices, often $400 USD or more and are included here for completeness.
We find that synthesis plus delivery of make-on-demand screening compounds
often takes little more than a month or so, just twice the time to
source many in-stock compounds.The molecular targets (proteins)
that these purchasable compounds
bind and modulate—if any—are rarely known. Fewer than
1 million compounds—less than 0.25%—have been reported
active in a target-specific assay according to public databases such
as ChEMBL[12] or other annotated collections
indexed by ZINC.[13] Investigators searching
for testable ligands might not consider the remaining readily available
compounds, as they are not annotated for targets and the sheer number
of options can be daunting. In the absence of target activity information,
the process of selecting compounds for general purpose screening will
often be target-naïve, relying on chemical or physical-property
diversity to sample chemical and property space, respectively.[14] If information on target bias—the likelihood
that a compound is more disposed to bind to a particular target or
class of targets—were readily available, libraries more likely
to cover biological targets of interest could be designed.Systematically
assaying every commercially available compound against
every target is experimentally impractical, so prioritizing compounds
through computational predictions is a pragmatic alternative. There
are many methods for predicting biological activities by chemical
similarity;[15−36] here, we use two. The Similarity Ensemble Approach (SEA)[37,38] predicts biological targets of a compound based on its resemblance
to ligands annotated in a reference database, such as ChEMBL.[12] SEA relates proteins by their pharmacology by
aggregating chemical similarity among entire sets of ligands. By leveraging
extreme value statistics, SEA filters out unreliable signals and normalizes
the aggregate results against a random chemical background to predict
the significance of pharmacological similarity. SEA has successfully
predicted targets of marketed drugs,[37−39] toxicity targets,[40] and mechanism of action targets for hits in
zebrafish[41] and C. elegans(42) phenotypic screens. We also use the
maximum Tanimoto coefficient[43] at 0.40[44] or better based on ECFP4 fingerprints[45] to inform predictions. Neither method generates
models incorporating discrete chemotypes as do Naïve Bayes
classifiers, for instance, but instead consider the molecule holistically.
This is advantageous because the method can suggest molecules that
do not conform to what has been highly weighted by precedent. Other
methods such as Naïve Bayes[46] can
explicitly weight for chemical substructures that are potentially
important to bioactivity (“warheads”), and thus a future
version might use such an approach to complement this work.To be useful for research, predictions should be accessible, searchable,
and downloadable. An interface should allow access to predictions
for each compound, as well as for each target, vendor, and gene. A
mechanism to select more novel or more conservative predictions would
cater to a wide range of requirements. And libraries should be downloadable
in 2D formats for chemoinformatics as well as in popular 3D formats
for docking screens.The prospective user of such a resource
expects some way to evaluate
the predictions. As one proxy to assess this data set, we performed
a retrospective 5-fold cross-validation on the ChEMBL bioactivity
data set for our method as compared to SEA and a naïve-Bayesian
classifier, at a variety of threshold parameters (Figure ; Supporting Information Figures S1 and S2). Second, in assessing performance,
we reencountered the observation that whereas the canonical targets
of all but a few drugs are known,[47] hundreds
of established drugs and investigational compounds nonetheless lack
their respective target annotations in ChEMBL. We turned this deficit
to our advantage, by testing the method’s prediction of targets
for several such drugs, corroborating our predictions with the literature
when available. Finally, as these predictions are based on protein−ligand
annotations derived from ChEMBL, we expect that this method will be
silent about chemotypes and targets not contained in this approximation
of the public pharmacopeia.
Figure 1
Comparative performance of SEA, SEA+TC, and
a multinomial naive-Bayesian
classifier (NBC) on ChEMBL cross-validation sets. (A) Receiver operating
characteristic (ROC) curves from independent 5-fold cross-validation
runs for each method. Methods are evaluated on independent cross-validation
sets filtered for >5 ligands per ChEMBL protein target (equivalent
analyses at >50 ligands per target reported in Supporting Information Figure S2). Overall performance is
gauged by the area under the ROC curve (AUROC). Note, for SEA+TC cross-validation
sets, ROC curves are the result of stepping a decision threshold across
MaxTc values, while holding a separate pSEA decision threshold at
40 (yellow curve) or 80 (cyan curve) (see Methods). Complementary curves stepping across SEA p-values are available
in Supporting Information Figures S1 and S2. Dotted lines span the distance between a fully stratified classifier
(TPR = 0; FPR = 0) and the minimum point at which both SEA+TC decision
thresholds begin to affect performance. Pink and blue circles indicate
the recommended upper and lower bounds for MaxTc thresholding on their
respective pSEA-threshold curves, respectively (upper = 0.80; lower
= 0.40). (B) Corresponding precision-recall curves (PRCs) for cross-validation
runs described in part A. Positive-class prevalence (dashed red line)
indicates the chance of selecting a positive association from the
data set at random (0.0014). Performance is measured by the area under
the PRC (AUPRC).
Comparative performance of SEA, SEA+TC, and
a multinomial naive-Bayesian
classifier (NBC) on ChEMBL cross-validation sets. (A) Receiver operating
characteristic (ROC) curves from independent 5-fold cross-validation
runs for each method. Methods are evaluated on independent cross-validation
sets filtered for >5 ligands per ChEMBL protein target (equivalent
analyses at >50 ligands per target reported in Supporting Information Figure S2). Overall performance is
gauged by the area under the ROC curve (AUROC). Note, for SEA+TC cross-validation
sets, ROC curves are the result of stepping a decision threshold across
MaxTc values, while holding a separate pSEA decision threshold at
40 (yellow curve) or 80 (cyan curve) (see Methods). Complementary curves stepping across SEA p-values are available
in Supporting Information Figures S1 and S2. Dotted lines span the distance between a fully stratified classifier
(TPR = 0; FPR = 0) and the minimum point at which both SEA+TC decision
thresholds begin to affect performance. Pink and blue circles indicate
the recommended upper and lower bounds for MaxTc thresholding on their
respective pSEA-threshold curves, respectively (upper = 0.80; lower
= 0.40). (B) Corresponding precision-recall curves (PRCs) for cross-validation
runs described in part A. Positive-class prevalence (dashed red line)
indicates the chance of selecting a positive association from the
data set at random (0.0014). Performance is measured by the area under
the PRC (AUPRC).
Results
The ZINC
database contains 400 million commercially available organic
molecules with molecular weight between 50 and 1000 Da, sourced from
204 commercial catalogs published by 145 companies. We have created
a database of predicted biological activities for the 171 million
compounds that had predictions and have made it freely accessible
via ZINC (https://zinc15.docking.org) and our file server (https://files.docking.org). All predictions were computed using a combination of the Similarity
Ensemble Approach (SEA)[37] and Tanimoto
similarity calculations based on compound annotations derived from
ChEMBL Version 21[12] (see Methods). We refer to this combinatorial approach as SEA+TC
throughout the text.To enhance this resource’s applicability
to a broad audience,
we sought to increase the specificity of predictions by using more
stringent criteria for what constitutes an annotated ligand. In prior
work we had used a 10 μM affinity cutoff, but at this scale,
we encountered flawed predictions that appeared to arise from similarity
to weak binders, possible PAINS, or promiscuous aggregator compounds.
Based on our experience with these encounters, we changed the baseline
affinity threshold to 1 μM and further required activities of
at least 100 nM for compounds containing PAINS patterns or being Tc
0.70 to any compound observed to aggregate.[48−50]We adopted
a statistical significance threshold of negative log
SEA p-value[54] (pSEA) ≥ 40 and a
MaxTc cutoff ≥0.40 guided by the work on belief theory from
the Abbvie group.[34] MaxTc is complementary
to pSEA as it provides a single-nearest-neighbor-molecule view of
similarity, compared to SEA’s global view arising from the
ensemble of annotated ligands. To quantify how this bivariate threshold
improves predictive capability, we evaluated the performance of SEA,
SEA+TC, and a Naïve-Bayesian classifier (NBC) via 5-fold cross-validation
of ChEMBL’s bioactivity data set (version 21; Figure ). SEA+TC’s ability
to correctly predict compound−target interactions as either
positive (does bind) or negative (does not bind) outperformed both
SEA and the NBC, as measured by the area under the receiver operating
characteristic (AUROC) curve, (AUROC = 0.995, Figure A). Further, when predicting a compound−target
interaction as positive, SEA+TC was correct in its prediction more
often than SEA or the NBC, as indicated by its area under the precision-recall
(AUPRC) curve (AUPRC = 0.684, Figure B). In performing this analysis, we additionally identified
a more stringent bivariate threshold, which some users may wish to
adopt. At a threshold of MaxTc ≥ 0.80 with pSEA ≥ 80,
the retrospective analyses achieve higher precision than the baseline
threshold (Figure A and B, blue circle) at acceptable recall (pink circle). Users of
the ZINC interface may choose thresholds to suit their needs.In addition to controlling the sensitivity and specificity of predictions,
the significance threshold (i.e., pSEA and MaxTc values)[17] also influences the novelty of the predictions.
Novel compounds can be desirable because they likely have unrelated
off-target effects, which can help establish the signaling and toxicity
role of a receptor, as well as selectively activate downstream signaling,
which is important for many receptors such as GPCRs.[38] Accordingly, we designed the ZINC interface to help users
rapidly identify predictions with their desired precision. The user
can control the MaxTc and pSEA limits, and each prediction can be
compared with the most similar annotated actives (Figure ) allowing side-by-side comparison.
Each SEA prediction is accompanied by a pSEA to the set of actives
and MaxTc to the nearest active. Clicking on the MaxTc value in the
interface performs a real-time search for the most similar ligands
annotated at 10 μM or better for that target.
Figure 2
Predictions supported
by evidence. (A) Here, Bucumolol (ZINC100)
is shown with a SEA prediction for ADRB2 at a pSEA = 33 and MaxTc
to the nearest annotated compound of 0.44. The user may click on the
“44” to go to the URL shown, which lists bucumolol’s
closest-match known ADRB2 ligands in decreasing order of similarity
(the first four are shown). The user may also click on “Run
SEA” to rerun a SEA calculation on the molecule, providing
comprehensive statistics.
Predictions supported
by evidence. (A) Here, Bucumolol (ZINC100)
is shown with a SEA prediction for ADRB2 at a pSEA = 33 and MaxTc
to the nearest annotated compound of 0.44. The user may click on the
“44” to go to the URL shown, which lists bucumolol’s
closest-match known ADRB2 ligands in decreasing order of similarity
(the first four are shown). The user may also click on “Run
SEA” to rerun a SEA calculation on the molecule, providing
comprehensive statistics.To find predictions for a given target using ZINC15 (zinc15.docking.org), the
user may select Genes from the Biological dropdown menu to browse a listing of all genes and predictions (Figure A). In this work,
we use genes and their identifiers as convenient shorthand for their
protein products—or molecular targets. To find a specific gene,
the user may type part of the gene name in the top right search bar,
here SLC6, and click the blue search button on the
top right. To display predictions for this gene, the user clicks on
the link in the predictions column, here for SLC6A1 (Figure B). The
user may for example use the subset selector to specify strong predictions (which we chose to mean pSEA = 80) and purchasability
(Figure C). Some advanced
features are currently only accessible by hand-editing the URL. Here,
the user adds table.html?sort=-maxtc and &maxtc-between=40+45 to display the information in a tabular format, to sort by decreasing
MaxTc, and to select only predictions between MaxTc of 40 and 45,
respectively (Figure D). We plan to make these API-level features available via a point
and click interface soon. Documentation is available via the help
pages https://zinc15.docking.org/genes/help and https://zinc15.docking.org/predictions/help.
Figure 3
Tools to display predictions for a gene and filter and sort them
by MaxTc and pSEA. (A) Gene page showing predictions, with search
bar to locate genes by name, top right. https://zinc15.docking.org/genes. (B) Gene listings for genes matching “SLC6” https://zinc15.docking.org/genes/search?q=SLC6. (C) Strongly predicted ligands for SLC6A11, showing
the popup for subset selections https://zinc15.docking.org/genes/SLC6A11/predictions/subsets/strong. (D) Individual predictions, showing MaxTc and pSEA for each prediction,
sorted by pSEA, with a MaxTc (novelty/similarity) limit specified https://zinc15.docking.org/genes/SLC6A1/predictions/subsets/strong/table.html?sort=-pvalue&maxtc-between=40+45.
Tools to display predictions for a gene and filter and sort them
by MaxTc and pSEA. (A) Gene page showing predictions, with search
bar to locate genes by name, top right. https://zinc15.docking.org/genes. (B) Gene listings for genes matching “SLC6” https://zinc15.docking.org/genes/search?q=SLC6. (C) Strongly predicted ligands for SLC6A11, showing
the popup for subset selections https://zinc15.docking.org/genes/SLC6A11/predictions/subsets/strong. (D) Individual predictions, showing MaxTc and pSEA for each prediction,
sorted by pSEA, with a MaxTc (novelty/similarity) limit specified https://zinc15.docking.org/genes/SLC6A1/predictions/subsets/strong/table.html?sort=-pvalue&maxtc-between=40+45.Predictions are available for
2629 genes[51] (Figure ). The number
of predictions per gene varies substantially, reflecting both the
diversity of annotated ligands for the target as well as how well
these chemotypes are represented in current vendor catalogs. For example,
natural products and their analogs are often difficult to access synthetically
and are therefore generally sparsely represented. At the high end
of predictions per gene, the eukaryotic GPCRs D4 dopamine
receptor (DRD4), C−C chemokine receptor type 3 (CCR3), and
the voltage gated ion channels KCNK3 and KCNK9 each have over 4.8
million purchasable predicted ligands. The number of strong predictions
(pSEA ≥ 80) varies from over 500 000 for KCNK3 to as
few as 9181 for DRD4. Filtering at MaxTc ≥ 0.60 instead, corresponding
to a precision exceeding 0.334 using ECFP4 fingerprints,[44] the predictions for these four genes varied
from as many as 25 728 for DRD4 to as few as 8912 for KCNK9.
At the other extreme of predictions per gene, fungal laccase-2 precursor
(LCC2), human C−C chemokine receptor type 6 (CCR6), voltage-gated
sodium channel Nav1.9 (SCN11A), and fruit fly DNA topoisomerase
2 (TOP2) each had fewer than 50 predicted commercially available ligands.
The small number of predicted ligands can often be explained by a
paucity of reference ligands; here, SCN11A and CCR6 have only 1 ligand
each at 10 μM or better. Another reason for the lack of ligands
is that the knowns are in an area of chemical space that is difficult
to access synthetically, such as natural products for both SCN11A
and CCR6.
Figure 4
Predictions available for 2629 genes. (A) The web interface allows
genes and their predictions to be found by name or gene symbol: https://zinc15.docking.org/genes. Enter the gene name in the search field (1). Click on the predictions
link (2) to display the predicted ligands. (B) Predictions and purchasable
compounds for 2629 genes. The horizontal axis is genes, sorted by
number of predictions. The vertical axis is number of compounds, log
scale, labeled by exponent. Dark gray circles indicate the number
of predicted purchasable compounds for a gene. Green triangles represent
the number of purchasable annotated compounds for the same gene.
Predictions available for 2629 genes. (A) The web interface allows
genes and their predictions to be found by name or gene symbol: https://zinc15.docking.org/genes. Enter the gene name in the search field (1). Click on the predictions
link (2) to display the predicted ligands. (B) Predictions and purchasable
compounds for 2629 genes. The horizontal axis is genes, sorted by
number of predictions. The vertical axis is number of compounds, log
scale, labeled by exponent. Dark gray circles indicate the number
of predicted purchasable compounds for a gene. Green triangles represent
the number of purchasable annotated compounds for the same gene.
Access by Gene Groupings
In addition
to individual
genes, predictions may also be accessed by groups of genes. This could
be helpful if the investigator is looking for new aminergic GPCR ligands
or ligands for voltage gated ion channels or simply wishes to ensure
balanced coverage of major target classes in a library. The interface
offers convenient ways to access gene groupings based on a protein
classification scheme inherited from ChEMBL. There are 15 major target
classes (Figure A)
further organized into 42 target subclasses (Figure B). Thus, there are 67 million predictions
for membrane proteins, of which 1 million are strong (pSEA ≥
80). Considered separately, there are 873,000 less chemically novel
predictions having a Tanimoto coefficient ≥0.60 to an annotated
active. At a higher level of granularity, there are 4.7 million predictions
for epigenetic reader proteins, of which 2.4 million are strong predictions
(pSEA ≥ 80) and 38 000 are highly similar (Tc ≥
0.60). At the organism level (Figure C), 18 million ligands are predicted for specific bacterial
targets, 1.0 million of which are stronger (pSEA ≥ 80) and
92 000 of which are highly similar (Tc ≥ 0.60). The
user may select purchasable compounds based on this classification.
These compounds will resemble precedented bacterial protein inhibitors
far more strongly than compounds selected at random. Ligands predicted
for specific bacterial targets are available to browse interactively
at https://zinc15.docking.org/organisms/bacteria/genes/ or to
download by gene at https://files.docking.org/predictions/current/. A plot of predictions per gene vs annotated ligands per gene shows
a general trend toward more predicted ligands when more known ligands
are available (see Supporting Information Figure S3).
Figure 5
Prediction counts and purchasable compounds. The gray line indicates
the number of predictions, and the green line represents the number
of annotated compounds. (A) By major target class. Data from https://zinc15.docking.org/majorclasses. (B) By target subclass. Most target predictions have a maximum
tanimoto coefficient between 0.30 and 0.39 and 0.40−0.49. Percent
of predictions for each target subclass relative to MaxTc are plotted
in the inset to show the full spread of prediction across bins. (C)
By Kingdom, called organism class in ChEMBL and ZINC. Data from https://zinc15.docking.org/organisms.
Prediction counts and purchasable compounds. The gray line indicates
the number of predictions, and the green line represents the number
of annotated compounds. (A) By major target class. Data from https://zinc15.docking.org/majorclasses. (B) By target subclass. Most target predictions have a maximum
tanimoto coefficient between 0.30 and 0.39 and 0.40−0.49. Percent
of predictions for each target subclass relative to MaxTc are plotted
in the inset to show the full spread of prediction across bins. (C)
By Kingdom, called organism class in ChEMBL and ZINC. Data from https://zinc15.docking.org/organisms.
Benchmarks
We
predicted the targets of established
drugs that nonetheless lack a protein binding affinity annotation
in ChEMBL to benchmark our approach. We found hundreds of drugs, withdrawn
drugs, and investigational compounds with target predictions that
agreed with the literature. Fifty of these were selected and tabulated
as illustration of our predictions (Table ). Thus, the beta blocker bufetolol[52] (ZINC101) is predicted to be a β2 adrenergic
receptor ligand with pSEA = 47 and MaxTc = 0.46 and to be a β1
adrenergic receptor ligand with pSEA = 51and MaxTc = 0.44. Aranidipine[53] (ZINC600803) is predicted for the calcium voltage-gated
ion channel CACNA1C with pSEA = 121 and MaxTc = 0.75. Ancarolol (ZINC39)
illustrates the discriminatory value of the SEA prediction, with pSEA
= 59 and MaxTc = 0.43 for ADRB1: 255 656 purchasable ligands
have higher MaxTc than ancarolol to this target while only 46 753
have a higher pSEA score.
Table 1
Drugs with No Binding
Data in ChEMBL,
Predicted by SEA or MaxTc, Corroborated by the Literature
drug(ref)
ZINC ID
target
pSEA
MaxTc
Acemetacin[63]
601272
PTGS2
40
0.76
Afeletecan[64]
150339966
TOP1
69
0.41
Alclometasone[65]
4172330
NR3C1
15
0.58
Alminoprofen[66]
22
PTGS2
0.47
Amisulpride[67]
1846088
DRD3
22
0.66
Ancarolol[68]
39
ADRB2
42
0.44
ADRB1
59
0.43
ADRB3
29
0.44
Aranidipine[53]
600803
CACNA1C
121
0.75
CACNA1D
132
0.51
Azasetron[69]
4132
HTR3A
25
0.61
Azelnidipine[70]
38141706
CACNA1C
91
0.56
CACNA1D
124
0.57
Azetirelin[71]
3804057
TRHR
95
0.59
TRHR2
0.61
Besifloxacin[72]
3787097
PARC
0.46
Bevantolol[73]
1542891
ADRB1
89
0.51
ADRB2
73
0.58
ADRB3
73
0.53
Bilastine[74]
3822702
HRH1
48
0.51
Binospirone[75]
1999423
HTR1A
0.48
Bufetolol[52]
101
ADRB1
51
0.44
ADRB2
47
0.46
Bunazosin[76]
601249
ADRA1B
52
0.61
Bupranolol[77]
106
ADRB2
45
0.44
ADRB1
19
0.45
Butofilolol[78]
112
ADRB1
50
0.40
ADRB2
34
0.46
Calcifediol[79]
12484926
VDRA
0.79
GC
0.79
Camazepam[80]
2008504
GABARA5
25
0.53
GABARA2
15
0.53
Cellcept[81]
21297660
IMPDH1
0.70
IMPDH2
0.70
Ciprokiren[82]
8214528
REN
178
0.68
Dasotraline[83]
2510873
SLC6A3
25
0.63
SLC6A2
29
0.63
Demecarium[84]
3875376
ACHE
0.71
Dienesterol[85]
4742540
ESR1
26
0.46
ESR2
15
0.46
Edaglitazone[86]
1483899
PPARG
83
0.66
PPARA
83
0.65
Efonidipine[87]
38139973
CACNA1C
81
0.51
CACNA1D
118
0.51
Eptazocine[88]
1846076
OPRD1
30
0.42
OPRK1
30
0.46
OPRM1
32
0.46
Etanterol[89]
263
ADRB1
23
0.47
ADRB2
47
0.40
Ethylmorphine[90]
3629718
OPRD1
28
0.62
OPRK1
24
0.62
OPRM1
32
0.75
OPRL1
0.57
Etomoxir[91]
1851171
CPT1
0.47
Fiduxosin[92]
29747110
ADRA1A
30
0.53
ADRA1B
45
0.53
ADRA1D
38
0.46
Floxacillin[93]
4102187
BLAACC-4
0.80
Flurazepam[94]
537752
GABARA5
28
0.50
GABARA1
17
0.49
Granisetron[95]
347
HTR3A
25
0.75
Halobetasol[96]
4214603
NR3C2
20
0.60
Hexoprenaline[97]
3872806
ADRB2
77
0.52
Ketobemidone[98]
1600
OPRD1
49
0.46
OPRK1
45
0.48
OPRM1
44
0.55
Lercanidipine[99]
19685790
CACNA1B
0.49
CACNA1C
107
0.70
CACNA1D
146
0.63
Lexacalcitol[100]
4474609
VDR
144
0.62
Meptazinol[101]
854
OPRD1
44
0.48
OPRK1
39
0.60
OPRM1
38
0.55
Metipranolol[102]
494
ADRB1
27
0.45
ADRB2
31
0.52
Ormeloxifene[103]
5104028
ESR1
86
0.51
ESR2
58
0.44
Paroxypropione[104]
1890
ESR1
38
0.58
ESR2
30
0.58
Pipenzolate[105]
601314
CHRM1
0.47
CHRM2
30
0.43
CHRM3
57
0.53
CHRM4
35
0.53
CHRM5
40
0.53
Pozanicline[106]
6562
CHRNA2
33
0.57
CHRNA4
0.57
CHRNA10
53
0.55
Propiverine[107]
1530934
CHRM2
24
0.42
CHRM3
50
0.57
Revatropate[108]
4214265
CHRM1
55
0.53
CHRM2
33
0.53
CHRM3
59
0.57
GPM3
0.57
Temazepam[109]
740
GABA5
28
0.59
Udenafil[110]
13916432
PDE5A
74
0.61
Unoprostone[111]
8214703
PTGER1
45
0.57
PTGER2
30
0.40
PTGER3
0.57
PTGDR
52
0.40
PTGFR
85
0.51
Valategrast[112]
72190226
ITGA4
60
0.32
Verubulin[113]
35978229
TUBB3
62
0.51
Among the 535 million predictions
of protein−ligand affinity
we expect numerous false positives and false negatives. These errors
stem from three major classes of problem. (1) Issues with target annotation:
annotated ligands may not be representative for a gene, such as curcumin
(ZINC100067274), which is annotated for 32 genes and is probably artifactual
for many of them.[54] Annotated ligands may
also be mis-annotations in ChEMBL, leading to false positives. For
instance, nicotinamide (CHEMBL1140) is annotated for fatty-acid amide
hydrolase 1 (FAAH), because it shares an abbreviation (NAM) with the
actual ligand, N-arachidonylmaleimide.[55] (2) Errors with the SEA method: We use ECFP4
fingerprints, which have little specificity for certain classes of
molecules, such as peptides and sterols, which share many common features
and thus are not well discriminated using this fingerprint. SEA also
has high variance for small ligand sets and low sensitivity for large,
diverse ligand sets. For instance, SEA fails to predict the well-known
antihistamine drugs chlorcyclizine and propiomazine for histamine
H1 receptor (HRH1), despite their having Tc values of 0.79 and 0.69,
respectively, to the most similar HRH1 ligands. The pSEA values of
11 in each case have been diluted by the 9000 diverse ligands annotated
to this target. A remedy might be to split targets with large number
of ligands, perhaps by chemical clusters, mode of action, or binding
site, if known. Note that Naïve Baysian classifiers can be
trained to correctly predict these activities, as can be seen on ChEMBL’s
ligand detail pages for these compounds. (3) No explicit model of
promiscuity for SEA: We have made some progress here by stringent
filtering of ligands we suspect are promiscuous (both PAINS and aggregator-like),
but we fail to handle frequent hitters such as staurosporine (ZINC3814434,
hits 365 targets in ChEMBL) and its ilk. Our current approach also
performs poorly on sigma nonopioid intracellular receptor 1 (SIGMAR1)
and cytochromes P450 3A4 (CYP3A4), because the ligands annotated to
it are highly diverse. To remedy this problem for targets with many
ligands, we could cluster by chemotype.
Genes Lacking Commercially
Available Ligands
When a
target has purchasable ligands, they can be used to rapidly probe
its biological function without requiring synthetic chemistry expertise.
Yet there are 69 targets with 20 or more annotated ligands in ChEMBL
where none is readily purchasable (Table ). To fill these holes in “target
space”, we have identified purchasable compounds that are predicted
to be active. In one example, voltage dependent calcium channel subunit
alpha-2/delta-2 (CACNA2D2) has 26 ligands in ChEMBL, none of which
is for sale, such as CHEMBL1801206 with a pKi of 7.7. The compound
ZINC36664273, however, is sold by Specs as AO-476/43421055 and has
a pSEA of 132 and a MaxTc of 0.72. Looking at these compounds side
by side (Table ) and
without detailed experimental knowledge of this target, the Specs
compound may be reasonable to try against this target. If successful,
such compounds could become a purchasable control for these targets.
Table 2
Selected Plausible Predictions of
Purchasable Compounds for Genes with No Purchasable Ligands in ChEMBL
Dark Chemical
Matter
Intriguingly, 229 million purchasable
compounds have no prediction at all by either pSEA ≥ 40 or
Tanimoto similarity Tc ≥ 0.40. Some of these will have just
missed our cutoffs, wherever the cutoffs may be drawn. A few will
be known actives, or analogs of actives, that simply lack a direct
binding annotation in ChEMBL. Still, these compounds are generally
interesting because they do not much resemble any direct binding actives
in ChEMBL. Should they be found to be active in an assay, they are
more likely to have fewer off-targets, at least against well-studied
targets, and are less likely to be encumbered by patents. A substantial
body of literature explores the strengths and pitfalls of dark chemical
matter.[56−59] To illustrate what a user of this resource can expect to find in
this underexploited yet commercially available space, we have highlighted
ten compounds (Table ). For each commercially available molecule, we show the nearest
precedented bioactive from public sources available to ZINC, which
may also include compounds not in ChEMBL. Dark chemical matter[56−59] may be browsed online at zinc15.docking.org/substances/having/no-predictions
and downloaded at scale by physical property tranches (https://files.docking.org/dark-matter/current), by vendor catalogs (e.g., for ChemBridge at https://files.docking.org/catalogs/50/chbr/chbr.predict.txt.gz) and by the genes they are predicted to bind (https://files.docking.org/genes//.predictions.txt.gz).
Table 3
Compounds with No Predictions “Chemical
Dark Matter”a
To browse, use: https://zinc15.docking.org/substances/having/no-predictions. To download: https://files.docking.org/special/dark-matter. To browse annotated compounds similar to any compound (e.g., at
least 0.30 similar to ZINC compound 14). https://zinc15.docking.org/substances/having/genes?ecfp4_fp-tanimoto-30=14 or 0.30 similar to SMILES https://zinc15.docking.org/substances/having/genes?ecfp4_fp-tanimoto-30=c1ccccc1NOCOCN. Also try https://zinc15.docking.org/substances/subsets/in-vitro?ecfp4_fp-tanimoto-30={zincorsmiles}. For similarity to natural products, try, https://zinc15.docking.org/substances/subsets/biogenic?ecfp4_fp-tanimoto-30=. Please note: these queries are efficient if there are few matches,
but will time out if too many hits are found. As a general rule, use
tanimoto-50 first, which will be fast, and decrease progressively
to −40 and then −30 only if no matches are found. This
calculation is intensive, and we may limit usage if there are too
many queries that return multiple thousands of hits to allow us to
keep this service freely available.
To browse, use: https://zinc15.docking.org/substances/having/no-predictions. To download: https://files.docking.org/special/dark-matter. To browse annotated compounds similar to any compound (e.g., at
least 0.30 similar to ZINC compound 14). https://zinc15.docking.org/substances/having/genes?ecfp4_fp-tanimoto-30=14 or 0.30 similar to SMILES https://zinc15.docking.org/substances/having/genes?ecfp4_fp-tanimoto-30=c1ccccc1NOCOCN. Also try https://zinc15.docking.org/substances/subsets/in-vitro?ecfp4_fp-tanimoto-30={zincorsmiles}. For similarity to natural products, try, https://zinc15.docking.org/substances/subsets/biogenic?ecfp4_fp-tanimoto-30=. Please note: these queries are efficient if there are few matches,
but will time out if too many hits are found. As a general rule, use
tanimoto-50 first, which will be fast, and decrease progressively
to −40 and then −30 only if no matches are found. This
calculation is intensive, and we may limit usage if there are too
many queries that return multiple thousands of hits to allow us to
keep this service freely available.A new research tool is now available within ZINC15
for public use.
We demonstrate the use of these new tools in four use cases, which
illustrate how to access predictions both interactively and via static
downloads, below.
Use Case One
The user is interested
in a well-studied
target such as the serotonin 2A receptor (HTR2A) and seeks compounds
to purchase that are likely to work but have not been reported active
in ChEMBL21. The user first checks how many ligands are annotated
active at 10 μM or better (5031, interactively at https://zinc15.docking.org/genes/HTR2A/substances or statically downloaded at https://files.docking.org/genes/current/HTR2A/HTR2A.smi). The user then queries how many commercially available ligands
have SEA predictions at an exceptionally strong statistical significance,
with pSEA = 80 (30 952 at https://zinc15.docking.org/genes/HTR2A/predictions/subsets/strong+purchasable). For instance, ZINC462039162 available from Enamine, catalog number
Z1269906839, with a pSEA = 82 and MaxTc = 0.63 (https://zinc15.docking.org/substances/ZINC000462039162/predictions/table.html). Millions of other commercially available molecules can be obtained
in a similar way. All predictions are downloaded immediately using https://files.docking.org/genes/current/HTR2A/HTR2A.predictions.txt.gz, from which compounds may be selected.
Use Case Two
The
user wishes to obtain a screening
library for projects involving several voltage-gated ions channels.
The user wishes to find purchasable compounds that do not seem too
similar, yet are more likely to be ligands than purely random compounds,
i.e., having a high MaxTc between 0.65 and 0.70, corresponding to
an expected precision of 0.35−0.40. The library should be downloaded
in 2D for chemoinformatics and 3D for docking. In ZINC, there are
14 849 already annotated ligands for any such channel in ChEMBL21
at 10 μM or better (https://zinc15.docking.org/subclasses/vgic/substances). Of these, 1108 (7.5%) are purchasable and may be a good starting
point for the library. A further 21 242 purchasable predicted
ligands also are available, such as ZINC629100 (https://zinc15.docking.org/substances/ZINC000000629100/predictions/table.html), which is Tc 0.69 to the nearest annotated active CHEMBL1097858,
active at pKi of 7.7. To obtain the first 1000 ZINC codes for these
molecules, the user accesses: https://zinc15.docking.org/subclasses/vgic/predictions/subsets/purchasable.txt?maxtc-between=65+70&count=1000. To download 3D models of these compounds, please see Obtaining 3D Models, below. A second approach
to download predicted compounds for voltage gated ion channels would
be to first obtain the names of all the genes: https://zinc15.docking.org/subclasses/vgic/genes.txt:name. Then, the user would use this list to download the static predictions
by gene. For example, for the sodium channel protein type 5 subunit
alpha (SCN5A), the predictions are in https://files.docking.org/genes/SCN5A/SCN5A.predictions.txt.gz.
Use Case Three
The user would like to know all predictions
for a particular vendor catalog. Vendors may be interested to know
possible targets of their compounds for marketing purposes. Vendors
may also wish to know which of their make-on-demand compounds might
be prioritized for synthesis based on possible activity. Academic
centers that screen vendor libraries may be interested in individual
vendors because they have negotiated special pricing, or because the
vendor makes plates available at a discount to facilitate the mechanics
of screening. We have been precomputed searches to enable such investigations
to save time. To access them, the user would complete the following
steps:Browse to https://files.docking.org/catalogs to select the catalog of interest.Download the file of predictions. For
instance, for ChemBridge, the code is chbr and the
URL is https://files.docking.org/catalogs/50/chbr/chbr.predict.txt.gz. Each row contains the vendor code, ZINC ID, InChIKey, predicted
gene, MaxTc, and pSEA: one molecule per row.Break the downloaded files into subsets
using Unix command-line tools to filter by MaxTc, pSEA, and predicted
gene.To download these in 3D for docking,
please see Obtaining 3D Models, below.
Use Case Four
The user wishes to download dark chemical
matter screening libraries in 2D or 3D formats. To do so, the user
browses to https://files.docking.org/dark-matter. The compounds have been binned into tranches by physical property
using our standard scheme (http://wiki.docking.org/index.php/Physical_property_space). The 2D files are available as compressed text files organized
by purchasability. Each row contains one molecule with its SMILES,
ZINC ID, physical property tranche, purchasability, and reactivity.
The 3D files will likewise be prepared in future but are meanwhile
available as described in Obtaining 3D Models, below.
Obtaining 3D Models
To download 3D models for a set
of molecules in bulk for one of the above use cases, here is a general
approach that will work for any arbitrary set of ZINC IDs:Obtain
the codes of the molecules to
download using the previous use cases or otherwise and store the codes
in zinc-codes.txt.Select mol2, db, or db2 file formats. mol2 may be converted
to other formats as required. The latter two are
used by the UCSF DOCK 3.x programs only.Download the script getfiles.csh from https://files.docking.org/catalogs/getfiles.csh.Edit the file by hand following the
instructions within.Run the script, with the list of ZINC
codes in the same directory. The 3D files will be downloaded.Please note that 3D models are currently
available for
about 120 million of the 400 million compounds in ZINC. We are continually
building and rebuilding them, prioritizing the popular lead-like and
fragment-like areas best suited to docking. If a 3D model is not available,
the molecule detail page contains a “Request Generation”
button in the 3D representations section. If a 3D model does not exist,
it is either because it fails to build or because it is still on our
action list.
Discussion
Four major results emerge
in this work. First, using ZINC and ChEMBL,
we predict molecular target activities for 171 million commercially
available compounds at 2629 targets and store them in an accessible
database. Second, we create an interface to search, access, and download
the predictions (https://zinc15.docking.org and https://files.docking.org). Predictions can be accessed individually or downloaded in bulk,
and are available in a range of formats ready for both docking and
chemoinformatics, or for purchase. To demonstrate the utility of these
predictions, we perform a retrospective 5-fold cross-validation of
the ChEMBL bioactivity data set. Further, we identify likely targets
of drugs known in the literature where direct binding annotations
are not available in ChEMBL. Finally, this new tool allows us to quantify
predicted target biases of purchasable chemical space. Target bias
predicted by this model is substantial—some genes are represented
by millions of purchasable compounds, others have very few. Nearly
60% of purchasable compounds in ZINC have no prediction at all, allowing
us to offer purchasable “dark chemical matter”. We take
up each of these results in turn.We predict targets for over
40% of the 400 million compounds currently
for sale in ZINC. The number is admittedly arbitrary, as we were obliged
to choose pSEA and Tanimoto similarity cutoffs. Knowing that this
approach would produce false positives and false negatives, we attempted
to strike a useful balance, and equip the user to apply further constraints.
Many compounds with MaxTc as low as 0.40 to the nearest active may
not bind the predicted target−previous work suggests 18% precision
might be a good estimate[44] and this is
consistent with the results we found in Figure (blue circle). Likewise, those with a pSEA
near our chosen threshold of pSEA = 40 may not be active against the
predicted target. Should such chemically novel predictions be confirmed
experimentally, they may represent new starting points for optimization
and could lead to new biology. If the user wishes higher confidence
hits, more stringent cutoffs in pSEA or MaxTc are easily applied.
We refer the reader to the set of thresholds examined in our cross-validation
of the ChEMBL bioactivity data set (Supporting Information Figure S1) for guidance in choosing pSEA and
MaxTc values to optimize the desired output. For the highest rates
of precision at an acceptable recall, we recommend threshold values
at pSEA ≥ 80 and MaxTc ≥ 80 (Figure , pink circle), noting this may reduce the
number of novel compound−target associations that pass the
cutoff.For those wishing to buy a compound that works, the
user might
only consider the most similar compounds, having high Tc to a precedented
bioactive. For those seeking chemical novelty against a target, where
testing 10 or even 50 more novel compounds to find new chemical matter
is acceptable, more novel compounds may be sought. Users of virtual
screening methods such as docking may want particularly novel (low
MaxTc) compounds, because their screening method makes an independent
assessment of each prediction. Some will prefer to pursue the most
novel—and potentially most interesting—the purchasable
chemical dark matter, those compounds that do not seem similar to
any of the annotated compounds used to make these predictions. Whatever
the appetite for risk, investigators are empowered by these tools
to select predictions that are right for their project.Interfacing
the prediction database through ZINC allows predictions
to be searched, grouped, filtered, compared, and downloaded using
the extensive ZINC machinery. Thus 3D models of predicted compounds
may be accessed for molecular docking screens, while SMILES strings
or molecular properties may be downloaded for ligand-based methods.
Predicted compounds for any of 2629 genes may be accessed and downloaded
in any of eight formats. Results may be filtered by prediction statistics
(pSEA, MaxTc), molecular properties (e.g., molecular weight, calculated
logP, polar surface area, fraction sp3) and purchasability (in stock,
make-on-demand, or by vendor). Both 2D and 3D results can be organized
by gene (e.g., ADRB2, SRC), minor class (e.g., GPCR Class B, voltage-gated
ion channel), major class (e.g., transcription factor or membrane
protein), Kingdom (bacterial, eukaryotic, viral), vendor, and physical
property tranche. Attributes of predictions may be downloaded in tabular
form for analysis. A REST API, exemplified in this work, described
previously[9] and documented online,[60] allows automated queries and machine-readable
results, so that this database may be incorporated into third-party
software applications.We examined drugs and investigational
compounds without an established
molecular target annotation in ChEMBL to assess the relevance of the
predictions. The 50 we highlighted exemplify typical results that
can be expected using our approach for the millions of molecules that
have never been assayed (Table ). Whereas an exhaustive analysis is impractical, this result
supports the view that our predictions are often consistent with experimentally
observed binding.A global picture of target bias in commercially
available libraries
emerges. Of the 535 million compound−target predictions, over
500 000 predictions on 400 000 compounds have a MaxTc
better than 0.60 (ECFP4) to a ligand annotated for that target; a
level of similarity that suggests 35% precision.[44] A further 1.6 million predictions on 1.4 million compounds
with MaxTc between 50 and 59 are also strong candidates for experimental
testing. Many of these two million compounds could have been predicted
by pairwise Tanimoto similarity alone, without the help of SEA. The
pSEA adds most value below MaxTc 0.50, where it provides a global
similarity measure to the set of annotated ligands as a group instead
of a single pairwise one. This becomes even more acute below MaxTc
of 0.40, where we only retain predictions with pSEA ≥ 40 as
the Tanimoto coefficient alone becomes too untrustworthy, with precision
falling rapidly below 10%.Our analysis provides additional
resources. We have predicted compounds
for 69 targets[61] for which none of the
20 or more actives is commercially available (Table ). If confirmed experimentally, these genes
could now be represented in screening panels of commercially available
compounds, and these new ligands used as controls or perhaps even
starting points for design. For each of 2629 genes, a range of commercially
available compounds from high-confidence, having high MaxTc, to more-novel-yet-intriguing
at lower MaxTc are now available. For the most studied targets, there
is a deep bench of predictions running into the millions of compounds
each. Massive biases for some targets, such as the dopamine D2 (DRD2)
and beta-2 adrenergic (ADRB2) receptors for instance, echoes our earlier
work[62] that commercial libraries are heavily
biased toward long-studied, important biological targets. Correspondingly,
less-well-studied targets with few ligands often have sparse representation
in commercial libraries, which can occur when the known actives are
natural products or their derivatives. We have also assembled a database
of “dark chemical matter”, 229 million purchasable compounds
that received no target prediction and that generally do not resemble
known bioactives, which is available from our website in 2D and 3D
formats. If these compounds were active in a screen, they would likely
represent new starting points for optimization.Our approach
has other liabilities. Our cutoffs in MaxTc and pSEA
inevitably exclude sensible predictions. Some classes of compounds
such as sterols, peptides, and nucleotides suffer from higher mis-prediction
rates, a subject of continuing research. pKa and explicit charge are poorly treated in our current protocol based
on stereochemistry-naïve ECFP4 fingerprints, making amidenitrogens
and basic amines too much alike, for instance, leading to some obviously
wrong predictions. Massive turnover in the chemical marketplace means
stored predictions may lag the appearance of new compounds in ZINC.
ChEMBL contains artifacts and errors, which this approach can magnify.
The SEA and MaxTc approaches quantify whole-molecule similarities
and are thereby naïve of critical chemical moieties (often
called warheads).Notwithstanding these limitations, our database
of predicted biological
activities for purchasable chemical space is a pragmatic tool that
should be useful to a broad audience. It affords both a retail view—buy
this compound for this target—as well as a wholesale one—this
target is well represented, and here are some compounds for it. Our
predictions can be rapidly tested because the compounds are purchasable.
We intend to continue to update the database as purchasable chemical
space evolves and ChEMBL is enhanced. This database is provided in
the hope that it will be useful, but you must use it at your own risk.
Methods
Library
Preparation
We used CHEMBL21 compounds annotated
for targets better than 10 μM and grouped by Uniprot gene symbol
across eukaryotes, as previously described in ZINC15.[9] Thus in this scheme, DRD2_HUMAN, DRD2_RAT, and DRD2_MOUSE
are all grouped into a single gene annotation DRD2, and predictions
are made against the unified collection for the gene and not the individual
orthologs. In situations where the target is composed of several gene
products, as in some ion channels for instance, we used the ChEMBL
name. When no gene has been formally assigned by Uniprot, we use the
Uniprot accession code itself as the gene name, as in ZINC15.
SEA Reference
Library Construction
We grouped ligands
by affinity. We computed an affinity bin as the negative log of the
molar affinity, which is variously expressed as Ki, IC50, and EC50 among others in
ChEMBL21 and which we refer to as pKi in this work for simplicity.
Thus in this scheme, bin 6 contains all compounds with 1 μM
affinity or better. Lower affinity bins were inclusive of compounds
from all higher affinity bins. We built three SEA libraries as follows.
In the first library, we only proceed if there are at least five distinct
compounds active against a single gene, we only accept activities
of 1 μM or better. We found 1382 such genes, which we defined
as being well described by their ligands. In the second library, we
only predict for those single gene targets that did not qualify for
the first pass, accepting activities as weak as 10 μM, and as
few as one good ligand. We found 1347 of these less-well-described
genes. The third library was an attempt to overcome a statistical
weakness, which diluted the signal of genes having many diverse ligands.
We clustered ligands to describe individual chemotypes of 302 genes
having 300 ligands or more each. For each library we computed a statistical
background for SEA based on the 410 624 annotated compounds.
We computed the pSEA based on an extreme value distribution and the
maximum Tanimoto similarity of the prediction to the annotated compounds
(MaxTc). Throughout we suppressed from the libraries compounds with
PAINS patterns or similarity to a precedented aggregator by 0.70 (ECFP4)
having an affinity worse than 100 nM.[48] This was likely too conservative, but earlier, more permissive attempts
at this library often suffered from excessive erroneous predictions,
likely owing to these fraught compounds.
Database Loading
Predictions were loaded into ZINC.
To minimize ligands whose charge differed sharply from precedent,
we computed the mean and the standard deviation of the average microspecies
charge using ChemAxon’s CXCALC program for each gene. When
loading each prediction, if the charge of a 3D representation at pH
7.4 (reference model) was available, we suppressed loading if the
charge on the molecule fell outside 1.5 standard deviations from the
mean charge for ligands annotated to that gene. This remains an area
of ongoing research. The result was to suppress predictions that we
likely would have thrown out on inspection, in a scalable if incomplete
and imperfect way.
ChEMBL Cross-Validation
We evaluated
the predictive
performance of SEA+TC using ChEMBL’s bioactivity data set (version
21). Receiver operating characteristic curves were generated from
independent 5-fold cross-validation runs for each method examined
(SEA, SEA+TC, NBC). For SEA and NBC cross-validation sets, each point
on the curve represents the average true-positive rate (TPR) and false-positive
rate (FPR) from all 5 folds. TPRs and FPRs along the curve were determined
by stepping a decision threshold across the range of possible SEA
p-values (0.0−1.0), for all predicted compound-target interactions.
To examine the sensitivity of these results to how well the target
is described by ligands, we ran the analysis using targets with a
minimum of 5 ligands and also with 50 ligands.For SEA+TC cross-validation
sets, TPRs and FPRs along the curve were determined by two separate
decision thresholds; one for the SEA p-value and another for the maximum
Tanimoto coefficient (MaxTc). As ROC curves evaluate a binary classifier
using a single discrimination threshold, assessing performance by
simultaneously stepping across both metrics was not ideal. To account
for this, we generated ROC curves by stepping across all possible
values of MaxTc, while holding the pSEA decision threshold constant
(Figure ). Predicted
compound−target associations are therefore positive if their
pSEA or MaxTc passes either of the respective cutoffs. A consequence
of this bivariate thresholding is that the static pSEA threshold prevents
the TPR and FPR from ever reaching zero. To highlight this, the distance
between a fully stratified classifier (TPR = 0; FPR = 0) and the minimum
point at which both decision thresholds begin to affect performance
is shown in dashed lines (Figure ). Performance metrics for a range of pSEA decision
thresholds are shown in Supporting Information Figure S1A and B. Complementary curves stepping across pSEA
while holding a separate MaxTc decision threshold constant are shown
in Supporting Information Figure S1C and D.
Interface
We added support for SEA predictions to the
user interface on the Molecule Detail, Target Detail and Gene Detail
pages of ZINC. The interface classifies each gene by one of 15 major
target classes (e.g., membrane receptor, ion channel, transporter)
and by one of 42 subclasses (e.g., Class A GPCR, voltage gated ion
channel, etc) whose pages also allow access to the SEA predictions.
The results are downloadable in eight formats: SMILES, mol2, SDF,
pdbqt, json, xml, txt, and xls. The predictions may be accessed visually
via a web browser or programmatically using an application program
interface, both located at https://zinc15.docking.org/predictions/home. Static files are accessible via https://files.docking.org/predictions, https://files.docking.org/genes, https://files.docking.org/catalogs, and https://files.docking.org/dark-matter.
Caveats
Vendors often advertise stereochemically ambiguous
molecular descriptions and thus the number of compounds and predictions
strongly depends on how these are treated. Since ZINC is a 3D focused
database, we are obliged to commit to a 3D representation. Where there
is ambiguity, we enumerate up to a maximum of four possible stereoisomers
(R/S and E/Z) and readily admit that this inflates the numbers in
this work.
Authors: Amanda J DeGraw; Michael J Keiser; Joshua D Ochocki; Brian K Shoichet; Mark D Distefano Journal: J Med Chem Date: 2010-03-25 Impact factor: 7.446
Authors: Ágnes Peragovics; Zoltán Simon; László Tombor; Balázs Jelinek; Péter Hári; Pál Czobor; András Málnási-Csizmadia Journal: J Chem Inf Model Date: 2012-12-18 Impact factor: 4.956
Authors: Michael J Marks; Charles R Wageman; Sharon R Grady; Murali Gopalakrishnan; Clark A Briggs Journal: Biochem Pharmacol Date: 2009-05-27 Impact factor: 5.858
Authors: Barbara Malinowska; Katarzyna Kieć-Kononowicz; Karsten Flau; Grzegorz Godlewski; Hanna Kozłowska; Markus Kathmann; Eberhard Schlicker Journal: Br J Pharmacol Date: 2003-08 Impact factor: 8.739
Authors: Andrey V Bogolubsky; Yurii S Moroz; Pavel K Mykhailiuk; Dmitry S Granat; Sergey E Pipko; Anzhelika I Konovets; Roman Doroschuk; Andrey Tolmachev Journal: ACS Comb Sci Date: 2014-04-18 Impact factor: 3.784
Authors: David Xu; Donghui Zhou; Khuchtumur Bum-Erdene; Barbara J Bailey; Kamakshi Sishtla; Sheng Liu; Jun Wan; Uma K Aryal; Jonathan A Lee; Clark D Wells; Melissa L Fishel; Timothy W Corson; Karen E Pollok; Samy O Meroueh Journal: ACS Chem Biol Date: 2020-05-21 Impact factor: 5.100
Authors: Albert J Kooistra; Márton Vass; Ross McGuire; Rob Leurs; Iwan J P de Esch; Gert Vriend; Stefan Verhoeven; Chris de Graaf Journal: ChemMedChem Date: 2018-02-14 Impact factor: 3.466
Authors: Huikun Zhang; Spencer S Ericksen; Ching-Pei Lee; Gene E Ananiev; Nathan Wlodarchak; Peng Yu; Julie C Mitchell; Anthony Gitter; Stephen J Wright; F Michael Hoffmann; Scott A Wildman; Michael A Newton Journal: PLoS Comput Biol Date: 2019-08-05 Impact factor: 4.475
Authors: John A Marwick; Richard J R Elliott; James Longden; Ashraff Makda; Nik Hirani; Kevin Dhaliwal; John C Dawson; Neil O Carragher Journal: SLAS Discov Date: 2021-06-02 Impact factor: 3.341