Literature DB >> 26479676

ZINC 15--Ligand Discovery for Everyone.

Abstract

Many questions about the biological activity and availability of small molecules remain inaccessible to investigators who could most benefit from their answers. To narrow the gap between chemoinformatics and biology, we have developed a suite of ligand annotation, purchasability, target, and biology association tools, incorporated into ZINC and meant for investigators who are not computer specialists. The new version contains over 120 million purchasable "drug-like" compounds--effectively all organic molecules that are for sale--a quarter of which are available for immediate delivery. ZINC connects purchasable compounds to high-value ones such as metabolites, drugs, natural products, and annotated compounds from the literature. Compounds may be accessed by the genes for which they are annotated as well as the major and minor target classes to which those genes belong. It offers new analysis tools that are easy for nonspecialists yet with few limitations for experts. ZINC retains its original 3D roots--all molecules are available in biologically relevant, ready-to-dock formats. ZINC is freely available at http://zinc15.docking.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Ligands

Year: 2015 PMID： 26479676 PMCID： PMC4658288 DOI： 10.1021/acs.jcim.5b00559

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

ZINC (ZINC Is Not Commercial) is a public access database and tool set, initially developed to enable ready access to compounds for virtual screening,[1] that has become ever widely used for virtual screening,[2−9] ligand discovery,[10−13] pharamcophore screens,[14] benchmarking,[15−17] and force field development.[12,18] Increasingly, however, investigators have tried to interrogate it for questions that it was not designed to answer. Simple questions, such as how many endogenous human metabolites are there, which of these are purchasable, or what natural product or drug does a compound most closely resemble, were surprisingly difficult to answer. With a target in mind, investigators often wanted a focused library biased toward ligands for that target. With new compounds discovered, they often wanted to find the most similar ligands already known for that target. To optimize that ligand, they might look to the availability of starting products for synthesis, asking, for instance, how many boronic acids that contain an indole ring may be purchased in preparative quantities and how soon will they arrive.[19] For these and many related questions, we wondered whether we could make a system that obviated the need for a computer expert’s assistance. Here, we describe a new version of ZINC designed to address these questions, while retaining the ease of use of the original tool. ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully programmable for chemoinformaticians and computational biologists. Our approach has four parts. 1) To integrate and curate biological activity, chemical property, and commercial availability data for small molecules from public sources, supplemented by additional calculated properties into a chemistry-aware relational database. 2) To build a general query language and report generator that is Web URL compatible. 3) To design a graphical user interface that requires no programming to interrogate the database using this query language. 4) To demonstrate and document the use of this tool to answer previously difficult questions. This effort has resulted in ZINC 15, a new research tool for ligand discovery that connects biological activities by gene product, drugs, and natural products with commercial availability. We describe the system and demonstrate its use to answer questions about biologically active and purchasable chemical space that were previously not easy for nonexperts.

Results

Previously in ZINC,[2] compounds stood on their own and were both the subject of queries to ZINC and the answers to such queries. An innovation of this version is to identify those molecules that have known biological effects or are of biological origin, such as drugs and natural products, and to link compounds to the proteins and biological processes that they modulate. Correspondingly, one can now interrogate ZINC regarding the ligands that bind to a particular protein or regarding the proteins that a particular molecule is known to modulate. Extending this, one can also ask what biologically active molecules are most like those that bind a particular protein of interest and what proteins such a compound might be predicted to bind, based on chemical similarity to known ligands. In this way, the mission of ZINC is expanded from purely compound-centric to one that links molecules to biological targets, processes, and other bioactive small molecules. The biological annotations—the identification of molecules as metabolites, drugs, and natural products and the identification of molecules as ligands for particular proteins and processes—all derived from other databases and libraries, such as HMDB,[20] ChEMBL,[21] and DrugBank,[22] for which ZINC is essentially a client and from whose rapid development in the last several years ZINC has benefited. What is new here is that ZINC cross-references this information with purchasability of reagents; this much expands its ability to bring readily available reagents to biological questions. Doing so has demanded optimization of the mechanics of ZINC and its interface. Some of these have obvious impacts on the user and the questions they can ask, for instance, expanding the number of molecules they can address by 2 orders of magnitude. For convenience, we divide the results into descriptions of the new associations with which purchasable molecules are freighted and examples of their use.

Bioactive Molecules and Their Associations

ZINC15 draws upon third party databases such as ChEMBL, HMDB, DrugBank, and https://ClinicalTrials.gov to annotate high information compounds that are active in, or created by, nature. These include biogenic molecules, such as natural products, metabolites, and FDA approved drugs, among others (Figure A).[23] What ZINC brings is not only lists of these molecules, which after all are derived from other sources, but their purchasability and, as we will see, their predicted as well as known associations. For instance, there are 70539 metabolites that have been annotated,[24] 15006 of which may be purchased from vendors.[25] Restricting this to endogenous human metabolites, there are 46941 molecules,[26] of which 8271 are for sale.[27] Properties of catalogs and, in turn, the molecules they contains are ranked in three independent series, thus: Biogenic: Endogenous human metabolites > nonhuman metabolites > natural products > unknown biogenicity. Drug-series: FDA approved > World drugs > Investigational > In man > In vivo > In cells > In vitro > unknown bioactivity. Purchasability: in-stock > via agent > on-demand > boutique > annotated (not for sale). Within each series, we classify each compound by the highest level it attains by catalog membership. Molecules found in catalogs categorized as containing only endogenous human metabolites[26] such as oxedrine (ZINC733), for instance, inherit that property. Therefore, oxedrine is in the “endogenous” substance subset.

Figure 1

The ZINC15 Web interface. A: Selected ZINC substance subsets showing the number of purchasable bioactive and biogenic subsets. The navigation bar (1) provides access to other resources. Click on the subset name (2) to browse or download a subset. (3) Estimates of the number of each subset. (4) Inset of dropdown menu providing access to other resources. B: Endogenous human metabolites subset page, represented as tiles. The page navigation tool (1), the Get total tool (2), the breadcrumbs (3), the selection tool (4), and the download tool (5). Click on any molecule’s number (6) to view its detail page. (7) Download popup (inset).

Compounds are also annotated with molecular properties that speak to their potential behavior and mechanism. For instance, 16485 compounds[28] have appeared in the literature and have been shown to aggregate,[29] a mechanism that is the origin of the greatest number of artifacts in early drug discovery.[30] Of these, 15072 are purchasable.[31] Caveat emptor! Interestingly, 53 metabolites have been observed to aggregate[32] as have 23 world drugs, i.e. drugs approved by major national regulatory agencies such as the FDA.[33] The ZINC15 Web interface. A: Selected ZINC substance subsets showing the number of purchasable bioactive and biogenic subsets. The navigation bar (1) provides access to other resources. Click on the subset name (2) to browse or download a subset. (3) Estimates of the number of each subset. (4) Inset of dropdown menu providing access to other resources. B: Endogenous human metabolites subset page, represented as tiles. The page navigation tool (1), the Get total tool (2), the breadcrumbs (3), the selection tool (4), and the download tool (5). Click on any molecule’s number (6) to view its detail page. (7) Download popup (inset). We demonstrate the Web site usage by using it to answer questions. For instance, to answer the following question: How many endogenous human metabolites are there. The user would browse to http://zinc15.docking.org, click on Substances in the navigation bar (top), click on Subsets (top, left), and select Endogenous from the list (bottom) (Figure A). The endogenous human metabolites are displayed, as tiles, but often there are too many to count immediately (Figure B). To count the number if it is not displayed, the user clicks on the “Get Total” button (top, left) (47319). How many of these are available for immediate delivery? The user selects “now” from the selection tool dropdown menu (top, right) and clicks on the Count button again (now 7190). To download these, the user would click on the Download button (top, right). The same procedure is applicable for drugs, investigational compounds, and biogenic compounds, among others. Thus, the user may ask for drugs that are also metabolites or natural products that have been investigated in clinical trials.

Activities by Organism Class

Having drawn on databases such as ChEMBL and DrugBank to associate compounds with their individual targets, one can organize ligands by organism class (Table ). For instance, in ZINC15 there are 2737 Eukaryotic proteins that have compounds binding at 10 μM or better annotated to them.[34] Over 100,000 distinct compounds hit at least one eukaryotic target at 10 nM or better,[35] rising to over 1/3 of a million at 10 μM.[36] Intriguingly, only 361 bacterial proteins have molecules annotated to them[37] amounting to only 4283 compounds at 1 μM or better;[38] the numbers for archael and viral targets are, notwithstanding intense interest, lower still.

Table 1

Number of Genes and Uniprot Codes and Ligands by Organism Class and Affinity Bina

			compounds
organism class	gene symbols	Uniprot codes	10 μM	1 μM	100 nM	10 nM	1 nM
eukaryotes	2,752	4,098	356,935	293,391	201,963	100,480	29,611
bacteria	386	515	6,903	4,283	2,467	1,028	262
archaea	3	3	25	25	1	0	0
viruses	69	102	12,584	9,625	6,486	3,473	1,467
totals	3,210	4,718	376,447	307,324	210,917	104,891	31,340

Number of distinct compounds active at five activity thresholds. Each value in the table may be calculated using the Web interface.[39]

Target Focused Libraries

ZINC may be used to acquire focused libraries of ligands annotated to a particular gene. For instance, a user seeking new ligands for the ionotropic glutamate receptor GRIN1 might want the 59 known ligands as controls[40] in 3D SDF format files of the usual relevant forms expected at physiological pH.[41] In the previous example orthologous genes were combined into gene symbols, but some questions require a particular species be specified. ZINC supports both options. Thus, 2025 compounds bind the human beta-2 adrenergic receptor at 10 μM or better,[42] while 2050 distinct compounds bind any of its orthologs,[43] which includes the 2025 above plus 25 additional compounds. Eight distinct Uniprot annotations are available for this gene,[44] and 2021 distinct ligands bind either the rat or human form at 1 μM or better.[45]

Chemical Search

The user may look for molecules either one at a time or in bulk. For a single molecule, the chemical drawing may be seeded with a drug or chemical name, SMILES string, SMARTS pattern, InChI, InChIkey, ZINC ID, or even original catalog IDs such as CHEMBL IDs (Figure A). Four buttons below the chemical drawing tool allow for exact, substructure, and two kinds of similarity searches, using either Tanimoto[46] or Dice[47] coefficients, each based on 512 bit ECFP4 fingerprints.[48] To look up many compounds at once, the user may use the Resolver (Figure B), specifying one molecule per line, again using SMILES, name, and ID but not SMARTS. We take these options up in turn.

Figure 2

Chemical search in ZINC. A: Search using a single molecule. (1) The user may seed the drawing with chemical or drug names, ZINC IDs, InChI, InChIkey, or original catalog numbers. The search options may be edited prior to search (2), which is run using one of four buttons below drawing (3): exact match, substructure, and two kinds of similarity. B: Search using many molecules in a single operation. Specify one molecule per line in a file (4) or by pasting (5). Additional search options (6). Format of results may be specified (7, inset 8) prior to running the resolver. Buttons to browse subsets or view subsets (9) as well as general free text search. There is no set limit on the number of molecules that may be returned in a single similarity search calculation. Using the API, it should be possible to download 1 million or more compounds, in any format, based on similarity and/or substructure. However, these queries may take a long time and would likely be run in batch mode. Our wiki contains suggestions for making long-running chemical searches run faster.[49] ZINC is a public resource, and occasionally the most pragmatic solution may be to download a large portion of ZINC and run a chemical search locally. ZINC can comfortably handle queries that return hundreds of thousands of molecules. We look forward to discovering the practical limits of this new technology. In previous versions of ZINC, we supported similarity searches for multiple molecules, apparently in parallel. Internally, they were run serially, and the results concatenated. Currently, there is no mechanism to run queries with multiple query molecules. There are two workarounds for querying many molecules. 1) Use the API to search each molecule independently. 2) Use the Resolver, which is limited to a minimum similarity of 0.7 (ECFP4).

Find by Chemical Similarity

When a hit is found in a screening campaign, a common next step is to identify, possibly model, and then purchase analogs, often called SAR-by-catalog. How well explored is the annotated or purchasable chemical space around a compound? To investigate analogs of Olaparib (ZINC 40430143), a recently approved PARP inhibitor, the user clicks on Substance in the navigation bar and types Olaparib in the input line above the drawing tool (Figure A). At time of writing there were 38885 analogs within a Tanimoto of 50% (ECFP4),[50] 297 of which were in stock for immediate delivery.[51] ZINC can answer questions about novelty of a newly discovered ligand for a particular target. For instance, the drug cariprazine is known to hit DRD2, but what are the most similar ligands that hit DRD3 or DRD4? To investigate this, the user would click on Substances in the navigation bar and enter cariprazine in the Draw/Search Structure field above the drawing tool (the molecule appears) (Figure A). The user clicks on Tanimoto to find similar ligands. On the results page, the user selects the relations selector (the chain link icon), selects gene from the popup, types DRD3 as the resource name, and clicks on the blue chain link icon to apply this constraint.[52]

Find Compounds by Substructure

ZINC supports full SMARTS using RDKit,[53] enabling complex chemical patterns to be matched. The same search tool used for similarity search may be used, in conjunction with the Substructure button. SMARTS pattern searching can be slow, and thus many of these queries will probably end up being run in batch mode. For instance, to find benzylamines, the user would click on Substances in the navigation bar, enter the SMARTS c1ccccc1CN in the Draw/Search Structure bar above the drawing tool, and click on the Substructure button below the drawing tool. To select only compounds available in preparative quantities, the users would click on the subsets popup (the label icon) and click on BB (building blocks).[54] To only select compounds available for immediate delivery, the user would select Now from the same popup.[55]

Multiple Compound Lookup

If the user needs to look up more that a few molecules, a bulk facility can simplify this task. To do this, the user selects Substances from the navigation bar and using the Resolver (Figure B) either selects a file containing molecules to look up or specifies them in the Paste SMILES field. In either case, there should be one molecule per line, which may be SMILES, CAS number, name, ZINC ID, or an original catalog ID, such as CHEMBL ID. Options include allowing close matches, looking for analogs, and whether a single or multiple matches should be returned per input line. The output may be to a Web page or a downloadable file in 11 supported formats (Figure A). When ready, the user clicks “Resolve File” to start the process. If more than 300 molecules are specified, the job is automatically run in batch mode.

Figure 3

ZINC reference information. A. ZINC results may be accessed in 11 formats plus the Web pages. Three line-oriented formats are easy to parse for both people and computers. Three machine readable formats provide for rich and flexible data interchange between programs. Five formats provide molecule structures for docking or modeling. Each format is also available compressed using a .gz suffix. B. Compounds are classified by how they may be purchased based on their current catalog membership. There are five primary levels and three derived levels. C. We classify substances into six levels[118] by the most reactive group they possess, based on SMARTS patterns.[56]D. 3D representations are associated progressively with the pH range at which they become relevant for docking.

Chemical Patterns

ZINC calculates and stores over 500 chemical patterns. These patterns enable new features, such as computing a “reactivity” attribute for each molecule, accelerating substructure search, and providing a basis for new features. We have calculated patterns for 480 PAINS patterns using the RDKit version of the Guha translation[57] of the original SLN format SMARTS.[58] We also include 40 filtering patterns used in the previous version of ZINC for backward compatibility. We have calculated statistics on the prevalence of these functional groups[59] allowing the most popular and the least popular to be easily identified. We compute a reactivity score (Figure C), which classifies each molecule by the “worst” functionality it contains, enabling queries as well as subsets that follow community opinion in the Tranche Browser. Precalculated patterns allow near instant substructure searches. For instance, which PAINS patterns are most common among drugs? To answer this, the user would select Patterns from the navigation bar, click Browse, and then click on the Drugs column heading twice to sort it in descending order. To find purchasable sulfonyl halides, the user would select patterns from the More menu in the navigation bar and then click Browse. In the lookup field, the user would type “sulfonyl halide” and click on the blue “go” button. The user can see that there are 85487 sulfonyl halides for sale and clicks on the number to view the substances. The user may further specify building block or now subsets to further narrow the query.

Rings

Rings are a popular organizing concept in medicinal chemistry. ZINC offers rings as a resource to rapidly identify molecules that contain them. The statistics of occurrence documents the popularity of rings by subset. To browse these, the user would select Rings from the navigation bar followed by Browse.[60] Rings may be ordered by their frequency of occurrence in drugs, natural products, or purchasable subsets by clicking on the column heading. The interface allows the user to, for instance, rapidly identify all compounds containing indole rings available in preparative quantities,[61] all investigational compounds containing quinazoline rings,[62] or all compounds containing both pyrimidine and morpholine rings.[63] The latter is interpreted as “all substances containing a morpholine ring, and among these, those where any ring in the compound is pyrimidine”. Interesting questions may be answered from the molecule detail page alone (Figure ). To look up an individual drug by name or ZINC ID, click on Substances in the navigation bar and enter its name or code into the text input field above the molecular drawing tool (we will use the example for Isoniazid, ZINC1590) and click Exact (below, left). The molecule detail page contains purchasing information, annotated catalog membership (Figure A), biological activity data derived from ChEMBL, biological activity predictions from SEA and ChEMBL, similar interesting molecules (Figure B), publications from ChEMBL, chemical patterns, rings, publications from ChEMBL, clinical trial information, and more. Molecules may be downloaded in either 2D or 3D in 11 formats (Figure A). Large subsets and slow-to-download ones will be queued and run in batch mode when resources permit. A special class of downloads are subsets of physical property space such as the widely used “lead-like” and “fragment-like” subsets, for which we recommend the Tranche Browser, accessed from the Tranches button in the navigation bar (Figure ). The Tranche Browser divides physical property space into 121 tranches based on two properties in 11 bins each: the horizontal axis is size (molecular weight) and the vertical axis is polarity (logP). The Tranche Browser allows the user to select the characteristics of the database subset required and then download it, in 2D (SMILES) for chemoinformatics or 3D formats for docking. The user may select or deselect individual tranches by clicking on them or use the Presets selector (top right) to choose a popular subset. For 2D databases, the user may also specify purchasability (Figure B) and reactivity (Figure C) and two downloading options, format (Figure A) and download method. The prospective downloader of 3D databases is faced with further choices. In addition to purchasability and reactivity, 3D users may specify restrictions on net molecular charge and pH range (Figure ).

Figure 4

Figure 5

Tranche browser for selection and download of chemical libraries. Physical-chemical space has been divided into 11 bins of polarity-hydrophobicity (calculated logP, vertical) (1) and size (molecular weight, horizontal) (2). Subsets of this space may be selected in 2D (SMILES) or 3D (Figure A) (3). Purchasability level (Figure B) (4) and reactivity (Figure C) (5) may also be specified. For 3D only, net molecular charge (6) and pH range (Figure D) (7) may also be specified. Presets (8) correspond to community practice (e.g., “lead-like”, “fragment-like”), and Download (9) provides for five methods to access the selected molecules.

Molecule detail page. A. Showing (1) ZINC ID, name if known, subset membership, (2) properties and 2D depiction, (3) 3D representations if available, (4) purchasing information, and (5) annotated catalog membership, (6) breadcrumbs indicating current location, (7) selection tool for refinement of query, and (8) download tool. B. Interesting bioactive and biogenic analogs section of molecule detail page: (1) similar biogenic compounds, (2) similar bioactive compounds, (3) compounds with a shared scaffold, (4) similar aggregators, and (5) similar purchasable compounds currently too slow to calculate. A find more button in each case will find more of the same. Tranche browser for selection and download of chemical libraries. Physical-chemical space has been divided into 11 bins of polarity-hydrophobicity (calculated logP, vertical) (1) and size (molecular weight, horizontal) (2). Subsets of this space may be selected in 2D (SMILES) or 3D (Figure A) (3). Purchasability level (Figure B) (4) and reactivity (Figure C) (5) may also be specified. For 3D only, net molecular charge (6) and pH range (Figure D) (7) may also be specified. Presets (8) correspond to community practice (e.g., “lead-like”, “fragment-like”), and Download (9) provides for five methods to access the selected molecules.

Similarity to Interesting Compounds

Knowing that a molecule resembles a drug or natural product can help generate hypotheses of mechanism of action, while the absence of similar annotated compounds might suggest biological novelty. Each substance detail page includes an Analogs section, in which the nearest endogenous human metabolite, any metabolite, natural product, drug, in man compounds, and bioactives are shown, if there is one within a Tanimoto index of 0.6 (512 bit ECFP4 fingerprints). A Find more button may be used to find more. We have already seen earlier how the most similar compounds annotated for an individual gene target, its major target or subclass, or from a particular catalog may also be found.

Download Docking Library

Downloading a 3D screening library for docking is now more flexible yet remains straightforward. For instance to download the “lead-like” subset for docking, the user selects “Tranches” from the navigation bar and then clicks on 3D (top left) to switch to the 3D downloading tool (Figure ). From the Preset popup (top right) the user chooses “lead-like”, which sets the molecular weight and logP range of tranches. Individual tranches may be selected or deselected by clicking on them. By default, the purchasability selector (Figure B) is set to Wait-OK, which includes both in stock and on demand compounds. The reactivity selector (Figure C) is set to mild by default, which includes PAINS and weakly reactive compounds. To download a script to download the database, the user clicks on the Download icon (top right), where two further choices are possible. The format may be specified (Figure A) and the download method. The user downloads the script, which may then be run to download the database tranches.

Genes by Target Class

Target classes such as membrane receptors, ion channels, transporters, and enzymes group proteins by function. We have adapted the top two classification tiers in ChEMBL to assign one of 15 major[64] and 42 target subclasses[65] to all genes (Table ). ZINC may be used to select compounds that are known or predicted to hit particular classes of targets. Thus, for instance, there are 35080 commercially available ligands for Class A GCPRs (purchasable), and for epigenetic regulator targets, there are 1286 purchasable annotated ligands.[66]

Table 2

Genes and Their Ligands and Their purchasability by Major Target Classa

		genes		no. of compounds (≤10 μM)
major target class	minor classes	total	1+ for sale (%)	for sale (%)	not for sale
membrane receptor	7	277	214 (77)	7,084 (5)	122,555
transcription factor	2	48	41 (85)	931 (6)	13,552
transporter	4	89	65 (73)	1,525 (8)	16,614
ion channel	3	147	117 (79)	1,180 (8)	13,467
epigenetic regulator	3	86	48 (56)	258 (11)	2,152
enzyme	13	1950	1280 (65)	12,704 (7)	166,195
other	10	613	335 (54)	3,984 (9)	38,546
totals	42	3210	2100 (65)	27,666 (7)	373,081

The number of genes total as well as the number of genes for which one or more compounds active at 10 μM or better was for sale at the time of publication. The number of distinct compounds that are for sale and not for sale for each target class. Cutoff for activity is 10 μM.

ZINC Continues To Grow

ZINC encompasses more of the purchasable and annotated chemical space as the catalogs it includes grow in size and number, currently adding around 50 new vendor and annotated catalogs each year. ZINC is updated continuously. The date each catalog was last updated is available in tabular form[67] and on the detail page of each catalog.[68] Two-dimensional SMILES have been decoupled from 3D models, allowing us to load molecules for which we do not build noncovalent 3D docking models such as boronic acids, tin, and silicon containing compounds. ZINC now includes molecules with molecular weight of up to 1000 Da, offering more complete coverage of commercially available chemical space, currently with 220 million molecules, over half of which are for sale. ZINC is now a good way to find molecules even if they are too big to be practical for docking, as long as they are available for purchase. Building blocks available in preparative quantities are now included and are easier to find.

Better 3D Representations for Better Docking

Molecular structures in 3D have been improved in four ways. We now use ChemAxon’s JChem to protonate and prepare biologically relevant tautomers,[69] resulting in an average of 2.2 biologically relevant forms per molecule at physiological pH. ZINC now uses the latest version of Omega (OpenEye Scientific Software, http://eyesopen.com) for improved 3D conformations for docking. We have added PDBQT to SDF, mol2, and our own flexibase formats, for both DOCK37[70] and previous versions. ZINC15 will now generate DUDE-style decoys[71] for any molecule directly from the database.

InChI and InChIkey for Linking to Other Databases

ZINC now provides IUPAC InChI and InChIKeys for every molecule to allow better interoperability with other resources such as Wikipedia, ChEMBL,[11] PubChem,[72] ChemSpider,[73] and UniChem.[74,75] The first part of an InChIkey, which specifies the framework without stereochemistry or protonation, offers a reliable way to find stereoisomers (e.g., for Praziquantel).[76] The canonicalization required by InChIs has reduced molecule duplication in ZINC so that search results are better estimates of their true values.

WHO Drug Classification

The Anatomical Therapeutic Chemical Classification System, curated by the World Health Organization, organizes drugs by their therapeutic, pharmacological, and chemical properties. ZINC acquires ATC codes of drugs via ChEMBL20 and Drugbank, allowing drugs to be selected by anatomical, therapeutic, and chemical classes.[77] For instance, the user may find all purchasable drugs for cancer,[78] all dermatologicals of biological origin,[79] and opioid anesthetics such as fentanyl.[80] Integration of ATC codes also allows for more consistent identification of WHO-assigned names for substances.

Clinical Trials

We load clinical trials information from https://clinicaltrials.gov, enabling ZINC to answer questions about these important compounds. For instance, to see the current clinical trials, the user would click on Clinical Trials in the navigation bar and select Current from the selection tool (top right). It is possible to ask to see which compounds are in clinical trials that hit a eukaryotic target at 10 nM or better[81] and also to ask to see compounds in clinical trials for cancer that are sold by Cayman Chemicals.[82]

Literature Links

Publications information from ChEMBL enables the literature to be browsed, active compounds for any paper to be displayed, papers to be found based on which compounds they report on, and many other questions. The user need only enter the PMID of a paper to retrieve all the active structures it reports.[83] Suppose the user would like the active compounds from a paper in 2D or 3D form. If the paper was indexed by ChEMBL, e.g. J. Med. Chem. 2014, 57, 9, the user would click on Publication in the navigation bar and enter either the PMID (here 24684293) or select the journal, year, volume, and page number from the selectors. The user clicks Search to arrive at the page where the two genes and the first five compounds that bind them at 10 μM or better as described in that paper are shown.[84] To download these compounds, use the Download selector (top, center). To find out whether any of the compounds reported in this paper can be purchased, the user would click on Purchasable in the selection tool (top, right). For example, it turns out that the subnanomolar ligand for protein kinase C, ZINC4096162, is available, in both screening and preparative quantities. This compound, in turn, is reported in 15 papers, which are summarized at the bottom of its detail page[85] or listed at the references relation to ZINC4096162.[86]

Application Program Interface (API)

A new API that is almost identical to the Web page URL structure allows ZINC to be scripted and integrated into third-party applications. The API supports 11 formats (Figure A) against 20 resources.[87] Documentation for both the Web page and the API is available using the help[88] and examples[89] endpoints, and the API (URL) syntax is described on our wiki.[90] Machine-readable formats such as JSON, XML, sdf, csv, and txt may be read directly into third party client programs such as Knime,[91] PipelinePilot, Cytoscape,[92] DataWarrior,[93] InstantJChem,[94] and iPython Notebook via Pandas. In many cases, the content of the help pages may also be retrieved in machine-readable format for dynamic scripting. Each resource supports up to ten endpoints, including the relation endpoint, which intersects one relation with another. The subsets endpoint allows the curators to define popular subsets of the resource, which can help simplify query syntax. The supported subsets for each resource are available at the respective subsets endpoint.[95]

Discussion

Three themes emerge from this work. First, a new research tool – ZINC15 – is now available. It enables chemists and biologists to answer questions that before would have required the assistance of a chemoinformatician. Second, ZINC has also been improved for experts, enabling them to integrate its features into their applications using the new API. Third, ZINC has undergone a wide variety of improvements for its original constituency, molecular docking. We take these points in turn. Nonspecialists may now use ZINC to answer formerly complex questions. This required the design of a new database, the use of new software such as RDKit, a new URL-compatible query language and report generator, and new Web pages designed to simplify complex tasks. The data are structured in 20 resources, which are further divided into subsets. Questions may now be asked not only about molecules but also genes and their target classes, catalogs, chemical patterns and rings, publications, and clinical trials as well as individual activity data points. Orthologous targets from ChEMBL are now grouped by gene symbol and organized by major and subclasses. The system offers focused libraries of known compounds, purchasable or otherwise, organized by gene, subclass, or major target class. ZINC answers questions about chemical novelty and similarity to known drugs, bioactives, and natural products. ZINC is a platform for research tool development. Chemoinformaticsts may now embed ZINC and its tools into their own applications. For instance, ZINC has been integrated into Cytoscape (ZINCytoscape) and R (spelteR). The new API offers a modular interface using industry-standard formats like XML and JSON, and reports are flexible in format and content. Resources and their attributes are fully documented and may also be retrieved in a machine-readable format allowing the creation of rich and dynamic tools. The interface accepts molecular queries represented not only as SMILES and SMARTS but also InChI, InChIkey, and even binary fingerprints. Finally, the virtual screening community can benefit from many innovations and improvements here. Among these are new vendors, new annotated catalogs, improved 3D representations, tranches for more efficient physical property subsets, less duplication, faster and more comprehensive searches by similarity and substructure, annotations grouped by gene and organism class, and search results that may be hundreds or thousands of times larger than before. ZINC provides a view of biologically precedented and commercially available chemical space organized by genes and the major and subclasses to which they belong. For over one-third of genes that have ligands reported in the literature not a single one of them is for sale, underscoring an urgent need for synthesis to fill gaps in screening libraries. ZINC retains important limitations. It inherits errors and ambiguity from the catalogs it incorporates, including stereochemical ambiguity, an ongoing challenge with few solutions that are not labor intensive. Whereas our goal is to make the interface capable of creating every query without programming expertise, the ZINC query and report language (API) allow many options for which we have not yet been able to build a point-and-click interface.[88−90] Due to its size and to the generality of questions supported, some queries will take a long time and must be run in batch. Batch mode, meant to handle long-running queries, will not be released until December 2015. ZINC remains a work in progress. Notwithstanding these limitations, ZINC should interest a broad audience. For vendors, ZINC allows compounds to be annotated for bioactivity and biogenicity, adding value to their library. Synthetic organic chemists may use ZINC to identify neglected metabolites or drugs for synthesis or other bioactives that are not currently purchasable, as well as the building blocks with which to make them. Curators of annotated libraries such as ChEMBL and DrugBank may use ZINC to enhance their offerings with supporting information such as purchasability and biogenicity information. Dockers and chemoinformaticians may download commercially available libraries for screening, in 2D or 3D, as well as sets of known actives as controls. Medicinal chemists may use ZINC to compare their discoveries to what is known publically and then to find purchasable analogs or building blocks to make new libraries. We expect the ZINC tools and libraries to have broad utility in the community.

Methods

ZINC was ported to PostgreSQL version 9.4. The database schema was modified to support new features. New software was written for loading, curating, and querying the database in Python using the chemoinformatics software system RDKit 2014_09_01,[53] the Python structured query language toolkit and object relational mapper SQLAlchemy version 0.9.8,[96] and the Python Web framework Flask version 0.10.1.[97]

Source Catalogs

We loaded catalogs from over 266 commercial vendors and 122 annotated catalogs. Some sources such as HMDB and DrugBank were loaded as several distinct catalogs in ZINC allowing us to leverage the curation of metabolite origin such as plant metabolites in HMDB or drug status such as investigational drugs in DrugBank. All catalogs in ZINC are categorized by their biogenic and bioactivity status, if any.[95] Only descriptions that characterized the entire catalog contents were applied. For instance, the “Approved” subset of DrugBank was categorized as “World Drugs” since it contains over 100 drugs approved in other countries but not by FDA, and the “Endogenous” subset of HMDB was categorized as having a biogenic type of “endogenous human metabolite”. Molecules inherit biogenic and bioactive properties from the catalogs they are found in. These values are computed and stored and are accessible in the interface as molecular features. There are four biogenic catalog levels: 1) Endogenous human metabolites, i.e. compounds that are synthesized in man. Interestingly, this may include compounds produced by our bacterial flora; 2) Metabolites of any species, i.e. small molecules that are involved in metabolism, development, and reproduction but not metabolites of xenobiotics; 3) Biogenic compounds, often called natural products; 4) Unknown biogenic status. Likewise, ZINC supports seven levels of bioactivity annotation as follows. 1) FDA approved; 2) World drugs; 3) Investigational, compounds reported to be used in clinical trials; 4) In Man, which including nutraceuticals, for instance; 5) In vivo, which includes DrugBank experimental compounds that have been in animals; 6) In cells, which includes compounds reported active in cell based assays; 7) In vitro, compounds active or assumed active at 10 μM or better in a direct binding assay. All other catalogs are marked as having unknown biological activity. The categories are ordered to be progressively inclusive within each series, thus all FDA approved drugs are also world drugs and all compounds active in cells are also active in vitro. We annotate as building blocks those catalogs of compounds available in preparative quantities, typically 250 mg or more. Commercial vendors are categorized by the speed and cost of compound acquisition, allowing the best purchasability of every compound to be computed based on its current catalog membership. Catalog categorizations are refined continually by purchasing experience in our lab and reports from colleagues, as follows:[95] 1) In stock, delivery in under 2 weeks, 95% typical acquisition success rate; 2) Procurement agent, in stock, delivery in 2 weeks, 95% typical acquisition success rate; 3) Make-on-demand, delivery typically within 8 to 10 weeks, 70% typical acquisition success rate; 4) Boutique, where the cost may be high but still likely cheaper than making it yourself, 70% typical acquisition success rate.

Catalog Processing

Source catalogs are processed and loaded into the database (2D only) as follows. We harvest tagged values in selected source SDF files. Name and CAS numbers are loaded into a synonyms table, while selected bioactivity and other selected data are stored in a provided_values table. We convert SDF to SMILES[98] using RDKit and take the largest organic part of the compound (desalting), enumerating up to four stereoisomers from stereochemically ambiguous SMILES using OEChem TK version 1.7 (OpenEye Scientific Software, Santa Fe, NM). Because of the combinatorial problem of ambiguous stereocenters in sterols, we used SMARTS filters to prioritize the most probable implied stereoisomers based on biosynthetic pathways (Prof. Leslie Kuhn, private communication).[99] The SMILES are neutralized with mitools (http://molinspiration.com), which also filters out incorrectly coded molecules well. Molecules are loaded using Python/RDKit scripts by attempting to map them to existing ZINC IDs or creating new ZINC substances as necessary, as well as any additional required datastructures. InChI and InChIkeys are calculated on loading, and the InChIkey is used as a unique constraint in the database. 512 bit Morgan fingerprints with radius 2 (effectively ECFP4) are calculated for each molecule using RDKit.[99]

Model Building

The 3D molecule processing pipeline is now disconnected from the 2D loading process, above. We now use ChemAxon’s package and the command line tool CXCALC to calculate protonation states and tautomers at or near physiologically relevant pH[69] in three pH tranches. These are physiological, covering roughly pH 6.4 to 8.4, high, roughly pH 8.4 to 9.0, and low, roughly pH 5.8 to 6.4. Each protomer is rendered into 3D using Jchem’s molconvert (ChemAxon, Budapest, Hungary) and conformationally sampled using Omega[100] (OpenEye Scientific Software, Santa Fe NM).[101] Atomic charges and desolvation penalties are calculated using AMSOL 7.1[102] and our previously published protocol.[103] Files are formatted for docking as flexibase files,[70,104] mol2,[105] sdf,[106] and pdbqt.[107]

ChEMBL and Uniprot

We loaded ChEMBL20 into ZINC as follows. We only used targets of type SINGLE PROTEIN and PROTEIN COMPLEX. We process activity annotations for molecular targets, not for whole organisms. We normalized pKi, IC50, EC50, AC50, and pIC50 to a single standard pKi value, which we rounded to two decimal places.[108] We filtered out data flagged with the data_validity_comment field indicating possible problems in the source document. We associate compounds annotated for protein complexes to each of the genes involved in that complex. Two common areas of biology where multigene complexes are observed is for the cell surface receptor integrins and the ligand-gated ion channels such as the nicotinic acetylcholine receptor. For instance, integrin VLA-1 is an alpha-1/beta-1 heterodimer of two genes, ITGA1 and ITGB1, respectively. Likewise, nAChRs such as (alpha-3)2(beta-4)3 is a heteropentamer of two genes CHRNA3 and CHRNB4, respectively. In such cases, compounds annotated for the complex are associated with each of the constituent genes. For single proteins, we used Uniprot gene symbols[109] based on the Uniprot accession codes in ChEMBL. Orthologs in the TrEMBL part of Uniprot often did not have assigned gene symbols, in which case we used the Uniprot accession code as a provisional gene name.

Major Classes and Subclasses

We assigned target classes and subclasses based on the first two subfields of the protein_class field of the protein_classification table of ChEMBL. In this version of ZINC there are 42 subclasses grouped into 15 major target classes: membrane receptor, ion channel, transporter, transcription factor, enzyme, epigenetic, and 9 other catch-all classes for the few cases when none of these fit.

Fingerprints

We computed 512 bit fingerprints using the Morgan algorithm with radius 2 as implemented in RDKit. Stereoisomers and some very near neighbors have identical fingerprints, resulting in approximately 50% fewer fingerprints than substances, on average. We grouped fingerprints into three classes, interesting, current, and benched for faster searching. Queries that limit their results to annotated compounds need only search fingerprints in the interesting subsets, while benched fingerprints, corresponding to compounds not in any current catalog, are never searched.

Parallel Similarity Search

We have implemented a very general chemical search API that automatically parallelizes chemical search queries using Python green threads to search in parallel increments of 1 million molecules at a time. Executed on a 64-core computer, we often see full database SMARTS searches completing in 30 s or less, although SMARTS can be of almost unlimited complexity, and some queries will certainly take far longer. Similarity searches also often only take a few seconds of wall clock time. Those over interesting compounds often only take a second. We support both Dice and Tanimoto coefficients as implemented in RDKit.

Molecular Features

Features label molecules with computed properties often derived from catalog membership that would be prohibitive to calculate interactively. There are four biogenic class annotations: biogenic (natural products), metabolites, endogenous human metabolites, and unknown. There are eight bioactivity classes, which includes drugs: FDA approved, world drugs, investigational, in man, in vivo, in cells, in vitro, and unknown. ZINC also supports aggregators as an annotation.

PAINS and Other Patterns

There has been considerable interest in pan-assay interference (PAINS)[58] SMARTS patterns recently. We used the RDKit version[53] of the Guha translation[57] of the original 480 PAINS expressed in Sybyl Line Notation (SLN).[58] All molecules in ZINC have been annotated and are searchable by PAINS and other SMARTS patterns. We compute a reactivity molecular property from the pattern membership of each molecule. The reactivity categories are A) anodyne; B) clean (PAINS-ok); C) mild (weakly reactive, typically as a nucleophile or electrophile); D) reactive; E) unstable or irrelevant for screening. For E, we do not build molecular models for noncovalent docking (e.g., boronic acids). We also curated 40 patterns used by the prior version of ZINC. SMARTS patterns are rendered using SMARTSViewer.[110]

Interface

The Web site and API were coded in Python using RDKit,[53] SQLAlchemy,[96] and Flask.[97] The RDKit to SQLAlchemy interface was inspired by Razi.[111] Celery[112] was used and adapated with our own code for job scheduling. A curator’s tool (zincmanage) is used to load, update, and curate the database. A command line interface (zinccli) provides a Unix-shell like interface for additional testing and curation.

Sterol Rings

The stereochemical ambiguity problem is particularly acute in sterol rings, but since these are almost always biological in origin and are derived from the sterol biosynthetic pathway, sensible guesses of stereochemistry are reasonable. We have therefore created a special sterol processing pipeline for loading molecules into ZINC. We thank Prof. Leslie Kuhn for drawing our attention to this and for providing SMARTS patterns and advice. We used mitools (http://molinspiration.com) to extract ring systems from every ZINC molecule and loaded them into the database. We calculate static counts of the number of molecules that have biogenic or bioactivity annotations, allowing rapid reports of approximate counts of numbers of qualifying compounds per ring.

References

We built an interface to the docs table in ChEMBL20 and integrated it into the docs resource on the molecule detail, gene detail, and target detail pages.

Substance Names

We attempt to identify names for substances in ZINC from WHO-assigned names via the ATC, ChEMBL molecule names, and synonyms extracted from HMDB, DrugBank, TTD, and other catalogs. We loaded all clinical trials from https://clinicaltrials.gov. Interventions and conditions are also loaded as linked resources. All drug or dietary supplement interventions are then queried using a free text search against substance names to map the indications to the corresponding substances in ZINC.

API Design

The Web site may be formally described as an ensemble of endpoints.[113] There are five classes of endpoints, ten static and an almost unlimited number of dynamic ones. The five endpoint classes are list, detail, relation, subsets, and special. Thus, for example, the substances help endpoint[88] provides guidance on how to find substances of interest and a table provides a list of available catalogs in ZINC and the time of their last update[114] and shows the available subsets of genes in ZINC per ChEMBL20.[115] The major classes home endpoint[116] provides an overview of target classes in ZINC and gives examples of how to query and select individual observations of compound-target affinities from ChEMBL, with or without additional purchasability constraints.[117]

40 in total

1. JChem: Java applets and modules supporting chemical database handling from web browsers

Authors:
Journal: J Chem Inf Comput Sci Date: 2000-03

2. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings.

Authors: J D Holliday; C-Y Hu; P Willett
Journal: Comb Chem High Throughput Screen Date: 2002-03 Impact factor: 1.339

3. A model binding site for testing scoring functions in molecular docking.

Authors: Binqing Q Wei; Walter A Baase; Larry H Weaver; Brian W Matthews; Brian K Shoichet
Journal: J Mol Biol Date: 2002-09-13 Impact factor: 5.469

4. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

5. ZINC--a free database of commercially available compounds for virtual screening.

Authors: John J Irwin; Brian K Shoichet
Journal: J Chem Inf Model Date: 2005 Jan-Feb Impact factor: 4.956

Review 6. Evaluation of the performance of 3D virtual screening protocols: RMSD comparisons, enrichment assessments, and decoy selection--what can we learn from earlier mistakes?

Authors: Johannes Kirchmair; Patrick Markt; Simona Distinto; Gerhard Wolber; Thierry Langer
Journal: J Comput Aided Mol Des Date: 2008-01-15 Impact factor: 3.686

7. Surface plasmon resonance based assay for the detection and characterization of promiscuous inhibitors.

Authors: Anthony M Giannetti; Bruce D Koch; Michelle F Browner
Journal: J Med Chem Date: 2008-01-09 Impact factor: 7.446

8. A Computer Program for Classifying Plants.

Authors: D J Rogers; T T Tanimoto
Journal: Science Date: 1960-10-21 Impact factor: 47.728

9. Discovery of novel agonists and antagonists of the free fatty acid receptor 1 (FFAR1) using virtual screening.

Authors: Irina G Tikhonova; Chi Shing Sum; Susanne Neumann; Stanislav Engel; Bruce M Raaka; Stefano Costanzi; Marvin C Gershengorn
Journal: J Med Chem Date: 2008-01-15 Impact factor: 7.446

10. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities.

Authors: Tiqing Liu; Yuhmei Lin; Xin Wen; Robert N Jorissen; Michael K Gilson
Journal: Nucleic Acids Res Date: 2006-12-01 Impact factor: 16.971

553 in total

1. Identification of a G-Protein-Independent Activator of GIRK Channels.

Authors: Yulin Zhao; Peter Man-Un Ung; Gergely Zahoránszky-Kőhalmi; Alexey V Zakharov; Natalia J Martinez; Anton Simeonov; Ian W Glaaser; Ganesha Rai; Avner Schlessinger; Juan J Marugan; Paul A Slesinger
Journal: Cell Rep Date: 2020-06-16 Impact factor: 9.423

2. OCTAD: an open workspace for virtually screening therapeutics targeting precise cancer patient groups using gene expression features.

Authors: Billy Zeng; Benjamin S Glicksberg; Patrick Newbury; Evgeny Chekalin; Jing Xing; Ke Liu; Anita Wen; Caven Chow; Bin Chen
Journal: Nat Protoc Date: 2020-12-23 Impact factor: 13.491

3. Discovery of GlyT2 Inhibitors Using Structure-Based Pharmacophore Screening and Selectivity Studies by FEP+ Calculations.

Authors: Filip Fratev; Manuel Miranda-Arango; Ashley Bryan Lopez; Elvia Padilla; Suman Sirimulla
Journal: ACS Med Chem Lett Date: 2019-05-22 Impact factor: 4.345

Review 4. Docking Screens for Novel Ligands Conferring New Biology.

Authors: John J Irwin; Brian K Shoichet
Journal: J Med Chem Date: 2016-03-15 Impact factor: 7.446

5. pK_a measurements for the SAMPL6 prediction challenge for a set of kinase inhibitor-like fragments.

Authors: Mehtap Işık; Dorothy Levorse; Ariën S Rustenburg; Ikenna E Ndukwe; Heather Wang; Xiao Wang; Mikhail Reibarkh; Gary E Martin; Alexey A Makarov; David L Mobley; Timothy Rhodes; John D Chodera
Journal: J Comput Aided Mol Des Date: 2018-11-07 Impact factor: 3.686

6. Chembench: A Publicly Accessible, Integrated Cheminformatics Portal.

Authors: Stephen J Capuzzi; Ian Sang-June Kim; Wai In Lam; Thomas E Thornton; Eugene N Muratov; Diane Pozefsky; Alexander Tropsha
Journal: J Chem Inf Model Date: 2017-01-19 Impact factor: 4.956

7. Computer-Aided Discovery and Characterization of Novel Ebola Virus Inhibitors.

Authors: Stephen J Capuzzi; Wei Sun; Eugene N Muratov; Carles Martínez-Romero; Shihua He; Wenjun Zhu; Hao Li; Gregory Tawa; Ethan G Fisher; Miao Xu; Paul Shinn; Xiangguo Qiu; Adolfo García-Sastre; Wei Zheng; Alexander Tropsha
Journal: J Med Chem Date: 2018-04-17 Impact factor: 7.446

8. Phenotypic Screening of Chemical Libraries Enriched by Molecular Docking to Multiple Targets Selected from Glioblastoma Genomic Data.

Authors: David Xu; Donghui Zhou; Khuchtumur Bum-Erdene; Barbara J Bailey; Kamakshi Sishtla; Sheng Liu; Jun Wan; Uma K Aryal; Jonathan A Lee; Clark D Wells; Melissa L Fishel; Timothy W Corson; Karen E Pollok; Samy O Meroueh
Journal: ACS Chem Biol Date: 2020-05-21 Impact factor: 5.100

9. Characterization of Electrospray Ionization (ESI) Parameters on In-ESI Hydrogen/Deuterium Exchange of Carbohydrate-Metal Ion Adducts.

Authors: O Tara Liyanage; Matthew R Brantley; Emvia I Calixte; Touradj Solouki; Kevin L Shuford; Elyssia S Gallagher
Journal: J Am Soc Mass Spectrom Date: 2018-10-23 Impact factor: 3.109

10. Ligand-Efficient Inhibitors of Trichomonas vaginalis Adenosine/Guanosine Preferring Nucleoside Ribohydrolase.

Authors: Samantha N Muellers; Juliana A Gonzalez; Abinash Kaur; Vital Sapojnikov; Annie Laurie Benzie; Dean G Brown; David W Parkin; Brian J Stockman
Journal: ACS Infect Dis Date: 2019-02-01 Impact factor: 5.084