| Literature DB >> 25415348 |
Christopher A Lipinski1, Nadia K Litterman, Christopher Southan, Antony J Williams, Alex M Clark, Sean Ekins.
Abstract
The availability of structures and linked bioactivity data in databases is powerfully enabling for drug discovery and chemical biology. However, we now review some confounding issues with the divergent expansions of public and commercial sources of chemical structures. These are associated with not only expanding patent extraction but also increasingly large vendor collections amassed via different selection criteria between SciFinder from Chemical Abstracts Service (CAS) and major public sources such as PubChem, ChemSpider, UniChem, and others. These increasingly massive collections may include both real and virtual compounds, as well as so-called prophetic compounds from patents. We address a range of issues raised by the challenges faced resolving the NIH probe compounds. In addition we highlight the confounding of prior-art searching by virtual compounds that could impact the composition of matter patentability of a new medicinal chemistry lead. Finally, we propose some potential solutions.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25415348 PMCID: PMC4360371 DOI: 10.1021/jm5011308
Source DB: PubMed Journal: J Med Chem ISSN: 0022-2623 Impact factor: 7.446
Summary Statistics for the Public and Commercial Chemistry Databases above or near Half a Million Structures (at the Time of Writing), Most of Which Include Linkages to Bioactivity and Biological Dataa
| name | total (million) | URL | notes |
|---|---|---|---|
| GDB13 | 977 | Virtual compounds, no bioactivity data | |
| SciFinder | 89 | Includes 28 million vendor compounds | |
| UniChem | 71 | Includes 15 million SureChEMBL from patents | |
| PubChem | 53 | Includes 42 million vendor compounds and 15 million from patents | |
| CSLS | 46 | Update status unclear | |
| ChemSpider | 32 | Includes 12 million vendor compounds | |
| Reaxys | 25 | 5.1 million medicinal chemistry data | |
| ZINC | 23 | All vendor compounds, 8.1 million in PubChem | |
| GOSTAR | 6.3 | Activity linked | |
| Thomson Pharma | 4.3 | Counted inside PubChem | |
| Liceptor | 3.2 | ||
| ChEMBL | 1.4 | 0.94 million inside PubChem | |
| BindingDB | 0.45 |
Note that apart from the three sources that have update cycles within PubChem (Thomson Pharma, ChEMBL, and BindingDB) all the others are likely to have at least a proportion of unique content (e.g., extractions from different journal articles).
Figure 1The “usual suspects” lineup, representing molecules of different classes from public and commercial databases, illustrating the difficulty of selecting desirable ones. From left to right, the documented probe is ML010 (CID 17757274), the drug is valsartan (CID 60846), a prophetic compound is from CAS 1164083-19-5 from WO 2001056358 (not in PubChem or ChemSpider),[42] a text extracted compound is from US20120040982[17] (CID 57498937), and one of the probes with incomplete data linkage is ML160 (CID 824820).
Figure 2Chemical structures for 322 NIH MLP probes (http://molsync.com/demo/probes.php) have been clustered into 44 groups for visualization purposes, using ECFP_6 fingerprints[58] and using a Tanimoto similarity threshold of >0.11 for cluster membership. The threshold was chosen empirically in order to show a representative selection of the kinds of molecules found within the set of probes. For each cluster, a representative molecule is shown (selected by picking the structure within the cluster with the highest average similarity to other structures in the same cluster). The clusters are decorated with semicircles which are colored blue for compounds that were considered high confidence based on our medicinal chemistry due diligence analysis. This analysis suggests that there is not an obvious correlation between structural composition and whether they pass the medicinal chemist’s logic.[30] Red is for those that are not. Circle area is proportional to cluster size, and singletons are represented as a dot.
CIDs from Selected Sources without Exact Structure Matches in SciFinder (November 2014)
| CID | source |
|---|---|
| CID 56593118 | ML226 probe inhibitor of lysophospholipase 1 [AID: 2202] |
| CID 46905036 | ML233 probe agonist of the APJ receptor [AID: 2580] |
| CID 53301938 | ML258 probe inhibitor of Bcl-B [AID: 720677] |
| CID 45100448 | ML179 probe inverse agonist of LRH-1 [AID: 504933] |
| CID 70789094 | ML353 probe modulator of mGlu5 [AID: 686927] |
| CID 71819646 | |
| CID 71819647 | |
| CID 77014274 | |
| CID 78243694 |