Literature DB >> 25415348

Parallel worlds of public and commercial bioactive chemistry data.

Christopher A Lipinski1, Nadia K Litterman, Christopher Southan, Antony J Williams, Alex M Clark, Sean Ekins.   

Abstract

The availability of structures and linked bioactivity data in databases is powerfully enabling for drug discovery and chemical biology. However, we now review some confounding issues with the divergent expansions of public and commercial sources of chemical structures. These are associated with not only expanding patent extraction but also increasingly large vendor collections amassed via different selection criteria between SciFinder from Chemical Abstracts Service (CAS) and major public sources such as PubChem, ChemSpider, UniChem, and others. These increasingly massive collections may include both real and virtual compounds, as well as so-called prophetic compounds from patents. We address a range of issues raised by the challenges faced resolving the NIH probe compounds. In addition we highlight the confounding of prior-art searching by virtual compounds that could impact the composition of matter patentability of a new medicinal chemistry lead. Finally, we propose some potential solutions.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25415348      PMCID: PMC4360371          DOI: 10.1021/jm5011308

Source DB:  PubMed          Journal:  J Med Chem        ISSN: 0022-2623            Impact factor:   7.446


Chemistry and Bioactivity Data: From Famine to Feast to Overload

It is hard to imagine now that in the early 2000s there was a dearth of chemistry and bioactivity data that were publicly accessible. Yet in the decade since the appearance of the large publically accessible PubChem[1] and ChEBI databases[2] we are arguably approaching an era of drug-discovery-related data overload as data generation, with high-throughput methods, is used to populate increasingly large databases.[3] Having just passed 53 million compounds, PubChem[4] has undoubtedly made the largest aggregated contribution to public or open chemistry and biology data, collating thousands of assay results against cells or biological targets for 2 million compounds. This will soon be complemented not only by the European Lead Factory,[5] which will focus on high throughput screening (HTS) and data generation, but also by a Knowledge Management Center that will capture data from the National Institutes of Health (NIH) “Illuminating the Druggable Genome” (IDG) program.[6] When we consider the availability of additional large chemical or biology related databases such as ChemSpider,[7] ChEMBL,[8] UniChem,[9] BindingDB,[10] and BARD,[11] as well as the emergence of Google as a de facto merged chemistry source,[12] two aspects come into focus. The first is that the era of the aforementioned “multistop datashops” (and the essential big data integration challenges this presents) is here to stay. The second is that public and commercial chemistry and bioactivity data sources will increasingly diverge. Users are thus faced with the necessity to compare content between the former but also to guess the proportion of unique structures in the latter (since, by definition, the latter do not openly benchmark themselves against each other or the former). Consequently, it is our view that commercial chemistry databases like SciFinder[13] from Chemical Abstracts Service (CAS) will be unable to keep pace with the totality of public chemistry data. It should however be noted that they ensure high curation quality[14] of their largely manually extracted data, with the assistance of software tools. The public domain resources, however, beyond their submission filtration pipelines, are dependent on the quality of depositing sources (analogous to the case with GenBank for primary sequence data). Multiple reviews of public domain data sources indicate that, in the main, data quality issues arise that are independent of the submitter.[15] Logically, some kind of comparative database quality metrics need to be generated and reported by a completely independent party. This would need a sampling strategy agreed by all those sources (commercial and public) prepared to participate in such a bench marking study. Our view that commercial chemistry databases will be unable to keep pace is especially validated as patent data continue to become openly available. For example IBM has deposited 2.6 million extracted patent compounds, SCRIPDB 6.6 million, and SureChem 9.4 million into PubChem. The European Bioinformatics Institute (EBI) recently acquired the SureChem operation and will expand the extraction pipeline to populate SureCHEMBL at EBI.[16] Efforts by a number of groups to extract chemical structures and content from patents and uncurated scientific papers[17] open up even more automated data flows whose scale precludes human verification.[18] In addition there are close to a billion virtual molecules in databases like the chemical universe database, GDB,[19] and at the other end of the scale are relatively small repositories of real molecules that may never appear in the pages of a journal.[20] A basic survey of some of the larger chemistry and biology data resources we are aware of is shown in Table 1 and highlights some of the differences in the content (vendor compounds, virtual compounds, etc.). It is important to put the scale of these big databases into context by considering that we are likely far from having a database of all possible chemistry, since a single simple empirical formula could potentially result in hundreds of millions of molecules with the same atomic composition.[21]
Table 1

Summary Statistics for the Public and Commercial Chemistry Databases above or near Half a Million Structures (at the Time of Writing), Most of Which Include Linkages to Bioactivity and Biological Dataa

nametotal (million)URLnotes
GDB13977http://www.gdb.unibe.ch/gdb/home.htmlVirtual compounds, no bioactivity data
SciFinder89http://www.cas.org/products/scifinderIncludes 28 million vendor compounds
UniChem71https://www.ebi.ac.uk/unichem/Includes 15 million SureChEMBL from patents
PubChem53https://pubchem.ncbi.nlm.nih.gov/Includes 42 million vendor compounds and 15 million from patents
CSLS46http://cactus.nci.nih.gov/cgi-bin/lookup/searchUpdate status unclear
ChemSpider32http://www.chemspider.com/Includes 12 million vendor compounds
Reaxys25http://www.elsevier.com/online-tools/reaxys5.1 million medicinal chemistry data
ZINC23http://zinc.docking.org/All vendor compounds, 8.1 million in PubChem
GOSTAR6.3https://gostardb.com/gostar/Activity linked
Thomson Pharma4.3http://www.thomson-pharma.com/Counted inside PubChem
Liceptor 3.2http://www.evolvus.com/products/databases/liceptordatabase.html 
ChEMBL1.4https://www.ebi.ac.uk/chembl/0.94 million inside PubChem
BindingDB0.45http://www.bindingdb.org/bind/index.jsp 

Note that apart from the three sources that have update cycles within PubChem (Thomson Pharma, ChEMBL, and BindingDB) all the others are likely to have at least a proportion of unique content (e.g., extractions from different journal articles).

Note that apart from the three sources that have update cycles within PubChem (Thomson Pharma, ChEMBL, and BindingDB) all the others are likely to have at least a proportion of unique content (e.g., extractions from different journal articles). A previous study has compared the content of several public and commercial databases.[22] This showed that the commercial databases captured a significant proportion of unique content and suggested they were complementary. However, even with the massive amount of public and commercial chemistry and bioactivity data now available in the various databases, finding the necessary information effectively remains difficult. As an example there are challenges in using molecular structures alone to search for and ascertain whether there is already biology or screening data associated with them, whether they are desirable as chemical probes or lead compounds, and for assessment of novelty for patent claims.[23] Differentiating between those molecules (the “usual suspects”) known to have liability or reactivity issues,[24] approved drugs,[25] useful probes,[26] prophetic compounds,[27] text abstracted compounds, and nominal probes with no provenance links in database records, is certainly now more complicated (Figure 1). This difficulty can be seen for even small defined sets of compounds, such as the National Center for Advancing Translational Sciences (NCATS) molecules for repurposing[28] or the NIH Molecular Libraries Program (MLP) probes.[29] The NIH MLP probes were initially the subject of a crowdsourcing analysis in which 11 scientists scored an initial set of 64 probes based on their own criteria of being acceptable or not.[26] This work has recently been greatly extended to 322 NIH MLP probes (at the time of this study) using the selection criteria of a single medicinal chemist.[30]
Figure 1

The “usual suspects” lineup, representing molecules of different classes from public and commercial databases, illustrating the difficulty of selecting desirable ones. From left to right, the documented probe is ML010 (CID 17757274), the drug is valsartan (CID 60846), a prophetic compound is from CAS 1164083-19-5 from WO 2001056358 (not in PubChem or ChemSpider),[42] a text extracted compound is from US20120040982[17] (CID 57498937), and one of the probes with incomplete data linkage is ML160 (CID 824820).

The “usual suspects” lineup, representing molecules of different classes from public and commercial databases, illustrating the difficulty of selecting desirable ones. From left to right, the documented probe is ML010 (CID 17757274), the drug is valsartan (CID 60846), a prophetic compound is from CAS 1164083-19-5 from WO 2001056358 (not in PubChem or ChemSpider),[42] a text extracted compound is from US20120040982[17] (CID 57498937), and one of the probes with incomplete data linkage is ML160 (CID 824820). The data needs for medicinal chemistry differ from those of biology. While both medicinal chemists and biologists seek high quality biology data to support their target choices, medicinal chemists also require information on freedom to operate by searching the literature for compounds identical to, similar to, or that are substructures of their leads, a search process we call “medicinal chemistry due diligence”. The NIH MLP probes are stored on PubChem and describe one or several probes with detailed rich biology but lack sufficient information for the medicinal chemist. We explored what a medicinal chemist might do in the early stages of medicinal chemistry due diligence using, as an example, the NIH MLP probes. Currently the most widely used and complete source of literature relevant to a putative lead resides in the CAS databases, very often accessed through the SciFinder software. We uncovered significant obstacles that a medicinal chemist would face trying to translate public sector probe discovery into a typical medicinal chemistry due diligence search. Attempting to track the status and provenance of this set of NIH MLP probes[30] (which we have clustered to simplify the visualization, Figure 2) exemplifies the complexity of linking current biology and chemistry data[30] and led directly to this review. For this reason we have used this set to discern trends that are reflected in the wider database “ecosystem”, which will now be described as examples. To many that are not seasoned explorers of these databases, we hope that this will be enlightening prior to your future quest to find information that is relevant. To those readers that have encountered these same issues, we hope that this increased attention will bring awareness to those involved in curating and funding such databases and that solutions will follow in due course.
Figure 2

Chemical structures for 322 NIH MLP probes (http://molsync.com/demo/probes.php) have been clustered into 44 groups for visualization purposes, using ECFP_6 fingerprints[58] and using a Tanimoto similarity threshold of >0.11 for cluster membership. The threshold was chosen empirically in order to show a representative selection of the kinds of molecules found within the set of probes. For each cluster, a representative molecule is shown (selected by picking the structure within the cluster with the highest average similarity to other structures in the same cluster). The clusters are decorated with semicircles which are colored blue for compounds that were considered high confidence based on our medicinal chemistry due diligence analysis. This analysis suggests that there is not an obvious correlation between structural composition and whether they pass the medicinal chemist’s logic.[30] Red is for those that are not. Circle area is proportional to cluster size, and singletons are represented as a dot.

Chemical structures for 322 NIH MLP probes (http://molsync.com/demo/probes.php) have been clustered into 44 groups for visualization purposes, using ECFP_6 fingerprints[58] and using a Tanimoto similarity threshold of >0.11 for cluster membership. The threshold was chosen empirically in order to show a representative selection of the kinds of molecules found within the set of probes. For each cluster, a representative molecule is shown (selected by picking the structure within the cluster with the highest average similarity to other structures in the same cluster). The clusters are decorated with semicircles which are colored blue for compounds that were considered high confidence based on our medicinal chemistry due diligence analysis. This analysis suggests that there is not an obvious correlation between structural composition and whether they pass the medicinal chemist’s logic.[30] Red is for those that are not. Circle area is proportional to cluster size, and singletons are represented as a dot.

Example 1. Complexities in Finding the NIH MLP Probes in PubChem

With just a few exceptions as we shall describe, NIH MLP probe compounds can be identified from NIH’s PubChem Web-based book[29] summarizing 5 years of probe discovery efforts. A probe compound is defined as essentially an excellent lead compound: very high binding affinity and ideally a well understood binding mode, high selectivity, good solubility, and low toxicity.[26] MLP probes are identified by a Molecular Libraries (ML) number and by a PubChem compound identification (CID) number that can be readily found by searching the NIH probe book.[29] Compared to many peer reviewed published formats, the NIH probe book is exemplary in being concise, but also information rich in both chemistry and biology. Subheadings across probe reports illuminate the importance and utility of each compound. Extensive out-linking (provided these do not decay) also adds to the user-friendliness of the reports. However, while some reports cover the medicinal chemistry aspects well, others are only designated by the PubChem substance identification (SID) number, which requires added effort to find the salient chemistry details. In this case, the probe is primarily characterized by a biological activity and SID link. Also, it was found that searching certain ML numbers listed in the book would not retrieve a CID in PubChem. In addition, a detailed Excel spreadsheet summary (WebTable 121012.xlsx) found on the NIH MLP Web site[31] contained data on only about two-thirds of the probes.[32] It appears that the concise organization found in the more recent probe reports may have been lacking at the outset. A few compounds were also identified that were originally described as NIH MLP probes but for which there is no probe report. We have recently compiled and shared the available information on the 322 NIH MLP probes we were able to resolve in an easily searchable collection available on Collaborative Drug Discovery (CDD)’s public database[33] as a free resource for the community[34] as well as elsewhere.[35]

Example 2. Identifier and Structure Searches in SciFinder Reveals an Extreme Disclosure

As we move beyond the NIH MLP probes to other databases to find more data on these or other compounds, we encounter further issues. The process of converting CID identifiers to CAS registry numbers can be used to obtain a summary of the number of literature references in SciFinder, and this identifier conversion is essential to medicinal chemistry due diligence. For example, when a high throughput screening (HTS) hit becomes of potential value in lead optimization, it is essential to conduct exact, substructure, and similarity searches on it. Literature descriptions of structure–function relationships are of value even if the prior literature report on the chemistry is in a very different field of biology to the current interest. There is a fundamental explanation for this observation, as diverse targets are under evolutionary pressure to interact with common signaling ligands.[36] In this sense ligand chemistry (at least for orthosteric ligands) is more conserved than target structure. This finding, coupled with the known conservation of biology target motifs,[37] is consistent with the knowledge that similar chemistry motifs tend to recur across varied biology. Computationally, this observation is also found in the RECAP technology[38] in which known drugs are fragmented and chemistry motifs are reconnected in new patterns to give new and often unexpected biological activities. These connected observations are also relevant to the behavior of medicinal chemists, who have been characterized, we think incorrectly, as conservative because they often tend to use and reuse the same chemical motifs in the compounds they make.[39] Rather, we think this medicinal chemistry behavior is better characterized as pragmatic as professional survival depends on creating compounds to meet project goals, and the use and reuse of chemical motifs previously shown to have useful biological activity are a proven successful strategy. SciFinder’s use of SMILES input rather than InChI or InChIKey preserves chemistry structure tautomeric information, which could be important for medicinal chemistry analysis and patent law, where tautomer structure can be critical. It is interesting that the SciFinder choice is consonant with the same selection for the Journal of Medicinal Chemistry digital structure capture.[40] Structure searches within SciFinder are subject to the well-known issues more broadly associated with chemistry structure drawing and include problems with stereochemical depiction, unclear double bond geometry, and unclear links between free base and salt forms.[15,41] When SciFinder refuses a structure search because of stereo bond depiction problems, the structure can be edited to remove stereo information from offending bonds, and the correct structure must be deduced from the pattern of literature citations. It should be noted that if the structure search within SciFinder fails to find a CAS registry number, the search can be repeated as a similarity search to ensure that the registry number was not missed because of a salt form. Once the CAS registry number is found, the total number of literature references with biological activity captured in SciFinder can be retrieved. It is at this point that any reference to the 2009 Goldfarb U.S. patent application on life extension in eukaryotic organisms (US 20090163545[42]) should be noted. US20090163545[42] contains a data table (Figure 16 in the patent ref (42)) on 499 compounds with PubChem substance IDs. However, SciFinder abstracts 6018 substances. How can this be? The patent includes the phrase, referring explicitly to (PubChem assay ID) 775, “the contents of which are herein incorporated in their entirety by reference”. This is full data disclosure taken to an extreme via subsummation of public HTS data into a patent by reference. While only 5796 substances from the HTS were referenced as “use” substances in SciFinder, 132 781 compounds were specified in the HTS (i.e., 32% of the entire Molecular Libraries screening collection, MLSMR). Thus, while this may be an exceptional patent abstraction example in SciFinder, it nonetheless illustrates how intellectual property (IP) due diligence searching can be confounded. Across the set of 322 NIH MLP probes, 72 intersect with the CIDs from AID 775, so a significant proportion will also intersect with the US20090163545 exemplifications. We were initially worried that a reference to this patent application was somehow an indicator for a flawed or promiscuous compound. We now believe the prevalence of references to this single patent application is an example of how complete data disclosure can lead to unexpected and potentially harmful consequences when performing IP due diligence.

Example 3. The Parallel Worlds of Commercial and Public Database Disclosure Do Not Completely Intersect

We expected that the chemical structures of all the NIH MLP probes would be abstracted by SciFinder. This proved not to be the case, raising the possibility of two parallel worlds of disclosure: the proprietary commercial database world of chemistry data abstracted by SciFinder and another data rich world of publically available and predominantly NIH funded chemistry and biology screening data, largely in Web format but not abstracted by SciFinder. Three CID examples are provided in Table 2, including one of the NIH MLP probes and two Web-only provenanced bioactive structures.
Table 2

CIDs from Selected Sources without Exact Structure Matches in SciFinder (November 2014)

CIDsource
CID 56593118ML226 probe inhibitor of lysophospholipase 1 [AID: 2202]
CID 46905036ML233 probe agonist of the APJ receptor [AID: 2580]
CID 53301938ML258 probe inhibitor of Bcl-B [AID: 720677]
CID 45100448ML179 probe inverse agonist of LRH-1 [AID: 504933]
CID 70789094ML353 probe modulator of mGlu5 [AID: 686927]
CID 71819646http://opensourcemalaria.org/, open source antimalarial active
CID 71819647http://opensourcemalaria.org/, open source antimalarial active
CID 77014274http://www.chemotion.net/, open chemistry publishing
CID 78243694http://www.chemotion.net/, open chemistry publishing
If this trend were to continue, intellectual property due diligence would be rendered even more difficult, requiring searching of multiple parallel disclosure formats at the same time.[23] Other intellectual property/legal due diligence issues may arise from the parallel worlds of public and commercial data. Much of the data input into public chemistry databases comes from deposition of massive numbers of compounds from chemical vendors (previously termed “vendor dilution effect” because only a minority of these compounds can be linked to bioactivity data[43]), many of them suffering from significant quality issues in structure representation as evidenced by our experiences. For example the ChemSpider[7,44] database required processing millions of chemical compounds for deposition, some of which had quality issues that required removal. Such data quality issues continue to plague chemistry databases and require vigilance.[15,41b] From previous work with a chemistry compound vendor, we estimate that at least half of “commercially available” compounds have never been made but rather are compounds that suppliers think can be made and that are listed as available in an attempt to elicit customer interest. These are commonly known as “make-on-demand” (MOD) compounds and are segregated in databases such as the ZINC database.[45] Most such compounds are identified by a chemical structure depiction and are annotated with some type of database identifier, but no other experimental data on the chemical depicted by the chemical structure drawing exist. On the basis of spot checks, about one-third of such low data value compounds found in PubChem do not appear in the CAS registry system.[46] For low data value compounds, the lack of abstraction by CAS can be viewed in a positive light, since abstracting such compounds could dilute the value of those abstracted real compounds, which are associated with experimental data. SciFinder had previously initiated abstraction of data from the ChemSpider database and had deposited over 300 000 chemicals from the database into the registry,[47] and this was discouraged by the hosts of ChemSpider because they had no way of distinguishing MOD compounds from synthesized and fully characterized chemicals. To our knowledge, CAS has not taken any ChemSpider data since the Royal Society of Chemistry (RSC) acquisition in 2008 (i.e., that is credited as such in SciFinder) and there has not been any agreement between RSC and CAS regarding ChemSpider data.

Example 4. Integration and Intersections of Databases and the Need for Bioassay Ontology Adoption

Understanding associations between chemical structures and biological assays is a further challenge, since there is essentially no standardization for describing the protocols for obtaining activity metrics (IC50, Ki, Kd, etc.) against a biological target, besides plain English text with scientific jargon. Because this form is intractable to software, it is impossible to determine whether two measurements of activity from different research groups are comparable, other than to have an expert read the full text for both descriptions. The use of a standard ontology, such as BioAssay Ontology,[48] across such databases would be helpful. This would enable enhanced searching and comparison, allowing for the automated aggregation and organization of assays to do sophisticated structure–activity relationship analysis and identify artifacts. Despite the benefits to the community, it currently requires substantial time and expert ontology understanding to correctly annotate each bioassay, so it has not been widely adopted. Efforts are currently underway to design a hybrid manual/automated method for making it relatively fast and easy for scientists to add semantic annotations to their bioassay protocols, which could improve the current situation.[49] This discussion leads us to ask whether compounds in databases without any experimental data and without any link to potential utility should be considered as prior art. This class of compounds is growing dramatically, especially in the public databases, and the utility is arguably markedly less than for prophetic compounds (defined in the Glossary) in patents, which may not be real compounds in an experimental sense but for which the relationship to experimentally tested compounds is at least clear. Such prophetic compounds have been abstracted in SciFinder since December 2007. As we have described earlier, the days when one could assume SciFinder had captured everything relevant to the entire global realm of bioactive chemistry are perhaps well passed. By definition, no quantitative assessment (such as the statistics of structure matching) across databases is possible without access to all of them, and to our knowledge this has not been undertaken to date. As the largest commercial source (Table 1) SciFinder contains organics, inorganics, organometallics, and “tabular inorganics”. Their reported (September 2014) total of 89 million substances would merge to a smaller collection of unique organic molecules if converted to InChiKeys followed by tautomer collapse (i.e., using just the 14-character connectivity layer). We can also estimate somewhere between 50 and 60 million InChiKeys are “in the wild”[12] mainly via the Google indexing of PubChem and ChemSpider, but there could be other sources of unique structures (including virtual compounds as described earlier). The intersections and differentials between SciFinder, PubChem, and ChemSpider and other databases (Table 1) are, to date, unknown and require quantification. In the future, with SciFinder opening up an STN application programming interface (API) for pharmas,[50] assessment of this overlap may become feasible. Other databases such as SureChEMBL may also overlap with PubChem (12.5 million compounds, of which 9.4 million are in PubChem). The ContentMine initiative[51] extracting molecules from documents could also further emphasize that SciFinder is perhaps no longer the definitive site for chemistry prior art checking. As SciFinder is based primarily on abstraction of compounds from the chemical literature and patents, it should be noted that the distinction of the public compound databases to host data that may never be published means that these databases will also continue to deviate until the commercial databases determine how to extract quality data from the public platforms.

Conclusions

From our own observations, we have identified a number of barriers to performing medicinal chemistry due diligence that arise due to the lack of integration between public and private data repositories. Even obtaining structures and associated data from well-funded public efforts like the NIH MLP probes and the NCATS molecules for repurposing[28] in PubChem or elsewhere is profoundly challenging. A medicinal chemist can hardly avoid being exposed to the debate calling for more data sharing and as much public exposure to primary data as possible. A rational response is enhanced by case studies of what can go wrong. For example, in our work on the NIH ML probes, we discovered a confounding case where the nominal subsummation of a public HTS screen into a patent application impacts over 20% of probes from a range of institutions. In addition, prophetic compounds in SciFinder and vendor molecules deposited in many public databases that include some proportion of probable MOD compounds complicate prior art designations. While we propose some more modest solutions for the highlighted issues, the one with the biggest potential impact would be if SciFinder generated and search-indexed the InChI identifiers (strings and keys), now effectively universally adopted by public chemical databases.[52] This would need to be in addition to using SMILES which retain the tautomeric structure of value to medicinal chemists and patent lawyers alike (as described earlier). The “multistop datashop” database challenge can be highlighted by the hypothetical novelty checking requirement for a new chemical structure proposal from a medicinal chemist or chemical biologist. This is equally important for someone in open source drug discovery who simply wants an answer to “what is out there that is similar” and who may even eschew IP on principle (e.g., their first response to a similarity match might be to make collaborative contact).[53] Those who seek to stake an IP position need exactly the same answer but in the different context of prior art and freedom to operate, i.e., the competitive landscape in structure terms. The issue for both of them is that all of the big four databases (SciFinder, PubChem, ChemSpider, and UniChem) have at least some unique content via differential source selectivity (as defined by an InChI not in the other three). Ipso facto all four databases need to be searched (although currently UniChem can only be interrogated for exact matches). Add to this the many open source (online) lab notebooks on the Web, and the increasing implausibility of being able to check everything “out there” becomes clear. Perhaps what is also needed is a shift toward more collaboration or openness in terms of availability of chemistry and biology data.[53,54] At the very least there needs to be increased communication between the various databases that are both public and proprietary in order to ensure the gulf does not widen further. Additionally they need to address some of the issues raised here. This would help to resolve discrepancies we have highlighted and to make analyses on what data exist for compounds more streamlined. For example, while in review, an article by Antolin and Mestres described 178 MLP chemical probes[55] that overlapped with our description of over 300 MLP chemical probes.[30] We think a meeting or discussion should be convened with all interested database parties. It could very well be conducted at a future American Chemical Society National Meeting or elsewhere. From previous public efforts to collate the data on melting point and solubility data, significant differences between different published studies[56] have been described for the same compounds. Recent efforts mining patents have also shown differences in biological data for the same compounds, based on the method of dispensing used.[57] These limited examples suggest there are benefits to making chemistry, biology screening, and other molecule related properties data accessible because it promotes new analyses and re-evaluation, which ultimately benefits science. We should note that despite our hopes that such a meshing of data is possible and would be of high value to the community, major hurdles exist to prevent this from happening in the short-term to middle-term future, as there is still simply too much commercial value to the hosts of the proprietary databases at present. We hope our experiences encourage the scientific community to develop creative solutions to enable a more comprehensive analysis of chemistry and related biological screening data. Clearly CAS and the other commercial vendors have to take notice and respond to the current rapidly evolving chemistry database situation; otherwise, their market may be rapidly eroded by these growing public efforts.
  43 in total

1.  ZINC--a free database of commercially available compounds for virtual screening.

Authors:  John J Irwin; Brian K Shoichet
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

2.  Public molecules: small, but perfectly formed.

Authors:  David Bradley
Journal:  Nat Rev Drug Discov       Date:  2004-12       Impact factor: 84.694

3.  Enhancement of chemical rules for predicting compound reactivity towards protein thiol groups.

Authors:  James T Metz; Jeffrey R Huth; Philip J Hajduk
Journal:  J Comput Aided Mol Des       Date:  2007-03-06       Impact factor: 3.686

4.  New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays.

Authors:  Jonathan B Baell; Georgina A Holloway
Journal:  J Med Chem       Date:  2010-04-08       Impact factor: 7.446

5.  Distant polypharmacology among MLP chemical probes.

Authors:  Albert A Antolín; Jordi Mestres
Journal:  ACS Chem Biol       Date:  2014-11-06       Impact factor: 5.100

6.  Rules for identifying potentially reactive or promiscuous compounds.

Authors:  Robert F Bruns; Ian A Watson
Journal:  J Med Chem       Date:  2012-10-25       Impact factor: 7.446

7.  A pharmacological organization of G protein-coupled receptors.

Authors:  Henry Lin; Maria F Sassano; Bryan L Roth; Brian K Shoichet
Journal:  Nat Methods       Date:  2013-01-06       Impact factor: 28.547

8.  PubChem: a public information system for analyzing bioactivities of small molecules.

Authors:  Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2009-06-04       Impact factor: 16.971

9.  UniChem: a unified chemical structure cross-referencing and identifier tracking system.

Authors:  Jon Chambers; Mark Davies; Anna Gaulton; Anne Hersey; Sameer Velankar; Robert Petryszak; Janna Hastings; Louisa Bellis; Shaun McGlinchey; John P Overington
Journal:  J Cheminform       Date:  2013-01-14       Impact factor: 5.514

Review 10.  Open source drug discovery - a limited tutorial.

Authors:  Murray N Robertson; Paul M Ylioja; Alice E Williamson; Michael Woelfle; Michael Robins; Katrina A Badiola; Paul Willis; Piero Olliaro; Timothy N C Wells; Matthew H Todd
Journal:  Parasitology       Date:  2013-08-28       Impact factor: 3.234

View more
  10 in total

1.  Tales from the war on error: the art and science of curating QSAR data.

Authors:  Marvin Waldman; Robert Fraczkiewicz; Robert D Clark
Journal:  J Comput Aided Mol Des       Date:  2015-08-20       Impact factor: 3.686

2.  Probes &Drugs portal: an interactive, open data resource for chemical biology.

Authors:  Ctibor Skuta; Martin Popr; Tomas Muller; Jindrich Jindrich; Michal Kahle; David Sedlak; Daniel Svozil; Petr Bartunek
Journal:  Nat Methods       Date:  2017-07-28       Impact factor: 28.547

3.  Data Mining and Computational Modeling of High-Throughput Screening Datasets.

Authors:  Sean Ekins; Alex M Clark; Krishna Dole; Kellan Gregory; Andrew M Mcnutt; Anna Coulon Spektor; Charlie Weatherall; Nadia K Litterman; Barry A Bunin
Journal:  Methods Mol Biol       Date:  2018

Review 4.  Small molecules with antiviral activity against the Ebola virus.

Authors:  Nadia Litterman; Christopher Lipinski; Sean Ekins
Journal:  F1000Res       Date:  2015-02-09

5.  The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands.

Authors:  Christopher Southan; Joanna L Sharman; Helen E Benson; Elena Faccenda; Adam J Pawson; Stephen P H Alexander; O Peter Buneman; Anthony P Davenport; John C McGrath; John A Peters; Michael Spedding; William A Catterall; Doriano Fabbro; Jamie A Davies
Journal:  Nucleic Acids Res       Date:  2015-10-12       Impact factor: 16.971

6.  DenovoProfiling: A webserver for de novo generated molecule library profiling.

Authors:  Zhihong Liu; Jiewen Du; Ziying Lin; Ze Li; Bingdong Liu; Zongbin Cui; Jiansong Fang; Liwei Xie
Journal:  Comput Struct Biotechnol J       Date:  2022-08-02       Impact factor: 6.155

7.  Phenotypic screening of low molecular weight compounds is rich ground for repurposed, on-target drugs.

Authors:  Christopher A Lipinski; Andrew G Reaume
Journal:  Front Pharmacol       Date:  2022-08-08       Impact factor: 5.988

8.  On drug discovery against infectious diseases and academic medicinal chemistry contributions.

Authors:  Yves L Janin
Journal:  Beilstein J Org Chem       Date:  2022-09-29       Impact factor: 2.544

9.  VB-MK-LMF: fusion of drugs, targets and interactions using variational Bayesian multiple kernel logistic matrix factorization.

Authors:  Bence Bolgár; Péter Antal
Journal:  BMC Bioinformatics       Date:  2017-10-04       Impact factor: 3.169

Review 10.  Caveat Usor: Assessing Differences between Major Chemistry Databases.

Authors:  Christopher Southan
Journal:  ChemMedChem       Date:  2018-02-23       Impact factor: 3.466

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.