| Literature DB >> 26582922 |
George Papadatos1, Mark Davies1, Nathan Dedman1, Jon Chambers1, Anna Gaulton1, James Siddle2, Richard Koks2, Sean A Irvine3, Joe Pettersson4, Nicko Goncharoff5, Anne Hersey6, John P Overington7.
Abstract
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26582922 PMCID: PMC4702887 DOI: 10.1093/nar/gkv1253
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of the SureChEMBL data pipeline from the raw patent feed to the standardized compounds in the database.
The patent content and coverage of the SureChEMBL database
| Data | Description and languages | Years | |
|---|---|---|---|
| EP applications | Bib. data | DocDB + Original | From 1978 |
| Full text | Original (EN, DE, FR) | ||
| EP granted | Bib. data | DocDB + Original | From 1980 |
| Full text | Original (EN, DE, FR) | ||
| WO applications | Bib. data | DocDB + Original | From 1978 |
| Full text | Original (EN, DE, FR, ES, RU) | From 1978 | |
| US applications | Bib. data | DocDB + Original | From 2001 |
| Full text | Original (EN) | From 2001 | |
| US granted | Bib. data | DocDB + Original | From 1920 |
| Full text | Original (EN) | From 1976 | |
| JP applications | Bib. data | DocDB | From 1973 |
| English abstracts/titles | From 1976 | ||
| JP granted | Bib. data | DocDB | From 1994 |
| 90+ countries | Bib. data | DocDB | From 1920 |
It is worth mentioning that the WIPO does not grant patents as this a prerogative of the national or regional patent authorities. DocDB refers to a database provided by the EPO, containing comprehensive bibliographic information for patent documents released worldwide.
Figure 2.(A) Field keyword-based search against full text and patent bibliographic metadata. (B) The equivalent search using the Lucene query fields syntax.
Figure 3.Similarity search for the near neighbours of the approved drug donepezil. The search results will have a molecular weight range of 300–800. Furthermore, only compounds that are extracted from the claims or description sections and images will be retrieved.
Figure 4.The results of search can be either a list of documents (A) or compounds (B). The former is sorted in reverse chronological order and provides a preview of the each document by means of patent ID, publication date, assignee, classification code(s), title and language. Moreover, for each document, members of the same patent family (i.e. a number of patent documents by the same inventors describing the same invention filed in multiple countries) across different patent authorities may be retrieved (listed in dark background). Finally, the chemistry annotated in each document can be exported and downloaded. In case of the compound hits (B), the report card view may be viewed for each hit (e.g. https://www.surechembl.org/chemical/SCHEMBL16354556 and Supplementary Figure S2). Additionally, users may choose a number of these search hits and retrieve the patent documents associated with their selection.
Figure 5.The export chemistry modal window allows users to filter compounds based on calculated physicochemical and related properties, simple counts and frequency of occurrence.