| Literature DB >> 26457120 |
Stefan Senger1, Luca Bartek1, George Papadatos2, Anna Gaulton2.
Abstract
BACKGROUND: First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.Entities:
Keywords: IBM SIIP; Patent chemistry databases; Patents; SureChEMBL
Year: 2015 PMID: 26457120 PMCID: PMC4594083 DOI: 10.1186/s13321-015-0097-z
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Distribution of SureChEMBL-SciFinder overlap. Number of patents where the percentage of hits from SciFinder that are also present in SureChEMBL is within a given range
Fig. 2SureChEMBL Chemical Corpus Count of Compounds. Number of compounds from the set of 35 patents with the SciFinder annotation “Biological Studies” within eight ranges of SureChEMBL chemical corpus count
Fig. 3Distribution of IBM-SciFinder overlap. Number of patents where the percentage of hits from SciFinder that are also present in IBM SIIP is within a given range
List of patents where the percentage of compounds found in SureChEMBL or IBM is low
| Patent number | Number of compounds in SciFinder | % SciFinder compounds in IBM SIIP | % SciFinder compounds in SureChEMBL |
|---|---|---|---|
| US4231938 | 2 | 0.0 | 0.0 |
| US4472305 | 277 | 0.4 | 0.4 |
| WO2009126123 | 169 | 5.3 |
|
| EP1481667 | 47 | 14.9 |
|
| WO2008129280 | 62 | 17.7 | 19.4 |
| US4847265 | 5 | 20.0 | 20.0 |
| US6323216 | 20 | 20.0 |
|
| WO2006083612 | 68 | 23.5 | 27.9 |
| US4738974 | 4 | 25.0 | 25.0 |
| US4572909 | 60 | 28.3 |
|
| US6156756 | 81 |
| 27.2 |
List of patents where the percentage of compounds in SciFinder that were also found in SureChEMBL or IBM SIIP is equal to or less than 30 %. Percentages above 30 % are marked in italics
Fig. 4Distribution of compounds found by one or both of the two automatically curated databases. Percentage of compounds that are in SciFinder for the set of 45 patents and are only in SureChEMBL (orange), IBM SIIP (blue), or both (green). US4231938 is not included since none of compounds in SciFinder have been found in either SureChEMBL or IBM SIIP (cf. Table 1). The results used to generate this visualisation can be found in Additional file 3
Fig. 5Bar charts showing the number of substances found in the databases. Bar charts with the number of substances found (green) or not found (red) in at least one of the two patent databases depending on the number of patents in Reaxys (left) and the number of components (right)
Results from the search for 1740 compound-patent pairs in SureChEMBL (SC) and IBM SIIP (IBM)
| Country code | Total | Not found | Total found | % Found | IBM only | SC only | Both |
|---|---|---|---|---|---|---|---|
| EP | 172 | 57 | 115 | 66.9 | 27 | 21 | 67 |
| US | 873 | 222 | 651 | 74.6 | 70 | 102 | 479 |
| WO | 695 | 228 | 467 | 67.2 | 64 | 78 | 325 |
| All | 1740 | 507 | 1233 | 70.9 | 161 | 201 | 871 |