| Literature DB >> 30698776 |
Saber A Akhondi1,2, Hinnerk Rey3, Markus Schwörer3, Michael Maier3, John Toomey4, Heike Nau3, Gabriele Ilchmann3, Mark Sheehan3, Matthias Irmer5, Claudia Bobach5, Marius Doornenbal2, Michelle Gregory2, Jan A Kors1.
Abstract
In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevier's Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patent's context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The gold-standard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation.Entities:
Mesh:
Year: 2019 PMID: 30698776 PMCID: PMC6351730 DOI: 10.1093/database/baz001
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Workflow of the relevancy classification.
Figure 2Patent corpus development.
Figure 3Annotations in a patent snippet with the brat annotation tool.
Number of annotations in the gold-standard set
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Compounds | Mono Component | 13 564 | 883 | 362 | 12 319 |
| Mixture part | 1010 | 0 | 0 | 1010 | |
| Prophetic | 625 | 625 | 0 | 0 | |
| Classes | Chemical class | 1848 | 249 | 30 | 1569 |
| Biomolecule | 1039 | 0 | 0 | 1039 | |
| Markush | 17 | 17 | 0 | 0 | |
| Mixture | 286 | 0 | 0 | 286 | |
| Mixture part | 174 | 0 | 0 | 174 | |
| Polymer | 226 | 0 | 0 | 226 | |
| Total chemical entities | 18 789 | 1774 | 392 | 16 623 | |
| Other | Suffix and prefix | 628 | — | — | — |
| Relation | 1848 | — | — | — | |
Performance of the ensemble system on compound recognition for different confidence score thresholds
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| 0.0 | 88.5 | 79.3 | 83.6 | 86.5 | 82.3 | 84.3 |
| 0.1 | 88.6 | 79.1 | 83.6 | 89.1 | 82.3 | 85.6 |
| 0.2 | 89.1 | 78.9 | 83.7 | 90.1 | 82.3 | 86.2 |
| 0.3 | 89.1 | 78.6 | 83.5 | 90.1 | 81.6 | 85.7 |
| 0.4 | 89.1 | 78.4 | 83.4 | 90.1 | 81.5 | 85.6 |
| 0.5 | 89.1 | 78.4 | 83.4 | 90.1 | 81.5 | 85.6 |
| 0.6 | 89.1 | 78.4 | 83.4 | 90.1 | 81.3 | 85.5 |
| 0.7 | 87.2 | 60.6 | 71.5 | 90.7 | 69.4 | 78.6 |
| 0.8 | 82.0 | 36.2 | 50.3 | 96.2 | 39.8 | 56.3 |
| 0.9 | 100.0 | 0.1 | 0.2 | 96.4 | 0.8 | 1.7 |
| 1.0 | 100.0 | 0.1 | 0.2 | 97.2 | 0.8 | 1.7 |
Figure 4The performance of the relevance system based on precision, recall and F-score.
The added value of individual features based on ``leave-one-out'' methodology
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| All features | 0.53 | 84.8 | 86.8 | 85.8 | - |
| A—Compound frequency | 0.47 | 82.8 | 86.2 | 84.5 | 1.3 |
| B—Compound section | 0.40 | 95.5 | 70.0 | 80.8 | 5.0 |
| C—Compound length | 0.40 | 75.9 | 75.5 | 75.7 | 10.1 |
| D—Surrounding characters | 0.53 | 85.1 | 82.9 | 84.0 | 1.8 |
| E—Compound section uniqueness | 0.53 | 84.8 | 82.9 | 83.9 | 1.9 |
| F—Compound without solvent | 0.53 | 85.1 | 82.9 | 84.0 | 1.8 |
| G—Compound wide usage | 0.53 | 83.9 | 76.4 | 80.0 | 5.8 |
Figure 5Performance of the relevancy classification system as a function of the relevance-score threshold when one of relevancy features A-G is removed (see Table 3 for feature legend).