| Literature DB >> 25268232 |
Saber A Akhondi1, Alexander G Klenner2, Christian Tyrchan3, Anil K Manchala4, Kiran Boppana4, Daniel Lowe5, Marc Zimmermann2, Sarma A R P Jagarlapudi4, Roger Sayle5, Jan A Kors1, Sorel Muresan6.
Abstract
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.Entities:
Mesh:
Year: 2014 PMID: 25268232 PMCID: PMC4182036 DOI: 10.1371/journal.pone.0107477
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Target class distribution of the 8,066 patents from which the final set was drawn.
| Target class | Number of patents | Final selection |
| GPCR | 3,569 | 20 |
| Protease | 1,093 | 17 |
| Kinase | 1,046 | 12 |
| Ion-Channel | 433 | 14 |
| Oxidoreductase | 404 | 17 |
| Hydrolase | 364 | 15 |
| NHR | 349 | 15 |
| Transporters | 323 | 18 |
| Other | 218 | 11 |
| Transferase | 152 | 12 |
| Phosphatase | 65 | 17 |
| Drugs from Sayle et al. | 50 | 32 |
| Total | 8,066 | 200 |
Figure 1Example patent text with pre-annotations as shown by the Brat annotation tool.
Number of annotated terms and unique terms within the harmonized set prior to disambiguation.
| Entity type | Annotated terms | Unique terms |
| IUPAC | 14,423 | 5,365 |
| Generic | 7,959 | 880 |
| Disease | 3,777 | 1,257 |
| Target | 3,227 | 705 |
| Trademark | 2,273 | 987 |
| Abbreviation | 1,460 | 153 |
| Formula | 1,069 | 171 |
| MOA | 1,014 | 211 |
| Registry Number | 108 | 90 |
| SMILES | 21 | 21 |
| CAS | 6 | 5 |
| InChI | 0 | 0 |
| Total | 35,337 | 9,845 |
Inter-annotator agreement (F-score) without ambiguity resolution.
| AstraZeneca | Fraunhofer | GVK BIO | NextMove | ||
| Fraunhofer | 0.42 | ||||
| GVK BIO | 0.60 | 0.39 | |||
| NextMove | 0.50 | 0.69 | 0.52 | ||
| Harmonized | 0.78 | 0.64 | 0.74 | 0.72 | |
The effect of the disambiguation process on the annotations.
| Rules | Type | Affected Terms | Affected Annotations |
| Add | IUPAC | 52 | 2,275 |
| annotation | Abbreviation | 29 | 1,631 |
| Generic | 67 | 976 | |
| Trademark | 71 | 442 | |
| Disease | 4 | 387 | |
| MOA | 2 | 203 | |
| Formula | 25 | 177 | |
| Registry Number | 28 | 111 | |
| Target | 19 | 32 | |
| Remove | Elements | 23 | 2,499 |
| annotation | IUPAC | 7 | 103 |
| Trademark | 3 | 101 | |
| Generic | 2 | 67 | |
| Target | 1 | 1 | |
| Total | 333 | 9,005 |
Inter-annotator agreement after ambiguity resolution.
| AstraZeneca | Fraunhofer | GVK BIO | NextMove | Harmonized | ||
| AstraZeneca | + 0.04 | + 0.09 | + 0.08 | + 0.06 | ||
| Fraunhofer | 0.46 | + 0.05 | + 0.03 | + 0.01 | ||
| GVK BIO | 0.69 | 0.44 | + 0.06 | + 0.05 | ||
| NextMove | 0.58 | 0.72 | 0.58 | + 0.03 | ||
| Harmonized | 0.84 | 0.65 | 0.79 | 0.75 | ||
The lower left triangle presents the inter-annotator agreement scores (F-score). The upper right triangle shows the improvement gained through disambiguation.
Inter-annotator agreement (F-score) between the harmonized set and the annotator groups for the main entity types.
| AstraZeneca Harmonized | Fraunhofer Harmonized | GVK BIO Harmonized | NextMove Harmonized | |
| Overall | 0.84 | 0.65 | 0.79 | 0.75 |
| Chemicals | 0.89 | 0.65 | 0.78 | 0.75 |
| Systematic | 0.94 | 0.81 | 0.91 | 0.93 |
| Non-systematic | 0.85 | 0.38 | 0.68 | 0.56 |
| Disease | 0.47 | 0.82 | 0.87 | 0.86 |
| Targets | 0.76 | 0.57 | 0.81 | 0.86 |
| MOA | 0.65 | 0.29 | 0.67 | 0.17 |
Number of annotated terms and unique terms in the harmonized set and in the full patent set of the gold standard corpus after disambiguation.
| Harmonized set (47 Patents) | Full set (198 Patents) | |||
| Unique terms | Annotated terms | Unique terms | Annotated terms | |
| IUPAC | 5,325 | 14,377 | 50,893 | 135,603 |
| Generic | 881 | 8,384 | 14,305 | 169,133 |
| Disease | 1,256 | 3,776 | 4,503 | 20,229 |
| Target | 703 | 3,235 | 3,514 | 14,398 |
| Trademark | 994 | 2,366 | 3,365 | 9,574 |
| Abbreviation | 153 | 2,088 | 778 | 21,087 |
| Formula | 169 | 1,127 | 3,108 | 25,716 |
| MOA | 210 | 1,017 | 110 | 3,837 |
| Registry Number | 96 | 140 | 188 | 329 |
| SMILES | 21 | 21 | 166 | 166 |
| CAS | 5 | 6 | 47 | 53 |
| InChI | 0 | 0 | 0 | 0 |
| Total | 9,813 | 36,537 | 80,977 | 400,125 |