| Literature DB >> 27173521 |
Robert Leaman1, Chih-Hsuan Wei1, Cherry Zou2, Zhiyong Lu3.
Abstract
The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach.Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2016 PMID: 27173521 PMCID: PMC4865327 DOI: 10.1093/database/baw065
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.An example sentence from the CHEMDNER training corpus with chemicals and gene/proteins in Patent ID: CA2119782C.
Description of the training, development and test sets for the BioCreative V CHEMDNER task, including mentions for the chemical named entity recognition (CEMP) and gene and protein related object identification (GPRO) subtasks
| Count description | Training set | Development set | Test set |
|---|---|---|---|
| Patent abstracts | 7000 | 7000 | 7000 |
| CEMP mentions | 33543 | 32142 | 33949 |
| GPRO mentions | 6876 | 6263 | 7093 |
| Type 1 only | 4396 | 3934 | 4093 |
Figure 2.Architecture of the ensemble chemical named entity recognition system for the CEMP task.
Detailed description of the five models for GPRO task
| Model | Details |
|---|---|
| M1 | Use both Type 1 and 2 mentions, treat them as a single class |
| M2 | Use both Type 1 and 2 mentions, but treat as two distinct classes |
| M3 | Use only Type 1 mentions, ignore Type 2 mentions |
| M4 | Like M1, but with the additional GNormPlus feature |
| M5 | Like M3, but with the additional GNormPlus feature |
The additional GNormPlus feature refers to the results of the default GNormPlus model, trained on PubMed abstracts.
Results for tmChem model 1 and model 2 on the CEMP task in two training configurations
| System | Training | Precision | Recall | F-score |
|---|---|---|---|---|
| tmChem.M1 | Patent | 0.8088 | 0.8437 | |
| tmChem.M2 | Patent | 0.8721 | 0.7953 | 0.8319 |
| tmChem.M1 | Both | 0.8741 | ||
| tmChem.M2 | Both | 0.8711 | 0.8159 | 0.8426 |
Each measure is averaged between the two evaluation sets. The highest value is shown in bold.
Results for our ensemble systems on the CEMP task as measured by precision (P), recall (R) and f-score (F)
| System | Evaluation sets | Test set | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | |
| Logistic | 0.8867 | 0.8979 | 0.8752 | 0.9129 | ||
| Huber SVM | 0.9091 | 0.8626 | 0.8853 | 0.8908 | 0.8918 | 0.8913 |
| libsvm | 0.8753 | 0.8901 | 0.8822 | 0.8896 | ||
| High recall | 0.6732 | 0.9562 | 0.7901 | 0.7967 | 0.9314 | 0.8588 |
| Higher recall | 0.5922 | 0.7331 | 0.5202 | 0.6787 | ||
The internal evaluation values are averaged between the two evaluation sets. The highest value is shown in bold.
Micro-averaged results for each model on the GPRO task test set, as measured by precision (P), recall (R) and f-score (F).
| Methods | Precision | Recall | F-score |
|---|---|---|---|
| M1 | 0.7835 | 0.8302 | 0.8062 |
| M2 | 0.7852 | 0.8034 | |
| M4 | 0.7677 | 0.8069 | |
| Majority voting based on M1–M4 | 0.8059 | 0.7982 | 0.8020 |
| Majority voting based on M1–M5 | 0.8143 | 0.8141 |
The highest value is shown in bold.
The error analysis on 5-fold cross validation of the GPRO development set.
| Error type | Example | FPs | FNs | ||
|---|---|---|---|---|---|
| Incorrect boundary | ‘NPY1 receptors’ | 383 | (38.96%) | 405 | (51.92%) |
| Gene/family/domain confusion | ‘progesterone receptors’ | 188 | (19.13%) | 43 | (5.51%) |
| Not a gene mention | ‘MRSA’ | 226 | (22.99%) | ||
| Missed gene mention | ‘CB1’ | 175 | (22.44%) | ||
| Annotation inconsistency | ‘Alk1’ | 186 | (18.92%) | 157 | (20.13%) |
| Others | 1 | (0.10%) | 5 | (0.64%) | |
| Total | 983 | (100.00%) | 780 | (100.00%) |