| Literature DB >> 30564940 |
Ling Luo1, Zhihao Yang2, Pei Yang1, Yin Zhang3, Lei Wang4, Jian Wang1, Hongfei Lin1.
Abstract
In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers, to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional long short-term memory with a conditional random field layer is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech, chunking and named entity recognition features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.Entities:
Keywords: Biomedical entity recognition; Conditional random field; Deep learning; Long short-term memory; Patents
Year: 2018 PMID: 30564940 PMCID: PMC6755562 DOI: 10.1186/s13321-018-0318-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The processing flowchart of our system
An example of all features
| Input | Substituted | piperidines | with | selective | binding | to | histamine | h3 | – | receptor | . |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Word | substituted | piperidines | with | selective | binding | to | histamine | h3 | – | receptor | . |
| Character | s u b s t i t u d e d | p i p e r i d i n e s | w i t h | s e l e c t i v e | b i n d i n g | t o | h i s t a m i n e | h 3 | – | r e c e p t o r | . |
| Cap | firstCaps | lower | lower | lower | lower | lower | lower | lower | lower | lower | lower |
| POS | VBN | NNS | IN | JJ | NN | TO | NN | NN | HYPH | NN | . |
| Chunk | B-NP | I-NP | B-PP | B-NP | I-NP | B-PP | B-NP | I-NP | B-NP | I-NP | O |
| NER | O | O | O | O | O | O | B-protein | I-protein | I-protein | I-protein | O |
Fig. 2The architecture of BiLSTM-CRF model
The main hyper-parameters of our model
| Hyper-parameter | Value | Values tested |
|---|---|---|
| Word embedding dimension | 100 | 50, 100, 200 |
| Character embedding dimension | 25 | 25, 50 |
| Character-level BiLSTM state size | 25 | 25, 50 |
| Capitalization embedding dimension | 5 | 5, 10 |
| POS embedding dimension | 25 | 25, 50 |
| Chunking embedding dimension | 10 | 10, 20 |
| NER embedding dimension | 5 | 5, 10 |
| Word-level BiLSTM state size | 100 | 50, 100, 200 |
| SGD learning rate | 0.001 | 0.01, 0.005, 0.001 |
CEMP and GPRO corpora overview
| Training set | Test set | Entire corpus | |
|---|---|---|---|
| Patent abstracts | 21,000 | 9000 | 30,000 |
| CEMP mentions | 99,632 | 44,486 | 144,188 |
| GPRO mentions | 17,751 | 8998 | 26,749 |
| GPRO type 1 mentions | 12,422 | 5330 | 17,752 |
| GPRO type 2 mentions | 5329 | 3668 | 8997 |
| Tokens | 1,770,836 | 767,599 | 2,538,435 |
The effect of the different ratios of positive and negative documents
| Ratio (positive:negative) | CEMP Dev | GPRO Dev | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | |
| 1:0 | 87.58 | 92.20 | 89.83 | 60.90 | 88.27 | 72.07 |
| 1:0.5 | – | – | – | 66.06 | 85.76 | 74.63 |
| 1:1 | – | – | – | 67.97 | 86.06 |
|
| 1:2 | – | – | – | 70.03 | 77.79 | 73.71 |
| All training set | 87.58 | 92.50 |
| 68.32 | 82.44 | 74.72 |
On the CEMP corpus, only the ratio (1:0) and all training set were tested since the number of positive documents is more than the number of negative documents
Italic values denote the highest values
The effect of our baseline components on our development sets
| Model | CEMP Dev | GPRO Dev | ||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F-score | △ | Precision | Recall | F-score | △ | |
| Baseline | 87.58 | 92.50 |
| – | 67.97 | 86.06 |
| – |
| − Character embedding | 86.27 | 90.98 | 88.56 | − 1.41 | 66.67 | 83.69 | 74.22 | − 1.73 |
| − Capitalization feature | 87.99 | 91.42 | 89.67 | − 0.30 | 68.07 | 84.94 | 75.58 | − 0.37 |
| − CRF layer | 84.84 | 88.55 | 86.66 | − 3.31 | 62.81 | 79.41 | 70.14 | − 5.81 |
| − Post-processing | 87.30 | 92.28 | 89.72 | − 0.25 | 68.04 | 85.61 | 75.82 | − 0.13 |
Italic values denote the highest values
The effect of additional features on our development sets
| Model | CEMP Dev | GPRO Dev | ||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F-score | △ | Precision | Recall | F-score | △ | |
| Baseline | 87.58 | 92.50 |
| – | 67.97 | 86.06 | 75.95 | – |
| + POS feature | 88.12 | 91.70 | 89.87 | − 0.10 | 68.72 | 85.46 | 76.18 | +0.23 |
| + Chunking feature | 87.21 | 92.58 | 89.81 | − 0.16 | 67.21 | 87.45 | 76.01 | +0.06 |
| + NER feature | 87.57 | 91.81 | 89.64 | − 0.33 | 69.32 | 84.72 | 76.25 | +0.30 |
| + All features | 87.97 | 91.39 | 89.65 | − 0.32 | 70.84 | 83.76 |
| +0.81 |
Italic values denote the highest values
Performance comparison with other participants on the test sets (the best runs per team)
| Row | CEMP Test | GPRO Test | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Team | Precision | Recall | F-score | SD (%) | Team | Precision | Recall | F-score | SD (%) | |
| A | 121(ours) | 88.32 | 92.62 |
| 0.25 | 121(ours) | 76.65 | 81.91 |
| 0.10 |
| B | 112 | 88.97 | 91.82 | 90.37 | 0.27 | 112 | 75.23 | 77.49 | 76.34 | 0.08 |
| C | 107 | 90.02 | 90.62 | 90.32 | 0.27 | 153 | 72.06 | 80.68 | 76.13 | 0.10 |
| D | 153 | 88.02 | 90.28 | 89.14 | 0.30 | 133 | 66.53 | 82.68 | 73.73 | 0.10 |
| E | 116 | 84.39 | 92.97 | 88.47 | 0.23 | 142 | 74.79 | 71.63 | 73.18 | 0.15 |
Italic values denote the highest values
Examples of gene/protein named entity recognition errors
| Error type | Example |
|---|---|
| Incorrect boundary | And in the treatment of diseases and conditions that are mediated by |
| Missing gene/protein mention | Combination of |
| Not a gene/protein mention | Application of tumor inhibitor |
The correct entity mentions are underlined, while the misrecognized entity mentions are italicized