| Literature DB >> 33238875 |
Yuanhe Tian1, Wang Shen2, Yan Song3,4, Fei Xia1, Min He2, Kenli Li2.
Abstract
BACKGROUND: Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance.Entities:
Keywords: Key-value memory networks; Named entity recognition; Neural networks; Syntactic information; Text mining
Mesh:
Year: 2020 PMID: 33238875 PMCID: PMC7687711 DOI: 10.1186/s12859-020-03834-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example sentence. An example where the object noun phrase (“Huntington disease”) is a named entity. The labels under the words are BIO tags
The statistics of the four benchmark datasets
| Datasets | Entity type | Token # | Sent. # | Entity # | |
|---|---|---|---|---|---|
| BC2GM | Gene/protein | Train | 355.4k | 12.5k | 15.1k |
| Dev | 71.0k | 2.5k | 3.0k | ||
| Test | 143.4k | 5.0k | 6.3k | ||
| JNLPBA | Train | 443.6k | 14.6k | 32.1k | |
| Dev | 117.2k | 3.8k | 8.5k | ||
| Test | 114.7k | 3.8k | 6.2k | ||
| BC5CDR-chemical | Chemical | Train | 118.1K | 4.5K | 5.2K |
| Dev | 117.4K | 4.5K | 5.3K | ||
| Test | 124.7K | 4.7K | 5.3K | ||
| NCBI-disease | Disease | Train | 135.7K | 5.4K | 5.1K |
| Dev | 23.9K | 923 | 787 | ||
| Test | 24.4K | 940 | 960 | ||
| LINNAEUS | Species | Train | 281.2k | 11.9k | 2.1k |
| Dev | 93.8k | 4.0k | 711 | ||
| Test | 165k | 7.1k | 1.4k | ||
| Species-800 | Train | 147.2K | 5.7K | 2.5K | |
| Dev | 22.2K | 830 | 384 | ||
| Test | 42.2K | 1.6K | 767 |
“Token #”, “Sent. #” and “Entity #” represent the number of tokens, sentences, and entities
Experimental results of models on six benchmark datasets
| Methods | BC2GM | JNLPBA | BC5CDR-chemical | NCBI-disease | LINNAEUS | Species-800 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | F1 | F1 | F1 | F1 | F1 | |||||||
| Base | 84.61 | 0.21 | 76.85 | 0.31 | 93.50 | 0.10 | 88.63 | 0.71 | 88.27 | 0.32 | 74.97 | 0.46 |
| + PL (DC) | 84.47 | 0.15 | 0.45 | 93.66 | 0.15 | 89.09 | 0.55 | 88.36 | 0.16 | 75.04 | 0.46 | |
| + PL ( | 0.10 | 77.06 | 0.05 | 0.19 | 0.56 | 0.30 | 0.41 | |||||
| + SC (DC) | 84.45 | 0.19 | 76.80 | 0.45 | 93.68 | 0.13 | 89.18 | 0.26 | 88.23 | 0.33 | 75.37 | 0.51 |
| + SC ( | 0.21 | 0.16 | 0.11 | 0.52 | 0.30 | 0.50 | ||||||
| + DR (DC) | 84.33 | 0.30 | 77.01 | 0.28 | 93.66 | 0.15 | 89.05 | 0.23 | 88.43 | 0.19 | 75.12 | 0.52 |
| + DR ( | 0.27 | 0.35 | 0.18 | 0.60 | 0.15 | 0.71 | ||||||
| Large | 84.89 | 0.17 | 77.29 | 0.19 | 93.90 | 0.31 | 88.65 | 0.59 | 88.87 | 0.65 | 74.98 | 0.59 |
| + PL (DC) | 85.06 | 0.08 | 0.18 | 93.90 | 0.16 | 88.74 | 0.26 | 88.65 | 0.39 | 74.92 | 0.86 | |
| + PL ( | 0.12 | 77.50 | 0.19 | 0.23 | 0.29 | 0.31 | 0.95 | |||||
| + SC (DC) | 85.12 | 0.13 | 77.56 | 0.12 | 93.95 | 0.09 | 88.78 | 0.54 | 0.28 | 0.29 | ||
| + SC ( | 0.15 | 0.19 | 0.13 | 0.37 | 88.92 | 0.35 | 75.08 | 0.68 | ||||
| + DR (DC) | 85.01 | 0.12 | 77.58 | 0.10 | 93.97 | 0.17 | 0.30 | 88.99 | 0.22 | 75.01 | 0.83 | |
| + DR ( | 0.10 | 0.11 | 0.10 | 88.81 | 0.51 | 0.27 | 0.91 | |||||
The experimental results are reported in terms of average F1 scores (F1) and the standard deviation . The methods in the group “Base” and “Large” refer to baselines with BioBERT-Base and BioBERT-Large encoder and our methods with KVMN (). “DC” refers to the baseline method using direct concatenation to incorporate syntactic information. “PL”, “SC”, and “DR” stand for POS labels, syntactic constituents, and dependency relations, respectively
Comparison with previous deep learning based methods
| Methods | BC2GM | JNLPBA | BC5CDR-chemical | NCBI-disease | LINNAEUS | Species-800 |
|---|---|---|---|---|---|---|
| biLSTM + pre-trained embeddings [ | 78.57 | 77.25 | 91.05 | 84.64 | 73.11 | |
| biLSTM + attentions [ | – | – | 92.57 | – | – | – |
| biLSTM + multi-task learning [ | 80.74 | 73.52 | - | 86.14 | – | – |
| biLSTM + pre-training [ | 81.69 | 75.03 | – | 87.34 | – | – |
| biLSTM + transfer learning [ | 78.66 | – | 91.64 | 84.72 | 93.54 | 74.98 |
| biLSTM + model ensemble [ | 79.73 | 93.31 | 86.36 | – | – | |
| SciBERT [ | – | 77.28 | – | 88.57 | – | – |
| BERT [ | 81.79 | 74.94 | 91.16 | 85.63 | 87.60 | 71.63 |
| BioBERT (Base) [ | 84.72 | 77.49 | 93.47 | 89.71 | 88.24 | 75.31 |
| BioBERT (Large) [ | 85.01 | – | – | 88.79 | – | – |
| BioBERT (Base) + DR ( | 84.92 | 77.72 | 94.00 | 88.79 | 76.21 | |
| BioBERT (Large) + DR ( | 77.83 | 89.63 | 89.24 |
The result (F1 scores) of our method on each dataset comes from the best performing model. The results for the base and large version of BioBERT [19] are from their paper and GitHub repository
We report the results of their version 1.1, which is identical to the BioBERT version used in our experiments
Results of the syntactic information ensemble on BC5CDR-chemical dataset
| Ensemble strategies | Syntactic info. | BioBERT-Base | BioBERT-Large | ||||
|---|---|---|---|---|---|---|---|
| PL | SC | DR | F1 | F1 | |||
| Baseline | 93.50 | 0.10 | 93.90 | 0.31 | |||
| Sum | 93.66 | 0.17 | 94.20 | 0.15 | |||
| 93.76 | 0.16 | 94.10 | 0.15 | ||||
| 93.81 | 0.15 | 94.12 | 0.14 | ||||
| 93.78 | 0.25 | 94.26 | 0.16 | ||||
| Concatenation | 93.75 | 0.23 | 94.25 | 0.12 | |||
| 93.80 | 0.26 | 94.22 | 0.16 | ||||
| 93.83 | 0.20 | 94.31 | 0.08 | ||||
| 0.26 | 0.25 | ||||||
The three types of syntactic information used for the ensemble are POS labels (PL), syntactic constituents (SC), and dependency relations (DR). The results are reported in terms of the average F1 scores and the standard deviation (). Sum and concatenation are two ensemble strategies applied to our method
Results of using different NLP toolkits on the BC5CDR-chemical dataset
| BioBERT-base | BioBERT-large | |||
|---|---|---|---|---|
| F1 | F1 | |||
| Baseline | 93.50 | 0.10 | 93.90 | 0.31 |
| Stanford CoreNLP Toolkits | ||||
| PL ( | 93.73 | 0.19 | 94.05 | 0.23 |
| DR ( | 0.18 | 94.05 | 0.10 | |
| spaCy | ||||
| PL ( | 93.69 | 0.12 | 0.10 | |
| DR ( | 93.71 | 0.12 | 93.97 | 0.13 |
The experimental results [the average F1 scores and the standard deviation ()] of our method with KVMN () using different NLP toolkits (i.e., Stanford CoreNLP Toolkits and spaCy) to obtain POS labels (PL) and dependency relations (DR). The results of baseline methods without using any syntactic information are also reported for reference
Fig. 2Case study. In the figure, a, b are two examples of syntactic information (i.e., syntactic constituents and dependency relations) and the context features for “SEP” and “dystrophy”, respectively. The weights for syntactic information learned from the memories are highlighted with the darker color referring to greater value
Fig. 3The overall architecture of BioKMNER. The top part of the figure shows the syntactic information extraction process: for the input word sequence, we firstly use off-the-shelf NLP toolkits to obtain its syntactic information (e.g., syntax tree), then map the context features and the syntactic information into keys and values, and finally convert them into embeddings. The bottom part is our sequence labeling based BioNER tagger, which uses BioBERT [19] as the encoder and a softmax layer as the decoder. Between the encoder and decoder are the key-value memory networks (KVMN) which weighs syntactic information (values) according to the importance of the context features (keys). The output of KVMN is fed into the decoder to predict output labels
Fig. 4Syntactic information extraction. Three types of syntactic information extracted for an example “Dihydropyrimidine dehydrogenase deficiency is an autosomal recessive disease” in the biomedical domain. The context features and their corresponding POS labels, syntactic constituents, and dependency relations for =“deficiency” are highlighted in part a, b, and c respectively