| Literature DB >> 32046638 |
S K Hong1, Jae-Gil Lee2,3.
Abstract
BACKGROUND: Biomedical named-entity recognition (BioNER) is widely modeled with conditional random fields (CRF) by regarding it as a sequence labeling problem. The CRF-based methods yield structured outputs of labels by imposing connectivity between the labels. Recent studies for BioNER have reported state-of-the-art performance by combining deep learning-based models (e.g., bidirectional Long Short-Term Memory) and CRF. The deep learning-based models in the CRF-based methods are dedicated to estimating individual labels, whereas the relationships between connected labels are described as static numbers; thereby, it is not allowed to timely reflect the context in generating the most plausible label-label transitions for a given input sentence. Regardless, correctly segmenting entity mentions in biomedical texts is challenging because the biomedical terms are often descriptive and long compared with general terms. Therefore, limiting the label-label transitions as static numbers is a bottleneck in the performance improvement of BioNER.Entities:
Keywords: Bioinformatics; Data mining; Named entity recognition; Neural network
Year: 2020 PMID: 32046638 PMCID: PMC7014657 DOI: 10.1186/s12859-020-3393-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The overall architectures of the proposed framework DTranNER. a As a CRF-based framework, DTranNER is comprised of two separate, underlying deep learning-based networks: Unary-Network and Pairwise-Network are arranged to yield agreed label sequences in the prediction stage. The underlying DL-based networks of DTranNER are trained via two separate CRFs: Unary-CRF and Pairwise-CRF. b The architecture of Unary-CRF. It is dedicated to train Unary-Network. c The architecture of Pairwise-CRF. It is also committed to train Pairwise-Network. A token embedding layer is shared by Unary-Network and Pairwise-Network. A token-embedding is built upon by concatenating its traditional word embedding (denoted as “W2V”) and its contextualized token embedding (denoted as “ELMo”)
BioNER corpora in experiments
| Datasets | Number of Sentences | Entity Types | Entity Counts | Max Entity Length | Average Entity Length |
|---|---|---|---|---|---|
| BC2GM [ | 20128 | Gene/Protein | 24583 | 26 tokens | 2.44 tokens |
| BC4CHEMD [ | 87682 | Chemical/Drug | 84310 | 137 tokens | 2.19 tokens |
| BC5CDR-Chemical [ | 13935 | Chemical/Drug | 15935 | 56 tokens | 1.33 tokens |
| BC5CDR-Disease [ | 13935 | Disease | 12852 | 19 tokens | 1.65 tokens |
| NCBI-Disease [ | 7284 | Disease | 6881 | 22 tokens | 2.21 tokens |
Performance values in terms of the precision (%), recall (%) and F1-score (%) for the state-of-the-art methods and the proposed model DTranNER
| Corpus | BC2GM | BC4CHEMD | BC5CDR-Chemical | BC5CDR-Disease | NCBI-Disease | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| Att-BiLSTM-CRF (2017) | - | - | - | 90.01 | 91.14 | 93.49 | 91.68 | 92.57 | - | - | - | - | - | - | |
| D3NER (2018) | - | - | - | - | - | - | 93.73 | 92.56 | 93.14 | 83.98 | 85.40 | 84.68 | 85.03 | 83.80 | 84.41 |
| Collabonet (2018) | 80.49 | 78.99 | 79.73 | 90.78 | 87.01 | 88.85 | 94.26 | 92.38 | 93.31 | 85.61 | 82.61 | 84.08 | 85.48 | 87.27 | 86.36 |
| Wang et al. (2018) | 82.10 | 79.42 | 80.74 | 91.30 | 87.53 | 89.37 | 93.56 | 92.48 | 93.03 | 84.14 | 85.76 | 84.95 | 85.86 | 86.42 | 86.14 |
| BioBERT (2019) | 83.65 | 84.40 | 92.23 | 90.61 | 91.41 | 93.27 | 93.61 | 93.44 | 85.86 | 87.27 | 86.56 | ||||
| 84.21 | 91.94 | 88.21 | 89.04 | 88.62 | |||||||||||
Note: The highest performance in each corpus is highlighted in Bold. We quoted the published scores for the other models. For Wang et al. [11], we conducted additional experiments to obtain the performance scores for two corpora (i.e., BC5CDR-Chemical and BC5CDR-Disease) using the software on their open source repository [45]
Impact of Unary-Network and Pairwise-Network in terms of the F1-score (%)
| Settings | BC5CDR-Chemical | BC5CDR-Disease | NCBI-Disease |
|---|---|---|---|
| Unary-CRF | 93.01 | 86.14 | 86.94 |
| Pairwise-CRF | 93.27 | 86.05 | 86.71 |
| Unary+Pairwise ensemble | 93.25 | 86.78 | 87.09 |
| DTranNER | 94.16 | 87.22 | 88.62 |
Note: “Unary-CRF” denotes a variant model excluding Pairwise-Network from DTranNER, “Pairwise-CRF” denotes a variant model excluding Unary-Network from DTranNER, and “Unary+Pairwise ensemble” is an ensemble model of “Unary-CRF” and “Pairwise-CRF.” In the ensemble model, “Unary-CRF” and “Pairwise-CRF” were independently trained, and they voted over the sequence predictions by their prediction scores
Impact of separate BiLSTM layers in terms of the F1-score (%)
| Settings | BC2GM | BC5CDR-Chemical | BC5CDR-Disease | NCBI-Disease |
|---|---|---|---|---|
| DTranNER-shared | 83.69 | 93.57 | 86.75 | 88.01 |
| DTranNER | 84.56 | 94.16 | 87.22 | 88.62 |
Note: “DTranNER-shared” is a variant model that shares the BiLSTM layer in “Unary-Network” and “Pairwise-Network.”
Impact of each component in the token embedding composition in terms of the F1-score (%)
| Settings | BC2GM | BC5CDR-Chemical | BC5CDR-Disease | NCBI-Disease |
|---|---|---|---|---|
| W2V | 82.03 | 92.64 | 85.17 | 84.88 |
| ELMo | 83.41 | 93.78 | 86.76 | 88.27 |
| ELMo + W2V(=DTranNER) | 84.56 | 94.16 | 87.22 | 88.62 |
Note: “W2V” is a variant model of DTranNER whose embedding layer uses only traditional context-independent token vectors (i.e., Wiki-PubMed-PMC [25]), “ELMo” is another variant model of DTranNER whose embedding layer uses only ELMo, and “ELMo + W2V” is equivalent to DTranNER
Case study of the label sequence prediction performed by DTranNER and Unary-CRF
| Diseases/Chemicals | ||
| Case 1 | Unary-CRF | to enable diagnosis of |
| DTranNER | to enable diagnosis of | |
| Case 2 | Unary-CRF | The present study was designed to investigate whether nociceptin / |
| DTranNER | The present study was designed to investigate whether | |
| Case 3 | Unary-CRF | We report the case of a female patient with |
| DTranNER | We report the case of a female patient with | |
| Case 4 | Unary-CRF | Reduced |
| DTranNER | Reduced | |
| Genes/Proteins | ||
| Case 5 | Unary-CRF | The MIC90 of ABK against coagulase type IV strains was rather high, 12.5 micrograms/ml |
| DTranNER | The MIC90 of ABK against | |
| Case 6 | Unary-CRF | subtle differences between individual subunits that lead to species - specific properties of |
| DTranNER | subtle differences between individual subunits that lead to species - specific properties of | |
| Case 7 | Unary-CRF | |
| DTranNER | ||
Note: Unary-CRF is the purpose-built model excluding Pairwise-Network from DTranNER. The named entities inferred by each model are underlined in sentences