| Literature DB >> 29084508 |
Chen Lyu1, Bo Chen2, Yafeng Ren3, Donghong Ji4.
Abstract
BACKGROUND: Biomedical named entity recognition(BNER) is a crucial initial step of information extraction in biomedical domain. The task is typically modeled as a sequence labeling problem. Various machine learning algorithms, such as Conditional Random Fields (CRFs), have been successfully used for this task. However, these state-of-the-art BNER systems largely depend on hand-crafted features.Entities:
Keywords: Biomedical named entity recognition; Character representation; LSTM; Recurrent neural network; Word embeddings
Mesh:
Year: 2017 PMID: 29084508 PMCID: PMC5663060 DOI: 10.1186/s12859-017-1868-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The model architecture
Statistics of the datasets
| Training | Dev | Test | |
|---|---|---|---|
| GM | |||
| Sentences | 13500 | 1500 | 5000 |
| One-word Entities | 7051 | 805 | 2831 |
| Multi-word Entities | 9355 | 1047 | 3494 |
| Total Entities | 16406 | 1852 | 6325 |
| JNLPBA | |||
| Sentences | 16691 | 1855 | 3856 |
| One-word Entities | 19476 | 2170 | 3466 |
| Multi-word Entities | 26765 | 2890 | 5196 |
| Total Entities | 46241 | 5060 | 8662 |
Results for various LSTM-RNNs and word embeddings on the GM and JNLPBA data sets
| Systems | Dim. | GM (P/R/F1 score) | JNLPBA (P/R/F1 score) |
|---|---|---|---|
| LSTM-RNN | |||
| +SENNA | 50 | 83.87/80.46/82.13 | 67.50/72.52/69.92 |
| +Biomedical | 50 | 85.85/84.09/84.96 | 70.69/74.80/72.69 |
| 300 | 83.90/82.80/83.35 | 69.19/72.56/70.83 | |
| +Biomedical | 300 | 86.66/85.58/86.12 | 70.34/74.96/72.58 |
| +Random | 300 | 83.63/76.56/79.94 | 66.96/71.46/69.13 |
| BLSTM-RNN | |||
| +SENNA | 50 | 84.29/79.83/82.00 | 67.00/71.60/69.22 |
| +Biomedical | 50 | 88.42/82.63/85.43 | 71.04/74.45/72.71 |
| 300 | 85.02/82.04/83.50 | 68.59/73.99/71.19 | |
| +Biomedical | 300 | 87.85/85.29/86.55 | 71.24/76.53/73.79 |
| +Random | 300 | 82.87/77.65/80.18 | 68.43/70.98/69.68 |
Effects of fine-tuning word embeddings in LSTM-RNN and BLSTM-RNN
| Systems | Dim. | GM | JNLPBA | ||
|---|---|---|---|---|---|
| LSTM-RNN | +tune | -tune | +tune | -tune | |
| +SENNA | 50 | 85.69 | 82.13 | 70.56 | 69.92 |
| +Biomedical | 50 | 85.33 | 84.96 | 71.78 | 72.69 |
| 300 | 85.65 | 83.35 | 71.13 | 70.83 | |
| +Biomedical | 300 | 84.56 | 86.12 | 72.04 | 72.58 |
| +Random | 300 | 84.74 | 79.94 | 71.10 | 69.13 |
| BLSTM-RNN | +tune | -tune | +tune | -tune | |
| +SENNA | 50 | 86.81 | 82.00 | 72.09 | 69.22 |
| +Biomedical | 50 | 85.24 | 85.43 | 72.28 | 72.71 |
| 300 | 86.52 | 83.50 | 73.03 | 71.19 | |
| +Biomedical | 300 | 84.53 | 86.55 | 73.44 | 73.79 |
| +Random | 300 | 84.94 | 80.18 | 71.81 | 69.68 |
Fig. 2Effects of character representation. +Char — with character representation; -Char — without character representation. a LSTM-RNN, b BLSTM-RNN
Comparison of systems with and without the CRF layer
| Systems | GM | JNLPBA |
|---|---|---|
| BLSTM-RNN | 82.64 | 71.93 |
| BLSTM-RNN+CRF | 86.55 | 73.79 |
Fig. 3Feature representation of our model. Each column indicates the feature representation from BLSTM for each token. Each grid in the column indicates each dimension of the feature representation. The dimension of the feature representation is 100
Fig. 4Feature representation of the word “factor”. “factor1” is the word in the first sentence. “factor2” and “factor3” are the corresponding words in the second sentence. Each vertical bar indicates one dimension of the feature representation for the corresponding word
Results of our model on the GM corpus, together with top-performance systems
| Systems | P/R/F1 |
|---|---|
| BLSTM + Biomedical (300 dim.) |
|
| AIIAGMT [ |
|
| IBM [ | 88.48/85.97/87.21 |
| Gimli [ | 90.22/84.82/87.17 |
| BANNER [ | 88.66/84.32/86.43 |
| NERSuite [ | 88.81/82.34/85.45 |
| Li et al. (2015) [ | 83.29/80.50/81.87 |
| NERBio [ | 92.67/68.91/79.05 |
Results of our model on the JNLPBA corpus, together with top-performance systems
| Systems | P/R/F1 |
|---|---|
| BLSTM + Biomedical (300 dim.) |
|
| NERBio [ |
|
| Infocomm [ | 69.42/75.99/72.55 |
| Gimli [ | 72.85/71.62/72.23 |
| NERSuite [ | 69.95/72.41/71.16 |
Error analysis on JNLPBA test set
| Error type | % | |
|---|---|---|
| FP | Boundary errors | 49.31 |
| Type errors | 7.52 | |
| FN | Boundary errors | 35.66 |
| Type errors | 7.52 |