| Literature DB >> 24729964 |
Buzhou Tang1, Hongxin Cao2, Xiaolong Wang3, Qingcai Chen3, Hua Xu4.
Abstract
Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements in F-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.Entities:
Mesh:
Year: 2014 PMID: 24729964 PMCID: PMC3963372 DOI: 10.1155/2014/240403
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Counts of different types of entities in two corpora used in this study.
| Corpus | BioCreAtIvE II GM | JNLPBA | ||||||
|---|---|---|---|---|---|---|---|---|
| Gene/protein | Total | Protein | DNA | RNA | Cell line | Cell type | Total | |
| Training | 18,265 | 18,265 | 30,269 | 9,534 | 951 | 3,830 | 6,718 | 51,301 |
| Test | 6,331 | 6,331 | 5,067 | 1,056 | 118 | 500 | 1,921 | 8,662 |
Examples of named entities represented by BIO labels. The first sentence comes from the JNLPBA corpus and the second sentence comes from the BioCreAtIvE II GM corpus.
| Example 1 | Token | IL-2 | gene | expression | and | NF-kappa | B | activation |
|
| Label | B-DNA | I-DNA | O | O | B-protein | I-protein | O |
| |
|
| |||||||||
| Example 2 | Token | Comparison | with | alkaline | phosphatases | and | 5 | — | nucleotidase |
| Label | O | O | B-GM | I-GM | O | B-GM | I-GM | I-GM | |
Figure 1A hierarchical structure fragment generated by Brown clustering for 7 words from the JNLPBA corpus.
A fragment of the semantic thesaurus of 3 words in the JNLPBA corpus, after running random indexing.
| zymosan-tr | zymogen | ym268 |
|---|---|---|
| interferon-tr: 0.276595744681 | monocyte/b-cell-specif: 0.359477124183 tubulointerstitium: 0.314720812183 | jak-l: 0.272425249169 |
Word embeddings of 4 words in the JNLPBA corpus. Each number denotes the feature value in a latent semantic/syntactic space.
| the: 0.067476 −0.017934 0.036855 0.348073 0.063362 −0.138005 −0.144527 −0.014324 0.161269 0.152643 … |
| of: 0.067905 −0.074922 0.012121 0.050542 0.327945 0.098191 −0.087244 0.194758 0.218592 −0.115941 … |
| gene: −0.254542 0.100417 −0.124032 0.084818 −0.279409 0.081752 −0.378949 −0.068434 −0.050847 0.142284 … |
| transcript: −0.157966 −0.303626 0.010010 −0.081133 −0.111763 −0.088829 −0.160671 0.185505 0.097515 −0.014036 … |
Performance of CRF-based BNER systems when different types of WR features were used.
| System | BioCreAtIvE II GM (%) | JNLPBA (%) | ||||
|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| |
| Baseline | 87.31 | 69.20 | 77.21 | 71.37 | 68.68 | 70.00 |
| Baseline + WR1 | 86.55 | 73.18 | 79.31 | 70.96 | 71.44 | 71.20 |
| Baseline + WR2 | 87.34 | 73.91 | 80.07 | 71.59 | 69.55 | 70.55 |
| Baseline + WR3 | 86.56 | 72.22 | 78.74 | 71.11 | 69.88 | 70.49 |
| Baseline + WR1 + WR2 | 86.56 | 75.39 | 80.59 | 70.99 | 71.77 | 71.38 |
| Baseline + WR1 + WR3 | 85.77 | 74.65 | 79.82 | 70.77 | 71.87 | 71.31 |
| Baseline + WR2 + WR3 | 87.03 | 74.90 | 80.51 | 71.19 | 70.41 | 70.80 |
| Baseline + WR1 + WR2 + WR3 | 86.54 | 76.05 | 80.96 | 70.78 | 72.00 | 71.39 |
*WR1, WR2, and WR3 denote three different types of word representation features: clustering-based, distributional, and word embeddings features, respectively.