| Literature DB >> 25810780 |
Tsendsuren Munkhdalai1, Meijing Li1, Khuyagbaatar Batsuren1, Hyeon Ah Park1, Nak Hyeon Choi1, Keun Ho Ryu1.
Abstract
BACKGROUND: Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data.Entities:
Keywords: Conditional Random Fields; Feature Representation Learning; Named Entity Recognition; Semi-Supervised Learning
Year: 2015 PMID: 25810780 PMCID: PMC4331699 DOI: 10.1186/1758-2946-7-S1-S9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1System design for chemical and drug Named Entity Recognition. The solid lines represent the flow of labeled data, and the dotted lines represent the flow of unlabeled data.
The baseline features.
| Feature description | Note/Regular expression |
|---|---|
| Roman number | [ivxdlcm]+|[IVXDLCM]+ |
| Punctuation | [,\\.;:?!] |
| Start with dash | "-.* |
| Nucleotide sequence | [atgcu]+ |
| Number | [0-9]+ |
| Capitalized | [A-Z] [a-z]* |
| Quote | [\"`'] |
| The lemma for the current token | Provided by BioLemmatizer [ |
| 2, 3 and 4-character prefixes and suffixes | |
| 2 and 3 character n-grams | Token start or end indicators are included |
| 2 and 3 word n-grams | |
CDI subtask evaluation results of different runs with varied features.
| Development set | Testing set | |||||
|---|---|---|---|---|---|---|
| Features | Pre | Rec | F-scr | Pre | Rec | F-scr |
| BANNER setup | 82.83 | 78.71 | 80.72 | 85.36 | 85.29 | 85.32 |
| Baseline | 81.71 | 82.3 | 82 | 75.87 | 70.55 | 73.11 |
| Baseline + Brown 300 | 82.2 | 82.96 | 82.58 | 86.03 | 85.45 | 85.74 |
| Baseline + Brown 1000 | 81.96 | 83.24 | 82.59 | 86.04 | 85.60 | 85.82 |
| Baseline + Brown 1000 + WVC 1000 | 82.73 | 83.89 | 83.31 | 86.23 | 85.37 | 85.8 |
| Baseline + Brown 1000 + Brown 300 | 82.1 | 83.42 | 82.76 | 86.46 | 85.63 | 86.04 |
| Baseline + Brown 1000 + WVC 300 | 82.43 | 83.82 | 83.12 | 86.06 | 86.06 | 86.06 |
| Baseline + Brown 1000 + WVC 500 | 82.78 | 83.56 | 83.17 | 86.12 | 86.2 | 86.16 |
| Baseline + Brown 1000 + WVC 500 + WVC 300 | 83.78 | 86.10 | 86.31 | 86.2 | ||
| Baseline + Brown 1000 + WVC 500 + WVC 1000 | 82.78 | 83.76 | 83.27 | 86.19 | 86.4 | 86.28 |
| Baseline + Brown 1000 + WVC 500 + WVC 300 + WVC 1000 | 82.3 | 83.16 | ||||
Feature groups are separated by (+). The parameters followed Brown and WVC are the number of classes induced in each model. Pre: Precision, Rec: Recall, F-scr: F-score.
CEM subtask evaluation results of different runs with varied features.
| Development set | Testing set | |||||
|---|---|---|---|---|---|---|
| Features | Pre | Rec | F-scr | Pre | Rec | F-scr |
| BANNER setup | 85.59 | 72.74 | 78.64 | 88.2 | 80.74 | 84.31 |
| Baseline | 84.40 | 77.34 | 80.71 | 79.81 | 63.16 | 70.51 |
| Baseline + Brown 300 | 84.6 | 78.47 | 81.42 | 88.67 | 81.17 | 84.75 |
| Baseline + Brown 1000 | 84.6 | 79.34 | 81.89 | 88.71 | 81.39 | 84.89 |
| Baseline + Brown 1000 + WVC 1000 | 85.25 | 80.3 | 82.7 | 88.79 | 81.45 | 84.96 |
| Baseline + Brown 1000 + Brown 300 | 84.76 | 79.46 | 82.03 | 89.1 | 81.54 | 85.2 |
| Baseline + Brown 1000 + WVC 300 | 84.98 | 80.07 | 82.45 | 88.65 | 82.13 | 85.26 |
| Baseline + Brown 1000 + WVC 500 | 85.32 | 79.92 | 82.53 | 88.77 | 82.42 | 85.48 |
| Baseline + Brown 1000 + WVC 500 + WVC 300 | 80.1 | 88.57 | 82.6 | 85.48 | ||
| Baseline + Brown 1000 + WVC 500 + WVC 1000 | 85.28 | 80.28 | 82.7 | 88.8 | 82.6 | 85.59 |
| Baseline + Brown 1000 + WVC 500 + WVC 300 + WVC 1000 | 84.89 | 82.56 | ||||
Feature groups are separated by (+). The parameters followed Brown and WVC are the number of classes induced in each model. Pre: Precision, Rec: Recall, F-scr: F-score.
BioCreative II gene mention evaluation results of different runs with varied features.
| Testing set | |||
|---|---|---|---|
| Features | Pre | Rec | F-scr |
| Baseline + Brown 300 | 86.49 | 83.79 | 85.12 |
| Baseline | 86.88 | 84.09 | 85.47 |
| Baseline + Brown 1000 | 86.82 | 84.27 | 85.53 |
| Baseline + Brown 1000 + WVC 500 + WVC 300 | 87.09 | 84.98 | 86.02 |
| Baseline + Brown 1000 + WVC 300 | 87.95 | 84.27 | 86.07 |
| Baseline + Brown 1000 + WVC 500 + WVC 1000 | 87.92 | 85.49 | 86.69 |
| Baseline + Brown 1000 + WVC 1000 | 85.58 | 86.84 | |
| Baseline + Brown 1000 + WVC 500 | 88.02 | ||
Feature groups are separated by (+). The parameters followed Brown and WVC are the number of classes induced in each model. Pre: Precision, Rec: Recall, F-scr: F-score.
Comparison of different systems on the BioCreative II testing set.
| System or author | BioCreative II rank | Pre | Rec | F-scr |
|---|---|---|---|---|
| Ando[ | 1 | 88.48 | 85.97 | 87.21 |
| BANNER-CHEMDNER | - | 88.02 | 86.08 | 87.04 |
| Kuo et al. [ | 2 | 89.3 | 84.49 | 86.83 |
| Huang et al. [ | 3 | 84.93 | 88.28 | 86.57 |
| BANNER | - | 88.66 | 84.32 | 86.43 |