| Literature DB >> 19025691 |
Yutaka Sasaki1, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou.
Abstract
BACKGROUND: When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity.Entities:
Mesh:
Year: 2008 PMID: 19025691 PMCID: PMC2586754 DOI: 10.1186/1471-2105-9-S11-S5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Block diagram of lexicon-based statistical NER.
Figure 2Block diagram of tagging and labelling model generation.
Figure 3Example of lexicon-based POS/Protein tagging.
Protein name recognition performance
| Tagging | R | P | F | |
| (a) POS/PROTEIN tagging | Full | 52.91 | 43.85 | 47.96 |
| Left | 61.48 | 50.95 | 55.72 | |
| Right | 61.38 | 50.87 | 55.63 | |
| Sequential Labelling | R | P | F | |
| (b) Word feature | Full | 63.23 | 70.39 | 66.62 |
| Left | 68.15 | 75.86 | 71.80 | |
| Right | 69.88 | 77.79 | 73.63 | |
| (c) (b) + orthographic feature | Full | 77.17 | 67.52 | 72.02 |
| Left | 82.51 | 72.20 | 77.01 | |
| Right | 84.29 | 73.75 | 78.67 | |
| (d) (c) + POS feature | Full | 76.46 | 68.41 | 72.21 |
| Left | 81.94 | 73.32 | 77.39 | |
| Right | 83.54 | 74.75 | 78.90 | |
| (e) (d) + PROTEIN feature | Full | 77.58 | 69.18 | 73.14 |
| Left | 82.69 | 73.74 | 77.96 | |
| Right | 84.37 | 75.24 | 79.54 | |
| (f) (e) after adding protein names in the training set to the lexicon | Full | |||
| Left | 84.82 | 72.85 | 78.38 | |
| Right | 86.60 | 74.37 | 80.02 | |
Protein name recognition performance of the proposed method, evaluated by recall (R), precision (P), and F-measure (F). The left boundary (Left), the right boundary (Right), and both boundary (Full) recognition performance were measured. (a) the performance of POS/PROTEIN tagging. (b) the performance of sequential labelling when using the word feature only. (c) the performance of sequential labelling when using the word and orthographic features. (d) the performance of sequential labelling when using the word, orthographic, and POS features. (e) the performance of sequential labelling when using the word, orthographic, POS, and PROTEIN name features. (f) the performance of sequential labelling with the features used in (e) after adding protein names appearing in the training set to the lexicon. NB: no retraining was conducted.
Upper bound protein name recognition performance after ideal lexicon enrichment
| Method | R | P | F | |
| Tagging (+test set protein names) | Full | 79.02 | 61.87 | 69.40 |
| Left | 82.28 | 64.42 | 72.26 | |
| Right | 80.96 | 63.38 | 71.10 | |
| Labelling (+test set protein names) | full | |||
| Left | 89.58 | 75.40 | 81.88 | |
| Right | 90.23 | 75.95 | 82.47 | |
The upper bound performance on the JNLPBA-2004 test set by enriching the lexicon with protein names appearing in the test set. NB: It was the only the lexicon that was modified. The tagging and sequential labelling models were not retrained using the test set. The first block shows the performance of POS/PROTEIN tagging after adding protein names appearing in the test set to the dictionary. Since many protein names overlap with general English words, sometimes protein names in sentences are not recognized as protein names. The second block shows the performance of the sequence labelling based on the tagging output. Note that the tagging and sequential labelling models were not retrained using the test set.
Conventional results for protein name recognition
| Authors | R | P | F |
| Tsai et al. [ | 71.31 | 79.36 | 75.12 |
| Zhou and Su [ | 69.01 | 79.24 | 73.77 |
| Kim and Yoon [ | 75.82 | 71.02 | 73.34 |
| Okanohara et al. [ | 77.74 | 68.92 | 73.07 |
| Tsuruoka [ | 81.41 | 65.82 | 72.79 |
| Finkel et al. [ | 77.40 | 68.48 | 72.67 |
| Settles [ | 76.1 | 68.2 | 72.0 |
| Song et al. [ | 65.50 | 73.04 | 69.07 |
| Rössler [ | 72.9 | 62.0 | 67.0 |
| Park et al. [ | 69.71 | 59.37 | 64.12 |
Conventional scores on the test set of JNLPBA-2004 shared task.
Error analysis
| False positives | |||
| Cause | Correct extraction | Identified term | |
| 1 | lexicon | - | protein, binding sites |
| 2 | prefix word | trans-acting factor | common trans-acting factor |
| 3 | unknown word | - | ATTTGCAT |
| 4 | sequential labelling error | - | additional proteins |
| 5 | test set error | - | Estradiol receptors |
| False negatives | |||
| Cause | Correct extraction | Identified term | |
| 1 | anaphoric | ( | - |
| 2 | coordination (and, or) | transcription factors NF-kappa B and AP-1 | transcription factors NF-kappa B |
| 3 | prefix word | activation protein-1 | protein-1 |
| catfish STAT | STAT | ||
| 4 | postfix word | nuclear factor kappa B complex | nuclear factor kappa B |
| 5 | plural | protein tyrosine kinase(s) | protein tyrosine kinase |
| 6 | family name, biding site, and domain | T3 binding sites | - |
| residues 639–656 | - | ||
| 7 | sequential labelling error | PCNA | - |
| Chloramphenicol acetyltransferase | - | ||
| 8 | test set error | superfamily member | - |
Error analysis of the results of the dictionary-based statistical approach.