| Literature DB >> 19025694 |
Yoshimasa Tsuruoka1, Jun'ichi Tsujii, Sophia Ananiadou.
Abstract
BACKGROUND: Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems.Entities:
Mesh:
Year: 2008 PMID: 19025694 PMCID: PMC2586757 DOI: 10.1186/1471-2105-9-S11-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Annotating named entities by dynamic sentence selection.
N-best sequences output by the CRF tagger
| Probability | Transcription | factor | GATA-1 | and | the | estrogen | receptor |
| 0.677 | B | I | O | O | O | O | O |
| 0.242 | B | I | O | O | O | B | I |
| 0.035 | O | O | O | O | O | O | O |
| 0.012 | B | I | I | O | O | O | O |
| 0.009 | B | I | I | O | O | B | I |
| : | : | : | : | : | : | : | : |
Feature templates used in the CRF tagger
| Word Unigram | & | |
| POS Unigram | & | |
| Prefix, Suffix | prefixes of | & |
| suffixes of | & | |
| (up to length 3) | ||
| Normalized Word | N( | & |
| Word Shape | S( | & |
| Tag Bi-gram | true | & |
Statistics of named entities
| # Entities | Sentences (%) | |
| CoNLL: LOC | 7,140 | 5,127 (36.5%) |
| CoNLL: MISC | 3,438 | 2,698 (19.2%) |
| CoNLL: ORG | 6,321 | 4,587 (32.7%) |
| CoNLL: PER | 6,600 | 4,373 (31.1%) |
| GENIA: DNA | 2,017 | 5,251 (28.3%) |
| GENIA: RNA | 225 | 810 (4.4%) |
| GENIA: cell_line | 835 | 2,880 (15.5%) |
| GENIA: cell_type | 1,104 | 5,212 (28.1%) |
| GENIA: protein | 5,272 | 13,040 (70.3%) |
Figure 2Annotation of LOC in the CoNLL corpus.
Figure 5Annotation of PER in the CoNLL corpus.
Figure 3Annotation of MISC in the CoNLL corpus.
Figure 6Annotation of DNA in the GENIA corpus.
Figure 9Annotation of cell_type in the GENIA corpus.
Coverage achieved when the estimated coverage reached 99%
| Coverage | # Sentences Annotated | Percentage in the Corpus | |
| CoNLL: LOC | 99.1% | 7,600 | 54.1% |
| CoNLL: MISC | 96.9% | 5,400 | 38.5% |
| CoNLL: ORG | 99.7% | 8,900 | 63.4% |
| CoNLL: PER | 98.0% | 6,200 | 44.2% |
| GENIA: DNA | 99.8% | 11,900 | 64.2% |
| GENIA: RNA | 99.2% | 2,500 | 13.5% |
| GENIA: cell_line | 99.6% | 9,400 | 50.7% |
| GENIA: cell_type | 99.3% | 8,600 | 46.4% |
| Average | 99.0% | - | 52.4% |
Time elapsed when the estimated coverage reached 99%
| Cumulative Time (second) | Last Interval (second) | |
| CoNLL: LOC | 3,362 | 92 |
| CoNLL: MISC | 1,818 | 61 |
| CoNLL: ORG | 5,201 | 104 |
| CoNLL: PER | 2,300 | 75 |
| GENIA: DNA | 33,464 | 443 |
| GENIA: RNA | 822 | 56 |
| GENIA: cell_line | 15,870 | 284 |
| GENIA: cell_type | 13,487 | 295 |
Detailed results of the annotation process for GENIA:RNA
| Iteration | Coverage | Estimated Coverage | Relevant Sentences | Coverage of Suggested Annotation | Average Rank of Suggested Annotation |
| 1 | 0.4% | 17.4% | 85% | 86% | 2.64 |
| 2 | 11.8% | 15.9% | 90% | 82% | 2.12 |
| 3 | 27.9% | 21.1% | 58% | 83% | 2.54 |
| 4 | 35.6% | 37.3% | 87% | 94% | 1.48 |
| 5 | 45.4% | 49.4% | 89% | 96% | 1.50 |
| 6 | 55.6% | 56.3% | 79% | 96% | 1.65 |
| 7 | 64.2% | 63.7% | 74% | 98% | 1.60 |
| 8 | 72.6% | 72.1% | 55% | 95% | 2.00 |
| 9 | 78.7% | 80.9% | 56% | 98% | 1.78 |
| 10 | 84.8% | 84.1% | 36% | 99% | 1.54 |
| 11 | 88.6% | 88.1% | 18% | 99% | 1.48 |
| 12 | 90.7% | 92.5% | 21% | 98% | 1.31 |
| 13 | 93.2% | 92.9% | 12% | 98% | 1.21 |
| 14 | 94.6% | 94.3% | 12% | 100% | 1.24 |
| 15 | 96.0% | 96.4% | 12% | 99% | 1.27 |
| 16 | 97.5% | 97.2% | 4% | 99% | 1.03 |
| 17 | 97.9% | 97.8% | 5% | 99% | 1.11 |
| 18 | 98.6% | 96.6% | 2% | 100% | 1.15 |
| 19 | 98.8% | 98.2% | 3% | 99% | 1.02 |
| 20 | 99.2% | 98.4% | 0% | 100% | 1.00 |
| 21 | 99.2% | 98.6% | 0% | 100% | 1.00 |
| 22 | 99.2% | 98.8% | 0% | 100% | 1.00 |
| 23 | 99.2% | 98.9% | 0% | 100% | 1.00 |
| 24 | 99.2% | 99.0% | 0% | 100% | 1.00 |
| 25 | 99.2% | 99.1% | 0% | 100% | 1.00 |
Figure 7Annotation of RNA in the GENIA corpus.
Coverage achieved when the estimated coverage reached 99% (assuming the named entities of the other categories are already annotated in the corpus)
| Coverage | # Sentences Annotated | Percentage in the Corpus | |
| CoNLL: LOC | 98.5% | 5,500 | 39.2% |
| CoNLL: MISC | 95.0% | 3,200 | 22.8% |
| CoNLL: ORG | 99.0% | 5,400 | 38.5% |
| CoNLL: PER | 97.9% | 4,700 | 33.5% |
| GENIA: DNA | 99.6% | 8,200 | 44.2% |
| GENIA: RNA | 99.5% | 1,800 | 9.7% |
| GENIA: cell_line | 99.3% | 5,000 | 27.0% |
| GENIA: cell_type | 99.2% | 7,000 | 37.7% |
| Average | 98.5% | - | 31.6% |