| Literature DB >> 15960839 |
Jenny Finkel1, Shipra Dingare, Christopher D Manning, Malvina Nissim, Beatrice Alex, Claire Grover.
Abstract
BACKGROUND: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960839 PMCID: PMC1869019 DOI: 10.1186/1471-2105-6-S1-S5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features Used Description of the Full Feature Set Used In the Closed Section Submission.
| Word Features | w |
| w | |
| w | |
| Last "real" word | |
| Next "real" word | |
| Disjunction of 4 previous words | |
| Disjunction of 4 next words | |
| Bigrams | w |
| w | |
| TnT POS | POS |
| POS | |
| POS | |
| Character Substrings | Up to a length of 6 |
| Abbreviations | abbr |
| abbr | |
| abbr | |
| abbr | |
| Word Shape | shape |
| shape | |
| shape | |
| shape | |
| shape | |
| shape | |
| Previous NE | NE |
| NE | |
| Previous NE + Word | NE |
| Previous NE + POS | NE |
| NE | |
| Previous NE + Shape | NE |
| NE | |
| NE | |
| NE | |
| Paren-Matching | A feature that signals when one parentheses in a pair has been assigned a different tag than the other in a window of 4 words |
Development set results System Results on Cross-Validated Training/Dev Data.
| Precision | Recall | F-Score | |
| Open | 0.813 | 0.861 | 0.836 |
| Closed | 0.784 | 0.852 | 0.817 |
Test set results System Results on Evaluation Data.
| Precision | Recall | F-Score | |
| Open | 0.828 | 0.835 | 0.832 |
| Closed | 0.792 | 0.854 | 0.822 |
Lesion study results Results on Cross-Validated Training and Development Data With One Feature Removed At a Time
| Precision | Recall | F-Score | Δ F | |
| Abbreviations | 0.813 | 0.860 | 0.836 | -0.05% |
| Abgene | 0.810 | 0.861 | 0.834 | -0.18% |
| Abstract | 0.811 | 0.855 | 0.832 | -0.39% |
| Gazette | 0.807 | 0.857 | 0.831 | -0.51% |
| Genia | 0.806 | 0.857 | 0.831 | -0.55% |
| Substrings | 0.814 | 0.852 | 0.833 | -0.37% |
| POS | 0.814 | 0.860 | 0.836 | -0.03% |
| Google Web | 0.807 | 0.864 | 0.835 | -0.17% |
| Word Shape | 0.815 | 0.862 | 0.838 | +0.13% |
| Zero Order | 0.741 | 0.799 | 0.770 | -6.66% |
| First Order | 0.818 | 0.853 | 0.835 | -0.15% |
| Second Order | 0.814 | 0.861 | 0.837 | +0.06% |
| Third Order | 0.814 | 0.863 | 0.837 | +0.07% |
Figure 1Learning curve. Learning curve for the performance of the "open" NER system on development data.
Examples of Errors Examples of FPs, FNs and boundary errors. In some of the examples square brackets are used to indicate the differences between the classifier's output and the annotation in the gold standard.
| General Words | homolog gene | - |
| Measures | kat/L | - |
| Possible Errors in GS | [ssDNA-] and [RNA-binding protein] | ssDNA- and [RNA-binding protein] |
| Coordination | [YAP2 uORF1] and uORF2 | [YAP2 uORF1] and [uORF2] |
| Missing Expansion | zinc-finger protein ([THZif-1]) | [zinc-finger protein] ([THZif-1]) |
| GS NE contains CL NE(s) | AP-1 complexes | high mobility AP-1 complexes |
| USH1C | USH1C disease gene | |
| partner of [Rac] | [partner of Rac] | |
| CL NE contains GS NE(s) | regulator virF | virF |
| Wnt pathway | Wnt | |
| CL and GS Overlap | Serum [Fibrin Degradation Products] | [Serum Fibrin] Degradation Products |