| Literature DB >> 15960827 |
Katrin Fundel1, Daniel Güttler, Ralf Zimmer, Joannis Apostolakis.
Abstract
BACKGROUND: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960827 PMCID: PMC1869007 DOI: 10.1186/1471-2105-6-S1-S15
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Results in BioCreAtIvE Task 1B: Our results compared to results with highest overall F-measure. For mouse and fly the highest F-measure is achieved by ProMiner. Our yeast result was obtained by exact matching of the curated list, no post filter was applied. Mouse(1) is the exact search with the curated list. Mouse(2) was additionally filtered with the rule-based post filter. Our fly results were obtained as post-evaluation by exact matching of the curated list and application of the SVM-based post filter.
| Yeast | Yeast max. | Mouse (1) | Mouse (2) | Mouse max. | Fly (post-eval.) | Fly max. | |
| F-measure | 0.897 | 0.921 | 0.764 | 0.773 | 0.790 | 0.768 | 0.815 |
| Precision | 0.917 | 0.950 | 0.735 | 0.764 | 0.766 | 0.802 | 0.831 |
| Recall | 0.878 | 0.894 | 0.796 | 0.781 | 0.814 | 0.737 | 0.800 |
| TP | 538 | 548 | 433 | 425 | 443 | 316 | 343 |
| FP | 49 | 29 | 156 | 131 | 135 | 78 | 70 |
| FN | 75 | 65 | 111 | 119 | 101 | 113 | 86 |
Figure 1Yeast results. BioCreAtIvE task 1B results for yeast and the impact of curation of the synonym list. The submitted result was obtained with the fully curated synonym list.
Figure 2Mouse results. BioCreAtIvE task 1B results for mouse and the impact of curation on exact search and the ProMiner approach. ES: Exact search; PM: ProMiner; cur.: fully curated synonym list; orig. syn. list: original synonym list as provided by organizers. For the exact search, the submitted results were obtained with the fully curated synonym list and the fully curated synonym list with subsequent application of the rule-based post filter. The results of exact matching of the original synonym list, and lists obtained from the two intermediate curation steps are also shown. For ProMiner, the results of the approximate search alone (PM search) and the results of the ProMiner framework (i.e. approximate search plus filtering and disambiguation) with optimal parameter setting are shown. The submitted results (PM results) were obtained with the entire ProMiner framework, the same fully curated synonym list and different sets of parameters [12]. The fully curated synonym lists used for exact search and the ProMiner approach were the same except ambiguous synonyms.
Figure 3Fly results. Results for fly, obtained as post-evaluation of the BioCreAtIvE-assessment. The figure shows the results of exact matching of the synonym list as provided by the ProMiner-team (Exact search, orig. syn. list), exact matching of the curated synonym list and exact matching of the curated synonym list with subsequent application of the SVM-based post filter. All submitted ProMiner results were obtained with one synonym list, which we here refer to as 'original synonym list', but different parameter settings [12].
False positive matches: Types of errors and samples. Synonyms are marked in italics. The synonyms and matches are correct but the context reveals that they should not have been reported for BioCreAtIvE task 1B.
| Type of error | Examples |
| overlap with English Words |
Rule-based post filter: Samples of false positive matches, mostly short names and abbreviations of protein names which have different meanings, and the effect of the rule-based post filter on these matches.
| Synonym | Context | Other synonym for wrongly identified object | Removed by post filter |
| P21 | Chromosome 2p16-p21 | cyclin-dependnet kinase inhibitor 1A (P21) | no |
| FACS | fluorescence-activated cell sorter (FACS) | fatty acid Coenzyme A ligase, long chain 2 | yes |
| PCR1 | E. coli plasmid pCR1 | mannosidase 1, alpha | no |
| CA1 | area CA1 of the hippocampus | carbonic anhydrase 1 | no |
| HEK | HEK cells | Eph receptor A3 | yes |
| NT2 | NTera 2(NT2) cell line | zinc finger protein 263 | yes |
| Eph | Eph family of receptors | Epa receptor A1 | no |
| PMN | polymorphonuclear (PMN) infiltration | progressive motor neuropathy | yes |
| all-trans | All-trans retinoic acid | retinol dehydrogenase 2 | no |
| slp | sphingosine 1-phosphate receptor genes | site-1 protease | no |
| Den | diethylnitrosamine (DEN) | denuded | yes |
Samples of false negative matches: closest synonyms in synonym list, occurrence in text, and type of error.
| Synonym(s) | Occurrence in text | Type of error |
| Lpa1, Lpa2, Lpa3 | lpa(1-3) | enumeration |
| Pkcb, Pkce | PKC beta, PKC-epsilon | different spelling |
| retinoic acid receptor, alpha | retinoic acid receptor-alpha | different spelling |
| interferon gamma | gamma-interferon | inversion |
| Braf2, Braf-rs1 | Braf | ambiguity |
| peroxisome proliferator activated receptor gamma | peroxisome proliferating antigen receptor gamma | not evident |