| Literature DB >> 15960825 |
Jeremiah Crim1, Ryan McDonald, Fernando Pereira.
Abstract
BACKGROUND: Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms.Entities:
Mesh:
Year: 2005 PMID: 15960825 PMCID: PMC1771968 DOI: 10.1186/1471-2105-6-S1-S13
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example of partial entries from the fly synonym list.
| Normalized Form | Possible Synonyms |
| FBgn0003943 | CG11624 Ub, Ubi p, Ubi63E, polyubiquitin |
| FBgn0003944 | CG10388 Cbx, DmUbx, Hm, Ubx, abx, bithorax |
| FBgn0003945 | Udg, Uracil DNA glycosylase |
| FBgn0004837 | Suppressor of Hairless, br7, C: Group C, RBP JKappa, lethal 7 in the black-reduced region |
Performance of pattern matching system on development data. Precision and recall numbers for pattern matching system using: A) Simple direct matching of synonyms to text. B) Direct matching of synonyms to text only considering informative synonyms. C) Same as B, except restrict that a synonym must be in a documents candidate list for match to be valid. D) Same as C, except matches are run with all tokens stemmed. Numbers are reported for both fly and mouse.
| A. basic matching | 0.033 | 0.861 | 0.063 |
| B. informative syns | 0.458 | 0.727 | 0.562 |
| C. candidate list | 0.709 | 0.667 | 0.687 |
| D. stemming | 0.713 | 0.690 | 0.701 |
| A. basic matching | 0.151 | 0.583 | 0.240 |
| B. informative syns | 0.478 | 0.548 | 0.511 |
| C. candidate list | 0.739 | 0.505 | 0.600 |
| D. stemming | 0.716 | 0.656 | 0.685 |
Pattern matching performance on evaluation data.
| fly | 0.638 | 0.695 | 0.665 |
| mouse | 0.830 | 0.673 | 0.743 |
| yeast | 0.950 | 0.894 | 0.921 |
Maximum entropy classification performance on evaluation data.
| fly | 0.704 | 0.783 | 0.742 |
| mouse | 0.787 | 0.732 | 0.758 |
| yeast | 0.956 | 0.881 | 0.917 |