| Literature DB >> 15960840 |
Ryan McDonald1, Fernando Pereira.
Abstract
BACKGROUND: We present a model for tagging gene and protein mentions from text using the probabilistic sequence tagging framework of conditional random fields (CRFs). Conditional random fields model the probability P(t/o) of a tag sequence given an observation sequence directly, and have previously been employed successfully for other tagging tasks. The mechanics of CRFs and their relationship to maximum entropy are discussed in detail.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960840 PMCID: PMC1869020 DOI: 10.1186/1471-2105-6-S1-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Gene identification as a tagging problem. A sample tagging of a sentence using the beginning, inside and outside tag labels. The sentence has two gene mentions, Varicella-zoster virus (VZV) glycoprotein gI and type 1 transmembrane gylcoprotein.
Orthographic features.
| Orthographic Feature | Reg. Exp. |
| Init Caps | [A-Z].* |
| Init Caps Alpha | [A-Z] [a-z]* |
| All Caps | [A-Z]+ |
| Caps Mix | [A-Za-z]+ |
| Has Digit | .*[0-9].* |
| Single Digit | [0-9] |
| Double Digit | [0-9][0-9] |
| Natural Number | [0-9]+ |
| Real Number | [-0-9]+ [.,]+[0-9].,]+ |
| Alpha-Num | [A-Za-z0-9]+ |
| Roman | [ivxdlcm]+ or [IVXDLCM]+ |
| Has Dash | .*-.* |
| Init Dash | -.* |
| End Dash | .*- |
| Punctuation | [,.;:?!-+'"'] |
This defines the complete set of orthographic predicate used by the system. The observation list for each token will include a predicate for every regular expression that token matches.
Effect of system components on development data
| System | Precision | Recall | F-Measure |
| A. No Lex, No Feat. Ind. | 0.793 | 0.731 | 0.761 |
| B. No Lexicons | 0.807 | 0.744 | 0.774 |
| C. Trigrams | 0.811 | 0.759 | 0.784 |
| D. Non-gene Lexicons | 0.818 | 0.743 | 0.778 |
| E. Gene Lexicons | 0.812 | 0.775 | 0.793 |
| F. All Lexicons | 0.817 | 0.782 | 0.799 |
A) System containing no lexicon features and does not use feature induction. B) Same as A, except feature induction is used. C) Same as B, except features using the infrequent trigram lexicon are used. D) Same as B, except features using the non-gene lexicons are used. E) Same as B, except features using the gene lexicon are used. F) Same as B, except features using all lexicons are used.
Precision and recall numbers for the system on the unseen evaluation data
| System | Precision | Recall | F-Measure |
| No Lexicons | 0.830 | 0.773 | 0.801 |
| Lexicons | 0.864 | 0.787 | 0.824 |
Precision and recall numbers for the system on the unseen evaluation data. Precision is measured by the fraction of predicted gene mentions that are correct and recall by the fraction of actual gene mentions that were identified. Two system results are provided. The first is for the system that contains only features extracted from the training data. These results are presented in the row No Lexicons. The second set of results are for the system that also contains features extracted from external lexicons. These results are presented in the row Lexicons.