| Literature DB >> 18586726 |
Chun-Nan Hsu1, Yu-Ming Chang, Cheng-Ju Kuo, Yu-Shi Lin, Han-Shen Huang, I-Fang Chung.
Abstract
MOTIVATION: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18586726 PMCID: PMC2718659 DOI: 10.1093/bioinformatics/btn183
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example of gene mention tagging and forward and backward parsing.
Categories of predicates on observed tokens
| Predicate | Example | Predicate | Example | Predicate | Example |
|---|---|---|---|---|---|
| Word | Hyphen | Nucleoside | |||
| StemmedWord | BackSlash | Nucleotide | |||
| PartOfSpeech | OpenSqure | Roman | |||
| InitCap | CloseSqure | MorphologyTypeI | |||
| EndCap | Colon | MorphologyTypeII | |||
| AllCaps | SemiColon | MorphologyTypeIII | |||
| LowerCase | Percent | WordLength | |||
| MixCase | OpenParen | N-grams(2-4) | |||
| SingleCap | CloseParen | ATCGUsequece | |||
| TwoCap | Comma | Greek | |||
| ThreeCap | FullStop | NucleicAcid | |||
| MoreCap | Apostrophe | AminoAcidLong | |||
| SingleDigit | QuotationMark | ‘,’ | AminoAcidShort | ||
| TwoDigit | Star | AminoAcid+Position | |||
| FourDigit | Equal | ||||
| MoreDigit | Plus |
Inside test results
| Method | Precision | Recall | |
|---|---|---|---|
| Forward | 86.60 | 80.77 | 83.59 |
| Backward | 87.33 | 81.18 | 84.14 |
| Union | 83.49 | 85.78 | 84.62 |
| Intersection | 90.76 | 71.86 | 80.21 |
| Top 10+HUGO | 87.73 | 82.63 | 85.10 |
These models were trained by 10 000 example sentences selected at random and tested by the remaining 5000 examples.
Fig. 2.An example illustrates the method of the final model integration step. fw_score and bw_score indicate the output scores of Mallet obtained by forward and backward parsing, respectively, and the highlighted row indicates the tag sequence pair selected by the integrated model.
Final BioCreative 2 result
| Method | Precision | Recall | Alt | |
|---|---|---|---|---|
| Backward | 89.30 | 83.83 | 86.48 | − |
| Union | 86.10 | 87.08 | 86.58 | − |
| Top 10+HUGO | 89.30 | 84.49 | 86.83 | 14.02 |
| Rank 1 | 88.48 | 85.97 | 87.21 | 32.48 |
aThe percentage of TP that matched alternative annotations.
Comparing forward and backward parsing models
| Forward | Backward | Difference | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | TP | FP | ||||
| Mallet | 88.88 | 83.57 | 86.14 | 89.34 | 84.05 | 86.61 | 1028 | 477 | +0.47 |
| CRF++:HMM-style | 90.15 | 84.28 | 87.12 | 90.22 | 84.13 | 87.07 | 100 | 46 | −0.05 |
| CRF++:Mallet-style | 89.48 | 82.45 | 85.81 | 89.89 | 83.54 | 86.60 | 646 | 401 | +0.79 |
aColumns under ‘Difference’ show the number of different TP, FP in the tagging results and the difference in F-scores of the forward and backward parsing models. These models were trained by 15 000 example sentences and tested by 5000 examples with the features described in Section 2.1. The split of training/test sets is provided by BioCreative 2. The total number of gold standard true gene entities in the test set is 6331.
Confusion matrices of tag bigrams for forward and backward parsing models
| Forward ( | ||||||||||
| BB | 3 | 19 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 12.00 |
| BI | 1 | 2701 | 58 | 0 | 168 | 2 | 112 | 0 | 458 | 77.17 |
| BO | 0 | 178 | 2057 | 0 | 75 | 105 | 0 | 0 | 379 | 73.62 |
| IB | 0 | 1 | 0 | 6 | 40 | 3 | 3 | 0 | 2 | 10.91 |
| II | 0 | 139 | 24 | 2 | 3845 | 115 | 136 | 0 | 994 | 73.17 |
| IO | 0 | 3 | 84 | 0 | 160 | 2682 | 1 | 0 | 514 | 77.87 |
| OB | 0 | 75 | 1 | 3 | 215 | 1 | 5004 | 0 | 941 | 80.19 |
| OI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - |
| OO | 0 | 278 | 161 | 0 | 651 | 475 | 514 | 0 | 119999 | 98.30 |
| 75.00 | 79.58 | 86.21 | 54.55 | 74.57 | 79.28 | 86.72 | - | 97.33 | ||
| 20.69 | 78.36 | 79.42 | 18.18 | 73.86 | 78.57 | 83.33 | 0 | 97.81 | ||
| Backward ( | ||||||||||
| BB | 8 | 14 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 32.00 |
| BI | 0 | 2713 | 68 | 0 | 165 | 1 | 91 | 0 | 462 | 77.51 |
| BO | 2 | 150 | 2110 | 0 | 72 | 85 | 0 | 0 | 376 | 75.49 |
| IB | 0 | 1 | 0 | 8 | 38 | 4 | 3 | 0 | 1 | 14.55 |
| II | 0 | 118 | 23 | 6 | 3886 | 100 | 113 | 0 | 1009 | 73.95 |
| IO | 0 | 0 | 68 | 1 | 145 | 2694 | 0 | 0 | 537 | 78.20 |
| OB | 0 | 76 | 0 | 2 | 194 | 0 | 4841 | 0 | 878 | 80.80 |
| OI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - |
| OO | 0 | 294 | 170 | 0 | 647 | 465 | 514 | 0 | 120235 | 98.29 |
| 80.00 | 80.60 | 86.48 | 47.06 | 75.49 | 80.44 | 87.02 | - | 97.36 | ||
| 45.71 | 79.03 | 80.61 | 22.22 | 74.71 | 79.31 | 83.80 | 0 | 97.82 |
Rows give true tag bigrams and columns give predicted tag bigrams. All tag bigrams are given in their original (forward) order. The special tags ‘O’ attached to heads and tails of sentences in forward and backward parsing, respectively, were counted. Legend: r=recall, p=precision, F=F-score.
Results of bi-directional parsing models
| Method | Precision | Recall | |
|---|---|---|---|
| Order 0 | |||
| Forward | 79.98 | 73.18 | 76.43 |
| Backward | 81.01 | 75.00 | 77.89 |
| Order 1 | |||
| Forward | 88.79 | 82.99 | 85.79 |
| Backward | 88.76 | 83.72 | 86.16 |
| Order 2 | |||
| Forward | 87.00 | 82.58 | 84.73 |
| Backward | 88.67 | 83.65 | 86.09 |
| Order 3 | |||
| Forward | 85.18 | 80.08 | 82.55 |
| Backward | 87.33 | 81.68 | 84.41 |
| CRF++default | 90.15 | 84.28 | 87.12 |
| ctjpgis | 90.60 | 82.96 | 86.61 |
| Intersection | 92.56 | 77.60 | 84.42 |
| 0+1+2+3 | 90.06 | 84.68 | 87.28 |
| 1+2+3 | 90.63 | 84.82 | 87.63 |
| 1+2+3+lc | 88.95 | 87.65 |
‘0+1+2+3’ is the integration of the bi-directional models with Order 0–3. ‘1+2+3’ is the integration of models with Order 1–3. 1+2+3+lc is the union of ‘1+2+3’ and the intersection of two CRF++ models. These models were trained and tested with the corpora from BioCreative 2. The feature sets used here were the same for all models but they were slightly different from the set that we used in BioCreative 2 because we removed some minor bugs and redundancies from the original feature extraction program.