| Literature DB >> 19025692 |
Xinglong Wang1, Michael Matthews.
Abstract
BACKGROUND: Term identification is the task of grounding ambiguous mentions of biomedical named entities in text to unique database identifiers. Previous work on term identification has focused on studying species-specific documents. However, full-length articles often describe entities across a number of species, in which case resolving the ambiguity of model organisms in entities is critical to achieving accurate term identification.Entities:
Mesh:
Year: 2008 PMID: 19025692 PMCID: PMC2586755 DOI: 10.1186/1471-2105-9-S11-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Ambiguity in protein names. Ambiguity in protein entities, with and without species information, in EPPI and TE datasets.
| Protein Cnt | ID Cnt | Ambiguity | |
| 6,955 | 184,633 | 26.55 | |
| 6,955 | 17,357 | 2.50 | |
| 8,539 | 103,016 | 12.06 | |
| 8539 | 12,705 | 1.49 |
Results (%) of the rule-based species tagger
| P | R | F1 | P | R | F1 | |
| PreWd | 81.88 | 1.87 | 3.65 | 91.49 | 1.63 | 3.21 |
| PreWd + Spread | 63.85 | 14.17 | 23.19 | 77.84 | 17.97 | 29.20 |
| PreWd Sent | 60.79 | 5.16 | 9.52 | 56.16 | 7.76 | 13.64 |
| PreWd Sent + Spread | 39.74 | 50.54 | 44.49 | 31.71 | 46.68 | 37.76 |
| Prefix | 98.98 | 3.07 | 5.96 | 77.93 | 2.97 | 5.72 |
| PreWd + Prefix | 91.95 | 4.95 | 9.40 | 82.27 | 4.62 | 8.75 |
| PreWd + Prefix + Spread | 68.46 | 17.49 | 27.87 | 77.77 | 21.26 | 33.39 |
| Majority Vote | 44.10 | 44.10 | 44.10 | 49.87 | 49.87 | 49.87 |
Results (%) of the machine-learning and hybrid species taggers. Accuracy (%) of the machine-learning based species tagger and the hybrid species tagger as tested on the EPPI and TE devtest datasets. An 'Overall' score is the micro-average of a system's accuracy on both datasets.
| BL | Combined Model | Combined Model+Rules | |||||
| 60.56 | 73.03 | 58.67 | 72.28 | 59.67 | 73.77 | ||
| 30.22 | 67.15 | 69.82 | 67.20 | 67.53 | 67.47 | ||
| Overall | 48.88 | 70.77 | 62.96 | 70.33 | 63.70 | 71.34 |
Results (%) of TI on the EPPI dataset. All figures, except 'Avg. Rank', are percentages. This evaluation was carried out on protein entities only.
| Method | Prec@1 | Prec@5 | Prec@10 | Prec@15 | Prec@20 | Avg. Rank |
| Baseline | 54.31 | 73.45 | 76.44 | 77.90 | 78.51 | 5.82 |
| Gold Species | 73.52 | 79.36 | 80.75 | 80.75 | 80.99 | 1.62 |
| Rule | 54.99 | 73.72 | 76.45 | 77.91 | 78.52 | 5.79 |
| ML(human) | 65.66 | 76.36 | 78.82 | 79.78 | 80.03 | 2.58 |
| ML( | 65.24 | 76.82 | 79.01 | 79.93 | 80.29 | 2.39 |
| ML( | 80.30 | |||||
| ML( | 55.87 | 75.14 | 78.69 | 79.85 | 80.30 | 2.86 |
| ML( | 56.54 | 75.47 | 78.70 | 79.86 | 80.31 | 2.83 |
| ML( | 64.55 | 76.48 | 78.53 | 79.83 | 80.38 | 2.49 |
| ML( | 65.03 | 76.62 | 78.55 | 79.84 | 2.46 | |
Results (%) of TI on the TE dataset. All figures, except 'Avg. Rank', are percentages. There are four entity types in the TE data, i.e., protein, gene, mRNAcDNA and GOMOP. The evaluation was carried out on all entity types.
| Method | Prec@1 | Prec@5 | Prec@10 | Prec@15 | Prec@20 | Avg. Rank |
| Baseline | 63.24 | 76.20 | 77.30 | 77.94 | 78.25 | 1.72 |
| Gold Species | 71.82 | 78.03 | 78.34 | 78.40 | 78.41 | 1.29 |
| Rule | 63.45 | 76.21 | 77.30 | 1.72 | ||
| ML(mouse) | 58.76 | 75.40 | 77.25 | 77.92 | 78.24 | 1.90 |
| ML( | 66.59 | 76.53 | 77.23 | 77.76 | 78.12 | 1.68 |
| ML( | 77.24 | 77.76 | 78.12 | |||
| ML( | 66.12 | 76.25 | 77.32 | 77.81 | 78.11 | 1.70 |
| ML( | 66.37 | 76.25 | 77.81 | 78.11 | 1.70 | |
| ML( | 65.78 | 76.14 | 77.28 | 77.84 | 78.12 | 1.71 |
| ML( | 66.03 | 76.14 | 77.29 | 77.84 | 78.12 | 1.70 |
Results (%) of species tagging on the BioCrAtIvE joint dataset. Accuracy (%) of the species disambiguation systems as tested on the BioCreAtIvE I & II test data. The 'BC model' was trained on the BioCreAtIvE devtest data, the 'TXM model' was trained on the TXMEPPI and TE training data, and the 'Majority Vote' was the default species tagging system in the TI system.
| human | fly | mouse | yeast | |
| Majority Vote | 82.35 | 78.43 | 71.69 | 85.12 |
| BC model | 70.23 | 89.24 | 75.41 | 87.64 |
| 93.35 | 3.27 | 31.89 | 3.49 |
Results (%) of TI on the BioCrAtIvE joint dataset. Performance of TI with or without the automatically predicted species on the joint BioCreAtIvE GN test dataset.
| System | Precision | Recall | |
| Gold | 70.1 | 63.3 | 66.5 |
| Majority Vote | 46.7 | 56.3 | 51.0 |
| 37.8 | 46.5 | 41.7 | |
| BC model | 45.8 | 56.1 | 50.4 |
# of species per document in the TXM data
| # Species | # of Docs | % of Docs |
| 1 | 96 | 26.20 |
| 2 | 121 | 32.79 |
| 3+ | 153 | 41.19 |