| Literature DB >> 22151087 |
Richard Tzong-Han Tsai1, Po-Ting Lai.
Abstract
BACKGROUND: Gene normalization (GN) is the task of identifying the unique database IDs of genes and proteins in literature. The best-known public competition of GN systems is the GN task of the BioCreative challenge, which has been held four times since 2003. The last two BioCreatives, II.5 & III, had two significant differences from earlier tasks: firstly, they provided full-length articles in addition to abstracts; and secondly, they included multiple species without providing species ID information. Full papers introduce more complex targets for GN processing, while the inclusion of multiple species vastly increases the potential size of dictionaries needed for GN. BioCreative III GN uses Threshold Average Precision at a median of k errors per query (TAP-k), a new measure closely related to the well-known average precision, but also reflecting the reliability of the score provided by each GN system.Entities:
Mesh:
Year: 2011 PMID: 22151087 PMCID: PMC3269942 DOI: 10.1186/1471-2105-12-S8-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1System workflow
Rule-based classifiers
| Species a | |
|---|---|
| Cell b | |
| PPI c | |
| History | |
| Full name/Acronym | |
| Tissue c | |
| Domain d | |
| Family d | |
| MASS d | |
| Gene Ontology | |
| Chromosome Location e | |
| Sequence Length d | |
| RS Number d | |
| The | |
| The | |
a Information collected from NCBI Taxonomy
[23], HyperCLDB[24] and Invitrogen[25]
cInformation collected from Human Protein Reference Database (HPRD)
d Information collected from UniProt database
e Information collected from EntrezGene database
Figure 2Candidate selection algorithm
Location features
| Location in full text article |
|---|
| Title |
| Abstract |
| Among the last |
| The first section (usually the introduction section) |
| Among the last |
| The Results section |
| The other sections |
| The last section (usually the conclusion section) |
| Section, sub-section or paragraph titles |
| Appendix |
| Figure captions |
| Table captions |
In our configuration, n1 and n2 is set to 3 and 5, respectively.
Known information feature sets.
| Feature type | Description |
|---|---|
| Keyword match | A Boolean feature which indicates whether or not the identifier’s gene name matches keywords. |
| Full name/abbreviation match | A Boolean feature which indicates whether or not the identifier’s gene name matches full names or abbreviations. |
Species distribution across data sets
| # | Training Set (32 articles) | Test Set (50 articles) | Test Set (507 articles) |
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | S.pneumoniae TIGR4 (9%) | ||
| 5 | S.scrofa (5%) | ||
| 6 | M.oryzae 70-15 (4%) | ||
| 7 | |||
| 8 | |||
| 9 | S.pneumoniae TIGR4 (2%) | ||
| 10 | G.gallus (2%) | E.histolytica HM-l (2%) | S.scrofa (1 %) |
| 11 | Other 18 species (9%) | Other 65 species (23%) | Other 91 species (7%) |
Figure 3Example list returned by a GN system with correct (C) and incorrect (I) IDs illustrating the j(E0)-th correct ID, TPIIs and the sentinel ID
Our strategies vs. BioCreative III participant average on gold-50 test set
| Configuration | Test set gold standard 50 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| TAP5 | TAP10 | TAP20 | |||||||
| TAP5 | ∆ | relative improvement | TAP10 | ∆ | relative improvement | TAP20 | ∆ | relative improvement | |
| BioCreative III | 0.1421 | - | - | 0.1643 | - | - | 0.1764 | - | - |
| Static strategy | 0.1773 | +0.0352 | +24.77% | 0.2096 | +0.0453 | +27.57% | 0.2374 | +0.0610 | +34.58% |
| Article-wide species | 0.2012 | +0.0591 | +41.59% | 0.2312 | +0.0669 | +40.72% | 0.2480 | +0.0716 | +40.59% |
| Section-wide species | 0.2007 | +0.0586 | +41.24% | 0.2319 | +0.0676 | +41.14% | 0.2480 | +0.0716 | +40.59% |
| Optimal Dynamic Dictionary | 0.2708 | +0.1287 | +90.57% | 0.3136 | +0.1493 | +90.87% | 0.3140 | +0.1376 | +78.00% |
BioCreative III average vs. Static vs. Section-wide vs. BioCreative III top systems on silver test set
| Configuration | Test set silver standard 50 | Test set silver standard 507 | ||||
|---|---|---|---|---|---|---|
| TAP5 | TAP10 | TAP20 | TAP5 | TAP10 | TAP20 | |
| BioCreative III Average (Baseline) | 0.2175 | 0.2499 | 0.2690 | 0.2930 | 0.3062 | 0.3109 |
| Static strategy | 0.3506 | 0.3942 | 0.3942 | 0.4351 | 0.4351 | 0.4351 |
| Dynamic strategy: | 0.3532 | 0.4024 | 0.4401 | 0.4401 | ||
| Team_74_R3 | 0.3747 | 0.3747 | 0.4555 | 0.4555 | 0.4555 | |
| Team_98_R3 | 0.3576 | 0.3953 | 0.4086 | 0.4511 | ||
| Team_83_R1 | 0.3498 | 0.3531 | 0.3531 | 0.4581 | 0.4581 | |