| Literature DB >> 20053840 |
Xinglong Wang1, Jun'ichi Tsujii, Sophia Ananiadou.
Abstract
MOTIVATION: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers.Entities:
Mesh:
Year: 2010 PMID: 20053840 PMCID: PMC2828111 DOI: 10.1093/bioinformatics/btq002
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Parsers and their input and output format
| Parser | Input | Output |
|---|---|---|
| C&C (Clark and Curran, | POS-tagged | GR |
| ENJU (Miyao and Tsujii, | POS-tagged | PAS |
| ENJU-Genia (Hara | POS-tagged | PAS |
| Minipar (Lin, | Sentence-detected | Minipar |
| Stanford (Klein and Manning, | POS-tagged | SD |
| Stanford-Genia | POS-tagged | SD |
Fig. 1.Predicate-argument structure.
Averaged 5-fold cross-validation evaluation results
| micro-avg. | macro-avg. | |
|---|---|---|
| R | 72.20 / 62.39 / 66.94 | 27.77 / 46.67 / 29.32 |
| R | 74.09 / 64.03 / 68.69 | 29.77 / 53.81 / 32.20 |
| R | 72.94 / 63.03 / 67.63 | 30.22 / 54.76 / 32.93 |
| C&C | 73.82 / 63.79 / 68.44 | 30.51 / 53.59 / 33.43 |
| ENJU | 72.98 / 63.06 / 67.66 | 31.35 / 55.00 / 34.61 |
| ENJU-Genia | 73.00 / 63.08 / 67.68 | 30.11 / 53.42 / 32.97 |
| Minipar | 73.02 / 63.10 / 67.69 | 30.19 / 53.56 / 33.10 |
| Stanford | 73.67 / 63.66 / 68.30 | 31.17 / 56.35 / 34.35 |
| Stanford-Genia | 73.48 / 63.50 / 68.13 | 30.61 / 55.61 / 33.78 |
| ML | 82.69 / 82.69 / 82.69 | 27.01 / 27.84 / 27.37 |
| R | 75.24 / 63.99 / 69.16 | 31.97 / 55.61 / 34.80 |
| H | 83.80 / 83.80 / 83.80 | 57.56 / 49.72 / 49.90 |
Precision/recall/F1-score, in %.
Fig. 2.A syntactic feature obtained from the ENJU parser.
Data sources
| Main Organism | Source | Abstracts |
|---|---|---|
| BC1 Devtest | 108 | |
| BC1 Devtest | 250 | |
| BC1 Devtest | 110 | |
| BC2 Test | 262 |
The percentage of the species and the micro-averaged F1 scores (%) of ML, Relation and Hybrid with respect to each species
| Species Name (TaxonID) | Pct (%) | ML | R | H |
|---|---|---|---|---|
| 50.30 | 85.60 | 70.51 | 86.48 | |
| 26.70 | 79.38 | 78.17 | 80.41 | |
| 10.01 | 87.07 | 79.53 | 87.37 | |
| 7.79 | 82.66 | 74.13 | 84.64 | |
| Other | 1.01 | 0.00 | 18.56 | 25.00 |
| 0.78 | 48.42 | 33.77 | 59.41 | |
| 0.28 | 0.00 | 0.00 | 0.00 | |
| 0.12 | 0.00 | 7.50 | 36.36 | |
| 0.11 | 0.00 | 38.71 | 22.22 | |
| 0.04 | 0.00 | 50.00 | 100.00 | |
| 0.03 | 0.00 | 22.22 | 66.67 |
Results of statistical significance tests between pairs of methods
| ENJU-Genia | C&C | ML | R | H | |
|---|---|---|---|---|---|
| R | +/+/+/ − / N / − | +/+/+/ − / N / − | − / − / − /+/+/+ | − / − / − / − / − / − | − / − / − / − / − / − |
| ENJU-Genia | − / N / − / N / N / − | − / − / − /+/+/+ | − / − / − / − / − / − | − / − / − / − / − / − | |
| C&C | − / − / − /+/+/+ | − / − / − / − / − / − | − / − / − / − / − / − | ||
| ML | +/+/+/ − / − / − | − / − / − / − / − / − | |||
| R | − / − / − / − /+/ − |
Fig. 3.Performance of ML, Relation, Hybrid over individual organisms.