| Literature DB >> 19376821 |
Dolf Trieschnigg1, Piotr Pezik, Vivian Lee, Franciska de Jong, Wessel Kraaij, Dietrich Rebholz-Schuhmann.
Abstract
MOTIVATION: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems.Entities:
Mesh:
Year: 2009 PMID: 19376821 PMCID: PMC2682526 DOI: 10.1093/bioinformatics/btp249
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Retrieval performance on document index based on KNN
| 2006 | 2007 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | MAP | P10 | MAP | P10 | |||||||||
| KNN+ | 0.411 | +13% | 0.504 | +10% | a | 0.280 | +6% | b | 0.472 | +5% | |||
| EAGL+ | 0.374 | +3% | 0.462 | +1% | 0.273 | +3% | 0.458 | +2% | |||||
| MetaMap+ | 0.372 | +2% | 0.458 | 0% | 0.268 | +2% | 0.456 | +1% | |||||
| MTI+ | 0.367 | +1% | 0.454 | −1% | 0.277 | +5% | 0.464 | +3% | |||||
| baseline | 0.363 | 0.458 | 0.264 | 0.450 | |||||||||
| CLM+ | 0.363 | 0% | 0.458 | 0% | 0.264 | 0% | 0.464 | +3% | |||||
| BM25+ | 0.363 | 0% | 0.458 | 0% | 0.264 | −0% | 0.453 | +1% | |||||
‘a’ and ‘b’ indicate a significant difference from the baseline (P <0.05 or 0.005, respectively).
Retrieval performance on TREC Genomics collections
| 2004 | 2005 | 2006 | 2007 | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | MAP | P10 | MAP | P10 | MAP | P10 | MAP | P10 | |||||||||||||||||
| KNN+ | 0.379 | +12% | b | 0.584 | +11% | a | 0.224 | +15% | b | 0.351 | +10% | a | 0.405 | +11% | a | 0.465 | +2% | 0.291 | +10% | a | 0.469 | +4% | |||
| MTI+ | 0.352 | +4% | a | 0.542 | +3% | 0.208 | +7% | 0.333 | +4% | 0.381 | +5% | 0.442 | −3% | 0.274 | +4% | 0.475 | +6% | ||||||||
| CLM+ | 0.345 | +2% | 0.520 | −1% | 0.199 | +2% | 0.298 | −6% | 0.364 | 0% | 0.458 | 0% | 0.267 | +1% | 0.461 | +2% | |||||||||
| BM25+ | 0.342 | +1% | 0.512 | −3% | 0.195 | 0% | 0.318 | 0% | 0.363 | 0% | 0.458 | 0% | 0.268 | +1% | 0.467 | +4% | |||||||||
| MetaMap+ | 0.341 | +1% | 0.526 | 0% | 0.200 | +2% | a | 0.318 | 0% | 0.364 | 0% | 0.442 | −3% | 0.265 | 0% | 0.450 | 0% | ||||||||
| EAGL+ | 0.341 | +1% | 0.520 | −1% | 0.198 | +2% | 0.312 | −2% | 0.365 | +1% | 0.477 | +4% | 0.268 | +2% | 0.453 | +1% | |||||||||
| Baseline | 0.339 | 0.526 | 0.195 | 0.318 | 0.363 | 0.458 | 0.264 | 0.450 | |||||||||||||||||
‘a’ and ‘b’ indicate a significant difference from the baseline (P <0.05 or 0.005, respectively).
MeSH classification performance on 1000 random MEDLINE citations, using title and abstract as input
| Document | Category | Decision | ||||||
|---|---|---|---|---|---|---|---|---|
| Method | MAP | P10 | F1 | micro F1 | ||||
| MTI | 0.2536 | 0.3200 | 0.4503 | 0.4415 | ||||
| BM25 | 0.0912 | −64% | 0.1021 | −68% | 0.2251 | −50% | 0.1972 | −55% |
| MetaMap | 0.1623 | −36% | 0.1910 | −40% | 0.3187 | −29% | 0.2968 | −33% |
| CLM | 0.1783 | −30% | 0.1748 | −45% | 0.3429 | −24% | 0.2982 | −32% |
| EAGL | 0.1976 | −22% | 0.2119 | −34% | 0.2987 | −34% | 0.2977 | −33% |
| KNN | 0.5052 | +99% | 0.4515 | +41% | 0.4074 | −10% | 0.4963 | +12% |
All differences in MAP and P10 are significant with a P <0.005, based on Fisher's randomization test.
Results from the analysis of false positives
| Judgment | True | False positives | ||||||
|---|---|---|---|---|---|---|---|---|
| positives | MetaMap | CLM | KNN | |||||
| Very relevant | 94 | 75% | 40 | 29% | 44 | 24% | 37 | 20% |
| Relevant | 17 | 13% | 39 | 29% | 26 | 14% | 27 | 14% |
| Undecided | 12 | 10% | 20 | 15% | 66 | 35% | 49 | 26% |
| Irrelevant | 1 | 1% | 33 | 24% | 35 | 19% | 58 | 31% |
| Incorrect | 2 | 2% | 4 | 3% | 16 | 9% | 17 | 9% |