| Literature DB >> 19159460 |
Dimitra Alexopoulou1, Bill Andreopoulos, Heiko Dietze, Andreas Doms, Fabien Gandon, Jörg Hakenberg, Khaled Khelif, Michael Schroeder, Thomas Wächter.
Abstract
BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively.Entities:
Mesh:
Year: 2009 PMID: 19159460 PMCID: PMC2663782 DOI: 10.1186/1471-2105-10-28
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Algorithms for Word Sense Disambiguation.
| [ | gene definition & abstract vector | 5 human gen. dbs & MeSH | cosine similarity | 52,529 Medline abstracts, 690 human gene symbols | 92.7% | |
| [ | free text | UMLS, Journal Descriptors | Journal Descriptor Indexing (JDI) | 45 ambiguous UMLS terms (NLM WSD Collection) | 78.7% | |
| [ | Medline abstracts | BioCreative-2 GN lexicon & text, EntrezGene, UniProt, GOA | motifs from multiple sequence alignments | BioCreative-2 GN challenge | 81% | |
| [ | Medline abstracts | list of gene senses, EntrezGene | inverse co-author graph | BioCreative GN challenge | 97%P | |
| [ | XML tagged abstracts, positional info, PoS | - | naive Bayes, decision trees, inductive rule training | protein/gene/mRNA assignment: 9 million words (mol. biol. journals) | 85% | |
| [ | text | - | word count, word cooc | - | 86.5% | |
| [ | Medline abstracts | UMLS terms | UMLS term cooc | 35 biomedical abbreviations | 93%P | |
| [ | abbreviations in Medline abstracts | - | SVM | build dictionary, use for abbreviations occurring with their long forms | 98.5% | |
| [ | gene symbol context ( | - | SVM | - | 85% | |
| [ | document | - | LSA/LSI, 2 | 170,000 documents, 1013 terms (TREC-1) (Wall Street Journal) | ↑ 7–14% | |
| [ | word cooc, PoS tags | WordNet | average link clustering | 13 words, ACL/DCI | 73.4% | |
| [ | Wall Street Journal Corpus | |||||
| [ | - | - | 1 | 24 Senseval-2 words, | 44% | |
| [ | text | few tagged data, WordNet | co-training, collocations | 12 common Engl. words × 4000 instances | 96.5% | |
| [ | - | - | co-training & majority voting | Senseval-2 generic English | ↑ 9.8% | |
| [ | - | WordNet | noun coocs, Markov clustering | - | - | |
Figure 1Three disambiguation approaches for one term. Thrush is an ambiguous term, as its senses include songbird or oral candidiasis. This figure shows the possibilities for disambiguating 'thrush'. Solid edges are is_a relationships.
Figure 2Subtype-aware signature calculation. The figure shows the path between the UMLS terms 'Body_Part_Organ_or_Organ_Component' and 'Amino_Acid_Peptide_or_Protein'. The edges describe relations between entities (in our case, the subtype-aware-signature and its sub-properties) and nodes consist of classes and relations of the ontology. 'Body_Part_Organ_or_Organ_Component' is a subsumption of 'Fully_Formed_Anatomical_Structure', which belongs to the signature of the relation 'produces'. This relation has as range 'Organic_Chemical' which is a super-class of 'Amino_Acid_Peptide_or_Protein'. The length of this path is 4.
Ambiguous terms and their senses in the WSD datasets collected.
| GO | Development | biological process of maturation (GO); development of a syndrome/disease/treatment; cataract development; colony development; development of a method; staff/economic development; software/algorithm development |
| Spindle | mitotic spindle (GO); sleep spindles; muscle spindle; spindle-shaped cells | |
| Nucleus | cell nucleus (GO); body structure (UMLS, subthalamic/cochlear/caudate nucleus); aromatic nucleus | |
| Transport | directed movement of substances into/out of/within/between cells (GO); patient transport (UMLS); transport by air; transport of virus cultures; maternal transport | |
| MeSH | Thrush | Oral Candidiasis (MeSH); songbird (e.g. thrush nightingale) |
| Lead | heavy metal (MeSH); lead measurement (UMLS); to result in | |
| Inhibition | psychological/behavioral inhibition (MeSH); metabolic inhibition (UMLS); % inhibition (SNOMED) | |
Examples of the senses (in and out of the taxonomies) per ambiguous term included in the benchmark dataset collected.
Benchmark datasets for WSD.
| False | True | False | True | False | True | ||
| GO | Development | 98 | 111 | 271 | 56 | 2296 | 715 |
| Spindle | 50 | 48 | 70 | 48 | 519 | 599 | |
| Nucleus | 99 | 100 | 25 | 61 | 131 | 1336 | |
| Transport | 102 | 91 | 102 | 56 | 1043 | 699 | |
| MeSH | Thrush | 17 | 83 | 45 | 7 | 35 | 1131 |
| Lead | 71 | 27 | 202 | 22 | 1564 | 735 | |
| Inhibition | 98 | 100 | 454 | 79 | 5247 | 553 | |
The above datasets contain manually collected PubMed articles by one expert (high quality/low quantity), manually curated articles by a group of non-experts (medium quality/medium quantity) and semi-automatically collected articles (low quality/high quantity). See Datasets section for details.
Results (% f-measure) for the baseline (bME) and the three methods (Closest Sense, Term Cooc, MetaData) for 7 ambiguous terms, tested on a high quality/low quantity corpus (manually annotated by expert).
| Development | 87 | 86 | 74 | 71 | 57 | 79 | 90 | 96 | 80 | 80 | 80 |
| Spindle | 70 | 79 | 90 | 80 | 95 | 98 | 98 | 100 | 77 | 78 | 87 |
| Nucleus | 89 | 94 | 81 | 78 | 75 | 95 | 97 | 99 | 91 | 77 | 88 |
| Transport | 83 | 71 | 90 | 89 | 88 | 94 | 89 | 98 | 91 | 88 | 88 |
| Thrush | 88 | 94 | 87 | 82 | 78 | 81 | 82 | 94 | 94 | 58 | 84 |
| Lead | 36 | 53 | 89 | 49 | 93 | 81 | 85 | 85 | 36 | 14 | 62 |
| Inhibition | 66 | 84 | 77 | 62 | 85 | 58 | 92 | 100 | 95 | 97 | 82 |
| Avg | 74 | 80 | 84 | 73 | 82 | 84 | 90 | 96 | 81 | 70 | 81 |
CS1 column contains the results (% f-measure) for the Closest Sense (CS) approach with the use of the classic distance (only subsumption). CS2 column contains the results for the CS approach with the use of the optimized signature together with the subsumption distance. TC1 and TC2 contain the results of the Term Cooc (TC) approach, when the co-occurrences or the inferred co-occurrences are used, respectively. TC3 contains the results for the TC approach with co-occurrences and support vector machines, and TC4 when inferred co-occurrences and SVMs are used. bME column contains the results for the baseline method (classical Maximum Entropy modelling of stems without metadata or hierarchical information), trained and tested on the high quality corpus in a 5-fold cross validation. MD1 is for the MetaData approach, trained and tested on the high quality corpus in a 5-fold cross validation. MD2 is trained on the medium quality/quantity corpus and tested on the high quality one. MD3 was trained on the low quality/high quantity corpus and tested on the high quality corpus. Some terms (spindle, nucleus, transport) are easier to disambiguate than others (development, lead). Overall, all methods perform well between 73–96% f-measure (f-measure, F, is the weighted harmonic mean of precision, P and recall, R: F = 2 × P × R/(P + R)).
Figure 3Term Cooc classification over time. The x-axis is the TC classification over time. Left-most articles are classified early, since they have the highest or lowest co-occurrences with the ambiguous term. Red crosses are errors or wrong predictions. Almost none of the early classified articles are errors.