| Literature DB >> 21635749 |
Antonio J Jimeno-Yepes1, Bridget T McInnes, Alan R Aronson.
Abstract
BACKGROUND: Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.Entities:
Mesh:
Year: 2011 PMID: 21635749 PMCID: PMC3123611 DOI: 10.1186/1471-2105-12-223
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Distribution of ambiguous terms per semantic groups
| Group | Description | Distinct | Ambiguous | % ambiguous | MeSH |
|---|---|---|---|---|---|
| ACTI | Activities & Behaviors | 7652 | 236 | 3.08 | 12 |
| ANAT | Anatomy | 183049 | 1328 | 0.73 | 182 |
| CHEM | Chemicals & Drugs | 1043202 | 15015 | 1.44 | 503 |
| CONC | Concepts & Ideas | 49701 | 3482 | 7.01 | 197 |
| DEVI | Devices | 40454 | 548 | 1.35 | 25 |
| DISO | Disorders | 230779 | 4574 | 1.98 | 354 |
| GENE | Genes & Molecular Sequences | 183096 | 15724 | 8.59 | 302 |
| GEOG | Geographic Areas | 1835 | 445 | 24.25 | 190 |
| LIVB | Living Beings | 433254 | 2475 | 0.57 | 141 |
| OBJC | Objects | 11658 | 577 | 4.95 | 36 |
| OCCU | Occupations | 3559 | 240 | 6.74 | 16 |
| ORGA | Organizations | 3939 | 175 | 4.44 | 18 |
| PHEN | Phenomena | 9903 | 240 | 2.42 | 18 |
| PHYS | Physiology | 307357 | 4437 | 1.44 | 80 |
| PROC | Procedures | 327686 | 1760 | 0.54 | 155 |
This table shows the term ambiguity per semantic group. The row "Distinct" denotes the number of unique terms which belong to any concept assigned to the given semantic group. The row "Ambiguous" denotes the number of unique terms which are assigned to more than one concept of the given semantic group.
Intra-semantic group ambiguity
| ACTI | ANAT | CHEM | CONC | DEVI | DISO | GENE | GEOG | LIVB | OBJC | OCCU | ORGA | PHEN | PHYS | PROC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACTI | 7652 | 3 | 13 | 158 | 3 | 75 | 17 | 3 | 12 | 10 | 2 | 4 | 14 | 29 | 48 |
| ANAT | 183049 | 307 | 311 | 37 | 182 | 181 | 38 | 56 | 24 | 6 | 8 | 7 | 48 | 121 | |
| CHEM | 1043202 | 425 | 235 | 384 | 9674 | 124 | 808 | 224 | 9 | 25 | 45 | 827 | 518 | ||
| CONC | 49701 | 144 | 374 | 366 | 214 | 397 | 697 | 122 | 75 | 81 | 599 | 589 | |||
| DEVI | 40454 | 53 | 36 | 3 | 12 | 54 | 0 | 1 | 2 | 17 | 67 | ||||
| DISO | 230779 | 2077 | 89 | 148 | 38 | 14 | 24 | 55 | 372 | 237 | |||||
| GENE | 183096 | 186 | 111 | 56 | 12 | 31 | 24 | 461 | 308 | ||||||
| GEOG | 1835 | 45 | 9 | 0 | 8 | 8 | 36 | 51 | |||||||
| LIVB | 433254 | 140 | 70 | 32 | 6 | 40 | 82 | ||||||||
| OBJC | 11658 | 8 | 931 | 5 | 19 | 33 | |||||||||
| OCCU | 3559 | 11 | 6 | 14 | 47 | ||||||||||
| ORGA | 3939 | 1 | 17 | 28 | |||||||||||
| PHEN | 9903 | 82 | 34 | ||||||||||||
| PHYS | 307357 | 316 | |||||||||||||
| PROC | 327686 |
Example of CUIs assigned to the string lens
| CUI | STR | Example of UMLS preferred term |
|---|---|---|
| C0023308 | lens | Lens Diseases |
| C0023317 | Lens | Lens, Crystalline |
| C0023318 | Lens | Lenses |
| C0996842 | Lens | Genus Lens |
Example of CUIs assigned to the term lens
| CUI | STR | SAB | TTY |
|---|---|---|---|
| C0023308 | Lens Diseases | MSH | MH |
| C0023317 | Lens, Crystalline | MSH | MH |
| C0023318 | Lenses | MSH | MH |
The information is taken from MRCONSO table fields and shows CUIs linked to the term lens and the MeSH term linked to it. CUI is the concept identifier in the Metathesaurus, STR is the MeSH Heading linked to the concept, SAB indicates the source of the string which is MeSH in this case, TTY indicates that the strings in the table are MeSH Headings.
Figure 1Example query for one of the senses of term lens. PubMed query used to retrieve citations which contain the term lens when it is related to lens diseases. The retrieved citations should have been indexed with the MeSH Heading lens diseases and should not be indexed with Lens, Crystalline or Lenses.
Figure 2WSD example for the term cold in ARFF format. The @RELATION line contains the list of concepts from the Metathesaurus. Each data line has the PMID of the citation, the text where the ambiguous term appears and the sense number.
Top semantic types by frequency in the NLM WSD and our data set
| NLM WSD | MSH WSD | ||||
|---|---|---|---|---|---|
| T061 | Therapeutic or Preventive Procedure | 9 | T047 | Disease or Syndrome | 73 |
| T040 | Organism Function | 7 | T116 | Amino Acid, Peptide, or Protein | 50 |
| T032 | Organism Attribute | 7 | T121 | Pharmacologic Sub-stance | 44 |
| T098 | Population Group | 6 | T123 | Biologically Active Substance | 32 |
| T070 | Natural Phenomenon or Process | 6 | T023 | Body Part, Organ, or Organ Component | 29 |
| T041 | Mental Process | 6 | T109 | Organic Chemical | 26 |
| T081 | Quantitative Concept | 6 | T083 | Geographic Area | 24 |
| T080 | Qualitative Concept | 6 | T129 | Immunologic Factor | 17 |
| T059 | Laboratory Procedure | 5 | T191 | Neoplastic Process | 15 |
| T170 | Intellectual Product | 5 | T114 | Nucleic Acid, Nucleoside, or Nucleotide | 11 |
For each set, the semantic type, description and frequency in the set are shown.
NLM WSD term frequency
| Term | Frequency |
|---|---|
| single | 830940 |
| growth | 780721 |
| evaluation | 626911 |
| surgery | 602878 |
| reduction | 547831 |
| inhibition | 525793 |
| pressure | 492250 |
| support | 470918 |
| weight | 470011 |
| frequency | 460948 |
| sensitivity | 410728 |
| failure | 375471 |
| culture | 365909 |
| resistance | 355190 |
| degree | 338131 |
| determination | 307813 |
| energy | 281706 |
| lead | 280893 |
| glucose | 265023 |
| scale | 263109 |
| strains | 255978 |
| sex | 255545 |
| condition | 251454 |
| uid | 249806 |
| variation | 228733 |
| secretion | 222020 |
| transport | 219625 |
| man | 205108 |
| radiation | 199449 |
| blood pressure | 181752 |
| transient | 175823 |
| white | 174704 |
| depression | 165689 |
| repair | 158033 |
| pathology | 146981 |
| fat | 133861 |
| extraction | 121110 |
| ultrasound | 115408 |
| discharge | 89344 |
| implantation | 87057 |
| nutrition | 80029 |
| adjustment | 71935 |
| japanese | 67796 |
| cold | 67218 |
| fit | 55692 |
| ganglion | 42474 |
| immunosuppression | 32835 |
| mosaic | 19621 |
| mole | 12947 |
Frequencies of the terms is MEDLINE as of 23rd July 2010.
Overall accuracy on the data set
| Data set | NB | AEC | JDI | MRD | 2-MRD |
|---|---|---|---|---|---|
| Abbreviation Set | 0.9716 | 0.9090 | 0.8759 | 0.8501 | |
| Abbreviation Subset | 0.9218 | 0.6725 | 0.8838 | 0.8725 | |
| Term Set | 0.8980 | 0.7462 | 0.7148 | 0.6773 | |
| Term Subset | 0.7448 | 0.6209 | 0.7132 | 0.6609 | |
| Term/Abbreviation Set | 0.8879 | 0.8801 | 0.9356 | ||
| Term/Abbreviation Subset | 0.9360 | 0.9026 | 0.6899 | 0.8715 | 0.9350 |
| Overall MSH WSD Set | 0.9386 | 0.8383 | 0.8070 | 0.7799 | |
| Overall MSH WSD Subset | 0.8448 | 0.6551 | 0.8118 | 0.7837 | |
| NLM WSD | 0.8830 | 0.6836 | 0.6389 | 0.5500 | |
| NLM WSD Subset | 0.9063 | 0.6932 | 0.7475 | 0.6526 | 0.5800 |
NB stands for Naïve Bayes, AEC stands for Automatic Extracted Corpus, MRD stands for Machine Readable dictionary, 2-MRD stands for 2nd Order Co-occurrence MRD, and JDI stands for Journal Descriptor Indexing. The term set stands for all the ambiguous words in the category while subset indicates that only the words that the JDI method can use are considered. Results on the NLM WSD set have been included.
Distribution of semantic groups in the MSH WSD and NLM WSD datasets
| Semantic Group(s) | NLM WSD | MSH WSD | ||
|---|---|---|---|---|
| Activities & Behaviors | 7 | 0.0619 | 5 | 0.0121 |
| Anatomy | 4 | 0.0354 | 44 | 0.1063 |
| Chemicals & Drugs | 3 | 0.0265 | 118 | 0.2850 |
| Concepts & Ideas | 24 | 0.2124 | 10 | 0.0242 |
| Devices | 1 | 0.0088 | 6 | 0.0145 |
| Disorders | 13 | 0.1150 | 100 | 0.2415 |
| Living Beings | 7 | 0.0619 | 39 | 0.0942 |
| Objects | 2 | 0.0177 | 3 | 0.0072 |
| Occupations | 3 | 0.0265 | 3 | 0.0072 |
| Phenomena | 9 | 0.0796 | 4 | 0.0097 |
| Physiology | 20 | 0.1770 | 15 | 0.0362 |
| Procedures | 20 | 0.1770 | 28 | 0.0676 |
| Genes & Molecular Sequences | 0 | 0 | 8 | 0.0193 |
| Geographic Areas | 0 | 0 | 23 | 0.0556 |
| Organizations | 0 | 0 | 5 | 0.0121 |
| Chemicals & Drugs/Objects | 0 | 0 | 2 | 0.0048 |
| Objects/Organizations | 0 | 0 | 1 | 0.0024 |
Accuracy per ambiguous word MEDLINE frequency range
| Q | Frequency range | NB | AEC | MRD | 2-MRD | JDI |
|---|---|---|---|---|---|---|
| Q1 | 1,903,168 - 40,499 | 0.7708 | 0.7427 | 0.7206 | 0.6505 | |
| Q2 | 40,425 - 11,033 | 0.8591 | 0.8199 | 0.7812 | 0.6458 | |
| Q3 | 10,817 - 3,482 | 0.8928 | 0.8490 | 0.8192 | 0.6618 | |
| Q4 | 3,427 - 59 | 0.8309 | 0.8160 | 0.7974 | 0.6623 |
NB stands for Naïve Bayes, AEC stands for Automatic Extracted Corpus, MRD stands for Machine Readable dictionary, 2-MRD stands for 2nd Order Co-occurrence MRD, and JDI stands for Journal Descriptor Indexing.