| Literature DB >> 21092226 |
Antonio J Jimeno-Yepes1, Alan R Aronson.
Abstract
BACKGROUND: Word sense disambiguation (WSD) algorithms attempt to select the proper sense of ambiguous terms in text. Resources like the UMLS provide a reference thesaurus to be used to annotate the biomedical literature. Statistical learning approaches have produced good results, but the size of the UMLS makes the production of training data infeasible to cover all the domain.Entities:
Mesh:
Year: 2010 PMID: 21092226 PMCID: PMC3001745 DOI: 10.1186/1471-2105-11-569
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Top 10 most ambiguous semantic types
| Frequency | Type | Description |
|---|---|---|
| 8,688 | T028 | Gene or Genome |
| 4,089 | T116 | Amino Acid, Peptide, or Protein |
| 3,534 | T201 | Clinical Attribute |
| 2,189 | T200 | Clinical Drug |
| 1,969 | T047 | Disease or Syndrome |
| 1,691 | T123 | Biologically Active Substance |
| 1,408 | T170 | Intellectual Product |
| 1,278 | T121 | Pharmacologic Substance |
| 1,252 | T126 | Enzyme |
| 1,218 | T129 | Immunologic Factor |
Top 10 most ambiguous terms in MEDLINE
| Frequency | Term | Amb. level |
|---|---|---|
| 3,215,158 | study | 6 |
| 2,122,371 | treatment | 4 |
| 2,064,598 | all | 6 |
| 1,955,592 | 2 | 5 |
| 1,945,251 | 1 | 5 |
| 1,872,536 | other | 44 |
| 1,795,137 | had | 2 |
| 1,762,387 | effect | 2 |
| 1,757,672 | can | 11 |
| 1,755,725 | cell | 4 |
Figure 1Example of dictionary file used in the Page Rank method. The first field is a word in the dictionary. The remaining fields are the possible identifiers in the Metathesaurus. If more than one identifier appears, it means that the word is ambiguous and the identifiers are the possible candidates.
Figure 2Example of relation file used in the Page Rank method. The u and v fields indicate the origin and destination of the link respectively. The s field indicates the relation as it appears in the MRREL file [26] (AQ-allowed qualifier, RB - has a broader relationship, RO - has relationship other than synonyms, narrower or broader).
Figure 3Example of instance to be disambiguated by the Page Rank method. The first line is used to identify the citation (PMID), the ambiguous word and the correct sense of the ambiguous word. In the second line, which shows only part of the citation, each word is composed of: stemmed term, a part-of-speech (to be ignored in our experiments), the word count and a flag which indicates if the word should be disambiguated (0 - do not disambiguate, 1 - disambiguate).
Figure 4Query example for term repair using synonyms and related concepts.
NLM WSD results: method comparison
| 1999 | Accuracy all | Accuracy JDI set |
|---|---|---|
| MRD | 0.6389 | 0.6526 |
| PPR s | 0.5826 | 0.5867 |
| AEC | 0.6836 | 0.6932 |
| JDI | 0.7475 | |
| CombSW | 0.7626 | 0.7794 |
| CombV | 0.7601 | 0.7739 |
| MFS | 0.8550 | 0.8669 |
| NB | ||
MRD stands for Machine Readable Dictionary, PPR s stands for Page Rank and MetaMap with Strict model, AEC stands for Automatic Extracted Corpus, JDI stands for Journal Descriptor Indexing, CombSW stands for weighted linear combination, CombV stands for voting combination, MFS stands for Maximum Frequency Sense and NB stands for Naïve Bayes.
Accuracy results per word
| Word | F Max | Total | MFS | NB | AEC | MRD | PPR s | JDI | CombSW | CombV |
|---|---|---|---|---|---|---|---|---|---|---|
| adjustment | 62 | 93 | 0.6667 | 0.7634 | 0.6237 | 0.2308 | 0.3226 | 0.6923 | 0.6882 | 0.5269 |
| blood pressure | 54 | 100 | 0.5400 | 0.5700 | 0.3700 | 0.4343 | 0.4600 | 0.2020 | 0.3838 | 0.4444 |
| cold | 86 | 95 | 0.9053 | 0.9263 | 0.3895 | 0.6044 | 0.9158 | 0.3895 | 0.7895 | |
| Condition | 90 | 92 | 0.9783 | 0.9783 | 0.7065 | 0.3370 | 0.9121 | 0.8370 | 0.7802 | 0.6923 |
| culture | 89 | 100 | 0.8900 | 0.9300 | 0.6000 | 0.8200 | 0.1212 | 0.9700 | 1.0000 | 0.5455 |
| degree | 63 | 65 | 0.9692 | 0.9692 | 0.8923 | 0.4923 | 0.9692 | 0.7077 | 0.8769 | 0.8154 |
| depression | 85 | 85 | 1.0000 | 1.0000 | 0.9529 | 0.9941 | 0.9294 | 0.9176 | 0.9647 | 0.9882 |
| determination | 79 | 79 | 1.0000 | 1.0000 | 0.1392 | 0.9936 | 0.0000 | 1.0000 | 0.9620 | 0.1392 |
| discharge | 74 | 78 | 0.9487 | 0.9867 | 0.7067 | 0.9861 | 0.8800 | 0.5556 | 0.7067 | 0.9600 |
| energy | 99 | 100 | 0.9900 | 0.9900 | 0.4000 | 0.4536 | 0.1000 | 0.7732 | 0.4600 | 0.5400 |
| evaluation | 50 | 100 | 0.5000 | 0.7800 | 0.5000 | 0.5800 | 0.5000 | 0.5800 | 0.5200 | 0.5000 |
| extraction | 82 | 87 | 0.9425 | 0.9425 | 0.7471 | 0.2907 | 0.5747 | 0.9535 | 0.9770 | 0.8621 |
| failure | 25 | 29 | 0.8621 | 0.8621 | 0.8621 | 0.5862 | 0.8621 | 1.0000 | 0.8621 | 1.0000 |
| fat | 71 | 73 | 0.9726 | 0.9726 | 0.8356 | 0.9718 | 0.2029 | 0.9296 | 0.9130 | 0.8406 |
| fit | 18 | 18 | 1.0000 | 0.8330 | 0.8889 | 0.8387 | 1.0000 | 1.0000 | 0.8889 | 1.0000 |
| fluid | 100 | 100 | 1.0000 | 0.9710 | 0.4800 | 0.6082 | 0.5354 | 0.3608 | 0.4848 | 0.3535 |
| frequency | 94 | 94 | 1.0000 | 0.9690 | 0.6064 | 0.9362 | 1.0000 | 0.1809 | 0.6277 | 0.8085 |
| ganglion | 93 | 100 | 0.9300 | 0.9500 | 0.8600 | 0.9565 | 0.3838 | 0.9130 | 0.8788 | 0.8586 |
| glucose | 91 | 100 | 0.9100 | 0.9100 | 0.7800 | 0.2755 | 0.9200 | 0.7347 | 0.7800 | 0.3900 |
| growth | 63 | 100 | 0.6300 | 0.7300 | 0.3700 | 0.6700 | 0.3700 | 0.6500 | 0.5500 | 0.6600 |
| immunosuppression | 59 | 100 | 0.5900 | 0.7900 | 0.5700 | 0.4896 | 0.4646 | 0.7083 | 0.5960 | 0.6465 |
| implantation | 81 | 98 | 0.8265 | 0.9796 | 0.9490 | 0.8316 | 0.8367 | 0.9053 | 0.9388 | 0.9694 |
| inhibition | 98 | 99 | 0.9899 | 0.9899 | 0.8384 | 0.9697 | 0.0101 | 0.9899 | 0.9697 | 0.8283 |
| japanese | 73 | 79 | 0.9241 | 0.9241 | 0.6329 | 0.9211 | 0.1646 | 0.8947 | 0.6329 | 0.9367 |
| lead | 27 | 29 | 0.9310 | 0.9310 | 0.8276 | 0.3793 | 0.9310 | 0.1724 | 0.8276 | 0.8621 |
| man | 58 | 92 | 0.6304 | 0.8696 | 0.6522 | 0.3187 | 0.3187 | 0.6484 | 0.4176 | |
| mole | 83 | 84 | 0.9881 | 0.9881 | 0.4405 | 0.8916 | 0.8889 | 0.9398 | 1.0000 | 1.0000 |
| mosaic | 57 | 97 | 0.5876 | 0.8247 | 0.8144 | 0.5795 | 0.5979 | 0.7273 | 0.8454 | 0.7216 |
| nutrition | 45 | 89 | 0.5056 | 0.5506 | 0.3708 | 0.3933 | 0.5056 | 0.4719 | 0.4545 | 0.4318 |
| pathology | 85 | 99 | 0.8586 | 0.8586 | 0.6061 | 0.3939 | 0.8404 | 0.8182 | 0.7553 | 0.8298 |
| pressure | 96 | 96 | 1.0000 | 0.9580 | 0.5208 | 0.9836 | 0.1042 | 0.8172 | 0.6354 | 0.8750 |
| radiation | 61 | 98 | 0.6224 | 0.8367 | 0.7449 | 0.6979 | 0.7551 | 0.7917 | 0.7653 | 0.7653 |
| reduction | 9 | 11 | 0.8182 | 0.8182 | 0.9091 | 0.8182 | 0.9091 | 0.8182 | 1.0000 | 0.8182 |
| repair | 52 | 68 | 0.7647 | 0.9559 | 0.8529 | 0.8358 | 0.6618 | 0.8358 | 0.8676 | 0.8824 |
| resistance | 3 | 3 | 1.0000 | 1.0000 | 1.0000 | 0.3333 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
| scale | 65 | 65 | 1.0000 | 1.0000 | 0.7231 | 0.0615 | 1.0000 | 0.0615 | 0.6875 | 0.6563 |
| secretion | 99 | 100 | 0.9900 | 0.9900 | 0.4600 | 0.3535 | 0.9451 | 0.9798 | 0.5814 | 0.9651 |
| sensitivity | 49 | 51 | 0.9608 | 0.9608 | 0.7255 | 0.8431 | 0.0196 | 0.2745 | 0.9216 | 0.7255 |
| sex | 80 | 100 | 0.8000 | 0.8400 | 0.6000 | 0.5455 | 0.2700 | 0.6000 | 0.5300 | |
| single | 99 | 100 | 0.9900 | 0.9900 | 0.8900 | 0.0400 | 0.9700 | 0.9300 | 0.8900 | 0.9500 |
| strains | 92 | 93 | 0.9892 | 0.9892 | 0.9570 | 0.9780 | 0.6129 | 1.0000 | 0.9892 | 0.9570 |
| support | 8 | 10 | 0.8000 | 0.8000 | 1.0000 | 0.3000 | 0.2000 | 0.9000 | 1.0000 | 0.9000 |
| surgery | 98 | 100 | 0.9800 | 0.9800 | 0.1900 | 0.9394 | 0.6900 | 0.8990 | 0.4300 | 0.9600 |
| transient | 99 | 100 | 0.9900 | 0.9900 | 0.9100 | 0.9900 | 0.9899 | 0.9600 | 0.9485 | 0.9691 |
| transport | 93 | 94 | 0.9894 | 1.0000 | 1.0000 | 0.9780 | 0.0106 | 1.0000 | 1.0000 | 0.9787 |
| ultrasound | 84 | 100 | 0.8400 | 0.8500 | 0.7400 | 0.6667 | 0.8500 | 0.7813 | 0.8100 | 0.8300 |
| variation | 80 | 100 | 0.8000 | 0.9100 | 0.6900 | 0.7600 | 0.8586 | 0.3500 | 0.6465 | 0.8586 |
| weight | 29 | 53 | 0.5472 | 0.8491 | 0.6604 | 0.4717 | 0.6444 | 0.6591 | 0.6818 | |
| white | 49 | 90 | 0.5444 | 0.8111 | 0.5111 | 0.4831 | 0.5393 | 0.6517 | 0.5730 | 0.5843 |
| Accuracy all | 68.96 | 81.35 | 0.8550 | 0.8830 | 0.6836 | 0.6389 | 0.5826 | 0.7626 | 0.7601 | |
| Accuracy JDI set | 69.47 | 81.02 | 0.8669 | 0.9063 | 0.6932 | 0.6526 | 0.5867 | 0.7475 | 0.7794 | 0.7739 |
F Max is the number of instances of the sense with the frequency, Total is the number of instances, MFS stands for Maximum Frequency Sense, NB stands for Naïve Bayes, AEC stands for Automatic Extracted Corpus, MRD stands for Machine Readable Dictionary, PPR s stands for Page Rank and MetaMap with Strict model, CombSW stands for weighted linear combination and CombV stands for voting combination.
Scale top tf × idf terms for the senses M1, M2, M3
| M1 | M2 | M3 | |||
|---|---|---|---|---|---|
| scale | 19.06 | scale | 68.75 | scale | 55.30 |
| integumentari | 8.39 | interv | 25.17 | weight | 46.06 |
| seri | 24.74 | measur | 41.91 | ||
| loinc | 22.52 | compon | 33.80 | ||
| sequenc | 21.38 | devic | 31.98 | ||
Nutrition top tf × idf terms for the senses M1, M2, M3
| M1 | M2 | M3 | |||
|---|---|---|---|---|---|
| nutrit | 1519.81 | nutrit | 1318.84 | nutrit | 158.13 |
| physiolog | 548.57 | scienc | 453.13 | Scienc | 81.95 |
| avail | 205.97 | health | 433.43 | Statu | 35.07 |
| Statu | 182.38 | physiolog | 351.14 | Regim | 13.48 |
| phenomena | 131.35 | food | 311.01 | Outcom | 10.17 |