| Literature DB >> 19025693 |
Mark Stevenson1, Yikun Guo, Robert Gaizauskas, David Martinez.
Abstract
BACKGROUND: Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of biomedical texts. Previous approaches to resolving this problem have made use of various sources of information including linguistic features of the context in which the ambiguous term is used and domain-specific resources, such as UMLS.Entities:
Mesh:
Year: 2008 PMID: 19025693 PMCID: PMC2586756 DOI: 10.1186/1471-2105-9-S11-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The NLM-WSD test set and some of its subsets. The 12 terms which Weeber et al. [2] described as "problematic" due to low levels of agreement between annotators are shown in italics. The test set used by Joshi et al. [16] comprises the set union of the terms used by Liu et al. [14] and Leroy and Rindflesch [15] while the "common subset" is formed from their intersection.
Results from WSD system. Results from WSD system applied to various sections of the NLM-WSD data set using a variety of features and machine learning algorithms. The best results obtained by our system are highlighted in bold font. Results from baseline and previously published approaches are included for comparison.
| Features | |||||||
| Data sets | Linguistic | CUI | MeSH | CUI+MeSH | Linguistic+MeSH | Linguistic+CUI | Linguistic+MeSH+CUI |
| Vector space model | |||||||
| All words | 87.0 | 85.8 | 81.9 | 86.9 | 87.3 | 87.5 | |
| Joshi subset | 82.1 | 79.6 | 76.6 | 81.4 | 82.4 | 82.8 | |
| Leroy subset | 77.5 | 74.4 | 70.4 | 75.8 | 78.7 | 78.9 | |
| Liu subset | 84.0 | 81.3 | 78.3 | 83.4 | 83.9 | 84.2 | |
| Common subset | 79.1 | 75.1 | 70.4 | 76.9 | 80.0 | 79.7 | |
| Naive Bayes | |||||||
| All words | 86.4 | 81.2 | 85.7 | 81.1 | 86.4 | 81.7 | 81.8 |
| Joshi subset | 80.9 | 73.4 | 80.1 | 73.7 | 81.1 | 74.1 | 74.5 |
| Leroy subset | 76.9 | 66.1 | 74.6 | 65.9 | 77.5 | 66.5 | 67.2 |
| Liu subset | 82.1 | 75.4 | 81.7 | 75.3 | 82.7 | 76.3 | 76.6 |
| Common subset | 77.2 | 66.1 | 74.7 | 65.8 | 79.0 | 66.7 | 67.4 |
| Support Vector Machine | |||||||
| All words | 85.9 | 83.5 | 85.3 | 84.5 | 86.2 | 85.3 | 86.0 |
| Joshi subset | 80.1 | 76.4 | 79.5 | 78.0 | 80.9 | 79.1 | 80.3 |
| Leroy subset | 75.5 | 69.7 | 72.6 | 72.0 | 77.1 | 74.5 | 76.3 |
| Liu subset | 81.7 | 78.2 | 81.0 | 80.0 | 82.3 | 80.6 | 81.7 |
| Common subset | 76.3 | 69.8 | 71.6 | 73.0 | 78.1 | 75.1 | 76.9 |
| Previous Approaches | |||||||
| Per-term | Global | ||||||
| MFS baseline | Liu | Joshi | Leroy and Rindflesch (2005) | Joshi | McInnes | ||
| All words | 78.0 | - | - | - | 86.2 | 85.3 | |
| Joshi subset | 66.9 | - | 82.5 | - | 80.9 | 80.0 | |
| Leroy subset | 55.3 | - | 77.4 | 65.5 | 75.7 | 74.5 | |
| Liu subset | 69.9 | 78.0 | 84.9 | - | 83.3 | 81.9 | |
| Common subset | 54.9 | - | 79.8 | 68.8 | 78.1 | 75.6 | |
Per-word performance of best reported systems
| MFS baseline | Leroy and Rindflesch (2005) | Joshi et. | McInnes | Reported system | |
| adjustment | 62 | 57 | 71 | 70 | |
| association | 100 | - | 97 | ||
| 54 | 46 | 50 | 46 | ||
| cold | 86 | - | 88 | ||
| 90 | - | 89 | |||
| culture | 89 | - | 94 | 95 | |
| degree | 63 | 68 | 89 | 79 | |
| depression | 85 | - | 84 | 81 | |
| determination | 79 | - | 85 | 81 | |
| discharge | 74 | - | 95 | 94 | |
| 99 | - | 98 | |||
| evaluation | 50 | 57 | 67 | 73 | |
| extraction | 82 | - | 84 | 85 | |
| failure | 71 | - | 69 | ||
| fat | 71 | - | 77 | ||
| fit | 82 | - | 81 | 87 | |
| fluid | 100 | - | 99 | ||
| frequency | 94 | - | 94 | 94 | |
| ganglion | 93 | - | 95 | 94 | |
| glucose | 91 | - | 90 | 91 | |
| growth | 63 | 62 | 69 | 69 | |
| immunosuppression | 59 | 61 | 79 | 75 | |
| implantation | 81 | - | 92 | 91 | |
| inhibition | 98 | - | |||
| japanese | 73 | - | 76 | 76 | |
| lead | 71 | - | 88 | 90 | |
| man | 58 | 80 | 80 | 86 | |
| mole | 83 | - | 87 | 88 | |
| mosaic | 52 | 66 | 75 | 85 | |
| nutrition | 45 | 48 | 52 | 49 | |
| pathology | 85 | - | 85 | 84 | |
| 96 | - | 91 | 93 | ||
| radiation | 61 | 72 | 81 | 81 | |
| reduction | 89 | - | 91 | 88 | |
| repair | 52 | 81 | 87 | 86 | |
| resistance | 97 | - | 96 | ||
| scale | 65 | 84 | 76 | 83 | |
| secretion | 99 | - | |||
| sensitivity | 49 | 70 | 85 | 92 | |
| sex | 80 | - | |||
| single | 99 | - | 98 | ||
| strains | 92 | - | 92 | ||
| support | 90 | - | 89 | 90 | |
| 98 | - | 94 | 97 | ||
| transient | 99 | - | 98 | ||
| transport | 93 | - | 93 | 93 | |
| ultrasound | 84 | - | 87 | 85 | |
| variation | 80 | - | 88 | 91 | |
| weight | 47 | 68 | 79 | 82 | |
| white | 49 | 62 | 71 | 74 |
Contribution of linguistic features. Results from various combinations of types of linguistic features, as described in Features section, combined using Vector Space Model learning algorithm. LC = Local Collocations, SB = Salient Bigrams and U = Unigrams.
| Features | ||||||
| Data sets | LC | SB | U | SB+U | LC+SB | LC+U |
| All words | 79.2 | 82.0 | 86.9 | 85.9 | 86.3 | 86.9 |
| Joshi subset | 72.6 | 74.4 | 81.6 | 82.3 | 81.0 | 82.0 |
| Leroy subset | 66.2 | 66.9 | 76.7 | 77.5 | 76.5 | 77.3 |
| Liu subset | 75.7 | 76.2 | 83.4 | 84.3 | 82.7 | 83.9 |
| Common subset | 69.6 | 77.7 | 79.3 | 79.1 | 77.6 | 78.8 |