| Literature DB >> 17650333 |
Jung-Wei Fan1, Hua Xu, Carol Friedman.
Abstract
BACKGROUND: Biomedical ontologies are critical for integration of data from diverse sources and for use by knowledge-based biomedical applications, especially natural language processing as well as associated mining and reasoning systems. The effectiveness of these systems is heavily dependent on the quality of the ontological terms and their classifications. To assist in developing and maintaining the ontologies objectively, we propose automatic approaches to classify and/or validate their semantic categories. In previous work, we developed an approach using contextual syntactic features obtained from a large domain corpus to reclassify and validate concepts of the Unified Medical Language System (UMLS), a comprehensive resource of biomedical terminology. In this paper, we introduce another classification approach based on words of the concept strings and compare it to the contextual syntactic approach.Entities:
Mesh:
Year: 2007 PMID: 17650333 PMCID: PMC2014782 DOI: 10.1186/1471-2105-8-264
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Error rates of the experiments
| Error rate | MRR | Error rate | MRR | Error rate | MRR | |
| Syntactic dependencies | -- | -- | 0.315 | 0.792 | 0.187 | |
| Strings without annotations* | 0.191 | 0.881 | 0.221# | 0.862# | 0.247# | 0.853# |
| Strings with annotations* | 0.143 | 0.164# | 0.895# | 0.192# | 0.886# | |
* Note that the string-based approach is independent of the number of available syntactic features used by the distributional approach, and the alignment (#) is just for comparison purpose.
Confusion matrix for the 17 misclassifications by the distributional approach on the N = 91 test set
| 1) | 2) | 3) | 4) | 5) | 6) | 7) | 8) | |
| 1) anatomy | 2 | |||||||
| 2) behavior | ||||||||
| 3) biologic function | 1 | 2 | 2 | 1 | ||||
| 4) disorder | 3 | |||||||
| 5) gene or protein | ||||||||
| 6) microorganism | ||||||||
| 7) procedure | 1 | 1 | 1 | |||||
| 8) substance | 1 | 1 |
Confusion matrix for the 17.5 misclassifications by the string-based approach (with parenthesized annotations) on the N = 91 test set
| 1) | 2) | 3) | 4) | 5) | 6) | 7) | 8) | |
| 1) anatomy | 1 | 3 | 1 | |||||
| 2) behavior | 1 | |||||||
| 3) biologic function | 3 | 1 | ||||||
| 4) disorder | 1 | 1 | ||||||
| 5) gene or protein | 1 | |||||||
| 6) microorganism | ||||||||
| 7) procedure | 1 | |||||||
| 8) substance | 3.5 |
Misclassifications that are complemented to be correct by different approaches
| Distributional dependencies | Strings without annotations | Strings with annotations | |
| Distributional dependencies | -- | 10/17 | 12/17 |
| Strings without annotations | 15.5/22.5 | -- | 6/22.5 |
| Strings with annotations | 12.5/17.5 | 1/17.5 | -- |
Error rates of the combined classifier
| Error rate | MRR | Error rate | MRR | Error rate | MRR | |
| Syntactic dependencies | 0.191 | 0.881 | 0.167 | 0.900 | 0.110 | 0.935 |
| Syntactic dependencies | 0.143 | 0.907 | 0.143 | 0.912 | 0.055 | 0.969 |
* Note that in the first column we set w = 0 because the distributional approach is not applicable.
Figure 1Hierarchical illustration of the eight semantic classes (with some of the constituent SN types and CUIs omitted for conciseness).
Example strings of the concept C0079380
| S6441909 | Frameshift Mutation function |
| S0042786 | Frameshift Mutation |
| S6148830 | Reading Frame Shift Mutation |
| S6364685 | Out-of-Frame Mutation |
Words and frequencies obtained from the string texts of Table 6
| frameshift | 2 |
| mutation | 4 |
| function | 1 |
| reading | 1 |
| frame | 2 |
| shift | 1 |
| out | 1 |