| Literature DB >> 25949785 |
Alina Petrova1, Yue Ma2, George Tsatsaronis1, Maria Kissa1, Felix Distel2, Franz Baader2, Michael Schroeder1.
Abstract
BACKGROUND: Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. SNOMED CT is a large medical ontology that is formally defined so that it ensures global consistency and support of complex reasoning tasks. Most biomedical ontologies and taxonomies on the other hand define concepts only textually, without the use of logic. Here, we investigate how to automatically generate formal concept definitions from textual ones. We develop a method that uses machine learning in combination with several types of lexical and semantic features and outputs formal definitions that follow the structure of SNOMED CT concept definitions.Entities:
Keywords: Biomedical ontologies; Formal definitions; MeSH; Relation extraction; SNOMED CT
Year: 2015 PMID: 25949785 PMCID: PMC4422531 DOI: 10.1186/s13326-015-0015-3
Source DB: PubMed Journal: J Biomed Semantics
Textual and formal definitions of
| Textual definition “ | |
|---|---|
| is caused by long-term exposure to | |
| Formal definition | |
| ∃ |
Figure 1Overview of the main aspects related to automated extraction of formal concepts definitions, via a simple example of the definition of “Baritosis”. The figure illustrates an established text mining workflow based on supervised machine learning to address the task. In this work we analyze the impact to the overall performance of the different aspects, namely: modeling (selection of corpora and relations set), feature engineering (selection of lexical and semantic features) and machine learning (selection of classifiers and number of training examples).
Figure 2The distribution of relations in the SemRep corpus.
Sizes of the explicit and inferred relationships for the relations: Associated_morphology, Causative_agent, and Finding_site
|
|
|
| |
|---|---|---|---|
| InfRB | 503,306 | 91,794 | 1,306,354 |
| ExpRB | 32,454 | 13,225 | 43,079 |
Example alignment between sentences and relationships via semantic annotation, and lexical and semantic features extracted from the alignment
| Sentence | “Baritosis is pneumoconiosis caused by barium dust”. | ||||||
|---|---|---|---|---|---|---|---|
| Annotated Sentence | “ | ||||||
| Baritosis_(disorder) | Barium_Dust_(substance) | ||||||
| SNOMED CT relationship | Baritosis_(disorder) | Causative_agent | Barium_Dust_(substance) | ||||||
| Semantic Features | left type | between-words | right type | ||||
| disorder | “is pneumoconiosis caused by” | organism | |||||
| BoW | {is, pneumoconiosis, caused, by} | ||||||
| Word 2-grams | {is pneumoconiosis, pneumoconiosis caused, caused by} | ||||||
| Char. 3-grams | {is␣, s␣p, ␣pn, pne, neu, eum, umo, moc, oco, con, oni, nio, ios, osi, sis, is␣, s␣c, ␣ca, cau, aus, use, sed, ed␣, d␣b, ␣by} | ||||||
Examples of highly weighted lexical features for the three SNOMED CT roles: AM Associated_morphology CA (Causative_agent), and FS (Finding_site)
|
| “displacement of”, “medical condition characterized” |
|---|---|
|
| “caused”, “cause”, “from the”, “by a”, “agent of”, “an infection of” |
|
| “of”, “in”, “affects only”, “infection of” |
Description of the setup of the three experiments
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| MeSH | SNOMED CT | InfRB | 3-grams | — | 424 | 74% | 99.1% |
|
| SemRep | SemRep | SemRep | 3-grams | UMLS | 1,357 | 51%–54% | 94% |
|
| WIKI+D4D | SNOMED CT | InfRB | 3-grams | SNOMED CT | 9,292 | 58%–70% | 100% |
In all experiments Support Vector Machines are used.
The performance of multi-class relational classifier across three different SemRep datasets
|
|
|
| |
|---|---|---|---|
| F-measure | |||
| (with Types) | 94% | 89.1% | 82.7% |
| F-measure | |||
| (only Types) | 93.5% | 79.2% | 65.5% |
| Size | 860 (63%) | 1,144 (84%) | 1,357 (100%) |
The size of each dataset is specified by the absolute number of instances and by the percentage of instances covered by the respective set of relations. The table reports F-Measure for two settings: including semantic types in the feature space, and excluding them.
Main results on the web corpora WIKI and D4D, where the lexical feature is character 3-grams and type is the SNOMED CT semantic type as discussion in Section ‘ Feature engineering’
|
|
| |
|---|---|---|
| WIKI | 58% | 100% |
| D4D | 70% | 100% |