| Literature DB >> 20495688 |
Spiro P Pantazatos1, Jianrong Li, Paul Pavlidis, Yves A Lussier.
Abstract
An approach towards heterogeneous neuroscience dataset integration is proposed that uses Natural Language Processing (NLP) and a knowledge-based phenotype organizer system (PhenOS) to link ontology-anchored terms to underlying data from each database, and then maps these terms based on a computable model of disease (SNOMED CT(R)). The approach was implemented using sample datasets from fMRIDC, GEO, The Whole Brain Atlas and Neuronames and allowed for complex queries such as "List all disorders with a finding site of brain region X, and then find the semantically related references in all participating databases based on the ontological model of the disease or its anatomical and morphological attributes". Precision of the NLP-derived coding of the unstructured phenotypes in each dataset was 88% (n=50), and precision of the semantic mapping between these terms across datasets was 98% (n=100). To our knowledge, this is the first example of the use of both semantic decomposition of disease relationships and hierarchical information found in ontologies to integrate heterogeneous phenotypes across clinical and molecular datasets.Entities:
Year: 2009 PMID: 20495688 PMCID: PMC2874327 DOI: 10.4137/cin.s1046
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Overall scheme for heterogeneous database integration. Natural Language Processing and Coding (PhenOS) was first used to assign terms (and their corresponding SNOMED codes) to underlying data (Primary data) for each of the participating databases. These were organized into tables (Secondary data) whose fields were then related and mapped using ancestor-descendant and translation tables generated from SNOMED (Data mapping).
Figure 2Model-theoretic query using hierarchical information as well as semantic decomposition of diseases. The SNOMED ontology model extends along two axes (i) the ‘hierarchical-axis (diagonal-axis or y-axis)’ where subsumption-type relationships can be derived between ancestor and descendant concepts in the same semantic type (e.g. astrocytoma of brain is an intracranial glioma), and along (ii) semantic model of diseases that can be decomposed in their attributes (horizontal axis or x-axis) where Diseases (D) are decomposed in Anatomical attributes (A) and Abnormal Morphologies (M). While the SNOMED semantic model of diseases also supports functional and etiological attributes for diseases, only the anatomies and morphologies were used in this proof-of-concept. Participating databases extend down along the ‘vertical-axis’. Each axis can be extended further; extension down the ‘y-axis’ is accomplished as more specific terms are added to SNOMED with upcoming revisions, relatable semantic classes could be added along the ‘x-axis’ (i.e. Disease can also be related to class ‘Organism’ through linkage concept “causative agent”), and more heterogeneous databases can be added along the ‘z-axis’.
Figure 3Schematic of fMRIDC_AMtoD_Brain_subsumed select ‘mapping query’ setup in MS Access 2003 A). This query creates a table of pair-wise mappings in which the terms in fMRIDC table are either an Anatomical Structure or Abnormal Morphology and terms in the Brain table are Diseases that are subsumed by diseases that have as finding site or associated morphology the term in the fMRIDC table. This would be symbolized by ‘fMRIDdc to Brain A,M→D↓’. Users can also specify their own term in a class query, exemplified in a AMtoD_fMRIDC class-based query setup B) in MS Access. An instance of this type of query was shown in Figure 4: “List all diseases with Finding Site ‘temporal lobe’ and then find references to these disease (identical or subsuming) in all participating databases.” Sample results tables generated from both of these queries are depicted in C.
Total numbers of pair-wise mappings of concepts generated through PhenOS from each of four databases to the other according to 7 types of relationships. 1) Identity—Number of unique pair-wise mappings in which the terms are identical or similar between the row and column database. 2) Subsuming—Number of unique pair-wise mappings in which terms in the row database subsume terms in the column database. 3) Subsumed—Number of unique pair-wise mappings in which the terms in the row database are subsumed by terms in the column database. 4) A,M→D↑ —Number of unique pair-wise mappings in which the terms in the row database are either an Anatomical Structure or Abnormal Morphology and terms in the column database are Diseases that subsume diseases that have as finding site or associated morphology the term in the row database. 5) A,M→D↓—Number of unique pair-wise mappings in which the terms in the row database are either an Anatomical Structure or Abnormal Morphology and terms in the column database are Diseases that are subsumed by diseases that have as finding site or associated morphology the term in the row database. 6) D→A,M↑—Number of unique pair-wise mappings in which the terms in the row database are Diseases and terms in the column database are either an Anatomical Structure or Abnormal Morphology that subsume finding sites or associated morphologies of terms in the row database. 7) D→A,M↓—Number of unique pair-wise mappings in which the terms in the row database are Diseases and terms in the column database are either an Anatomical Structure or Abnormal Morphology that are subsumed by finding sites or associated morphologies of terms in the row database. Entries along the diagonal are number of unique terms in the tables for each database linking terms with accession numbers. (Note: NN = Neuronames)
| From | To | fMRIDC | GEO | Brain | Neuronames |
|---|---|---|---|---|---|
| Identity | 11 | 10 | 14 | ||
| Subsuming(↑) | 48 | 46 | 48 | ||
| Subsumed (↓) | 32 | 9 | 348 | ||
| A,M→D↑ | 100 unique terms | 12 | 104 | N/A | |
| A,M→D↓ | 1 | N/A | |||
| D→A,M↑ | 2 | 1 | 2 | ||
| D→A,M↓ | 47 | 1 | 475 | ||
|
| |||||
| Identity | 11 | 8 | 18 | ||
| Subsuming(↑) | 32 | 29 | 370 | ||
| Subsumed (↓) | 48 | 146 | 50 | ||
| A,M→D↑ | 7 | 142 unique terms | 194 | N/A | |
| A,M→D↓ | 2 | 13 | N/A | ||
| D→A,M↑ | 0 | 1 | 0 | ||
| D→A,M↓ | 17 | 0 | 205 | ||
|
| |||||
| Identity | 10 | 8 | 0 | ||
| Subsuming(↑) | 9 | 146 | 0 | ||
| Subsumed (↓) | 46 | 29 | 0 | ||
| A,M→D↑ | 0 | 6 | 251 unique terms | N/A | |
| A,M→D↓ | 0 | 0 | N/A | ||
| D→A,M↑ | 9 | 9 | 10 | ||
| D→A,M↓ | 209 | 229 | 2463 | ||
|
| |||||
| Identity | 14 | 18 | 0 | ||
| Subsuming (↑) | 348 | 50 | 0 | ||
| Subsumed (↓) | 48 | 370 | 0 | ||
| A,M→D↑ | 8 | 26 | 241 | 221 unique terms | |
| A,M→D↓ | 2 | 1 | 13 | ||
| D→A,M↑ | N/A | N/A | N/A | ||
| D→A,M↓ | N/A | N/A | N/A | ||
=corresponds to mappings generated by the example query depicted in Fig. 4.
Figure 4Graphical depiction of the class-based query: “List all diseases with Finding Site ‘temporal lobe’ and then find references to these diseases (identical or subsuming) in all participating databases.” In this example, ‘temporal lobe epilepsy’ is directly referenced in GEO, but must be expanded to subsuming ancestor term ‘epilepsy’ to find the closet match in fMRIDC, and ‘progressive aphasia in Alzheimer’s disease’ must be expanded to subsuming ancestor term ‘Alzheimer’s disease’ to find matches in both GEO and fMRIDC.
Figure 5‘Close-up’ depiction of semantic navigation path through the SNOMED ontology for one result in answering the class-based query “List all diseases with Finding Site ‘temporal lobe’ and then find references to these disease (identical or subsuming) in all participating databases.” Solid arrows are query navigation path, and dashed arrows are SNOMED directed relationships (“has finding site” and “is a”). “Temporal lobe epilepsy” is found to be referenced in GEO, whereas only the more general term “epilepsy” was found in fMRIDC.
Most frequent types of errors in precision are shown along with examples.
| Example error | Reason | Count | |||
|---|---|---|---|---|---|
| fMRIDC to Brain—Subsuming (↑) Animals | cyst | Homonymy | 2 | ||
| NN to Brain—Subsuming (↑) Brain | brain | Ontology | 1 | ||
| Accession | term Photic stimulation | Incorrect relation | 5 |
Delineation of possible queries (navigation paths of query model) and their general potential utilities.
| Query symbol | Query description | General utility | Example query |
|---|---|---|---|
| A↓and/or↑ | Find data entries that reference anatomies subsumed by and/or subsuming A. | Query expansion. | “Find all structures that are part of ‘limbic system’” |
| D↓and/or↑ | Find data entries that reference diseases subsumed by and/or subsuming D. | Query expansion. | “Find subsuming diseases of ‘Argyrophilic brain disease’” |
| M↓and/or↑ | Find data entries that reference abnormal morphologies subsumed by and/or subsuming M. | Query expansion. | “Find all variants and subtypes of ‘inflammation’” |
| A → D | Find data entries that reference all diseases with Finding Site (FS) A. | Compare tissues according to diseases that affect them. | “Find diseases with finding site ‘temporal lobe’” |
| A → D → M | Find data entries that reference abnormal morphologies associated with all diseases with FS A. | Compare tissues according to abnormal morphologies that affect them. | “Find all abnormal morphologies that occur in ‘hypothalamus’” |
| D → A | Find data entries that reference anatomies that are a FS for D. | Compare diseases according to tissues they affect. | “Find regions affected by ‘limbic encephalitis’” |
| D → M | Find data entries that reference abnormal morphologies associated with D. | Compare diseases according to their associated morphologies. | “Find known associated morphologies of ‘prion’ diseases” |
| M → D | Find data entries that reference diseases with associated morphology (AM) M. | Compare abnormal morphologies according to diseases they associate with. | “Find brain diseases known to exhibit ‘inflammation’” |
| M → D → A | Find data entries that reference anatomies that are a FS for diseases with associated morphology (AM) M. | Compare abnormal morphologies according to tissues they affect. | “Find regions known to be affected by ‘inflammation’” |
| SNOMED ID | Words | NUM | DEFINITION |
|---|---|---|---|
| 115240006 | Glioma (morphologic abnormality) | 3 | Fully specified Name |
| 115240006 | Glioma | 1 | Preferred |
| 115240006 | [M] Gliomas | 2 | Synonym |
Table of acronyms used in the primary text
| Acronym | Full term |
|---|---|
| DL | Description Logic |
| SNOMED-CT | Systematized Nomenclature of Medicine—Clinical Terms |
| UMLS | Unified Medical Language System |
| DAG | Directed Acyclic Graph |
| fMRI | Functional Magnetic Resonance Imaging |
| PhenOS | Knowledge-based Phenotype Organizer System |
| NLP | Natural Language Processing |
| GEO | Gene Expression Omnibus |
| A | Anatomical Structure |
| M | Abnormal Morphology |
| D | Disease |
| TP | True Positive |
| FP | False Positive |
| FS | Finding Site |
Example entries of tables created through PhenOS for two (fMRIDC and GEO) participating databases.
| fMRIdc | ||
|---|---|---|
|
| ||
| fMRIdc accession | fMRI term | SNOMED ID |
| 2-2002-112R1 | Aphasia | 229654003 |
| GDS 462 | Cancer | 86049000 |
Example entry from the Ancestor-Descendant Table. (SID = SNOMED ID code).
| Ancestor-Descendant | ||
|---|---|---|
|
| ||
| Descendant (SID) | Ancestor (SID) | Distance |
| 109006 | 74732009 | 2 |
Example entries from a translation table mapping diseases to anatomies or morphologies.
| Disease2Anatomy_Morphology | ||||
|---|---|---|---|---|
|
| ||||
| Disease name | Disease SID | AnaMorph SID | AnaMorph name | Linkage |
| Alzheimer Disease | 26929004 | 83678007 | Cerebral structure (body structure) | 363698007 |
| Alzheimer Disease | 26929004 | 33359002 | Degeneration (morphologic abnormality) | 116676008 |
Example entry from SID to term translation table.
| SNOMED code-term translation table | |
|---|---|
|
| |
| SNOMED code (SID) | SNOMED code description |
| 2470005 | Brain damage (disorder) |
100 randomly selected pairwise mappings (25 to each datasouce, top) from Table 6 and 50 randomly selected codings (12–13 for each datasource, bottom) generated through PhenOS and NLP (Stage 1).
| From | To fMRIdc | From | To GDS | ||||
|---|---|---|---|---|---|---|---|
| NN | Subsuming (↑) | Brain | flocculus | Brain | Subsuming (↑) | disease | hepatoma |
| Brain | D→A,M↓ | Brain | D→A,M↓ | ||||
| Brain | D→A,M↓ | NN | Subsumed (↓) | ||||
| Brain | D→A,M↓ | fMRIdc | Subsumed (↓) | ||||
| NN | Subsuming (↑) | Brain | Subsuming (↑) | ||||
| GDS | Subsuming (↑) | NN | Subsuming (↑) | ||||
| NN | Subsuming (↑) | fMRIdc | D→A,M↓ | ||||
| NN | Subsuming (↑) | NN | Subsuming (↑) | ||||
| NN | Subsuming (↑) | NN | Subsumed (↓) | ||||
| NN | Subsuming (↑) | fMRIdc | Subsumed (↓) | ||||
| Brain | D→A,M↓ | Brain | Subsuming (↑) | ||||
| NN | Subsuming (↑) | NN | Subsumed (↓) | ||||
| NN | Subsuming (↑) | NN | Subsumed (↓) | ||||
| NN | Subsumed (↓) | Brain | D→A,M↓ | ||||
| GDS | Subsuming (↑) | Brain | D→A,M↓ | ||||
| NN | Subsuming (↑) | NN | Subsumed (↓) | ||||
| NN | Subsuming (↑) | Brain | Subsumed (↓) | ||||
| GDS | Subsuming (↑) | NN | Subsumed (↓) | ||||
| Brain | D→A,M↓ | NN | Subsumed (↓) | ||||
| Brain | D→A,M↓ | NN | Subsumed (↓) | ||||
| NN | Subsumed (↓) | Brain | D→A,M↓ | ||||
| Brain | D→A,M↓ | Brain | Subsuming (↑) | ||||
| NN | Subsuming (↑) | NN | Identity | ||||
| Brain | D→A,M↓ | NN | A,M→D↑ | ||||
| NN | Subsuming (↑) | NN | Subsumed (↓) | ||||
| fmridc | Subsuming (↑) | Brain | D→A,M↓ | ||||
| fmridc | A,M→D↑ | Brain | D→A,M↓ | ||||
| fmridc | Subsuming (↑) | Brain | D→A,M↓ | ||||
| NN | A,M→D↑ | Brain | D→A,M↓ | ||||
| GDS | A,M→D↑ | Brain | D→A,M↓ | ||||
| GDS | A,M→D↑ | Brain | D→A,M↓ | ||||
| NN | A,M→D↑ | fmridc | D→A,M↓ | ||||
| GDS | Subsumed (↓) | fmridc | D→A,M↓ | ||||
| NN | A,M→D↑ | Brain ? | D→A,M↓ | ||||
| GDS | Subsumed (↓) | fmridc | D→A,M↓ | ||||
| NN | A,M→D↑ | Brain | D→A,M↓ | ||||
| NN | A,M→D↑ | fmridc | Subsumed (↓) | ||||
| GDS | A,M→D↑ | Brain | D→A,M↓ | ||||
| fmridc | Identity | Brain | D→A,M↓ | ||||
| GDS | Subsuming (↑) | Brain | D→A,M↓ | ||||
| GDS | Subsumed (↓) | Brain | D→A,M↓ | ||||
| fmridc | A,M→D↑ | fmridc | Subsumed (↓) | ||||
| fmridc | Subsuming (↑) | fmridc | Subsumed (↓) | ||||
| GDS | A,M→D↑ | fmridc | D→A,M↓ | ||||
| GDS | Subsumed (↓) | Brain | D→A,M↓ | ||||
| GDS | Subsuming (↑) | Brain | D→A,M↓ | ||||
| NN | A,M→D↑ | Brain | D→A,M↓ | ||||
| NN | A,M→D↑ | Brain | D→A,M↓ | ||||
| fmridc | A,M→D↑ | Brain | D→A,M↓ | ||||
| GDS | Subsumed (↓) | Brain | D→A,M↓ | ||||
| 279394006 | 2-2003-113NF | 113091000 | |||||
| 181431007 | 2-2001-111G6 | 47078008 | |||||
| 36615005 | 2-2001-112D3 | 19225000 | |||||
| 35039007 | 2-2002-1135N | 248152002 | |||||
| 75573002 | 2-2001-111YA | 252741004 | |||||
| 181464007 | 2-2002-1132M | 258061005 | |||||
| 363406005 | 2-2001-1123Y | 67822003 | |||||
| 12915002 | 2-2003-113NF | 1086007 | |||||
| 13645005 | 2-2002-112KP | 278412004 | |||||
| 11000004 | 2-2000-1115T | 25221002 | |||||
| 279297002 | 2-2001-111XN | 180923002 | |||||
| 256373005 | 2-2002-1135N | 50360004 | |||||
| 279185008 | 74848003 | ||||||
| 49243008 | 187192000 | ||||||
| 369205003 | 33336008 | ||||||
| 113303003 | 9546005 | ||||||
| 33960001 | 39821008 | ||||||
| 279290000 | 70819003 | ||||||