| Literature DB >> 19208184 |
Nigam H Shah1, Clement Jonquet, Annie P Chiang, Atul J Butte, Rong Chen, Mark A Musen.
Abstract
The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.Entities:
Mesh:
Year: 2009 PMID: 19208184 PMCID: PMC2646250 DOI: 10.1186/1471-2105-10-S2-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Architecture of our prototype system comprising of different levels. The figure shows the architecture of the prototype consisting of different levels (0). The resource level, provides and abstraction for elements (E1...En) of public biomedical resources (such as GEO and PubMed). The annotation level, uses a concept recognition tool called mgrep (developed by Univ. of Michigan) to annotate (or tag) resource elements with terms from a dictionary; constructed by including all the concept names and synonyms from a set of ontologies accessible to the ontology level. The annotation tables contain information of the form "element E was annotated with concept T in context C". At the index level, a global index combines all the annotations to index them by ontology concepts and provides information for the form "Concept T annotates elements E1, E2,..., En". See main text for details.
Categorization of GEO datasets according to the Semantic type after excluding matches to high level ontology terms.
| Semantic type | Number of GDS |
| Neoplastic Process | 109 |
| Disease or Syndrome | 97 |
| Injury or Poisoning | 8 |
| Mental or Behavioral Dysfunction | 3 |
Overview of the number of GEO datasets for concepts in the neoplastic process and disease or syndrome category
| GDS | Concept name | CUI | Semantic type |
| Examples of cancers with many GEO datasets | |||
| 26 | Breast cancer | C0006142 | Neoplastic Process |
| 11 | Acute myeloid leukemia | C0023467 | Neoplastic Process |
| 5 | Acute lymphoblastic leukemia | C0023449 | Neoplastic Process |
| Examples of cancers with few GEO datasets | |||
| 1 | Kaposi's sarcoma | C0036220 | Neoplastic Process |
| 1 | Acute promyelocytic leukemia | C0023487 | Neoplastic Process |
| 1 | Pleural mesothelioma | C1377913 | Neoplastic Process |
| Examples of diseases with many GEO datasets | |||
| 13 | Duchenne dystrophy | C0013264 | Disease or Syndrome |
| 6 | Arthritis | C0003864 | Disease or Syndrome |
| 4 | Chronic obstructive pulmonary disease | C0024117 | Disease or Syndrome |
| Examples of diseases with few GEO datasets | |||
| 1 | Open-angle glaucoma | C0017612 | Disease or Syndrome |
| 1 | Purpura thrombocytopenic | C0857305 | Disease or Syndrome |
| 1 | Corneal dystrophy | C0010035 | Disease or Syndrome |
Diseases for which there are both gene expression and tissue microarray datasets
| Disease | GEO datasets | GEO samples | TMAD samples |
| Acute myeloid leukemia | 11 | 366 | 3 |
| Malignant melanoma | 3 | 47 | 43 |
| B-cell lymphoma | 3 | 133 | 27 |
| Prostate cancer | 3 | 47 | 15 |
| Renal carcinoma | 2 | 34 | 185 |
| Carcinoma squamous | 2 | 105 | 175 |
| Multiple myeloma | 2 | 225 | 169 |
| Clear cell carcinoma | 2 | 34 | 63 |
| Renal cell carcinoma | 2 | 34 | 9 |
| Breast carcinoma | 2 | 3 | 1277 |
| Hepatocellular carcinoma | 1 | 80 | 163 |
| Carcinoma lung | 1 | 91 | 66 |
| Cutaneous malignant melanoma | 1 | 38 | 41 |
| T-cell lymphoma | 1 | 29 | 31 |
| Lymphoblastic lymphoma | 1 | 29 | 30 |
| Uterine fibroid | 1 | 10 | 19 |
| Medulloblastoma | 1 | 46 | 9 |
| Clear cell sarcoma | 1 | 35 | 8 |
| Leiomyosarcoma | 1 | 24 | 5 |
| Mesothelioma | 1 | 54 | 5 |
| Kaposi's sarcoma | 1 | 4 | 3 |
| Cardiomyopathy | 1 | 14 | 2 |
| Dilated cardiomyopathy | 1 | 14 | 2 |
Accuracy of identifying disease related datasets
| Accuracy in identifying disease related datasets | |||
| Correct | Incorrect | Total | |
| Positive | 202 (TP) | 39 (FP) | 241 |
| Negative | 97 (TN) | 31 (FN) | 128 |
| Precision = 83.8% | Recall = 86.6% | ||
| Accuracy in identifying disease related datasets after limiting high level matches | |||
| Correct | Incorrect | Total | |
| Positive | 188 (TP) | 21 (FP) | 209 |
| Negative | 115 (TN) | 45 (FN) | 160 |
| Precision = 89.9% | Recall = 80.6% | ||
Number of elements annotated from each resource in the current prototype
| Number of elements | Resource local size (Mb) | Number of direct annotations (mgrep results) | Total number of 'useful'1 annotations | Average number of annotating concepts | |
| 1050000 | 146.1 | 30822190 | 174840027 | 763 | |
| 3371 | 3.6 | 502122 | 1849224 | 525 | |
| 50303 | 99 | 16108580 | 48796501 | 824 | |
| 2085 | 0.7 | 165539 | 772608 | 359 | |
| 1155 | 0.5 | 134229 | 662687 | 564 | |
| 1106914 | 249.9 | 47732660 | 226921047 | (avg)461.5 |
1 We do not add a closure annotation between an element and a concept in the index if the given element was already directly annotated with the given concept.
Figure 2User interface within BioPortal. The figure shows the view seen by a user browsing the NCIT in BioPortal and selecting an ontology concept (in this case, Hepatocellular carcinoma). The user can see the numbers of online resource elements that relate directly to that concept (and the concepts that it subsumes). The interface allows the user to directly access the original elements that are associated with Hepatocellular carcinoma for each of the indexed resources.