| Literature DB >> 19796399 |
Jamie P McCusker1, Joshua A Phillips, Alejandra González Beltrán, Anthony Finkelstein, Michael Krauthammer.
Abstract
The National Cancer Institute (NCI) is developing caGrid as a means for sharing cancer-related data and services. As more data sets become available on caGrid, we need effective ways of accessing and integrating this information. Although the data models exposed on caGrid are semantically well annotated, it is currently up to the caGrid client to infer relationships between the different models and their classes. In this paper, we present a Semantic Web-based data warehouse (Corvus) for creating relationships among caGrid models. This is accomplished through the transformation of semantically-annotated caBIG Unified Modeling Language (UML) information models into Web Ontology Language (OWL) ontologies that preserve those semantics. We demonstrate the validity of the approach by Semantic Extraction, Transformation and Loading (SETL) of data from two caGrid data sources, caTissue and caArray, as well as alignment and query of those sources in Corvus. We argue that semantic integration is necessary for integration of data from distributed web services and that Corvus is a useful way of accomplishing this. Our approach is generalizable and of broad utility to researchers facing similar integration challenges.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19796399 PMCID: PMC2755823 DOI: 10.1186/1471-2105-10-S10-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Figure 1Principal components analysis. Projection of the first two principal components of gene expression microarray experiment GSE5949 from GEO. The clinical diagnoses for the biological source cell line were extracted from caTissue and joined using Corvus.
Load and query performance. Load and query times for the operations used. The compute environment used an Intel Core 2 Quad @ 2.40 GHz and 4 GB of memory. The repository was single-threaded.
| Stage | Data Size (Entities) | Data Size (Statements) | Processing Time (s) |
|---|---|---|---|
|
| 57,526 | 88,654 | 473 |
|
| 14.003 | 607,532 | 910 |
|
| - | - | 3.24 |
Figure 2Corvus ETL process. The Corvus ETL Process.
Figure 3Modules diagram. Package diagram showing the import and dependency relationships between ontology modules.
Figure 4Annotated UML class OWL representation. Representation of the Hybridization class from caArray in OWL format. Hybridization is annotated with the NCIt term "Nucleic Acid Hybridization" and has two attributes: "name" and "amountOfMaterial", which in turn have their NCIt annotations. The values of these attributes are maintained via the dm:datatype property.
Figure 5caArray query paths. caArray query paths – this graph is needed to extract all information about an experiment.