Literature DB >> 21347128

BioEMR: an integrative framework for cancer research with multiple genomic technologies.

Yu Rang Park¹, Yun Jung Bae, Ju Han Kim.

Abstract

The rapid development of omic technologies facilitate cancer researchers to apply multiple genomic technologies simultaneously. In fact, the complex nature of cancer biology is the reason why we need tools for data integration. Given the complexity of managing multiple technologies and dataset formats, several projects have been introduced including cancer Biomedical Informatics Grid (caGRID) and the Biomedical Research Institute Domain Group (BRIDG) with limited applicability. We introduce an object-oriented data model, Cancer Genomics Object Model (CaGe-OM) for multiple genomics data and Xperanto-CaGe, a web-based application using CaGe-OM with hybrid object-relational mapping technique. The hybrid approach uses objectrelational mapping which is extended to include dynamic structure by using Entity-Attribute-Value (EAV) model. CaGe-OM and Xperanto-CaGe are an attempt to establish a comprehensive framework for integrated storage and interpretation of clinical and multiple genomics data and to facilitate model-level integration of other newly emerging data types. A pilot implementation for the integrated clinical, histo-pathological and genomic information systems is introduced.

Entities: CellLine Chemical Disease Gene Species

Year: 2008 PMID： 21347128 PMCID： PMC3041523

Source DB: PubMed Journal: Summit Transl Bioinform ISSN： 2153-6430

Background

The emergence of a variety of high-throughput technologies produces overwhelming amount of heterogeneous genomic data in a quest to measure multi parts of a biological system simultaneously (mRNA, proteins, metabolites, etc)[1]. For managing and representing theses genomics data, several technology-specific data models have been proposed, including MAGE-OM for transcriptomics [2], PEDRo for proteomics [3], SMAR [4], ArMET [5], and MIAMET [6] for metabolomics, and Tissue MicroArray-Object Model (TMA-OM) [7] for tissue microarray. Despite the increasing number of cancer studies using multiple genomic technologies, there is no integrated data model for multiple functional genomics experimental and clinical data. Several initial efforts have been introduced for solving this problem. The applications provided by National Cancer Institute (NCI) cancer Biomedical Informatics Grid (caGRID) and the Biomedical Research Institute Domain Group (BRIDG) are not yet fully completed and the large-scale architectures and some interdependency problems between modules can be prohibitively costly for a real-world application with limited purposes [8, 9]. The Chemical Effects in Biological Systems (CEBS) does not include cancer genomics data but focuses on functional genomic data in toxicology domain [10]. We proposed Cancer Genomics Object Model (CaGe-OM), for representing data from multiple omics technologies and clinico-histopathological domain in cancer research [11] along with TMA-OM [7]. In the present study, we implemented a web-based application, Xperanto-CaGe, using hybrid object-relational mapping technique in an attempt to establish a comprehensive framework for integrated storage and interpretation of clinical and multiple genomics data types with inclusively flexible design.

Result

Object model

To design an integrated data model for multiple functional genomics data in cancer research in CaGe-OM, we referenced four experimental data models (i.e. FuGe-OM, MAGE-OM, PEDRo and TMA-OM). For modeling clinical and histopathological data, we analyzed cancer management workflow and referenced document models of clinical and histopathological information like College of American Pathologist (CAP) Cancer Protocols (CPs) and National Cancer Institute (NCI) Common Data Emement (CDEs) [11]. CaGe-OM is a data model containing 183 classes grouped into 25 packages (Fig. 1). Most packages are categorized into 3 namespaces: the Common BioData, ClinicalData and TechnologySpecificData namespace. The remaining 6 packages are reused from the corresponding MAGE-OM packages. CaGe-OM is expressed in Class diagram of Unified Modeling Language (UML), which is a standard notation to represent the design and visualization of the system architecture.

Fig. 1

The relationships of the 25 packages in CaGe-OM. Most packages in this model are categorized into three namespaces; the CommonBioData (in yellow), ClinicalData (in pink) and TechnologySpecificData (in blue). Six packages (in gray) are adopted from MAGE-OM and remaining for general purposes.

Database design

The relational database schema of Xperanto-CaGe was derived from CaGe-OM using hybrid object-relational mapping approach. There are three fundamental object-mapping rules [12]. One table per an entire class hierarchy: all the attributes of all the classes in the hierarchy are stored. One table per concrete class: each table includes both specific attributes of a class and any attributes it inherits. One table per class including abstract superclasses: supports polymorphism and each attribute in the class inheritance tree is represented exactly once in a table. We chose the second object-relational mapping rule because this rule provides efficient ad hoc reporting and dose not waist space; all the attributes for any single class are stored in one table. We also applied Entity-Attribute-Value (EAV) model to solve the problems associated with the storage of sparse attributes, attribute heterogeneity and flexibility in adding new attributes for a class. The EAV model, also called row modeling, stores the value of an attribute as a row with the names of its attribute in another column. The EAV model has been widely implemented in management systems for heterogeneous data sets [13, 14]. The detail mapping processes are as follow. Each table includes both the specific attributes of a class and any attributes it inherits, except for abstract classes. According to the multiplicity between classes, associations are defined as one of the normalization form. For instance, one or more (1..*) multiplicity is represented as second normalization form in relational database. Some classes belong to ClinicalData namespace, which have sparse and heterogeneous attributes, are mapped onto a table based on EAV model. Abstract classes are not captured. The associations of abstract classes are passed on to those of subclasses. Further information is available through the supplement web site (http://www.snubi.org/software/cage_om/).

Developing clinical pilot system: BioEMR

For the purpose of developing a pilot system for the evaluation of the practical utility of the integrated clinical, histopathological and high-throughput biological data in real clinical settings, we are establishing a pilot information system, named BioEMR, with Xperanto-CaGe in the Breast Cancer Center at Seoul National University Hospital (Fig. 2).

Fig. 2

BioEMR. Architecture of the pilot information system of integrated clinical, histo-pathological and genomic information.

Most of the clinical data from the legacy clinical information system can be represented in XML-based standard including HL-7, LOINC, DICOM and CDA. The biological data standards for the data from the genomics laboratory include BSML, MAGE-ML, MIAPE and TMA-OM. Both are extracted as XML files and deposited in an integrated document repository after layers of data processing. The integrated document repository is supported by clinical research and clinical trial knowledge database and analyzed by a set of analytical modules. Currently, pilot application modules are using this secondary information system, IDR, in parallel with the primary real-time hospital legacy system.

Integrating external resources

We use the MGED Ontology for the description of common experimental procedure and array information. For describing TMA-specific and clinico-histopathological data, we use the controlled vocabulary defined in TMA-OM [7]. We also implemented an interface to add new user-defined terms. For analyzing high-throughput functional genomic data, integration with statistical analysis tools is required. Xperanto-CaGe is linked to statistical analysis packages, BioChip Analysis and Data Integration (BioCANDI), which pipelines genomic data analysis modules implemented in R statistical language [15]. BioCANDI is composed of 15 normalizations and 54 high-level analysis protocols. After the statistical analysis using BioCANDI, the genes with the significant expression change are represented with integrated annotation through Genome Research Informatics Pipeline (GRIP) system. The GRIP is an integrated annotation database of genes that includes genomics and proteomics as well as ontology and disease information [16]. Integration of data annotation and analysis systems help cancer researchers in biomarker discovery from multiple heterogeneous genomic datasets. Fig. 3 demonstrates the overall structure of Xperanto-CaGe.

Fig. 3

System architecture of Xperanto-CaGe.

Conclusion

We developed Xperanto-CaGe based on CaGe-OM for representing and managing clinical and histo-pathological data as well as high-throughput biological experimental data covering most of the cancer types. They are developed considering the extensibility for newly emerging data types. CaGe-OM and Xperanto-CaGe are attempts to establish a comprehensive framework for integrated storage and analysis of clinical and multiple genomic data and to facilitate model-level integration of unseen data types. The pilot system for breast cancer research (Fig. 2) using the CaGe-OM and Xperanto-CaGe described may serve as a test platform for future clinical genomics research and many translational bioinformatics applications.

14 in total

1. An approach to object-relational mapping in bioscience domains.

Authors: David Tuck; Ryan O'Connell; Peter Gershkovich; James Cowan
Journal: Proc AMIA Symp Date: 2002

2. Metabolomics Standards Workshop and the development of international standards for reporting metabolomics experimental results.

Authors: Arthur L Castle; Oliver Fiehn; Rima Kaddurah-Daouk; John C Lindon
Journal: Brief Bioinform Date: 2006-04-24 Impact factor: 11.622

3. Design aspects of a distributed clinical trials information system.

Authors: A Gouveia Oliveira; Nuno C Salgado
Journal: Clin Trials Date: 2006 Impact factor: 2.486

4. Chemical effects in biological systems (CEBS) object model for toxicology data, SysTox-OM: design and application.

Authors: Sandhya Xirasagar; Scott F Gustafson; Cheng-Cheng Huang; Qinyan Pan; Jennifer Fostel; Paul Boyer; B Alex Merrick; Kenneth B Tomer; Denny D Chan; Kenneth J Yost; Danielle Choi; Nianqing Xiao; Stanley Stasiewicz; Pierre Bushel; Michael D Waters
Journal: Bioinformatics Date: 2006-01-12 Impact factor: 6.937

5. The tissue microarray object model: a data model for storage, analysis, and exchange of tissue microarray experimental data.

Authors: Hye Won Lee; Yu Rang Park; Jaehyun Sim; Rae Woong Park; Woo Ho Kim; Ju Han Kim
Journal: Arch Pathol Lab Med Date: 2006-07 Impact factor: 5.534

6. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid.

Authors: Joel Saltz; Scott Oster; Shannon Hastings; Stephen Langella; Tahsin Kurc; William Sanchez; Manav Kher; Arumani Manisundaram; Krishnakant Shanbhag; Peter Covitz
Journal: Bioinformatics Date: 2006-06-09 Impact factor: 6.937

7. Cancer genomics object model: an object model for multiple functional genomics data for cancer research.

Authors: Yu Rang Park; Hye Won Lee; Sung Bum Cho; Ju Han Kim
Journal: Stud Health Technol Inform Date: 2007

8. Potential of metabolomics as a functional genomics tool.

Authors: Raoul J Bino; Robert D Hall; Oliver Fiehn; Joachim Kopka; Kazuki Saito; John Draper; Basil J Nikolau; Pedro Mendes; Ute Roessner-Tunali; Michael H Beale; Richard N Trethewey; B Markus Lange; Eve Syrkin Wurtele; Lloyd W Sumner
Journal: Trends Plant Sci Date: 2004-09 Impact factor: 18.313

9. A proposed framework for the description of plant metabolomics experiments and their results.

Authors: Helen Jenkins; Nigel Hardy; Manfred Beckmann; John Draper; Aileen R Smith; Janet Taylor; Oliver Fiehn; Royston Goodacre; Raoul J Bino; Robert Hall; Joachim Kopka; Geoffrey A Lane; B Markus Lange; Jang R Liu; Pedro Mendes; Basil J Nikolau; Stephen G Oliver; Norman W Paton; Sue Rhee; Ute Roessner-Tunali; Kazuki Saito; Jørn Smedsgaard; Lloyd W Sumner; Trevor Wang; Sean Walsh; Eve Syrkin Wurtele; Douglas B Kell
Journal: Nat Biotechnol Date: 2004-12 Impact factor: 54.908

10. Design and implementation of microarray gene expression markup language (MAGE-ML).

Authors: Paul T Spellman; Michael Miller; Jason Stewart; Charles Troup; Ugis Sarkans; Steve Chervitz; Derek Bernhart; Gavin Sherlock; Catherine Ball; Marc Lepage; Marcin Swiatek; W L Marks; Jason Goncalves; Scott Markel; Daniel Iordan; Mohammadreza Shojatalab; Angel Pizarro; Joe White; Robert Hubley; Eric Deutsch; Martin Senger; Bruce J Aronow; Alan Robinson; Doug Bassett; Christian J Stoeckert; Alvis Brazma
Journal: Genome Biol Date: 2002-08-23 Impact factor: 13.583

3 in total

1. Different Seasonal Variations of Potassium in Hemodialysis Patients with High Longitudinal Potassium Levels: A Multicenter Cohort Study Using DialysisNet.

Authors: Yunmi Kim; Seong Han Yun; Hoseok Koo; Subin Hwang; Hyo Jin Kim; Sunhwa Lee; Hyunjeong Baek; Hye Hyeon Kim; Kye Hwa Lee; Ju Han Kim; Ji In Park; Kyung Don Yoo
Journal: Yonsei Med J Date: 2021-04 Impact factor: 2.759

2. Development of korean rare disease knowledge base.

Authors: Heewon Seo; Dokyoon Kim; Jong-Hee Chae; Hee Gyung Kang; Byung Chan Lim; Hae Il Cheong; Ju Han Kim
Journal: Healthc Inform Res Date: 2012-12-31

3. Real-world treatment patterns of renal anemia in hemodialysis patients: A multicenter cohort study performed using DialysisNet (RRAHD study).

Authors: Hyo Jin Kim; Ji In Park; Kyung Don Yoo; Yunmi Kim; Hyunjeong Baek; Sung Ho Kim; Taehoon Chang; Hye Hyeon Kim; Kye Hwa Lee; Seungsik Hwang; Clara Tammy Kim; Hoseok Koo; Ju Han Kim
Journal: Medicine (Baltimore) Date: 2020-01 Impact factor: 1.817

3 in total