| Literature DB >> 19458771 |
Vadim Y Bichutskiy1, Richard Colman, Rainer K Brachmann, Richard H Lathrop.
Abstract
Complex problems in life science research give rise to multidisciplinary collaboration, and hence, to the need for heterogeneous database integration. The tumor suppressor p53 is mutated in close to 50% of human cancers, and a small drug-like molecule with the ability to restore native function to cancerous p53 mutants is a long-held medical goal of cancer treatment. The Cancer Research DataBase (CRDB) was designed in support of a project to find such small molecules. As a cancer informatics project, the CRDB involved small molecule data, computational docking results, functional assays, and protein structure data. As an example of the hybrid strategy for data integration, it combined the mediation and data warehousing approaches. This paper uses the CRDB to illustrate the hybrid strategy as a viable approach to heterogeneous data integration in biomedicine, and provides a design method for those considering similar systems. More efficient data sharing implies increased productivity, and, hopefully, improved chances of success in cancer research. (Code and database schemas are freely downloadable, http://www.igb.uci.edu/research/research.html.).Entities:
Keywords: Cancer; Data Warehousing; Heterogeneous Database Integration; Hybrid Database Integration; Mediation; p53
Year: 2007 PMID: 19458771 PMCID: PMC2675489
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
CRDB design principles.
| Mediation approach should be used for data that changes often. Data warehousing approach should be used for data that changes periodically. |
| Mediation approach should be used for larger data. Data warehousing approach should be used for smaller data. |
| Mediation approach should be used for sources that are always available. Data warehousing approach should be used for sources that are often unavailable. |
| Mediation approach should be used for data with flexible timing constraints. Data warehousing approach should be used for data with stringent timing constraints. |
| Predictable queries on data that does not change often should be written in advance with the results stored in the warehouse. In addition, knowledge of the queries to be performed on the system should be taken into consideration during global schema design. |
CRDB design method. A.Data sources were classified based on their characteristics. B. Each characteristic was converted to the data integration approach that best implements it. C. Each approach was assigned a numerical value with warehouse = 1 and mediation = 0, and a design score was calculated for each data source by summing across the characteristics.
| A
| |||||
|---|---|---|---|---|---|
| Data Source | Data Changes | Data Size | Source Availability | Timing Constraints | Query Predictability |
| Docking | often | large | sometimes | flexible | low |
| Small Molecules | often | large | sometimes | flexible | high |
| Functional Assays | periodically | small | sometimes | flexible | high |
| Structural Assays | periodically | small | sometimes | flexible | low |
Figure 1.CRDB global database schema. The schema follows a design pattern with “Condition” tables (Molecules, Mutants, Experiments) related to “Results” tables (DockingResults and AssayResults). PK denotes a primary key.
Figure 2.System architecture and the hybrid strategy to data integration. Docking and small molecule data use the mediation approach, while the functional and structural assay data use the data warehousing approach. The CRDB is both a mediator and a data warehouse. “Mutants” and “Molecular” are data marts of the warehouse. The ODBC drivers are wrappers in the mediation approach. Dashed lines indicate integration planned in the future.