| Literature DB >> 16501185 |
Tahsin Kurc1, Daniel A Janies, Andrew D Johnson, Stephen Langella, Scott Oster, Shannon Hastings, Farhat Habib, Terry Camerlengo, David Ervin, Umit V Catalyurek, Joel H Saltz.
Abstract
Diverse data sets have become key building blocks of translational biomedical research. Data types captured and referenced by sophisticated research studies include high throughput genomic and proteomic data, laboratory data, data from imagery, and outcome data. In this paper, the authors present the application of an XML-based data management system to support integration of data from disparate data sources and large data sets. This system facilitates management of XML schemas and on-demand creation and management of XML databases that conform to these schemas. They illustrate the use of this system in an application for genotype-phenotype correlation analyses. This application implements a method of phenotype-genotype correlation based on phylogenetic optimization of large data sets of mouse SNPs and phenotypic data. The application workflow requires the management and integration of genomic information and phenotypic data from external data repositories and from the results of phenotype-genotype correlation analyses. Our implementation supports the process of carrying out a complex workflow that includes large-scale phylogenetic tree optimizations and application of Maddison's concentrated changes test to large phylogenetic tree data sets. The data management system also allows collaborators to share data in a uniform way and supports complex queries that target data sets.Entities:
Mesh:
Year: 2006 PMID: 16501185 PMCID: PMC1513665 DOI: 10.1197/jamia.M1848
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.(Top) The data model for the output from POY, a program for phylogenetic tree optimization. The rectangles and arrows represent the element types and parent-child relationships. The contents of the square brackets denote the attributes associated with a node type. For example, a branch element type has attributes for the ancestor node, the descendant node, and the minimum and maximum length as computed by the POY program. (Bottom) An annotated example of section of tabular POY output depicting inferred transformations in three characters on a branch of a phylogenetic tree. Character 1 is a phenotype that changed from state 1 in the ancestor (HTU6) to state 2 in the descendant (strain C57BL10J). This change could be from normal to elevated non–high-density lipoprotein cholesterol plasma levels. Character 2 is a single nucleotide polymorphism (SNP), in which change could have occurred along this branch but is ambiguous due to missing data. Character 3 is an SNP, in which a transition mutation occurs. This could be SNP rs3023213 as depicted in ▶.
Figure 2.Two views of the same phylogenetic tree of females of mouse strains displaying correlated changes of a phenotype and a genotype across 15 mouse strains. The right tree depicts phenotypic change in non–high-density lipoprotein (non-HDL) cholesterol plasma levels in female mice after six weeks of atherogenic diet. Black branches indicate strains (C57BL/6J and CAST/EiJ) with non-HDL levels greater than one standard deviation (sd) above the mean after treatment. Genotype observations for each strain for the SNP of interest (rs3023213; T or C) are indicated on the left tree. Boxes at the terminal branches of the trees indicate genotype or phenotype observations in databases for those strains. Concentrated changes test results for this phenotype–genotype correlation differ for females (p = 0.004) and males (p = 0.088) (not shown).
List of Data Sets Currently Managed by the Data Management and Integration System
| Data Set | Explanation |
|---|---|
| Mpd146, Paigen 2, GNF2 | Mouse phenome database |
| MGI_Coordinate.rpt | MGI sequence coordinates |
| Gene_association.mgi | Gene ontology (GO) annotations of mouse markers |
| Go_terms.mgi | GO terms and GO IDs |
| HMD_HGNC_Accession.rpt | Human and mouse orthology |
| HMD_HumanSequence.rpt | Human and mouse orthology with sequence information |
| HMD_OMIM.rpt | Human and mouse orthology with human OMIM IDs |
These data sets are obtained from the Jackson Laboratory.4,50
Figure 3.Data models for the HMD Human Sequence (left) and mpd146 (right) data sets.
Figure 4.A view of the Client Interface after a query has been executed for a single nucleotide polymorphism (rs3023213) correlated with higher non–high-density lipoprotein cholesterol levels (see also ▶). Within this block of genetic similarity between the query strains (C57BL/6J and CAST/EiJ), a candidate gene, NNMT (nicotinamide N-methyltransferase), for the trait was noted.
Figure 5.The execution time of the data set extraction and loading step for the gene association, mpd146, and coordinate data sets.
Figure 6.Query execution time as the query size is scaled.