| Literature DB >> 19812726 |
Francisco Prosdocimi1, Brandon Chisham, Enrico Pontelli, Julie D Thompson, Arlin Stoltzfus.
Abstract
Comparative analysis is used throughout biology. When entities under comparison (e.g. proteins, genomes, species) are related by descent, evolutionary theory provides a framework that, in principle, allows N-ary comparisons of entities, while controlling for non-independence due to relatedness. Powerful software tools exist for specialized applications of this approach, yet it remains under-utilized in the absence of a unifying informatics infrastructure. A key step in developing such an infrastructure is the definition of a formal ontology. The analysis of use cases and existing formalisms suggests that a significant component of evolutionary analysis involves a core problem of inferring a character history, relying on key concepts: "Operational Taxonomic Units" (OTUs), representing the entities to be compared; "character-state data" representing the observations compared among OTUs; "phylogenetic tree", representing the historical path of evolution among the entities; and "transitions", the inferred evolutionary changes in states of characters that account for observations. Using the Web Ontology Language (OWL), we have defined these and other fundamental concepts in a Comparative Data Analysis Ontology (CDAO). CDAO has been evaluated for its ability to represent token data sets and to support simple forms of reasoning. With further development, CDAO will provide a basis for tools (for semantic transformation, data retrieval, validation, integration, etc.) that make it easier for software developers and biomedical researchers to apply evolutionary methods of inference to diverse types of data, so as to integrate this powerful framework for reasoning into their research.Entities:
Keywords: character data; comparative method; evolution; ontology; phylogeny
Year: 2009 PMID: 19812726 PMCID: PMC2747124 DOI: 10.4137/ebo.s2320
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1ontology development strategy. The strategy for development of CDAO was modified from that suggested by Stevens et al.32 We began by studying use cases. After deciding on a representation system, we conceptualized domain knowledge by identifying, defining, and classifying terms for key concepts and relations. These concepts and relations were formalized, and then subjected to evaluation as described in the text.
Some related artefacts from the domain of evolutionary analysis.
| Name | Type (language) | Coverage | Reference |
|---|---|---|---|
| NEXUS | File format (text) | Character data (various types), trees, assumptions, sets, notes | |
| NeXmL | File format (XML Schema) | Character data (various types), trees, models, meta-data | |
| CHADO | DB schema (SQL) | Sequences, genotypes, phenotypes, phylogenies | |
| TreeBase | DB schema (SQL) | Character data (various types), trees, meta-info on analyses | |
| MAO | Ontology (OBO) | Multiple alignments of DNA, RNA and protein sequences | bips.u-strasbg.fr/LBGI/mAo/mao.html |
| NCBI Taxonomy | DB schema (SQL) | Organismal classification using the Linnean system | |
| NCBI data model | Object model (ASN.1) | DNA, RNA and protein sequences, features, and alignments | |
| PATO | Ontology (OBO, OWL) | Phenotypic and trait ontology | |
| GO | Ontology (OBO, OWL) | Terms for molecular function, biological process, cellular location | |
| SO | Ontology (OBO, OWL) | Sequence features and attributes, similarity, gene models | |
| PRO | Ontology (OBO) | Protein entities, their structural parts, isoforms and modifications | pir.georgetown.edu/pro |
| PO Protein ontology | Ontology (OWL) | Protein attributes other than sequence | proteinontology.info/ |
| RnaO | Ontology (OBO) | RNA sequence, structure, motifs, alignments | roc.bgsu.edu/ |
Figure 2Illustration of some key concepts in evolutionary analysis. These data on a hypothetical family of proteins may be used to illustrate various concepts that are familiar in the domain of comparative evolutionary analysis. Phylogenetic trees have tips that typically represent currently existing biological entities (here proteins) that are referred to as OTUs, and that are associated with character-state data. The tips of the tree are linked to their ancestors (internal nodes) by branches or edges. Aligned sites in a protein-coding sequence are a type of character with a coordinate system (1 … 10) and with discrete states comprising nucleotides (A, T, C, G) or an alignment gap (−). Individual characters can be combined to form a compound character, e.g. 3 consecutive base-pairs combined to represent a single codon. The cellular location represented by a Gene Ontology (GO) term is also a discrete character that can be analyzed using the comparative evolutionary approach. An example of a continuous character would be the response of the protein to a chemical inhibitor (here shown as an IC50 value in micromolar). ND indicates that the state of a character is unknown for a given OTU.
Figure 3Some key concepts and relations formalized in CDAO. Domain-specific terms in CDAO represent either classes, shown by ovals and boxes, or properties (also called “relations”), shown by lines with arrows. The subsumption property “is_a” relates a class to its superclass (solid lines). other properties are defined in CDAO and discussed in the text (dashed lines).
Figure 4Annotation of rooted and unrooted evolutionary trees using CDAO concepts and relations. a) An example of a rooted tree showing how the concepts and relations defined in CDAO can be used to represent the topology of the tree and associated data. In particular, important evolutionary concepts, such as the Most Recent Common Ancestor (MRCA) can be specified. In the case of a rooted tree, the edges (or branches) of the tree are directed and the relations has_parent_node and has_child_node are used. b) The representation of an unrooted tree using CDAO. here, the direction of the edges is unknown and the relations has_Left_node and has_Right_node are used. Unrooted trees may contain subtrees for which the ancestor node is known, and in this case a rooted subtree can be specified using the has_Root relation.
Figure 5An example of instance data in the NEXUS format used commonly in phylogenetics.