| Literature DB >> 22357728 |
Rutger A Vos1, James P Balhoff, Jason A Caravas, Mark T Holder, Hilmar Lapp, Wayne P Maddison, Peter E Midford, Anurag Priyam, Jeet Sukumaran, Xuhua Xia, Arlin Stoltzfus.
Abstract
In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.Entities:
Mesh:
Year: 2012 PMID: 22357728 PMCID: PMC3376374 DOI: 10.1093/sysbio/sys025
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
FData modeling in evolutionary informatics. Nodes in phylogenetic trees (shown left) and character-state data (right) can be conceptualized as forming a nexus at the center of which are operational taxonomic units (OTUs). Under this model, any number of trees and character-state data sets (the latter themselves following an entity–attribute–value model) are represented as data that apply to OTUs, which in principle can also be decorated with additional metadata such as taxonomy database record identifiers. This conceptualization is implicit in the NEXUS format and applications that build on it such as Mesquite (Maddison and Maddison 2011) and has been reused in NeXML. (Figure modified from Hladish et al. 2007).
FNeXML syntax example: TreeBASE OTU annotations. This example shows a single container of OTUs (the otus element) with a single OTU (the otu element) that was submitted to the database with the label Zenodorus cf. orbiculatus. Matching this label to the uBio web service returned a close match with the record for Zenodorus orbiculatus (with the namebank identifier 3546132), which uBio describes as matching the NCBI taxonomy record for Zenodorus cf. orbiculatus d008 (with taxon identifier 393215). The normalized OTU label was defined within the context of TreeBASE study S1787.
FNeXML syntax example: Phenoscape character states. This code fragment shows how the Phenoscape project uses the NeXML-compatible application Phenex to annotate character states. A character, identified by “char01,” is defined as able to occupy any of the states from state set “states01.” Within that state set, in this instance, there is only the state “state0102.” That state is annotated with an EQ statement (here expressed in a Phenex-specific XML dialect) that identifies a morphological feature called the “antorbital” and qualifies it as being absent. (In a complete NeXML document, the format element occurs within a characters element, which is preceded by a container of OTUs, i.e., an otus element, here omitted for clarity.)
| ASN.1: | Abstract Syntax Notation One (ASN.1) is an object representation language well suited to highly structured data (see |
| CDAO: | the Comparative Data Analysis Ontology ( |
| DNS: | Domain Name System, a hierarchical distributed naming system for resources connected to the Internet. DNS is used to translate human-readable names (e.g., |
| EvoInfo: | The Evolutionary Informatics Working Group supported by NESCent from 2006 to 2009 ( |
| GraphML: | a file format for graphs ( |
| GUID: | Globally Unique Identifier, an identifier, that is, a string of text, intended to identify one and only one object (e.g., a concept, a species, a publication). Different schemes have been devised for this, among which are LSIDs, DOIs, and HTTP URIs. A characteristic shared by a number of GUID schemes is that they are frequently a combination of a (sometimes DNS-based) “naming authority” part and a local identifier that is managed by the naming authority. |
| HTTP: | HyperText Transfer Protocol, the data transfer protocol used on the World Wide Web. HTTP can be used as a technology upon which GUID schemes can be built because it, in turn, builds on a scheme for uniquely identifying addresses (DNS) and because it defines a mechanism for resolving those addresses and returning con- tent, such that information about an object that is identified using an HTTP-based GUID can be looked up. |
| JSON: | JavaScript Object Notation ( |
| LSID: | Life Science Identifier, a means to identify a piece of biological data using an URN scheme (see URI, below) comprised of an authority, a namespace, an object identifier, and an optional version number. HTTP URI serves the same function and is more widely used and supported. |
| MIAPA: | Minimum Information for a Phylogenetic Analysis, a draft proposal for a MIBBI (Minimum Information for Biological and Biomedical Investigations) standard, specifying the key information for authors to include in a phylogenetic record in order to facilitate the reuse of the phylogenetic data and validation of phylogenetic results. |
| NCBI: | the National Center for Biotechnology Information ( |
| OBO: | Open Biological and Biomedical Ontologies ( |
| OWL: | Web Ontology Language, a knowledge representation metalanguage for authoring the formal semantics of ontologies commonly serialized as RDF/XML. |
| RDF: | Resource Description Framework ( |
| RDFa: | Resource Description Framework in attributes, which extends XHTML and other XML formats to allow data described in RDF to be rendered into well-formed XML documents. RDFa therefore bridges RDF to the XML-based web and database world. |
| RDFS: | RDF Schema, a semantic extension of RDF that defines a set of classes and properties using the RDF language. These classes and properties provide basic elements for the description of RDF vocabularies or ontologies. |
| RDF/XML: | RDF serialized as XML. |
| SKOS: | Simple Knowledge Organization System ( |
| uBio: | the Universal Biological Indexer and Organizer, |
| URI: | Uniform Resource Identifier, which can take 2 forms, the uniform resource name (URN) and uniform resource locator (URL). A digital object identifier (DOI) is an example of URN, for example, a journal paper can have a URN as |
| W3C: | the World Wide Web Consortium, a standards body that published “recommendations” that formally describe technologies used on the world wide web, including, for our purposes, OWL, RDF, RDFa, RDFS, RDF/XML, SKOS, XHTML, XML, XPath, XQuery, XSD, and XSLT. |
| XHTML: | Extensible HyperText Markup Language, an XML-based, stricter version of HTML, the markup language in which pages on the World Wide Web are authored. |
| XML: | Extensible Markup Language (XML), a metalanguage consisting of a set of rules for encoding data in machine-readable form in user-defined, customized domain languages, of which NeXML is an example. |
| XPath: | the XML Path Language, which is a query language for selecting nodes from an XML document which is represented by a hierarchical multi-furcating tree. The query language facilitates the tree traversal by allowing the selection of specific nodes in the tree through a variety of criteria. It is used in XML parsers and other software programs that process XML documents. |
| XQuery: | a query and functional programming language that is intended to achieve the ultimate objective of seamlessly integrating the web and the database, that is, when both are based on XML and therefore can be accessed and processed in the same way. XPath is a component of XQuery. |
| XSD: | XML Schema Definition, a language for describing the syntax and grammar of an XML-based domain language such as NeXML (see |
| XSLT: | Extensible Stylesheet Language Transformations, which can take an XML document and convert it either into another XML document or a non-XML document containing either the same or a subset of the information in the original XML document. It does this by applying transformation templates on XPath expressions that select patterns in a source XML document. For example, a mitochondrial genomic sequence stored in the XML format in GenBank can be rendered by XSLT to other sequence format (e.g., FASTA or HTML for web display) or to another XML file containing a subset of information (e.g., containing only coding sequences in the genome). |