| Literature DB >> 33232313 |
Anne E Thessen1,2, Ramona L Walls3, Lars Vogt4, Jessica Singer5, Robert Warren5, Pier Luigi Buttigieg6, James P Balhoff7, Christopher J Mungall8, Deborah L McGuinness9, Brian J Stucky10, Matthew J Yoder11, Melissa A Haendel1.
Abstract
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heterogeneous phenotypic data sets that are very difficult or impossible to integrate at scale because of variable formats, lack of digitization, and linguistic problems. One powerful solution is to represent phenotypic data using data models with precise, computable semantics, but adoption of semantic standards for representing phenotypic data has been slow, especially in biodiversity and ecology. Some phenotypic and trait data are available in a semantic language from knowledge bases, but these are often not interoperable. In this review, we will compare and contrast existing ontology and data models, focusing on nonhuman phenotypes and traits. We discuss barriers to integration of phenotypic data and make recommendations for developing an operationally useful, semantically interoperable phenotypic data ecosystem.Entities:
Mesh:
Year: 2020 PMID: 33232313 PMCID: PMC7685442 DOI: 10.1371/journal.pcbi.1008376
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.779
Fig 1Phenotypic data integration challenges.
(A) The many names for the mountain gorilla, Gorilla beringei, resulted from years of nomenclatural acts, misspellings, and the quirks of human language and popular culture. (B) The term “paramere” has been ambiguously used to describe 5 different parts of the male genitalia of a gasteruptiid wasp (red). (C) The end-of-season height of a wheat plant can be described by an exact measurement or relative to a “wild type.” (D) With the exception of microorganisms, measurements are collected from specimens but are sometimes represented as a single value representing an entire population or taxon. All 4 of these panels represent 1 or more challenges to phenotypic data integration. Image credit: Panel A by David J. Patterson, used with permission.
Semantic knowledge bases containing information about organism characteristics.
| Name | Description or Scope | Format | Pattern | Reference |
|---|---|---|---|---|
| Biodiversity | ||||
| Phenoscape | Vertebrate morphology | OWL in RDF Blazegraph triplestore | EQ | [ |
| EOL TraitBank | Internet aggregator of data about species | Neo4j | Character/Character State | [ |
| Microbial Phenotypes Wiki | Web-based community resource designed to display microbial phenotypes and the methods used to study them. | MediaWiki | Tabular, uses OMP [ | [ |
| PolyTraits | Database on biological traits of polychaetes | Relational database | Character/Character State | [ |
| TRY | Global database of curated plant traits | Relational database | Map traits to TOP (EQ) [ | [ |
| FuTRES | Functional traits of vertebrates | OWL in RDF triplestore | Measurement-Based quantitative data, trait definitions follow EQ pattern from OBAEQ | [ |
| Planteome | Plant genomics and phenomics | GAF and SOLR | EQ and DOS-DP | [ |
| Global Plant Phenology | Aggregator of plant phenological data | OWL and JSON | Measurement-Based quantitative and presence/absence data; EQ model | [ |
| Semantic Morph·D·Base | Repository for morphological data | OWL in RDF triplestore | Measurement-Based with connection to TBox: Phenotype Knowledge Graphs | [ |
| TaxonWorks | Web-based workbench for taxonomists and biodiversity scientists | PostgreSQL (relational database) | Class (OTU) or Measurement-Based (collection object). Qualitative, quantitative, statistical, media, gene, text, presence/absence, arbitrary triples (data attributes). | [ |
| World Register of Marine Species | Authoritative classification and catalogue of marine species | MS SQL relational database with trait module | Character/Character State | [ |
| Agriculture | ||||
| Gramene | Comparative functional genomics in crops and model plant species | MongoDB | JSON-like, using PO [ | [ |
| Sol Genomics Network | Clade-oriented database dedicated to the biology of the Solanaceae family | Relational database (chado) | Tabular, dbxref to PO | [ |
| GrainGenes | Comprehensive resource for molecular and phenotypic information for wheat, barley, rye, and other related species, including oat. | Relational database (chado) | Tabular, using Plant TO [ | [ |
| Annex | Cereals ontology | OWL | Measurement and Class-based | [ |
| CassavaBase | Genomic and phenomic resource for cassava | Relational database (chado) | Tabular, uses CO [ | [ |
| AgroLD | Integrated data about commercially important plants | RDF triples | EQ and DOS-DP | [ |
| Biomedicine and Model Organisms | ||||
| Monarch Initiative, uPheno, and Human Phenotype Ontology | Integrator of cross species genotype-phenotype data including human phenotypes and their relationship to diseases | OWL | EQ and DOS-DP | [ |
| MGI | Mouse genomic and phenomic resource | OWL and OBO | EQ and DOS-DP | [ |
| WormBase | Nematode genomic and phenomic resource | OWL and OBO | EQ and DOS-DP | [ |
| TAIR | OWL and OBO | EQ and DOS-DP | [ | |
| FlyBase | Fruit fly genomic and phenomic resource | OWL and OBO | EQ and DOS-DP | [ |
| XenBase | OWL and OBO | EQ and DOS-DP | [ | |
| ZFIN | Zebrafish genomic and phenomic resource | OWL and OBO | EQ and DOS-DP | [ |
| Saccharomyces Genome Database | Comprehensive integrated biological information for the budding yeast | PostgreSQL | Tabular, uses APO [ | [ |
| RGD | Structured and standardized data for 8 species (rat, mouse, human, chinchilla, bonobo, 13-lined ground squirrel, dog, and pig) | Relational database (chado), GAF, and OBO | Qualitative, links QTLs to multiple OBO phenotype ontologies | [ |
*To be included in this table, a resource must contain annotations linking traits to organisms, use a phenotype ontology, and not require login credentials.
†Includes phenotype data reported at the individual specimen level.
‡Includes phenotype data reported at the group level.
APO, Ascomycete Phenotype Ontology; CO, Crop Ontology; DOS-DP, Dead Simple Ontology Design Pattern; EQ, Entity–Quality; GAF, GO Annotation File format; OBAEQ, Ontology of Biological Attributes-Entity Quality; OBO, Open Biomedical Ontologies format; OMP, Ontology of Microbial Phenotypes; OWL, Web Ontology Language; OTU, Operational Taxonomic Units; PO, Plant Ontology; QTL, Quantitative Trait Locus; RDF, Resource Description Framework; RGD, Rat Genome Database; TO, Trait Ontology; TOP, Thesaurus of Plant Characteristics.
Fig 2TBox versus ABox.
The TBox (A) includes classes (kinds of things), properties (the possible relationships between classes and instances of the classes), and assertions about the classes and properties. The ABox (B) represents instances of the classes represented in the TBox and assertions about those instances. For example, an instance of femur in a frog specimen is 1.2 cm long. Image credit: Photo from National Museum of Natural History, Washington DC.
Fig 3EQ Formalism for categorical phenotypes versus character states.
From [112]. The EQ Formalism uses ontology terms from an anatomy ontology (green) and a trait ontology (blue) to represent a phenotype and maps to the Character/Character State model (gray). EQ, Entity–Quality.
Fig 4Darwin Core star schema with traits.
Phenotypes can be represented in the Darwin Core star schema that consists of separate tabular files (blue) linked together by unique identifiers for taxa, occurrences, and measurements (green).
Fig 5Measurement-Based phenotype data models.
(A) Semantic Morph·D·Base. Pink-bordered boxes: instances; yellow-bordered boxes: classes; gray-bordered boxes: literals (labels or values); boxes with dashed borders: named graphs. (B) TaxonWorks. The underlying goal is to let scientists assert phenotype observations as required for their research. Assertions are persisted in Descriptor–Observation format where subclasses of descriptor (e.g., qualitative, quantitative, statistical, gene, free-text, and media) classify/define observations. Descriptor types anticipate downstream serialization into computable formats, semantic or otherwise. Phenotype assertions are at the class (= Taxon concept, an “OTU” in TaxonWorks) or instance (= Collection object) level (“Entity”). Ultimately, both levels will permit anatomical part assertions. While the approach includes improvements to the overall semantics, it still lacks specifics used in other models (e.g., Fig 5A and 5C); however, the typed descriptor approach provides a flexible software design, whereby incremental improvements to semantics are possible. All data are highly annotatable. Dashed boxes are features in progress. (C) Global Plant Phenological Database. Rounded rectangles represent classes, and hexagons represent instances. The original data set (bottom of figure) indicates that there is an instance of the class/phenophase “open flower presence,” which is a quality of an instance of “whole plant” from the PO. Because the value of the instance of measurement datum is >0, the ontology infers that open flowers are present. Due to the subsumption hierarchy of the PO (left side of figure), the ontology can also infer that nonsenesced flowers, flowers, and plant structures are present. IAO, Information Artifact Ontology; PATO, Phenotype and Trait Ontology; PO, Plant Ontology; OBI, Ontology for Biomedical Investigations; OTU, Operational Taxonomic Unit; RDF, Resource Description Framework; RO, Relations Ontology; UO, Unit Ontology.