| Literature DB >> 30239679 |
Lisa Harper1, Jacqueline Campbell2, Ethalinda K S Cannon1,2, Sook Jung3, Monica Poelchau4, Ramona Walls5, Carson Andorf1,2, Elizabeth Arnaud6, Tanya Z Berardini7, Clayton Birkett8, Steve Cannon1, James Carson9, Bradford Condon10, Laurel Cooper11, Nathan Dunn12, Christine G Elsik13, Andrew Farmer14, Stephen P Ficklin3, David Grant1, Emily Grau14, Nic Herndon15, Zhi-Liang Hu16, Jodi Humann3, Pankaj Jaiswal11, Clement Jonquet17, Marie-Angélique Laporte6, Pierre Larmande18, Gerard Lazo19, Fiona McCarthy20, Naama Menda21, Christopher J Mungall22, Monica C Munoz-Torres22, Sushma Naithani11, Rex Nelson1, Daureen Nesdill23, Carissa Park16, James Reecy16, Leonore Reiser7, Lacey-Anne Sanderson24, Taner Z Sen19, Margaret Staton10, Sabarinath Subramaniam7, Marcela Karey Tello-Ruiz25, Victor Unda3, Deepak Unni12, Liya Wang25, Doreen Ware8,25, Jill Wegrzyn15, Jason Williams26, Margaret Woodhouse27, Jing Yu3, Doreen Main3.
Abstract
The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.Entities:
Mesh:
Year: 2018 PMID: 30239679 PMCID: PMC6146126 DOI: 10.1093/database/bay088
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Survey results for ontology use in databases for each data type (from 29 respondents)
| Sequence ( | Marker ( | QTL ( | Germplasm ( | Phenotype ( | Genotype ( | |
|---|---|---|---|---|---|---|
| GO | 17 | 1 | 2 | 1 | ||
| SO | 10 | 1 | 2 | |||
| PO | 4 | 2 | 2 | 2 | 1 | |
| Trait ontologies: TO/VT/LPT | 1 | 1 | 7 | 2 | 3 | |
| CO | 1 | 3 | ||||
| Other ref ontology | 1 (MI) | 1 (LBO, CMO) | 1 | 1 PATO | ||
| In-house | 1 | 3 |
List of ontologies, CVs and thesaurus of interest for AgBioData member databases
| Name | Domain | ID space | URL |
|---|---|---|---|
| Amphibian Gross Anatomy Ontology | Amphibian anatomy | AAO |
|
| Agronomy Ontology | agronomy trials | AGRO |
|
| AGROVOC | a controlled vocabulary covering all areas of interest of the Food and Agriculture Organization | AGROVOC |
|
| Animal Trait Ontology for Livestock | phenotypic traits of farm animals | ATOL |
|
| CAB Thesaurus | bibliographic databases of CABI (Centre for Agriculture and Biosciences International) | CABT |
|
| Cephalopod Ontology | cephalopod anatomy and development | CEPH |
|
| Chemical Entities of Biological Interest | molecular entities | CHEBI |
|
| Cell Ontology | Metazoan (not plant) cell types | CL |
|
| Clinical Measurement Ontology | morphological and physiological measurement records generated from clinical and model organism research and health programs | CMO |
|
| Crop Ontology | a collection of vocabularies that describe breeders’ traits for agriculturally important plants: banana, barley, beets, Brachiaria, brassica, cassava, castor bean, chickpea, common bean, cowpea, grapes, groundnut, lentil, maize, mung bean, pearl millet, pigeon pea, potato, rice, sorghum, soybean, sugar kelp, sweet potato, wheat, woody plant and yam | CO |
|
| Drosophila Phenotype Ontology | Drosophila phenotypes | DPO |
|
| Evidence and Conclusion Ontology | types of scientific evidence | ECO |
|
| Experimental Factor Ontology | anatomy, disease and chemical compounds | EFO |
|
| Environment Ontology | biomes, environmental features and environmental materials | ENVO |
|
| Feature Annotation Location Description Ontology | FALDO is the Feature Annotation Location Description Ontology. It is a simple ontology to describe sequence feature positions and regions as found in GFF3, DBBJ, EMBL, GenBank files, UniProt and many other bioinformatics resources. | FALDO |
|
| Drosophila Gross Anatomy Ontology | Drosophila melanogaster anatomy | FB-BT |
|
| Flora Phenotype Ontology | traits and phenotypes of flowering plants occurring in digitized floras | FLOPO |
|
| Gene Ontology | gene function, biological processes and cellular components | GO |
|
| Hymenoptera Anatomy Ontology | anatomy of Hymenoptera | HAO |
|
| Infectious Disease Ontology | infectious diseases | IDO |
|
| Dengue fever | disease ontology for Dengue fever | IDODEN |
|
| Malaria | disease ontology for malaria | IDOMAL |
|
| Livestock Breed Ontology | buffalo, cattle, chicken, goat, horse, pig and sheep breeds | LBO |
|
| Livestock Product trait Ontology | traits of products from agricultural animals or birds | LPT |
|
| Mammalian Feeding Muscle Ontology | an anatomy ontology for the muscles of the head and neck that participate in feeding, swallowing and other oral-pharyngeal behaviors | MFMO |
|
| Molecular interactions | protein–protein interactions | MI |
|
| Mosquito Insecticide Resistance | mosquito insecticide resistance | MIRO |
|
| MONDO Disease Ontology | diseases (currently mostly human but also animal diseases) | MONDO |
|
| Mammalian phenotype | mammalian phenotypes | MP |
|
| Mouse Pathology Ontology | mutant and transgenic mouse pathological lesions and processes | MPATH |
|
| National Agricultural Library Thesaurus | vocabulary tools of agricultural terms | NALT |
|
| Neuro Behavior Ontology | behavior terms | NBO |
|
| Ontology of Arthropod Circulatory Systems | arthropod circulatory system | OARCS |
|
| Ontology of Biological Attributes | traits (all species) | OBA |
|
| Ontology of Host-Microbe interactions | host–microbe interactions | OHMI |
|
| Ontology of Microbial Phenotypes | microbial phenotypes | OMP |
|
| Ontology for Parasite Lifecycle | parasite life cycle stages | OPL |
|
| Phenotype and Trait Ontology | phenotypic qualities | PATO |
|
| Population and Community Ontology | populations and communities | PCO |
|
| Plant Experimental Conditions Ontology | plant treatments, growing conditions and/or study types | PECO |
|
| Plant Ontology | plant anatomy and growth stages | PO |
|
| Protein Ontology | protein-related entities | PR |
|
| Social Insect Behavior Ontology | chemical, anatomy and behavior of social insects | SIBO |
|
| Sequence Ontology | sequence types and features | SO |
|
| SOY Ontology | soybean traits, growth and development | SOY |
|
| Spider anatomy and behavior ontology | spider anatomy, behavior and products | SPD |
|
| Tick anatomy | Tick gross anatomy | TADS |
|
| Mosquito anatomy | Mosquito gross anatomy | TGMA |
|
| Plant Trait Ontology | plant traits | TO |
|
| Tribolium Ontology | anatomy of the red flour beetle | TRON |
|
| Teleost Taxonomy Ontology | Teleost phenotypes specifically for zebrafish | TTO |
|
| Uberon multispecies anatomy ontology | animal anatomical structures | Uberon |
|
| Variation Ontology | variations in DNA, RNA and/or protein | VARIO |
|
| VectorBase controlled vocabulary | controlled vocabulary for vector biology | VBCV |
|
| Vertebrate Trait Ontology | morphology, physiology or development of vertebrates | VT |
|
| Xenopus anatomy and development ontology | anatomy and development of Xenopus sp. | XAO |
|
| Zebrafish Anatomy and Development Ontology | Zebrafish anatomy and development | ZFA |
|
| Zebrafish Developmental Stages | Zebrafish (Danio rerio) developmental stages | ZFS |
|
List of tools for data curation with ontologies, annotation data exchange format and tools for ontology editing
| Use | Tool | Summary | Reference/URL |
|---|---|---|---|
| Data curation/annotation | Noctua | web-based tool for collaborative editing of models of biological processes |
|
| PubSearch | TAIR in-house literature curation tool |
| |
| Protein2GO | EBI’s GO annotation tool |
| |
| TOAST | community curation tool for GO and PO annotations |
| |
| CANTO | web-based literature curation tool |
| |
| Textpresso Central | web-based text mining and literature curation (with plug ins for Noctua) | doi: 10.1186/s12859-018-2103-8. | |
| CACAO | community annotation tool used in undergraduate competitions |
| |
| Table Editor | application for easily editing spreadsheet-formatted data with associated ontologies |
| |
| PhenoteFX | phenotype curation |
| |
| Annotation data exchange formats | GAF2 | file format for ontology annotation data exchange |
|
| RDF | Resource Description Framework | ||
| Phenopackets | an extensible data model and data exchange format for phenotype data |
| |
| BioLink model | schema for biological data and associations |
| |
| Ontology editors | Protégé | ontology editing tool |
|
Trait-related ontologies for describing data from livestock, arthropods and other animals. A key to the ontologies is available in Table 2
| Data type | Domain | Ontology (see |
|---|---|---|
| Phenotype | cattle, sheep, goats, pig other animals | ATOL, LBO, LPT, VT MP, VT, ATOL, ABO |
| Anatomy | cattle, sheep, goats, pig other animals arthropods | UBERON, CARO XAO, ZFA, CARO, CEPH, MFMO, TTO, UBERON DPO, FBBT, HAO, OARCS, SPD, TGMA, AAO, TRON |
| Growth and development | cattle, sheep, goats, pig other animals arthropods | ATOL, VT ZFS, CEPH, ATOL FDdv |
| Behavior | livestock/other animals arthropods | NBO, ATOL, VT SIBO |
| Disease | growth and development related disease other disease | IDO, OHMI, OPLMPATH, OMP, MONDO |
Data types and their support by each GGB platform. The following table rates support for each data type using the following scale: (x) No Support, (i) Schema Support Only, (ii) Extension Module Support and (iii) Core Interface Support. Core interface support implies the core application supports this data by providing loaders and front–end visualization. Extension module support implies this functionality has been added by some groups using the application and is now available in a sharable format (extension module or detailed tutorial). Schema Support Only implies the database schema can store this type of data but as of yet, no loaders or front–end visualizations are available
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
| ||
| Data category | Data type | Chado & PostgreSQL | PostgreSQL | MySQL | MySQL | Chado & PostgreSQL | MySQL | GraphDb (neo4J) |
|
| Assembly | 3 | 3 | x | 3 | 3 | 2 | x |
| Gene annotation | 3 | 3 | x | 3 | 2 | 2 | x | |
| Gene–gene interactions | 1 | 3 | x | x | 2 | x | x | |
| Protein domains | 1 | 3 | x | 3 | 2 | x | x | |
| HTS Gene expression data | 2 | 2 | x | 3 | 3 | x | x | |
| Array-based gene expression data | 1 | 2 | x | x | x | x | x | |
| RNA-seq | 2 | 2 | x | 3 | 2 | x | x | |
|
| Genomic variation (copy #, translocations, SNPs) | 1 | 2 | x | 3 | 3 | 3 | x |
| Genotypic data (alleles, polymorphisms) | 2 | 3 | 3 | 3 | 3 | 3 | x | |
| Phenotypic data | 2 | 2 | 3 | 3 | 3 | 3 | x | |
| QTL | 3 | 2 | x | x | 3 | 2 | x | |
| Mutants (fast neutron, transposon, Ethyl methanesulfonate (EMS)) | x | x | x | x | 3 | x | x | |
| Phylogeny | 3 | 2 | x | 3 | 2 | x | x | |
| Comparative analyses | 1 | 1 | x | 3 | 3 | x | x | |
|
| Germplasm Stocks | 3 | 3 | x | x | 3 | x | x |
| Germplasm (Varieties, Landraces etc.) | 3 | 2 | 3 | x | 3 | 3 | x | |
| Germplasm pedigrees | 2 | 2 | 3 | x | 3 | 3 | x | |
| Genetic maps | 2 | 2 | 3 | x | 3 | 3 | x | |
| Field trial Data | 1 | 1 | 3 | x | 3 | 3 | x | |
|
| Pathways | x | 3 | x | 2 | 2 | 2 | 3 |
| Images | 2 | 2 | 3 | x | 3 | x | x | |
| Ontology | 3 | 3 | x | 2 | 3 | 3 | x | |
| Ontology-based annotations | 3 | 3 | x | 3 | 3 | 3 | x | |
Feature sets supported by each GGB platform. The following table rates support for a comprehensive list of features with the following scale: (x) no support, (⭑) extension module support and (✔) supported by core. You should highlight which features are most important to your user group to ensure that you choose the platform best suited to your needs. Core interface support implies the core application supports this data by providing loaders and front–end visualization. Extension module support implies this functionality has been added by some groups using the application and is now available in a sharable format (extension module or detailed tutorial)
| Feature category | Feature type | Chado & PostgreSQL | PostgreSQL | MySQL | MySQL | Chado & PostgreSQL | MySQL | GraphDb (neo4J) |
| Simple full site keyword search | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Question-based searches (set up by administrator) | 2 | 2 | 2 | x | x | 2 | x | |
| Simple advanced searches (multiple fixed filter criteria) | 2 | 2 | 2 | x | 2 | 2 | 2 | |
| Query-builder advanced search | x | 2 | 2 | x | 2 | 2 | 2 | |
| List support | 1 | 2 | 2 | x | 2 | 2 | x | |
| Genomic region search/overlap queries | 1 | 2 | 1 | 2 | 1 | 1 | x | |
| Basic browsing | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Data information pages (e.g. gene pages) | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Data type-specific summaries | 1 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Forum | 1 | x | 1 | x | 2 | 1 | x | |
| Conference pages | 2 | 1 | 1 | x | x | x | x | |
| Community news | 2 | 2 | 2 | x | 2 | 2 | x | |
| Community curation | 2 | x | x | x | 2 | x | x | |
| Community information pages | 2 | 1 | 2 | x | 2 | x | x | |
| Web services | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Query across sites | 2 | 2 | 1 | x | 2 | 1 | 2 | |
| Reference sister sites | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
| Genome browser integration | 1 | 2 | 1 | 2 | 2 | 1 | x | |
| BLAST integration | 1 | 1 | 1 | 2 | 2 | 1 | x | |
| Access control (login system) | 2 | 2 | 2 | 2 | 2 | 2 | x | |
| Data analysis tools | 1 | 1 | 2 | 2 | 2 | 2 | 2 | |
Requirements for programmatic access to data in the genetics, genomics and breeding community
| Theme | Requirements |
|---|---|
| Discovery |
1. Web services for discovery of available resources 2. A way to search data across many resources 3. Good API documentation describing programmatic access |
| Data and metadata |
4. Common file formats 5. Common classification systems (e.g. consistent use of same gene families and ontology terms) 6. Ability to access and combine data, retaining provenance and metadata such as species of origin that will be of interest in the aggregated context 7. Machine readable metadata |
| Authentication |
8. Shared authentication protocols 9. Authentication through use of keys |
| Data exchange/ transfer |
10. Web services to extract data from any compatible database 11. Services to deliver data to another database or end users 12. Easy data transfer from NCBI (currently requires installation of a specialized tool) 13. Data provenance tracking 14. Data usage tracking via web services 15. Data management support for distributed data |
| Remote analyses |
16. Data staging (temporary storage) for analysis platforms 17. Access to computing resources 18. Request status polling (mechanisms to automatically report the status of an operation) |