| Literature DB >> 30109017 |
Philippa C Griffin1,2, Jyoti Khadake3, Kate S LeMay4, Suzanna E Lewis5, Sandra Orchard6, Andrew Pask7, Bernard Pope2, Ute Roessner8, Keith Russell4, Torsten Seemann2, Andrew Treloar4, Sonika Tyagi9,10, Jeffrey H Christiansen11, Saravanan Dayalan8, Simon Gladman1, Sandra B Hangartner12, Helen L Hayden13, William W H Ho7, Gabriel Keeble-Gagnère7,13, Pasi K Korhonen14, Peter Neish15, Priscilla R Prestes16, Mark F Richardson17, Nathan S Watson-Haigh18, Kelly L Wyres19, Neil D Young14, Maria Victoria Schneider2,15.
Abstract
Throughout history, the life sciences have been revolutionised by technological advances; in our era this is manifested by advances in instrumentation for data generation, and consequently researchers now routinely handle large amounts of heterogeneous data in digital formats. The simultaneous transitions towards biology as a data science and towards a 'life cycle' view of research data pose new challenges. Researchers face a bewildering landscape of data management requirements, recommendations and regulations, without necessarily being able to access data management training or possessing a clear understanding of practical approaches that can assist in data management in their particular research domain. Here we provide an overview of best practice data life cycle approaches for researchers in the life sciences/bioinformatics space with a particular focus on 'omics' datasets and computer-based data processing and analysis. We discuss the different stages of the data life cycle and provide practical suggestions for useful tools and resources to improve data management practices.Entities:
Keywords: bioinformatics; data management; data sharing; open science; reproducibility
Year: 2017 PMID: 30109017 PMCID: PMC6069748 DOI: 10.12688/f1000research.12344.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. The Data Life Cycle framework for bioscience, biomedical and bioinformatics data that is discussed throughout this article.
Black arrows indicate the ‘traditional’, linear view of research data; the green arrows show the steps necessary for data reusability. This framework is likely to be a simplified representation of any given research project, and in practice there would be numerous ‘feedback loops’ and revisiting of previous stages. In addition, the publishing stage can occur at several points in the data life cycle.
Overview of some representative databases, registries and other tools to find life science data.
A more complete list can be found at FAIRsharing.
| Database/
| Name | Description | Datatypes | URL |
|---|---|---|---|---|
| Database | Gene Ontology | Repository of functional roles of gene products,
| Functional roles as determined experimentally or
|
|
| Database | Kyoto
| Repository for pathway relationships of
| Protein, gene, cell, and genome pathway
|
|
| Database | OrthoDB | Repository for gene ortholog information | Protein sequences and orthologous group
|
|
| Database
| eggNOG | Repository for gene ortholog information with
| Protein sequences, orthologous group
|
|
| Database | European
| Repository for nucleotide sequence information | Raw next-generation sequencing data, genome
|
|
| Database | Sequence Read
| Repository for nucleotide sequence information | Raw high-throughput DNA sequencing and
|
|
| Database | GenBank | Repository for nucleotide sequence information | Annotated DNA sequences |
|
| Database | ArrayExpress | Repository for genomic expression data | RNA-seq, microarray, CHIP-seq, Bisulfite-seq and
|
|
| Database | Gene
| Repository for genetic/genomic expression data | RNA-seq, microarray, real-time PCR data on
|
|
| Database | PRIDE | Repository for proteomics data | Protein and peptide identifications, post-translational
|
|
| Database | Protein Data
| Repository for protein structure information | 3D structures of proteins, nucleic acids and
|
|
| Database | MetaboLights | Repository for metabolomics experiments and
| Metabolite structures, reference spectra and
|
|
| Ontology/
| ChEBI | Ontology and repository for chemical entities | Small molecule structures and chemical
|
|
| Database | Taxonomy | Repository of taxonomic classification information | Taxonomic classification and nomenclature data
|
|
| Database | BioStudies | Repository for descriptions of biological studies,
| Study descriptions and supplementary files |
|
| Database | Biosamples | Repository for information about biological
| Sample descriptions |
|
| Database
| IntAct | Repository for molecular interaction information | Molecular interactions and evidence type |
|
| Database | UniProtKB
| Repository for protein sequence and function
| Protein sequences, protein function and
|
|
| Database | European
| Controlled-access repository for sequence and
| Raw, processed and/or analysed sequence and
|
|
| Database
| EBI
| Repository and analysis service for
| Next-generation sequencing metagenomic
|
|
| Database
| MG-RAST | Repository and analysis service for
| Next-generation sequencing metagenomic and
|
|
| Registry | Omics DI | Registry for dataset discovery that currently
| Genomic, transcriptomic, proteomic and
|
|
| Registry | DataMed | Registry for biomedical dataset discovery that
| Genomic, transcriptomic, proteomic,
|
|
| Registry | Biosharing | Curated registry for biological databases, data
| Information on databases, standards and
|
|
| Registry | re3data | Registry for research data repositories across
| Information on research data repositories, terms
|
|
Useful ontology tools to assist in metadata collection.
| Tool | Task | URL |
|---|---|---|
| Ontology Lookup
| Discover different ontologies and their contents |
|
| OBO Foundry | Table of open biomedical ontologies with information
|
|
| Zooma | Assign ontology terms using curated mapping |
|
| Webulous | Create new ontology terms easily |
|
| Ontobee | A linked data server that facilitates ontology data
|
|
Overview of common standard data formats for ‘omics data.
A more complete list can be found at FAIRsharing.
| Data type | Format name | Description | Reference or URL for format specification | URLs for repositories
|
|---|---|---|---|---|
| Raw DNA/RNA
| FASTA
| FASTA is a common text format to store DNA/RNA/Protein
|
| |
| Assembled
| FASTA
| Assemblies without annotation are generally stored in
|
|
|
| Aligned DNA
| SAM/BAM
| Sequences aligned to a reference are represented in
|
|
|
| Gene model or
| GTF/GFF/
| General feature format or general transfer format are
|
|
|
| Gene functional
| GAF
| A GAF file is a GO Annotation File containing annotations
|
|
|
| Genetic/genomic
| VCF | A tab-delimited text format to store meta-information as
|
|
|
| Interaction data | PSI-MI XML
| Data formats developed to exchange molecular interaction
|
|
|
| Raw metabolite
| mzML
| XML based data formats that define mass spectrometry
|
| |
| Protein sequence | FASTA | A text-based format for representing nucleotide sequences
|
|
|
| Raw proteome
| mzML | A formally defined XML format for representing mass
|
|
|
| Organisms and
| Darwin Core | The Darwin Core (DwC) standard facilitates the exchange
|
|
Some community-designed minimum information criteria for metadata specifications in life sciences.
A more complete list can be found at FAIRsharing.
| Name | Description | Examples of projects/databases that
| URL |
|---|---|---|---|
| MINSEQE | Minimum Information about a high-
| Developed by the Functional Genomics
|
|
| MIxS - MIGS/MIMS | Minimum Information about a
| Developed by the Genomic Standards
|
|
| MIMARKS | Minimum Information about a
| Developed by the Genomic Standards
|
|
| MIMIx | Minimum Information about a
| Developed by the Proteomics Standards
|
|
| MIAPE | Minimum Information About a
| Developed by the Proteomics Standards
|
|
| Metabolomics
| Minimal reporting structures that
| Developed by the Metabolomics
|
|
| MIRIAM | Minimal Information Required
| Initiated by the BioModels.net effort.
|
|
| MIAPPE | Minimum Information About a Plant
| Adopted by the Plant Phenomics and
|
|
| MDM | Minimal Data for Mapping for
| Developed by the Global Microbial
|
|
| FAANG sample
| Metadata specification for biological
| Developed and used by the Functional
|
|
| FAANG experimental
| Metadata specification for
| Developed and used by the Functional
|
|
| FAANG analysis
| Metadata specification for analysis
| Developed and used by the Functional
|
|
| SNOMED-CT | Medical terminology and
| Commercial but collaboratively-designed
|
|
Figure 2. Flowchart of the data life cycle stages applied to an example research project.
Bold text indicates new data, software or workflow objects created during the project. Solid thin arrows indicate movement of objects from creation to storage and sharing. Dashed thin arrows indicate where downstream entities should influence decisions made at a given step. (For example, the choice of format, granularity, metadata content and structure of new data collected may be influenced by existing software requirements, existing data characteristics and requirements of the archive where the data will be deposited). Purple stars indicate objects for which the FAIR principles [9] can provide further guidance. Dotted thin arrows indicate citation of an object using its unique persistent identifier. Brown stars indicate where FAIRsharing can help identify appropriate archives for storing and sharing.
Identifiers throughout the data life cycle.
| Name | Relevant stage of
| Description | URL |
|---|---|---|---|
| Digital Object Identifier (DOI) | Publishing, Sharing,
| A unique identifier for a digital (or physical or
|
|
| Open Researcher and
| Publishing | An identifier for a specific researcher that
|
|
| Repository accession
| Finding, Processing/
| A unique identifier for a record within a
| For example,
|
| Pubmed ID (PMID) | Publishing | An example of a repository-specific unique
|
|
| International Standard
| Publishing | A unique identifier for a journal, magazine or
|
|
| International Standard Book
| Publishing | A unique identifier for a book, specific to the
|
|