| Literature DB >> 28637310 |
Petra Ten Hoopen1, Robert D Finn1, Lars Ailo Bongo2, Erwan Corre3, Bruno Fosso4, Folker Meyer5, Alex Mitchell1, Eric Pelletier6,7,8, Graziano Pesole4,9, Monica Santamaria4, Nils Peder Willassen2, Guy Cochrane1.
Abstract
Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonized way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (i) material sampling, (ii) material sequencing, (iii) data analysis, and (iv) data archiving and publishing. Taking examples from marine research, we summarize essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community, but greater awareness and adoption is still needed. We emphasize the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing.Entities:
Keywords: best practice; data analysis; metadata; metagenomics; sampling; sequencing; standard
Mesh:
Year: 2017 PMID: 28637310 PMCID: PMC5737865 DOI: 10.1093/gigascience/gix047
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:A generalized metagenomics data analysis workflow in the context of other “omics” approaches.
Figure 2:A common data model for read data and associated metadata.
Checklist of MIMS mandatory descriptors for a sample taken from an aquatic environment and associated with a metagenomic sequencing experiment.
| MIMS-mandatory water sample provenance descriptors | Descriptor format |
|---|---|
| Submitted to INSDC | Boolean |
| Project name | Text |
| Investigation type | Fixed value: “metagenome” |
| Geographic location (latitude and longitude) | Decimal degrees in WGS84 system |
| Depth | Metres: positive below the sea surface |
| Geographic location (country and/or sea region) | INSDC country list [ |
| Collection date | ISO8601 date and time |
| Environment (biome) | ENVO class [ |
| Environment (feature) | ENVO class |
| Environment (material) | ENVO class |
| Environment package | MIxS controlled vocabulary [ |
ENVO: Environment Ontology.
Checklist of M2B3 mandatory descriptors for a microbial sample taken from a saline water environment and associated with a metagenomic sequencing experiment.
| M2B3-mandatory saline water sample provenance descriptors | Descriptor format |
|---|---|
| INVESTIGATION_campaign | Text |
| INVESTIGATION_site | Text |
| INVESTIGATION_platform | SDN: L06 controlled vocabulary [ |
| EVENT_latitude | Decimal degrees in WGS84 system |
| EVENT_longitude | Decimal degrees in WGS84 system |
| EVENT_date/time | ISO8601 date and time in UTC |
| SAMPLE_title | Text |
| SAMPLE_protocol label | Text |
| SAMPLE_depth | Metres; positive below the sea surface |
| ENVIRONMENT_environment (biome) | ENVO class |
| ENVIRONMENT_environment (feature) | ENVO class |
| ENVIRONMENT_environment (material) | ENVO class |
| ENVIRONMENT_temperature | SDN: P02 [ |
| ENVIRONMENT_salinity | SDN: P02, SDN: P06 controlled vocab. |
ENVO: Environment Ontology; SDN: SeaDataNet; UTC: coordinated universal time.
Selection of nonmandatory MIxS and M2B3 descriptors (column B) and formats (column D).
| A: group | B: nonmandatory sample provenance descriptors | C: standard | D: descriptor format | E: value for analysis (H/M/L) |
|---|---|---|---|---|
| 1 | Sample collection device or method | MIxS (MIMS) | Text | H |
| 1 | EVENT_device | M2B3 | Text | H |
| 1 | EVENT_method | M2B3 | Text | H |
| 2 | Sample material processing | MIxS (MIMS) | Text | H |
| 3 | Amount or size of sample collected | MIxS (MIMS) | Numeric & unit | H |
| 3 | SAMPLE_quantity (e.g., length, mass) | M2B3 | Text | H |
| 4 | Sample storage location | MIxS (water) | Text | L |
| 4 | SAMPLE_container (e.g., storage container) | M2B3 | Text | L |
| 5 | Sample storage duration | MIxS (water) | Interval | H |
| 6 | Sample storage temperature | MIxS (water) | Numeric & unit | H |
| 6 | SAMPLE_treatment_storage (e.g., temperature) | M2B3 | Text | H |
| 7 | Chemical administration | MIxS (water) | CHEBI ontology [ | M |
| 7 | SAMPLE_treatment_chemicals | M2B3 | CHEBI ontology | M |
| 8 | SAMPLE_size_fraction_upper_threshold | M2B3 | Text | H |
| 8 | SAMPLE_size_fraction_lower_threshold | M2B3 | Text | H |
| 9 | SAMPLE_content (e.g., 0.22 μm filter, 20mL water) | M2B3 | Text | H |
| 10 | Concentration of chlorophyll | MIxS (water) | Numeric & unit | HM |
| 10 | ENVIRONMENT_ecosystem_pigment concentration | M2B3 | SDN: P02, SDN: P06 controlled vocab. | HM |
| 11 | Fluorescence | MIxS (water) | Numeric & unit | HM |
| 11 | ENVIRONMENT_ecosystem_fluorescence | M2B3 | SDN: P02, SDN: P06 controlled vocab. | HM |
| 12 | Density | MIxS (water) | Numeric & unit | M |
| 13 | Organism count | MIxS (water) | Numeric & unit | ML |
| 13 | ENVIRONMENT_ecosystem_picoplankton (flow cytometry) abundance | M2B3 | SDN: P02, SDN: P06 controlled vocab. | ML |
| 13 | ENVIRONMENT_ecosystem_nano/microplankton abundance | M2B3 | SDN: P02, SDN: P06 controlled vocab. | ML |
| 13 | ENVIRONMENT_ecosystem_meso/macroplankton abundance | M2B3 | SDN: P02, SDN: P06 controlled vocab. | ML |
| 14 | Primary production | MIxS (water) | Numeric & unit | M |
| 14 | ENVIRONMENT_ecosystem_primary production | M2B3 | SDN: P02, SDN: P06 controlled vocab. | M |
| 15 | Bacterial production | MIxS (water) | Numeric & unit | M |
| 15 | ENVIRONMENT_ecosystem_bacterial production | M2B3 | SDN: P02, SDN: P06 controlled vocab. | M |
| 16 | Biomass | MIxS (water) | Numeric & unit | ML |
| 16 | ORGANISM_biomass | M2B3 | Numeric & unit & method | ML |
| 17 | ORGANISM_biovolume | M2B3 | Numeric & unit & method | L |
| 18 | ORGANISM_size | M2B3 | Numeric & unit & method | L |
| 19 | INVESTIGATION_authors | M2B3 | Text | M |
| 20 | Host taxid | MIxS (host associated) | NCBI taxonomy identifier [ | M |
These descriptors cover such areas as the structure or viability of the community under investigation and sample pooling procedures. Column A groups descriptors that are related conceptually (1 – sample collection method & device, 2 – sample processing, 3 – sample quantity, 4 – storage container, 5 – storage duration, 6 – storage temperature, 7 – chemical treatment, 8 – microbial fraction thresholds, 9 – sample content, 10 – pigment concentration, 11 – fluorescence, 12 – density, 13 – organism abundance, 14 – primary production, 15 – bacterial production, 16 – organism biomass, 17 – organism biovolume, 18 – organism size, 19 – investigation contributors, 20 – unique taxonomic index identifier for organism host). Column C shows the descriptor association with the respective contextual data reporting the standard suitable for marine metagenomic data. Column E suggests the descriptor's importance for metagenomic data analysis (H – high relevance, M – medium relevance, L – low relevance).
CHEBI: Chemical Entities of Biological Interest; SDN: SeaDataNet.
Mandatory descriptors for sequencing.
| Mandatory descriptors of sequencing provenance | Descriptor format |
|---|---|
| Instrument platform | Controlled vocabulary [Illumina, Oxford Nanopore, PacBio smrt, Ion Torrent, ls454, Complete Genomics, Capillary] |
| Instrument model | Controlled vocabulary [ |
| Library source | Controlled vocabulary [ |
| Library strategy | Controlled vocabulary [ |
| Library selection | Controlled vocabulary [ |
| Library layout | Controlled vocabulary [single, paired] |
| Read file name | Text |
| Read file md5 checksum | 32-digit hexadecimal number |
| Second read file name (for paired Fastq files) | Text |
| Second read file md5 checksum (for paired Fastq files) | 32-digit hexadecimal number |
Nonmandatory sequencing descriptors (column A) and formats (column B); column C suggests the descriptor's potential importance for metagenomic data analysis (H – high relevance, M – medium relevance, L – low relevance).
| A: nonmandatory descriptors of sequencing provenance | B: descriptor format | C: value for analysis (H/M/L) |
|---|---|---|
| Sequencing centre contact | Text | M |
| Sequencing experiment name | Text | L |
| Library name | Text | L |
| Library description | Text | L |
| Library construction protocol | Text | M |
| Library construction method (MIMS) | Text | M |
| Library size (MIMS) | Numeric | M |
| Library reads sequenced (MIMS) | Numeric | M |
| Library vector (MIMS) | Text | M |
| Library screening strategy (MIMS) | Text | M |
| Insert size (for paired read files) | Numeric | M |
| Spot layout (for SFF read files) | Controlled vocabulary (single, paired FF, paired FR) | M |
| Linker sequence (for SFF read files) | Sequence of nucleotides | H |
| Multiplex identifiers (MIMS) | Sequence of nucleotides | H |
| Adapters (MIMS) | Sequence of nucleotides | H |
| Quality scoring system (for Fastq files) | Controlled vocabulary (phred, log-odds) | H |
| Quality encoding (for Fastq files) | Controlled vocabulary (ASCII, decimal, hexadecimal) | H |
| ASCII offset (for Fastq files) | Controlled vocabulary (!, @) | H |
| Nucleic acid extraction SOP (MIMS) | Text | H |
| Nucleic acid amplification SOP (MIMS) | Text | H |
| Sequencing coverage | Numeric | H |
Figure 3:Schematic overview of best practice for analysis metadata collection with example fields. A) Overarching metadata; B) Analysis component; C) Workflow.