| Literature DB >> 35169842 |
Emma J Griffiths1, Ruth E Timme2, Catarina Inês Mendes3, Andrew J Page4, Nabil-Fareed Alikhan4, Dan Fornika5, Finlay Maguire6, Josefina Campos7, Daniel Park8, Idowu B Olawoye9,10, Paul E Oluniyi9,10, Dominique Anderson11, Alan Christoffels11, Anders Gonçalves da Silva12, Rhiannon Cameron1, Damion Dooley1, Lee S Katz13,14, Allison Black15, Ilene Karsch-Mizrachi16, Tanya Barrett16, Anjanette Johnston16, Thomas R Connor17,18, Samuel M Nicholls19, Adam A Witney20, Gregory H Tyson21, Simon H Tausch22, Amogelang R Raphenya23, Brian Alcock23, David M Aanensen24,25, Emma Hodcroft26,27, William W L Hsiao1,5,28, Ana Tereza R Vasconcelos29, Duncan R MacCannell14.
Abstract
BACKGROUND: The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatics tools and resources, and advocate for greater openness, interoperability, accessibility, and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a need for a fit-for-purpose, open-source SARS-CoV-2 contextual data standard.Entities:
Keywords: SARS-CoV-2; bioinformatics; data standards; genomics; metadata
Mesh:
Year: 2022 PMID: 35169842 PMCID: PMC8847733 DOI: 10.1093/gigascience/giac003
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1: Contextual data flow. Contextual data can be captured and structured using the PHA4GE specification so that they can be more easily harmonized across different data sources and providers. Different subsets of the harmonized data can be (i) shared with public repositories, e.g., GISAID and INSDC; (ii) shared with trusted partners, e.g., national sequencing consortia, public health partners; and (iii) kept private and retained locally with the potential for sharing in the future for particular surveillance or research activities. While fields have been colour-coded in the template to indicate whether they are considered “required,” “strongly recommended,” or “optional,” how the specification is implemented and whether any of the data are shared is ultimately at the discretion of the user. Box 1 describes the information types covered in the full specification.
: Ontologies implemented in the PHA4GE SARS-CoV-2 specification
| Ontology[ | Link |
|---|---|
| BRENDA Tissue Ontology (BTO) |
|
| Cell Line Ontology (CLO) |
|
| Environmental conditions, treatments and exposures ontology (ECTO) |
|
| Environment Ontology (ENVO) |
|
| Food Ontology (FoodOn) |
|
| Gazetteer Ontology (GAZ) |
|
| Gender, Sex, and Sexual Orientation Ontology (GSSO) |
|
| Genomic Epidemiology Ontology (GenEpiO) |
|
| Genomics Cohorts Knowledge Ontology (GECKO) |
|
| Human Disease Ontology (DOID) |
|
| Human Phenotype Ontology (HP) |
|
| Mammalian Phenotype Ontology (MP) |
|
| Measurement Method Ontology (MMO) |
|
| Mondo Disease Ontology (MONDO) |
|
| Mouse Pathology Ontology (MPATH) |
|
| National Cancer Institute Thesaurus (NCIT) |
|
| NCBI Taxonomy Ontology (NCBITaxon) |
|
| Neuro Behaviour Ontology (NBO) |
|
| Ontology for Biomedical Investigations (OBI) |
|
| Ontology of Medically Related Social Entities (OMRSE) |
|
| Population and Community Ontology (PCO) |
|
| UBERON Multi-species Anatomy Ontology (UBERON) |
|
| Unit Ontology (UO) |
|
| Vaccine Ontology (VO) |
|
Vocabulary for fields and terms in the specification have been sourced or mapped to OBO Foundry domain and application ontologies, which are highlighted in this list. New fields and terms for which there were no existing equivalents have been developed and submitted to these ontologies, expanding these community resources.
: Resources that form the PHA4GE SARS-CoV-2 contextual data specification package [55]
| Resource[ | Description | Link |
|---|---|---|
| Collection template and controlled vocabulary pick lists | Spreadsheet-based collection form containing different fields (identifiers and accessions, sample collection and processing, host information, host exposure, vaccination and reinfection information, lineage and variant information, sequencing, bioinformatics and quality control metrics, diagnostic testing information, author acknowledgements). Fields are colour-coded to indicate required, recommended, or optional status. Many fields offer pick lists of controlled vocabulary. Vocabulary lists are also available in a separate tab |
|
| Reference guides | Field and term definitions, guidance, and examples are provided as separate tabs in the collection template .xlsx file (see Term Reference Guide and Field Reference Guide) |
|
| Curation protocol on protocols.io | Step-by-step instructions for using the collection template are provided in an SOP. Ethical, practical, and privacy considerations are also discussed. Examples and instructions for structuring sample descriptions as well as sourcing additional standardized terms (outside those provided in pick lists) are also discussed | dx.doi.org/10.17504/protocols.io.btpznmp6 |
| Mapping file of PHA4GE fields to metadata standards | PHA4GE fields are mapped to existing metadata standards such as the Sample Application Standard, MIxS 5.0, and the MIGS Virus Host-associated attribute package. Mappings are available in the Reference guide tab. Mappings highlight which fields of these standards are considered useful for SARS-CoV-2 public health surveillance and investigations, and which fields are considered out of scope |
|
| Mapping of PHA4GE fields to WHO metadata recommendations | PHA4GE fields are mapped to corresponding contextual data elements recommended by the World Health Organization |
|
| Mapping file of PHA4GE fields to EMBL-EBI, NCBI, and GISAID submission requirements | Many PHA4GE fields have been sourced from public repository submission requirements. The different repositories have different requirements and field names. Repository submission fields have been mapped to PHA4GE fields to demonstrate equivalencies and divergences. |
|
| Data submission protocol (NCBI) on protocols.io | The SARS-CoV-2 submission protocol for NCBI provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data | dx.doi.org/10.17504/protocols.io.bui7nuhn |
| Data submission protocol (EMBL-EBI) on protocols.io | The SARS-CoV-2 submission protocol for ENA provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data | dx.doi.org/10.17504/protocols.io.buqnnvve |
| Data submission protocol (GISAID) on protocols.io | The SARS-CoV-2 submission protocol for GISAID provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data | dx.doi.org/10.17504/protocols.io.bumknu4w |
| JSON structure of PHA4GE specification | A JSON structure of the PHA4GE specification has been provided for easier integration into software applications |
|
| PHA4GE template in the DataHarmonizer | Javascript application enabling standardized data entry, validation, and export of contextual data as submission-ready forms for GISAID and NCBI. The SOP for using the software can be found at |
|
There are a number of resources that form the PHA4GE SARS-CoV-2 contextual data specification package that are described in the table. The package has been compiled to support user implementation and data sharing, with integration into workflows and new software applications in mind. SOP: standard operating procedure.
: Minimal (required) contextual data fields
| Field name[ | Definition | Guidance |
|---|---|---|
| specimen collector sample ID | The user-defined name for the sample | Every Sample ID from a single submitter must be unique. It can have any format, but we suggest that you make it concise, unique, and consistent within your laboratory, and as informative as possible |
| sample collected by | The name of the agency that collected the original sample | The name of the agency should be written out in full (with minor exceptions) and consistent across multiple submissions |
| sequence submitted by | The name of the agency that generated the sequence | The name of the agency should be written out in full (with minor exceptions) and be consistent across multiple submissions |
| sample collection date | The date on which the sample was collected | Record the collection date accurately in the template. Required granularity includes year, month, and day. Before sharing these data, ensure that this date is not considered identifiable information. If this date is considered identifiable, it is acceptable to add “jitter” to the collection date by adding or subtracting calendar days. Do not change the collection date in your original records. Alternatively, “received date” may be used as a substitute in the data you share. The date should be provided in ISO 8601 standard format “YYYY-MM-DD” |
| geo_loc name (country) | Country of origin of the sample | Provide the country name from the pick list in the template |
| geo_loc name (state/province/region) | State/province/region of origin of the sample | Provide the state/province/region name from the GAZ geography ontology. Search for geography terms at |
| Organism | Taxonomic name of the organism | Use “Severe acute respiratory syndrome coronavirus 2” |
| Isolate | Identifier of the specific isolate | This identifier should be an unique, indexed, alphanumeric ID within your laboratory. If submitted to the INSDC, the “isolate” name is propagated throughout different databases. As such, structure the “isolate” name to be ICTV/INSDC compliant in the following format: “SARS-CoV-2/host/country/sampleID/date” |
| host (scientific name) | The taxonomic, or scientific name of the host | Common name or scientific name are required if there was a host. Scientific name example: |
| host disease | The name of the disease experienced by the host | This field is only required if there was a host. If the host was a human select COVID-19 from the pick list. If the host was asymptomatic, this can be recorded under “host health state details.” “COVID-19” should still be provided if the patient is asymptomatic. If the host is not huma, and the disease state is not known or the host appears healthy, put “not applicable.” |
| purpose of sequencing | The reason that the sample was sequenced | The reason why a sample was originally collected may differ from the reason why it was selected for sequencing. The reason a sample was sequenced may provide information about potential biases in sequencing strategy. Provide the purpose of sequencing from the pick list in the template. The reason for sample collection should be indicated in the “purpose of sampling” field |
| sequencing instrument | The model of the sequencing instrument used | Select a sequencing instrument from the pick list provided in the template |
| consensus sequence software name | The name of software used to generate the consensus sequence | Provide the name of the software used to generate the consensus sequence |
| consensus sequence software version | The version of the software used to generate the consensus sequence | Provide the version of the software used to generate the consensus sequence |
Through consultation and consensus, 14 fields were prioritized for SARS-CoV-2 surveillance, which are considered required in the specification. Field names, definitions, and guidance are presented.
Figure 2: The PHA4GE specification is being implemented in CanCOGeN to harmonize contextual data across jurisdictions. (A) CanCOGeN is Canada's SARS-CoV-2 national genomic surveillance initiative. Canada has a decentralized health system, with one federal and 13 provincial/territorial public health jurisdictions. Provinces/Territories have authority over how data are collected, stored, and shared. Every Canadian public health jurisdiction uses different collection instruments (e.g., case report forms), different data management systems, and different pipelines and software to perform bioinformatic analyses. Provinces/Territories share sequencing data and accompanying contextual data with the National Microbiology Lab's national SARS-CoV-2 genomics database (starred) according to a version of the PHA4GE specification for national surveillance activities. (B) Excerpts from two different province-specific case collection forms. Sample type information is collected in data collection instruments using different fields, different terms, at different levels of granularity, using abbreviations and formats. BAL: bronchoalveolar lavage; NPS: nasopharyngeal swab; UTM: universal transport medium. (C) An anonymized example of how the standard consistently structures contextual information and how it is being used for data sharing. The contextual data specification provides a wide variety of fields and pick lists of terms. In the example, the full set of standardized information shown would be shared by the province with the national database. Standardized information in boldface would be shared with public repositories; however select data elements (underscored) would be withheld according to jurisdictional data sharing policies. The specification enables users to harmonize and integrate data provenance, sampling strategy criteria, epidemiological information, and methods.
Figure 3: Overview of how the PHA4GE SARS-CoV-2 contextual data specification can be integrated into public repository submission. The PHA4GE collection template provides a one-stop shop for different data types that are important for global surveillance. The protocols provided as part of the specification package describe how PHA4GE fields can be mapped to different repository submission forms. Consensus sequences (FASTA), accompanied by a subset of PHA4GE fields, can be submitted to the GISAID EpiCoV database (A). Consensus sequences (FASTA) (B) as well as raw/processed data (FASTQ, BAM) (C, D) can be submitted to INSDC databases (e.g., GenBank, SRA) with different subsets of PHA4GE fields as part of a BioSample record. BioSamples are propagated throughout INSDC databases.
: A selection of accession numbers of harmonized contextual data records submitted to different public repositories
| Data contributor | Repository | Accession No. |
|---|---|---|
| African Centre of Excellence for Genomics of Infectious Diseases (Nigeria) | GISAID | EPI_ISL_1 035 827 |
| EPI_ISL_1 035 826 | ||
| EPI_ISL_1 035 825 | ||
| COVID-19 Genomic Surveillance Regional Network (Latin America) | GISAID | EPI_ISL_2 158 821 |
| EPI_ISL_2 158 802 | ||
| EPI_ISL_2 158 810 | ||
| COVID-19 Genomic Surveillance Regional Network (Latin America) | EMBL-EBI | SAMEA8968916 |
| Rhode Island Department of Health/Broad Institute (SPHERES) | NCBI | SAMN18306978 |
| Massachusetts General Hospital/Broad Institute (SPHERES) | NCBI | SAMN18309294 |
| Flow Health/Broad Institute (SPHERES) | NCBI | SAMN18308763 |
| New Brunswick Diagnostic Virology Reference Center/Public Health Agency of Canada (CanCOGeN) | NCBI | SAMN16784832 |
| Toronto Invasive Bacterial Diseases Network/McMaster University (CanCOGeN) | NCBI | SAMN17505317 |
| Bat coronavirus phylogeography—Université de La Réunion, UMR Processus Infectieux en Milieu Insulaire Tropical (PIMIT) and Field Museum of Natural History | NCBI | SAMN20400589 |
| SAMN20400588 |