| Literature DB >> 33103064 |
Ruth E Timme1, William J Wolfgang2, Maria Balkey1, Sai Laxmi Gubbala Venkata2, Robyn Randolph3, Marc Allard1, Errol Strain4.
Abstract
The holistic approach of One Health, which sees human, animal, plant, and environmental health as a unit, rather than discrete parts, requires not only interdisciplinary cooperation, but standardized methods for communicating and archiving data, enabling participants to easily share what they have learned and allow others to build upon their findings. Ongoing work by NCBI and the GenomeTrakr project illustrates how open data platforms can help meet the needs of federal and state regulators, public health laboratories, departments of agriculture, and universities. Here we describe how microbial pathogen surveillance can be transformed by having an open access database along with Best Practices for contributors to follow. First, we describe the open pathogen surveillance framework, hosted on the NCBI platform. We cover the current community standards for WGS quality, provide an SOP for assessing your own sequence quality and recommend QC thresholds for all submitters to follow. We then provide an overview of NCBI data submission along with step by step details. And finally, we provide curation guidance and an SOP for keeping your public data current within the database. These Best Practices can be models for other open data projects, thereby advancing the One Health goals of Findable, Accessible, Interoperable and Re-usable (FAIR) data.Entities:
Keywords: GenomeTrakr; Genomic epidemiology; Microbial pathogen surveillance; NCBI submission; One health; QA/QC; Whole genome sequencing
Year: 2020 PMID: 33103064 PMCID: PMC7568946 DOI: 10.1186/s42522-020-00026-3
Source DB: PubMed Journal: One Health Outlook ISSN: 2524-4655
Fig. 1INSDC hub showing how genomic data in public databases get analyzed by many different software platforms, for different purposes. Included in this figure are most genomic epidemiology-related open source analysis platforms available in March of 2020, and one private software tool, BioNumerics. BioNumerics is also the only platform with submission capability
Quality control threshold guidelines for enterica pathogens collected for GenomeTrakr
| Quality metric | ||||||
|---|---|---|---|---|---|---|
| Average read quality Q score for R1 and R2 | > = 30 | > = 30 | > = 30 | > = 30 | > = 30 | > = 30 |
| Average coverage | > = 30X | > = 20X | > = 40X | > = 40X | > = 20X | > = 40X |
| De novo assembly: Seq. length (Mbp) | ~ 4.3–5.2 | ~ 2.7–3.2 | ~ 4.5–5.9 | ~ 4.0–5.0 | ~ 1.5–1.9 | ~ 4.8–5.5 |
| De novo assembly: no. contigs | <=300 | <=300 | <=500 | <=650 | <=300 | <=300 |
The minimum set of metadata fields recommended by GenomeTrakr for BioSample submission of bacterial pathogens. Consult the “Populating the NCBI Pathogen metadata template protocol” [32] for expanded, up-to-date guidance
| Required fields | Description |
|---|---|
| strain | This is the authoritative ID used within NCBI Pathogen Detection and for the PulseNet/GenomeTrakr networks. Although the Strain ID can have any format, we suggest that it be unique, concise, and consistent within your laboratory (e.g. CFSAN123456). There are downstream advantages to the name being entirely alpha-numeric, so avoid special characters if possible. |
| sample_name | Sample Name is another unique identifier for the pure culture isolate and required by NCBI for BioSample submission (it cannot be left blank). It can have any format, but we suggest that it be the same as the strain name or contain another identifier important to the isolate or submitting laboratory. NCBI validates this attribute for uniqueness, so you cannot use “missing, or “not collected”. This identifier is NOT available in NCBI-PD. |
| organism | The organism name should include the most descriptive information you have at time of submission, adhering to proper nomenclature in NCBI taxonomy database: |
| collected_by | Name of laboratory that sequenced the isolate (or institute that collected the sample). Abbreviations are ok if they are well-known in the community (e.g. FDA or CDC). |
| attribute_package | This field provides the pathogen type (or “isolation type”). Allowed values are “Pathogen.cl” (for human clinical pathogens) or “Pathogen.env” (for environmental, food, or animal clinical isolates). The value provided in this field drives validation of other fields and cannot be left blank. |
| collection_date | Date of sampling in ISO 8601 standard: “YYYY-mm-dd”, “YYYY-mm” or “YYYY” (e.g., 1990–10–30, 1990–10, or 1990). |
| geo_loc_name | Geographical origin of the sample using controlled vocabulary: |
| isolation_source | Describes the physical, environmental and/or local geographical sample from which the organism was derived. Avoid generic terms such as patient isolate, sample, food, surface, clinical, product, source, environment. |
| host | aFor Pathogen.cl only: “ |
| host_disease | aFor Pathogen.cl only: Name of relevant disease, e.g., Salmonella gastroenteritis. This field must use controlled vocabulary provided at: |
| bioproject_accession | The accession number of the BioProject(s) to which the BioSample belongs (PRJNAxxxxxx). |
| lat_lon | Provide latitude and longitude to support “geo_loc_name”. This field is required to be populated by NCBI. However, if this level of detail is not available, GenomeTrakr recommends including “missing” or “not collected” here. |
a “For Pathogen.cl only”: These fields are mandatory ONLY if isolate is from a human clinical sample. If isolate was collected from food/water/env or animal sources, these fields should be left blank
Fig. 2Screen shot of a cluster within the NCBI-PD browser showing harmonized metadata submissions across five different submitting laboratories (PulseNet, GenomeTrakr, Public Health England, Israel Ministry of Health, and CA Food Inspection Agency). URL: https://www.ncbi.nlm.nih.gov/Structure/tree/#!/tree/Salmonella/PDG000000002.1922/PDS000025876.12?treelabel=sra_center,strain,epi_type,collection_date,geo_loc_name,isolation_source
Fig. 3Density plot showing the distribution of genome lengths for a random sample of isolates with Illumina sequence data available from NCBI Pathogen Detection portal (n = 10,000 for all species except V . paramaemolyticus where n = 1414 due to smaller number of samples). Sequences were assembled using SKESA 2.2 and the bars indicate ±3 standard deviations from the mean. Mbp = mega base pairs
Fig. 4Plot of mean coverage (as reported by SKESA v. 2.2) vs number of contigs for a random sample of isolates with Illumina sequence data available from NCBI Pathogen Detection portal (n = 10,000 for all species except V. parahaemolyticus where n = 1414 due to smaller number of samples). The smoothed line was generated using generalized additive smoothing in R. Assembly quality, as measured by a decrease in the number of contigs, generally increases with increasing coverage
Fig. 5Overview of the database structure at NCBI showing an example Salmonella umbrella BioProject with three linked laboratory data BioProjects, each with their own BioSamples and associated sequence data