| Literature DB >> 29906258 |
Ruth E Timme1, Hugh Rand1, Maria Sanchez Leon1, Maria Hoffmann1, Errol Strain1, Marc Allard1, Dwayne Roberson1, Joseph D Baugher1.
Abstract
Pathogen monitoring is becoming more precise as sequencing technologies become more affordable and accessible worldwide. This transition is especially apparent in the field of food safety, which has demonstrated how whole-genome sequencing (WGS) can be used on a global scale to protect public health. GenomeTrakr coordinates the WGS performed by public-health agencies and other partners by providing a public database with real-time cluster analysis for foodborne pathogen surveillance. Because WGS is being used to support enforcement decisions, it is essential to have confidence in the quality of the data being used and the downstream data analyses that guide these decisions. Routine proficiency tests, such as the one described here, have an important role in ensuring the validity of both data and procedures. In 2015, the GenomeTrakr proficiency test distributed eight isolates of common foodborne pathogens to participating laboratories, who were required to follow a specific protocol for performing WGS. Resulting sequence data were evaluated for several metrics, including proper labelling, sequence quality and new single nucleotide polymorphisms (SNPs). Illumina MiSeq sequence data collected for the same set of strains across 21 different laboratories exhibited high reproducibility, while revealing a narrow range of technical and biological variance. The numbers of SNPs reported for sequencing runs of the same isolates across multiple laboratories support the robustness of our cluster analysis pipeline in that each individual isolate cultured and resequenced multiple times in multiple places are all easily identifiable as originating from the same source.Entities:
Keywords: GenomeTrakr; foodborne pathogen; molecular epidemiology; outbreak detection; proficiency test; whole-genome sequencing
Mesh:
Year: 2018 PMID: 29906258 PMCID: PMC6113870 DOI: 10.1099/mgen.0.000185
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Summary information for the eight PT strains distributed as part of the exercise
NCBI accession numbers are listed for both the reference genomes and the raw data collected from the PT exercise.
| NCBI accession no. for annotated complete reference genome | Raw data collected from PT exercise | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Organism | CFSAN ID | Genome size (bp) | No. of plasmids | G+C content (mol%) | BioSample | BioProject | Assembly | NCBI BioSample | NCBI BioProject |
| CFSAN000255 | 4 694 375 | 0 | 53.4 | SAMN00710608 | PRJNA186035 | ASM18895v5 | SAMN07210937 | PRJNA386310 | |
| CFSAN000318 | 4 951 478 | 3 | 53.4 | SAMN01088008 | PRJNA186035 | ASM43010v1 | SAMN07210938 | PRJNA386310 | |
| CFSAN001178 | 3 032 269 | 0 | 38.1 | SAMN01816124 | PRJNA215355 | ASM19539v5 | SAMN07210944 | PRJNA386310 | |
| CFSAN002236 | 5 045 919 | 1 | 52.3 | SAMN02147037 | PRJNA230969 | ASM46495v2 | SAMN07210945 | PRJNA386310 | |
| CFSAN008100 | 3 108 121 | 1 | 38.1 | SAMN02689388 | PRJNA215355 | ASM100592v1 | SAMN07210931 | PRJNA386310 | |
| CFSAN030807 | 5 062 953 | 8 | 52.3 | SAMN03612247 | PRJNA273284 | ASM244253v1 | SAMN07210932 | PRJNA386310 | |
| CFSAN032805 | 1 750 173 | 2 | 31 | SAMN03580886 | PRJNA309864 | ASM240714v1 | SAMN07210933 | PRJNA386310 | |
| CFSAN032806 | 1 782 911 | 1 | 31 | SAMN03580887 | PRJNA309864 | ASM240712v1 | SAMN07210930 | PRJNA386310 | |
Participating laboratories in the 2015 GTPT exercise
| Arizona State Public Health Laboratory, TGen, USA |
| California Department of Public Health, USA |
| Centers for Disease Control and Prevention, Enteric Diseases Laboratory Branch, USA |
| FDA Arkansas Laboratory, USA |
| FDA Denver Laboratory, USA |
| FDA Northeast Food and Feed Laboratory, USA |
| FDA Pacific Northwest Laboratory, USA |
| FDA Pacific Southwest Laboratory, USA |
| FDA San Francisco Laboratory, USA |
| FDA Southeast Food and Feed Laboratory, USA |
| FDA Winchester Engineering Analytical Center, USA |
| FDA/CFSAN/Office of Applied Research and Safety Assessment, USA |
| FDA/CFSAN/Office of Regulatory Science, USA |
| Florida Department of Agriculture and Consumer Services, USA |
| Food Safety Laboratory, New Mexico State University, USA |
| Hawaii State Department of Public Health, USA |
| IEH Laboratories and Consulting Group, Lake Forest Park, WA, USA |
| Minnesota Department of Health, USA |
| Nestlé Research Center - Institute of Food Safety and Analytical Science, USA |
| New York State Department of Health – Wadsworth Center, USA |
| SENASICA – Servicio National De Sanidad, Inocuidad Y Calidad Agroalimentaria, Mexico |
| State of Alaska Public Health Laboratory, USA |
| Texas Department of State Health Service, USA |
| United States Department of Agriculture – Food Safety Inspection Service, USA |
| Virginia Division of Consolidated Laboratory Services, USA |
| Washington State Department of Health Public Health Laboratory, USA |
Fig. 1.Box and whisker plots summarizing the mean read quality across Read1 and Read2 for each of the eight PT strains. The box defines the median value, as well as the lower and upper quartiles (25 and 75 %). The whiskers extend to the most extreme data point that is no more than 2.5 times the interquartile range from the median. Outliers are shown as black dots. The colours are genus specific.
Fig. 2.Box and whisker plots summarizing two different coverage statistics: (a) mean read depth and (b) percentage of reads mapped to the reference. The box defines the median value, as well as the lower and upper quartiles (25 and 75 %). The whiskers extend to the most extreme data point that is no more than 2.5 times the interquartile range from the median. Outliers are shown as black dots. The colours are genus specific.
Summary of SNPs detected in the raw PT data when compared to each respective reference genome
Five columns are included (0–4 SNPs) with counts of genomes submitted for that number of SNPs.
| CFSAN000255 | 20 | 6 | 0 | 0 | 0 | |
| CFSAN000318 | 22 | 4 | 2 | 0 | 0 | |
| CFSAN001178 | 16 | 9 | 1 | 2 | 1 | |
| CFSAN002236 | 22 | 2 | 2 | 1 | 1 | |
| CFSAN008100 | 24 | 1 | 2 | 0 | 0 | |
| CFSAN030807 | 28 | 1 | 0 | 0 | 0 | |
| CFSAN032805 | 17 | 1 | 0 | 0 | 0 | |
| CFSAN032806 | 0 | 16 | 2 | 0 | 0 |
Fig. 3.Box and whisker plots summarizing two different assembly statistics: (a) number of contigs and (b) NG50, a measure of contig length. The box defines the median value, as well as the lower and upper quartiles (25 and 75 %). The whiskers extend to the most extreme data point that is no more than 2.5 times the interquartile range from the median. Outliers are shown as black dots. The colours are genus specific.
Fig. 4.Box and whisker plots summarizing six different MiSeq run metrics reported by the participating laboratories: (a) cluster density, (b) clusters passing filter, (c) number of reads collected, (d) per cent of reads passing filter, (e) total data yield in Gbp, and (f) percentage of Q scores greater or equal to Q30. The box defines the median value, as well as the lower and upper quartiles (25 and 75 %). The whiskers extend to the most extreme data point that is no more than 2.5 times the interquartile range from the median. Outliers are shown as black dots. Abbreviations for isolate metrics: Reads_M, number of sequencing reads passing filter for a given isolate; SeqLength, range of sequencing read lengths; MeanR1, Q score representing the mean of the mean read quality of R1 (forward) reads; MeanR2, Q score representing the mean of the mean read quality of R2 (reverse) reads; PercMapped, percentage of reads that could be mapped to the reference genome; MeanDepth, mean depth (coverage) of reads mapped to the reference genome; SNPs, SNPs reported by the CFSAN SNP Pipeline; MeanInsert, mean insert size, defined as the length of the sequence between the adapters; GenomeFraction, total number of aligned bases in the reference, divided by the genome size (a base in the reference genome is counted as aligned if at least one contig has an alignment to the base; contigs from repeat regions may map to multiple places and, thus, may be counted multiple times); NG50, contig length such that using equal or longer length contigs produces x% of the length of the reference genome, allowing for comparisons between different genomes (larger NG50 values generally correlate with a higher quality assembly); Contigs, total number of contigs in the assembly (fewer contigs generally correlate with a higher quality assembly). Abbreviations for run metrics: ClusterDensity, density of clusters (K mm−2); PercClustersPF, percentage of clusters which passed filtering; Reads_M, number of reads (clusters) in millions; ReadsPF_M, number of reads (clusters) that passed filtering (millions); Yield, number of gigabases that passed filtering; Q30, percentage of bases with a Q score ≥30.
Fig. 5.Pearson correlation table of run metrics and summary statistics. Significant positive and negative correlations are represented as + and −, respectively. The size of the symbols and font boldness represent the degree of correlation.