| Literature DB >> 29372115 |
Ruth E Timme1, Hugh Rand1, Martin Shumway2, Eija K Trees3, Mustafa Simmons4, Richa Agarwala2, Steven Davis1, Glenn E Tillman4, Stephanie Defibaugh-Chavez5, Heather A Carleton3, William A Klimke2, Lee S Katz3,6.
Abstract
BACKGROUND: As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines.Entities:
Keywords: Benchmark datasets; E. coli; Food safety; Foodborne outbreak; Listeria; Phylogenomics; Salmonella; Validation; WGS
Year: 2017 PMID: 29372115 PMCID: PMC5782805 DOI: 10.7717/peerj.3893
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Metadata table header.
Available key/value pairs that describe the entire dataset. Organism and source are required but other key/value pairs are optional.
| Organism | The genus, species, or other taxonomic description | |
| Outbreak | Usually the PulseNet outbreak code, but any other descriptive word with no spaces | 1408MLGX6-3WGS |
| PMID | The Pubmed identifier of a related publication | 25789745 |
| Tree | The URL to a newick-formatted tree |
|
| Source | A person who can be contacted about this dataset | Cheryl Tarr |
| DataType | Either empirical or simulated | Empirical |
| IntendedUse | Why this dataset might be useful for someone in bioinformatics testing | Epidemiologically and laboratory confirmed outbreak with outgroups |
Metadata table body.
Fields included in the body of the metadata table that describe the individual sequences included in the dataset. The required fields are biosample_acc, strain, and sra_acc. Any optional field can be blank or contain a dash (−) if no value is given. Field names are case insensitive.
| biosample_acc | The identifier found in the NCBI BioSample database. This usually starts with SAMN or SAME. | Yes |
|
| Strain | The name of the isolate | Yes | CFSAN002349 |
| genBankAssembly | The GenBank assembly identifier | No |
|
| SRArun_acc | The Sequence Read Archive identifier | Yes |
|
| outbreak | If the isolate is associated with the outbreak or recall, list the PulseNet outbreak code, or other event identifier here. | No | 1408MLGX6-3WGS outgroup |
| datasetname | To which dataset this isolate belongs | Yes | 1408MLGX6-3WGS |
| suggestedReference | For reference-based pipelines, a dataset can suggest which reference assembly to use | Yes | TRUE |
| sha256sumAssembly | The sha256 checksum of the genome assembly. This will help assure that the download is successful. | Yes | 9b926bc0adbea331a0a71f7bf18f6c7a62ebde7dd7a52fabe602ad8b00722c56 |
| sha256sumRead1 | The sha256 checksum of the forward read | Yes | c43c41991ad8ed40ffcebbde36dc9011f471dea643fc8f715621a2e336095bf5 |
| sha256sumRead2 | The sha256 checksum of the reverse read | Yes | 4d12ed7e34b2456b8444dd71287cbb83b9c45bd18dc23627af0fbb6014ac0fca |
Example dataset.
This as an example metadata table for a hypothetical single-isolate dataset, combining the header and body from Tables 1 and 2.
| 1408MLGX6-3WGS | |||||||||
| 25789745 | |||||||||
|
| |||||||||
| Cheryl Tarr | |||||||||
| Empirical | |||||||||
| Epi-validated outbreak | |||||||||
| biosample_acc | Strain | genBankAssembly | SRArun_acc | outbreak | datasetname | suggestedReference | sha256sumAssembly | sha256sumRead1 | sha256sumRead2 |
|
| CFSAN002349 | GCA_001257675.1 |
| 1408MLGX6-3WGS | 1408MLGX6-3WGS | TRUE | 9b926bc0adbea331a0a71f7bf18f6c7a62ebde7dd7a52fabe602ad8b00722c56 | c43c41991ad8ed40ffcebbde36dc9011f471dea643fc8f 715621a2e336095bf5 | |
Figure 1The “true” phylogeny included for each dataset.
The outbreak or event-related taxa are colored red. (A) Listeria monocytogenes, (B) Escherichia coli, (C) Salmonella enterica, (D) Campylobacter jejuni, (E) simulated dataset.
Benchmark datasets.
The key features of the four empirical and one simulated dataset are summarized in this table.
| Stone Fruit Food recall | 31 | 28 | CFSAN023463 | Empirical | PMID: 27694232 | |
| Spicy Tuna outbreak | 23 | 18 | CFSAN000189 | Empirical | PMID: 25995194 | |
| Raw Milk Outbreak | 22 | 14 | D7331 | Empirical |
| |
| Sprouts Outbreak | 10 | 3 | 2011C-3609 | Empirical |
| |
| Simulated outbreak | 23 | 18 | CFSAN000189 | Synthetic | Simulated dataset based off the |
Notes.
Number of Isolates: total number of isolates in the dataset.
Epidemiologically linked isolates: number of isolates implicated in the recall or outbreak.
Reference genome: suggested reference genome for SNP analysis.