| Literature DB >> 28348549 |
Lee S Katz1, Taylor Griswold2, Amanda J Williams-Newkirk3, Darlene Wagner3, Aaron Petkau4, Cameron Sieffert4, Gary Van Domselaar4, Xiangyu Deng5, Heather A Carleton6.
Abstract
Modern epidemiology of foodborne bacterial pathogens in industrialized countries relies increasingly on whole genome sequencing (WGS) techniques. As opposed to profiling techniques such as pulsed-field gel electrophoresis, WGS requires a variety of computational methods. Since 2013, United States agencies responsible for food safety including the CDC, FDA, and USDA, have been performing whole-genome sequencing (WGS) on all Listeria monocytogenes found in clinical, food, and environmental samples. Each year, more genomes of other foodborne pathogens such as Escherichia coli, Campylobacter jejuni, and Salmonella enterica are being sequenced. Comparing thousands of genomes across an entire species requires a fast method with coarse resolution; however, capturing the fine details of highly related isolates requires a computationally heavy and sophisticated algorithm. Most L. monocytogenes investigations employing WGS depend on being able to identify an outbreak clade whose inter-genomic distances are less than an empirically determined threshold. When the difference between a few single nucleotide polymorphisms (SNPs) can help distinguish between genomes that are likely outbreak-associated and those that are less likely to be associated, we require a fine-resolution method. To achieve this level of resolution, we have developed Lyve-SET, a high-quality SNP pipeline. We evaluated Lyve-SET by retrospectively investigating 12 outbreak data sets along with four other SNP pipelines that have been used in outbreak investigation or similar scenarios. To compare these pipelines, several distance and phylogeny-based comparison methods were applied, which collectively showed that multiple pipelines were able to identify most outbreak clusters and strains. Currently in the US PulseNet system, whole genome multi-locus sequence typing (wgMLST) is the preferred primary method for foodborne WGS cluster detection and outbreak investigation due to its ability to name standardized genomic profiles, its central database, and its ability to be run in a graphical user interface. However, creating a functional wgMLST scheme requires extended up-front development and subject-matter expertise. When a scheme does not exist or when the highest resolution is needed, SNP analysis is used. Using three Listeria outbreak data sets, we demonstrated the concordance between Lyve-SET SNP typing and wgMLST. Availability: Lyve-SET can be found at https://github.com/lskatz/Lyve-SET.Entities:
Keywords: SNP pipeline; bacterial pathogen; foodborne; genomic epidemiology; outbreak; wgMLST
Year: 2017 PMID: 28348549 PMCID: PMC5346554 DOI: 10.3389/fmicb.2017.00375
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Features of Lyve-SET.
| Repeat detection | Detection of repeat elements that could confound SNP results | 0 | 0 | 0 | 0 | 1 |
| Auto-choose reference or reference-free | Independence of a reference genome or a user-defined reference genome to find SNPs | 0 | 1 | 1 | 0 | 0 |
| Removal of distant genomes | Removal of genomes from analysis when they are greater than a certain threshold of SNPs | 0 | 0 | 0 | 1 | 0 |
| Phage detection | Detection and masking of phages | 1 | 0 | 0 | 0 | 0 |
| Cliff detection | Detection and masking of cliffs | 1 | 0 | 0 | 0 | 0 |
| SNP cluster detection | Detection and masking of clustered SNPs | 1 | 0 | 0 | 1 | 1 |
| Read cleaning | Cleaning and trimming of raw reads | 1 | 0 | 0 | 0 | 0 |
| BAM file for each individual genome | Standardized BAM files that describe the locations of mapped reads | 1 | 0 | 1 | 1 | 1 |
| VCF file for each individual genome | Standardized VCF files that describe the locations of SNPs and evidence supporting them | 1 | 1 | 1 | 1 | 1 |
| Pooled VCF file | Standardized VCF file that describes the locations of all SNPs for all genomes in a single file. This file is created with the | 1 | 0 | 0 | 1 | 0 |
| Fasta alignment of all sites | Standardized fasta file of all sites across the reference genome, whether they are invariant or SNP sites | 1 | 0 | 1 | 0 | 1 |
| Fasta alignment of SNPs | Standardized fasta file of SNP sites | 1 | 1 | 1 | 1 | 1 |
| Standardized tree file | File representing the phylogeny in a standardized format, e.g., Newick | 1 | 1 | 1 | 0 | 1 |
| Settings for different species | Does the pipeline have customizable settings for different species? Lyve-SET has customized settings using the– – | 1 | 0 | 0 | 0 | 0 |
| Audit trail: repeatability | Displays the path to the SNP pipeline installation and the exact command to repeat the analysis. Lyve-SET provides the command and all explicit and implicit options | 1 | 0 | 0 | 1 | 1 |
| Automated quality control | Reviews the analysis results and describes low-quality results. This quality control can be a review of the length of the multiple sequence alignment, the number of positions masked in each genome, or simply reviewing something minor like the insert length of each genome. Lyve-SET encompasses this quality control step in | 1 | 0 | 0 | 1 | 1 |
Although Lyve-SET does not have repeat detection, it does not allow the short-read mapper to place reads where they map equally well in two locations, i.e., repeat regions. SNVPhyl can perform the same function but also straightforwardly identifies repeat regions in the reference genome.
Although kSNP does not have SNP cluster detection directly, its fundamental algorithm prohibits any SNP from occurring within k-1 bp from each other, where k is the length of the kmer. For example on a kmer value of 5, two SNPs must occur at least 4 bp from each other.
Features of Lyve-SET are shown with a comparison of the other SNP pipelines compared in this study. “1” indicates the feature is present; “0” indicates that the feature is absent. A comparison of software-level features, e.g., command-line vs. web interface, has already been performed in Petkau et al. (.
Figure 1The Lyve-SET workflow. Starting from the top left, reads are generated from a single query genome and then compared against a reference genome. Starting from the top right, other genomes are being generated and compared against the reference genome simultaneously. The order is (1) sequence query genome; (2) obtain a reference genome; (3) discover SNPs in a comparison against the reference genome; (4) combine SNP profiles into (5) a SNP matrix. In the bottom portion, the SNP matrix is interrogated for low-quality sites including those that are invariant or semi-invariant (those with masked or reference alleles). The matrix is also interrogated for clustered SNPs, i.e those that appear too close to each other. After the SNP matrix is queried and filtered, Lyve-SET obtains high-quality SNPs which are then used for creating a phylogeny. The larger, unfiltered multiple sequence alignment is used to calculate pairwise distances which can be used in a comparison, e.g., a heat map.
Presets for Lyve-SET.
| lambda | |
| vibrio_cholerae | |
| listeria_monocytogenes | |
| salmonella_enterica | |
| escherichia_coli | |
| clostridium_botulinum |
Some settings that have been empirically determined are in a configuration file in the Lyve-SET package. These settings can be revised by individual users in the file presets.conf. For many species such as Campylobacter jejuni, we have not yet determined the most optimal preset options. However in the future these settings could be added upon or revised following any observations we may make in the due course of outbreak investigations. In each Lyve-SET run, these settings and their values are displayed in the log file, whether or not they were explicitly defined and whether or not the preset configurations were explicitly called.
List of outbreaks.
| 1308MDGX6-1 | 39, 7, 0 | Chen et al., submitted | |
| 1408MLGX6-3WGS | 19, 64, 1 | Jackson et al., | |
| 1411MLGX6-1WGS | 28, 16, 0 | CDC, | |
| 1504MLEXH-1 | 17, 2, 0 | Tataryn et al., | |
| 1405WAEXK-1 | 6, 4, 4 | CDC, | |
| 1407MNEXD-1 | 6, 10, 1 | Health MDo, | |
| 1203NYJAP-1 | 55, 8, 0 | Hoffmann et al., | |
| 1409MLJN6-1 | 9, 29, 0 | N/A | |
| 1410MLJBP-1 | 5, 10, 0 | N/A | |
| 0810PADBR-1 | 14, 111, 0 | Marler-Clark, | |
| 1509VTDBR-1 | 8, 8, 0 | N/A | |
| 1602VTDBR-1 | 6, 10, 0 | N/A |
The number of isolates associated with the outbreak, the number of isolates not associated with the outbreak, and the number of isolates with unknown status. Those with unknown status were not used in calculations for tree sensitivity and specificity.
Each outbreak is shown with counts of outbreak-associated and non-outbreak-associated isolates.
Summary of 12 pipeline comparisons.
| Tree sensitivity (Sn) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| Tree specificity (Sp) | 100.0% | 90.2% | 100.0% | 100.0% | 100.0% | 100.0% |
| Average of Sn and Sp | 100.0% | 95.1% | 100.0% | 100.0% | 100.0% | 100.0% |
| Kendall-Colijn (λ = 0) | – | 1.26E-02 | 7.51E-03 | 9.28E-03 | 9.15E-02 | 1.00E-04 |
| Robinson-Foulds | – | 3.16E-69 | 6.79E-40 | 5.39E-74 | 9.61E-49 | 1.55E-147 |
| Mantel | – | 0.60 | 0.77 | 0.77 | 0.79 | 0.74 |
| SNP ratio | – | 0.53, 0.78 | 0.97, 0.84 | 1.61, 1.75 | 0.67, 0.84 | 0.69, 0.72 |
| Goodness-of-fit ( | – | 0.46, 0.42 | 0.7, 0.75 | 0.77, 0.3 | 0.83, 0.68 | 0.75, 0.72 |
| Genome analyzed | 25.9% | 0.1% | 84.8% | 0.3% | 82.1% | 88.2% |
Average percentage from 11 outbreaks. The S. enterica outbreak 1203NYJAP-1 was removed as an outlier because all pipelines except wgMLST produced errors with grouping outbreak vs. non-outbreak isolates. Therefore this dataset was removed from the Sn and Sp calculations as an outlier.
Geometric mean.
Number of SNPs per Lyve-SET SNP, averaged across 12 outbreaks. For wgMLST, this is the number of alleles per Lyve-SET SNP.
The average for 12 outbreaks. First value is for all data points; second value is for distances between only outbreak-associated genomes.
The average for 12 outbreaks. Percentage of the reference genome included for analysis. For wgMLST, the average percentage was calculated by obtaining each GenBank-formatted file with annotated wgMLST loci and calculating the breadth of coverage for all loci.
More information can be found in Data Sheets .
Figure 2Scatterplot of all pairwise distances. Regression analysis of all pipelines compared with Lyve-SET. Outbreaks are shown in clockwise order from the top-left as those caused by L. monocytogenes, S. enterica, C. jejuni, and E. coli. Pairwise distances between genomes are plotted for Lyve-SET (x-axis) and other pipelines (y-axis). For each species, three outbreaks have been combined into one scatterplot. A trend line was calculated using regression analysis, and a y = mx+b formula is displayed accordingly with the goodness-of-fit (R2) value. The y = mx+b formula describes the slope of the trendline where m is the number of hqSNPs per Lyve-SET hqSNP and b is the number of hqSNPs when there are no Lyve-SET hqSNPs. All four pipelines are compared against Lyve-SET, and each panel is a different one of the four species.
wgMLST compared against Lyve-SET for outbreaks of .
| 1E-999 | 1E-999 | 1E-4 | |
| 5.53E-108 | 1.76E-263 | 3.79E-71 | |
| Mantel | 0.74 | 0.73 | 0.75 |
| Correlation coefficient | 0.64 | 0.73 | 0.70 |
| Trend line | 0.64 | 0.77 | 0.84 |
Figure 3Scatterplot of wgMLST against Lyve-SET. As in Figure 2, a scatterplot was generated using all allelic distances from wgMLST and SNP distances from Lyve-SET, but only for the three L. monocytogenes outbreak clusters. The top-left plot shows all pairwise distances; the top-right limits the data points to those with <255 SNPs; the bottom-left limits the data points to those <100. For this analysis, in cluster 1408MLGX6-3WGS, PNUSAL001994 was removed as an outlier because most of its data points are zero hqSNPs in contrast to >30 alleles.