Literature DB >> 34730485

Evaluation of whole-genome sequencing-based subtyping methods for the surveillance of Shigella spp. and the confounding effect of mobile genetic elements in long-term outbreaks.

Isabelle Bernaquez¹, Christiane Gaudreau^2,3, Pierre A Pilon^4,5, Sadjia Bekal^1,3.

Abstract

Entities: Chemical

Keywords: SNVPhyl; Shigella; mobile genetic elements; outbreak; whole-genome sequencing; whole/core genome MLST

Mesh：

Year: 2021 PMID： 34730485 PMCID： PMC8743557 DOI： 10.1099/mgen.0.000672

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

× No keyword cloud information.

Data Summary

The fastq reads from all sequences analysed in this article are available at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA), available at https://www.ncbi.nlm.nih.gov/sra under BioProject PRJNA639996. BioSample accession numbers for and can be found in Table S1 (available in the online version of this article). The implementation of routine whole-genome sequencing (WGS) for bacterial pathogens has significantly improved international surveillance and outbreak detection in public health laboratories. Continuous efforts have been directed towards subtyping method improvement and development, as well as genomic interpretation guideline definitions for cluster detection. However, most studies have focused on foodborne pathogens and outbreaks, limiting the use of the established WGS interpretation guidelines for long-term outbreaks (i.e. continuous person-to-person transmissions). As these types of outbreaks were the dominant sources of and cases in the province of Quebec, this study provides an insight into the possible variability of interpretation criteria recommended for whole-genome multi-locus sequence typing (wgMLST). Our evaluation of wgMLST, which considers genetic variations in the accessory genome, against a robust core SNV typing pipeline and alternative core genome multi-locus sequence typing-based methods, highlights the need of validating species-specific subtyping methods based on microbial genome plasticity and outbreak dynamics in addition to the importance of filtering confounding mobile genetic elements (MGEs) for cluster detection. We also demonstrate the use of minimum spanning genetic distances as an alternative measure for WGS interpretation guidelines that can be applied to outbreaks of different transmission natures.

Data statement

The authors confirm that all supporting data, code and protocols have been provided within the article or through supplementary data files. Two supplementary tables and five supplementary figures are available with the online version of this article and can be found at https://doi.org/10.6084/m9.figshare.14757978.v1 [1].

Introduction

spp. are facultative intracellular pathogens transmitted solely through the faecal–oral route and are the causative agents of shigellosis. While the rate of reported shigellosis cases in Canada has been steadily decreasing over time [2], and cases in large cities have become prevalent due to extensive sexual transmission networks involving men who have sex with men (MSM) [3-7], in addition to periodic food-, water- and travel-associated outbreaks in the general population [8-12]. Similar MSM-related outbreaks have also been documented globally and have been described as potential public health concerns due to their widespread resistances to antimicrobial treatments [13-18], occurrence of life-threatening invasive infections in patients co-infected with HIV [5, 6] and potential to spill into other at-risk communities. At any given time, several strains of the same pathogen can circulate through a population, making outbreak investigations more challenging. Standardized, epidemiologically concordant, and highly discriminatory subtyping methods are thus needed to differentiate between sporadic and outbreak-related cases and to detect source attributions, amongst other interests [19]. In fact, PulseNet International (PNI), an international public health laboratory network [20], has started implementing whole-genome sequencing (WGS) in order to enhance their surveillance power, as it can provide the absolute maximum resolution profile possible for any organism [21]. Currently, a gene-by-gene approach, specifically, a whole-genome multi-locus sequence typing (wgMLST) method has been chosen as the official replacement for standardized foodborne infectious disease surveillance and outbreak response by PNI [21]. The wgMLST method detects allelic variations (e.g. point mutations, indels and small translocations) and infers phylogeny based on a large pan-genomic scheme of 17 380 loci, where each allele combination profile corresponds to a unique sequence type (ST). However, other high-quality WGS-based subtyping methods exist, by which public health laboratories may find useful to support challenging investigations. Such methods include core genome MLST-based approaches [e.g. EnteroBase core genome MLST (coreMLST) and core genome MLST (cgMLST)] and single-nucleotide variant (SNV)-based approaches (e.g. high-quality core genome SNV typing by SNVPhyl) [22-26]. cgMLST uses the same pan-genomic loci scheme as wgMLST but only infers phylogeny based on loci present in a user-defined percentage (e.g.≥95 %) of the isolates under comparison. A second core genome MLST method was included in this study and differs from cgMLST by its use of the curated EnteroBase scheme (https://enterobase.warwick.ac.uk) consisting of 2513 highly conserved / loci [27]. Similar core genome schemes have been shown to be standardized and scalable for inter-laboratory comparisons for other enteric pathogens [25, 28, 29]. Alternatively, SNVPhyl detects common high-quality SNVs from a reference-based mapping approach, which stands in contrast to the MLST-based subtyping methods through its capacity to include intergenic sequence variants [30]. Similar core genome SNV-based approaches are currently used for routine surveillance by Public Health England and the Centers for Disease Control and Prevention in the USA. According to PNI, the wgMLST approach was selected based on its ability to integrate into routine public health surveillance, and to maintain real-time outbreak detection and inter-laboratory comparability [21]. However, an internal validation is required to verify the performance of any new method using a well-defined set of isolates and outbreaks. PulseNet Canada (PNC) has made progress towards the implementation of WGS for foodborne pathogen surveillance [22-26]. Studies on the top priority pathogens – , O157:H7 and common serovars (e.g. Heidelberg, Enteritidis and Typhimurium) – have determined that most single-strain outbreaks are contained within a 0–10 allele and/or high-quality core SNV (hqSNV) diversity [23–25, 31, 32]. Similar performance assessment studies are now needed to determine the genetic interpretation guidelines for spp. outbreak detections. However, while spp. are phylogenetically considered [33], we hypothesize that existing WGS-based interpretation criteria might not be suitable for all -related outbreaks due to differences in epidemiology, environmental pressures and genomic plasticity. This diversity is thought to be due to the higher level of genetic variation caused by prolonged continuous transmission patterns, bottleneck-derived stochastic effects due to a low infectious dose [34, 35] and frequent use of antibiotics for other sexually transmitted diseases by MSM [13, 36], in addition to an inherent higher incidence of mobile genetic elements [37, 38], facility of horizontal gene transfer [39, 40] and dynamic prophage acquisition and loss [41] through time and across hosts. In this study, four WGS-based subtyping methods (SNVPhyl, coreMLST, cgMLST and wgMLST) were evaluated based on discriminatory power, epidemiological concordance, compatibility, robustness and ability to generate stable WGS interpretation criteria for outbreak investigations using 9 foodborne-, travel- and MSM-related retrospective outbreaks of a collection of 91 . and 232 . isolates collected from the province of Quebec. We also aimed to define the number of allele or hqSNV differences regularly observed between outbreak-related isolates from different transmission natures (e.g. point source vs person-to-person transmission). These genetic differences will offer a preliminary guideline for public health officials to delimit outbreaks based on WGS data and to conduct effective routine surveillance of and in different contexts. The location of the genetic differences identified was also explored to determine the extent of genetic polymorphisms detected on mobile genetic elements (MGEs). This study therefore contributes to the ongoing validation of WGS-based subtyping methods for bacterial pathogen surveillance in Canada.

Methods

Isolate characterization and dataset selection

Bacterial isolates from spp. cases in Quebec were sent to the Laboratoire de santé publique du Québec (LSPQ) in Sainte-Anne-de-Bellevue, QC for serotyping and molecular typing. Species identification and serotyping was performed at the LSPQ using slide agglutination (Denka Seiken Co., Ltd, Coventry, UK). species were also confirmed by bioinformatic analysis through the ecTyper pipeline (https://github.com/phac-nml/ecoli_serotyping). Clinical cases of and were selected for analysis from the province of Quebec. Nine outbreaks involving three or more epidemiologically supported cases with a suspected source of infection were chosen for further analysis. Some subclusters were defined inside the large outbreaks to feature hypothetical transmission routes. Outbreaks and subclusters were delimited based on the SNVPhyl phylogenetic tree topology and distance metrics due to the recognition of SNV-based subtyping methods as strong subtyping tools in public health [22, 42], along with epidemiological evidence and temporality. Epidemiological curves for each outbreak clade can be observed in Fig. S1. Isolates with no epidemiological evidence available were designated as non-documented cases (NDCs). NDCs were included in the outbreaks when they clustered concordantly with the other epidemiologically supported isolates. Isolates clearly clustering outside of the nine well-characterized outbreaks were considered sporadic. Sporadic cases distanced by more than 100 alleles of the selected outbreaks via wgMLST were excluded for tree simplicity. The datasets included four outbreaks (SF-01 to SF-04), where two large MSM-related outbreaks (SF-01 and SF-02) were also further dissected into subclusters designated with letters (a through d). The dataset included five outbreaks (SS-01 to SS-05), where two large MSM-related outbreaks (SS-03 and SS-05) were also further divided into subclusters (a through d). Outbreaks and subclusters were characterized as ‘MSM-related’ when most cases were adult male with the presence of male cases with epidemiological data describing recent same-sex sexual relationships. All other outbreaks and subclusters are sometimes referred to as ‘non-MSM’ in this study. The final isolate collections of (n=91) and (n=232) used in this study are listed in Table S1 along with their BioProject and BioSample accession numbers.

WGS and quality control

spp. isolate sequencing was performed at the National Microbiology Laboratory (NML) in Winnipeg, MB. Sequencing protocols were followed as described previously [24]. The quality of the raw sequence reads and draft assemblies was evaluated using FastQC and BioNumerics v7.6. FastQC was incorporated within the Integrated Rapid Infectious Disease Analysis (IRIDA) platform to determine the average genome coverage and to check the basic sequencing quality metrics [43]. Only genomes with ≥40× coverage based on a 5 000 000 bp genome size were used in this study. The reads were then de novo-assembled using the SPAdes assembler (min contig size=1000) in BioNumerics v7.6 software [44]. Six quality metrics were also verified following -specific PNC guidelines: AvgQuality ≥30, N50 ≥18 000, NrContigs <500, length 4.4 to 5.0 Mb, NrConsensus (congruent allele calls by both assembly-based and assembly-free algorithms) >3300 and CorePercent ≥90 %.

Phylogenomic analysis

SNV-based subtyping

Reference-based core genome SNV detection was performed using the Single Nucleotide Variant Phylogenomics (SNVPhyl) pipeline v.1.1 [30] integrated within the NML Galaxy platform [45]. Reference chromosomes of serotype 2 a strain 301 (NC_004337.2) and Ss046 (NC_007384.1) were used for mapping the reads of all and isolates, respectively. PHASTER [46] and Island Viewer 4 [47] were used to identify prophage and genomic island sequences in the reference chromosomes for mapping exclusion. Pipeline parameters were followed as per the PNC guidelines and as described elsewhere [24]. Both vcf2core outputs were verified for ≥85 % of valid and included positions in the core genome prior to PHASTER and IslandViewer 4 masking. A script [48] was used to supplement the maximum-likelihood (ML) phylogenetic tree outputs from SNVPhyl with the number of genetic differences associated for each branch for its visualization in FigTree v1.4.3 [49]. A second script [50] was used to generate the proper input files to visualize the clusters generated by SNVPhyl by constructing minimum-spanning (MS) trees in PHYLOViZ v2.0 with the goeburst algorithm [51]. This script generates a profile table of each unique SNV site combination and assigns a number for each profile (i.e. sequence type) and also generates a table summarizing the profiles identified for each input isolate. The pairwise hqSNV distances were available in the snvMatrix output.

Allele-based subtyping

Three multi-locus sequence typing (MLST) methods were performed by the BioNumerics v7.6 software as described elsewhere [24]. MS and ML trees for categorical data were also constructed as follows. (a) EnteroBase core genome MLST (coreMLST) was performed by using the E. coli/Shigella-specific EnteroBase scheme consisting of 2513 core loci (https://enterobase.warwick.ac.uk). (b) Core genome MLST (cgMLST) was performed using the complete E. coli/Shigella-specific loci scheme consisting of 17 380 loci. However, loci used for analysis were required to be present in ≥95 % of the isolates under comparison. (c) Whole-genome MLST (wgMLST) was also performed using the E. coli/Shigella-specific loci scheme consisting of 17 380 loci. Any loci present in at least one isolate were included in the wgMLST analysis. When multiple consensus allele calls were identified for a single locus, only the allele call with the lowest allele identification number was tabulated via BioNumerics' default settings [27]. Hierarchical clustering was performed using the categorical (values) similarity coefficient measured as distances, which summarizes the pairwise allele differences between isolates by the following equation: categorical (values)=(n total/n common)×n differences, where n total is the total number of loci present in the comparison of the total 91 . or 232 . isolates under study, n common is the number of loci that are present in the pair of isolates under examination and n differences is the number of common loci with different allele calls in the same pair of isolates. All MLST distances reported in this study were the result of this equation and were rounded to their closest integer unless stated otherwise.

Discriminatory power and statistical analysis

Each subtyping method’s ability to differentiate isolates and consequently assign different subtypes was evaluated using Simpson’s diversity index (DI) [52]. The STs generated by SNVPhyl were resolved through the same script described in the previous section [50]. All STs from the MLST-based subtyping methods were identified through visualization of the MS trees, where all isolates included in a single node were given the same ST. The 95 % confidence intervals and P-values were calculated using the jackknife pseudo-values method to indicate significant differences in discriminatory power [53]. All indices, intervals and P-values were calculated using the Comparing Partitions website (www.comparingpartitions.info). A P-value ≤0.05 was considered as statistically significant.

Genetic distance comparison and linear regressions

Pairwise MLST-based allelic differences (y-axis) for each outbreak were plotted against their respective hqSNV difference (x-axis). Pairwise differences involving sporadic isolates and NDCs were excluded in addition to inter-outbreak differences. A simple linear regression analysis was added for each MLST-based subtyping method to produce a best-fit line, its formula (y=mx+b) and its variability (R 2) via GraphPad Prism 8. The slope of the trendline (m) is indicative of the number of MLST-based allele differences found per hqSNV identified by SNVPhyl. Pearson correlation coefficients were also calculated via GraphPad Prism 8 in order to compare genetic distance correlations between all approaches for both species.

High-quality SNV and MLST loci locations and classification

The location of the hqSNVs identified by the SNVPhyl pipeline was determined by a third script [54], which mapped the position of the hqSNVs to the GenBank file of the reference genome used by SNVPhyl and then generated a table summarizing the features (e.g. CDS, miscellaneous feature, intergenic, mobile element, RNA, etc.) and gene products associated with each hqSNV position. Miscellaneous features were only present in the reference genome (NC_004337.2). IS elements were identified via the presence of transposase genes. The loci corresponding to allele differences in BioNumerics were resolved by exporting the character data matrix for each dataset. The BioNumerics loci were divided into six categories according to their genome location, function, number of detected alleles per locus and number of mutations between different allele calls: chromosome, chromosome repeats, plasmid mutations, plasmid copies, IS element-related and prophage-related. The wgMLST quality assessment tool in BioNumerics enabled the identification of repeated loci and the number of mutations between allele calls. Loci with multiple allele calls for a single locus found in contigs of chromosomal origin were considered to be ‘chromosome repeats’. Plasmid-related loci were classified on a case-by-case basis. MOB-recon was used to extract and type the plasmids from the BioNumerics SPAdes draft assemblies of key outbreak-related isolates. Each contig of plasmid origin is given a cluster code based on its similarity to a reference plasmid ‘type’ assigned by MOB-cluster. MOB-typer was used to predict conjugation potential [55]. The default minimum e-value threshold (1e-05) and minimum sequence identity (80%) were used for blastn. Generally, loci considered as ‘plasmid mutations’ represent allele differences with ≤2 mutations encoded on the same plasmid cluster, which would be indicative of the vertical evolution of plasmid sequences. Loci with >2 mutations were considered ‘plasmid copies’ whether or not these loci were present on the same plasmid cluster and were indicative of HGT events. Loci with multiple allele calls were also found in contigs of plasmid origin, however, these were not easily distinguishable from ‘plasmid copies’ and were thus grouped according to the number of mutations between allele calls as mentioned above. IS elements were identified via the presence of transposase genes. PHASTER [46] was used to identify prophage-related sequences in the draft assemblies, but only the loci mapped to intact prophages were categorized as ‘prophage-related’. MLST locus locations were not mutually exclusive. Therefore, a binning approach was followed based on the average sequence length and mobility range (chromosome

Robustness comparisons

The variability of all four WGS-based subtyping methods was evaluated by comparing the intra-outbreak genetic distances found through the use of the complete collection of species-specific isolates and then isolating each outbreak and subcluster as distinct inputs. These two analyses should represent both extremes of distance range variability and therefore indicate their potential effect on future interpretation guidelines. Box and Tukey whisker plots of the pairwise genetic distances for all outbreaks and subclusters from both datasets were generated via GraphPad Prism 8. Phylogenetic trees were visually compared to detect major differences in clustering.

Results

Phylogenomic comparisons between SNVPhyl and MLST-based subtyping methods

The ML trees generated via SNVPhyl for and and used as the basis of outbreak and subcluster delimitations are shown in Figs S2 and S3, respectively. Visual comparisons of the MS trees shown in Figs S4 and S5 generated by the four WGS-based subtyping methods for and , respectively, concluded that all outbreaks (indicated by 1 through 5) and most subclusters (indicated by a through d) delimited via SNVPhyl were similarly grouped by the MLST-based approaches. However, several single isolates seem to cluster separately from their other subcluster-related isolates (SS-03b and SS-05b) via wgMLST with relatively large distances and are indicated by W, X, Y and Z in Fig. S5. Following these phylogenetic discrepancies observed via wgMLST, the extent of wgMLST’s correlation to the core genome-based subtyping methods was evaluated. For each spp. dataset, all outbreak-related pairwise genetic distances of the three MLST-based subtyping methods were combined and plotted against their respective hqSNV distances in the scatter plot illustrated in Fig. 1. The amount of allele differences identified by all MLST-based subtyping methods for was consistent in comparison to the number of hqSNVs located via SNVPhyl as the R 2 were all above 0.90. These results correlate with the phylogenetic tree similarities for in Fig. S4. The slopes of the linear regressions were given as follows: coreMLST (0.74)

Fig. 1.

Scatterplot of all outbreak-related pairwise differences of coreMLST, cgMLST and wgMLST against SNVPhyl for (a) and (b).

Comparison of the subtyping methods’ discriminatory power

Visual comparison of the MS trees also clearly indicates an increase in discriminatory power due to the reduced size of the nodes notably observed for cgMLST and wgMLST, increasing the number of distinct STs. To assess the capacity of each subtyping method to differentiate between the isolates, the discriminatory powers were measured using Simpson’s DI on the STs generated. This index calculates the average probability that a typing system will assign a different type to two randomly selected isolates in the collection under study. Table 1 indicates this measure along with ST data and P-values between confidence intervals (CIs).

Table 1.

Summary of the SNVPhyl, coreMLST, cgMLST and wgMLST subtyping results, diversity indexes and statistical significances for the complete 91 . and 232 . isolate collections

Species	Method	No. of hqSNV locations/loci*	No. of ST	Size of largest ST	Simpson’s DI	Jackknife pseudo-values 95 % CIs	P-values between jackknife pseudo-values 95 % CIs of Simpson’s DIs†
Species	Method	No. of hqSNV locations/loci*	No. of ST	Size of largest ST	Simpson’s DI	Jackknife pseudo-values 95 % CIs	SNVPhyl	coreMLST	cgMLST
S. flexneri (n=91)	SNVPhyl	2717	76	5	0.995	(0.990–0.999)
	coreMLST	2511	63	5	0.987	(0.980–0.994)	0.039
	cgMLST	3398	86	2	0.999	(0.997–1.000)	0.060	<0.001
	wgMLST	4051	89	2	1.000	(0.999–1.000)	0.036	<0.001	0.213
S. sonnei (n=232)	SNVPhyl	1072	128	35	0.970	(0.957–0.984)
	coreMLST	2511	110	28	0.970	(0.960–0.980)	0.976
	cgMLST	3613	196	4	0.998	(0.997–0.999)	<0.001	<0.001
	wgMLST	4426	209	4	0.999	(0.998–1.000)	<0.001	<0.001	0.013

*Number of sites used to generate phylogeny by SNVPhyl. Number of alleles included in the exported categorical data matrixes from BioNumerics.

†Significant differences in discriminatory power are indicated in green (P-value <0.05 and non-overlapping CIs).

CG, core genome; CI, confidence interval; DI, diversity index; hqSNV, high-quality single-nucleotide variant; MLST, multi-locus sequence typing; SNVPhyl, single-nucleotide phylogenomics pipeline; ST, sequence type; WG, whole genome.

Summary of the SNVPhyl, coreMLST, cgMLST and wgMLST subtyping results, diversity indexes and statistical significances for the complete 91 . and 232 . isolate collections Species Method No. of hqSNV locations/loci* No. of ST Size of largest ST Simpson’s DI Jackknife pseudo-values 95 % CIs P-values between jackknife pseudo-values 95 % CIs of Simpson’s DIs† SNVPhyl coreMLST cgMLST (n=91) SNVPhyl 2717 76 5 0.995 (0.990–0.999) coreMLST 2511 63 5 0.987 (0.980–0.994) 0.039 cgMLST 3398 86 2 0.999 (0.997–1.000) 0.060 <0.001 wgMLST 4051 89 2 1.000 (0.999–1.000) 0.036 <0.001 0.213 (n=232) SNVPhyl 1072 128 35 0.970 (0.957–0.984) coreMLST 2511 110 28 0.970 (0.960–0.980) 0.976 cgMLST 3613 196 4 0.998 (0.997–0.999) <0.001 <0.001 wgMLST 4426 209 4 0.999 (0.998–1.000) <0.001 <0.001 0.013 *Number of sites used to generate phylogeny by SNVPhyl. Number of alleles included in the exported categorical data matrixes from BioNumerics. †Significant differences in discriminatory power are indicated in green (P-value <0.05 and non-overlapping CIs). CG, core genome; CI, confidence interval; DI, diversity index; hqSNV, high-quality single-nucleotide variant; MLST, multi-locus sequence typing; SNVPhyl, single-nucleotide phylogenomics pipeline; ST, sequence type; WG, whole genome. As expected, the discriminatory power followed suit with the number of loci detected from the MLST-based subtyping methods as they were arranged as follows: coreMLST

Inter-outbreak genetic distances and relatedness to sporadic cases

All outbreaks were easily distinguished by all four WGS-based subtyping methods. The minimum inter-outbreak distances separating the distinct and outbreaks were 215 hqSNVs/143 core alleles/198 core genome (CG) alleles/211 WG alleles and 42 hqSNVs/30 core alleles/40 CG alleles/51 WG alleles, respectively. Not many sporadic isolates clustered closely (<100 WG alleles) to the outbreaks under study. The minimum distances between the and outbreak-related isolates to any isolate considered sporadic were 44 hqSNVs/32 core alleles/41 CG alleles/70 WG alleles and 31 hqSNVs/20 core alleles/27 CG alleles/36 WG alleles, respectively. NDCs not considered as outbreak-related were separated by at least 11 hqSNVs/5 core alleles/9 CG alleles/11 WG alleles and 3 hqSNVs/1 core allele/5 CG alleles/6 WG alleles from the and outbreaks under study, respectively. Additional epidemiological evidence would have been needed to support their inclusion as outbreak-related cases or exclusion as sporadic cases.

Intra-outbreak/subcluster diversity and WGS-based interpretation guidelines

Table 2 summarizes the characteristics of the outbreaks and subclusters under study along with their intra-outbreak distance ranges and largest minimum spanning distances. The intra-outbreak/subcluster distances indicate the range of pairwise genetic differences (hqSNVs or allele variants) observed between outbreak- and subcluster-related isolates, while the superior limits give a sense of the overall diversity observed within the outbreaks/subclusters. The largest minimum spanning distance illustrates the largest distance by which a single isolate differs from its neighbouring outbreak- or subcluster-related isolates according to the minimum spanning trees. For example, the three SS-05c-related isolates form a distinct subcluster in which they differ by at least 9 hqSNVs/6 core alleles/10 CG alleles/19 WG alleles from the other 117 isolates included in the SS-05 outbreak. This distance metric was included in this study in order to test an alternative interpretation criterion independent of the transmission nature, effective population size and duration of an outbreak. This was important since, naturally, prolonged outbreaks have more diversity caused by the constant accumulation of mutations in the outbreak strain, creating larger maximum intra-outbreak genetic distances, which has frequently been observed in the MSM community. Using a minimum spanning distance as a measure of relatedness would avoid such large distances between the first and last outbreak-related isolates and thus create a more stable guideline for public health decisions, assuming that the genetic variation correlates with the time of isolation from the index case and that a constant rate of cases is reported.

Table 2.

Species	Outbreaks/ Subclusters		Year	Serotypes	No. of sequenced isolates	Duration (days)*	Sex (M/F)	Age range	Epidemiology (frequency)	Intra-outbreak distance ranges†				Largest minimum spanning distances‡
Species	Outbreaks/ Subclusters		Year	Serotypes	No. of sequenced isolates	Duration (days)*	Sex (M/F)	Age range	Epidemiology (frequency)	SNVPhyl	coreMLST	cgMLST	wgMLST	SNVPhyl	coreMLST	cgMLST	wgMLST
S. flexneri		SF-01	2017–2019	2a and Y	31	1012	31/0	23–86	MSM (14)	0–37 (17)	0–30 (13)	1–41 (19)	1–53 (28)	23	16	23	26
		SF-01a	2018–2019	2a	4	289	4/0	23–67	MSM (4)+Travel (1)	1–6 (3.5)	0–2 (1)	1–5 (3)	1–8 (4)	4	1	3	5
		SF-01b	2019	2a	6	257	6/0	23–86	MSM (2)	0–4 (2)	0–4 (1)	1–5 (3)	2–7 (4)	3	3	3	3
		SF-01c	2019	2a	5	69	5/0	36–59	MSM (3)	0–3 (1.5)	0–2 (1)	2–6 (3)	2–8 (4.5)	2	1	3	4
		SF-01d	2018–2019	Y	7	322	7/0	29–78	MSM (1)+Travel (1)	0–10 (5)	0–9 (4)	2–16 (8)	5–18 (13)	5	4	8	8
		SF-02	2018–2019	1b and 3b	42	529	40/2	24–72	MSM (23)+Travel (1)	0–17 (6)	0–10 (3)	0–14 (6)	0–26 (9)	8	4	6	11
		SF-02a	2018–2019	1b and 3b	15	510	14/1	36–68	MSM (9)	0–10 (4)	0–4 (2)	0–8 (4)	1–12 (6)	3	2	3§	5§
		SF-02b	2019	1b	3	76	2/1	38–46	MSM (2)	0–1 (1)	0–1 (1)	0–3 (2)	2–5 (2)	1	1	2	2
		SF-03	2018–2019	2a	7	226	7/0	39–71	MSM (6)+Travel (1)	0–6 (3)	0–3 (2)	1–6 (4)	2–8 (5)	3	2	4	4
		SF-04	2018	1b	5	25	2/3	4–37	Food (3)+Familial (1)	0–2 (1)	0 (0)	1–6 (3)	1–7 (3)	1	0	3	3
S. sonnei		SS-01	2013	–	4	5	0/4	24–54	Non-MSM (4)	0 (0)	0–1 (0.5)	1–3 (2)	1–4 (2)	0	1	2	2
		SS-02	2014	–	3	22	0/3	46–53	Travel (3)	1–7 (6)	1–3 (1)	2–7 (4)	2–8 (5)	6	1	4	4
		SS-03	2017–2019	–	57	797	54/3	21–75	MSM (29)+Travel (1)	0–16 (5)	0–9 (3)	0–16 (6)	0–34 (8)	6	4	6	16
		SS-03a	2017–2018	–	16	155	16/0	23–75	MSM (12)+Travel (1)	0–3 (1)	0–2 (1)	0–6 (2)	0–8 (4)	1	1	4§	4§
		SS-03b	2018–2019	–	21	603	20/1	22–70	MSM (13)	0–4 (2)	0–5 (1)	0–10 (4)	0–29 (6)	2	4	4	16
		SS-04	2018	–	7	33	3/4	1–76	Food (7)	0–1 (0)	0–1 (0)	0–3 (1)	0–5 (2)	1	1	1	2
		SS-05	2018–2019	–	120	407	116/4	1–71	MSM (28)+Travel (2)	0–23 (8)	0–16 (6)	0–24 (10)	0–117 (15)	9	8	10	19
		SS-05a	2018–2019	–	69	396	67/2	24–70	MSM (19)+Travel (1)	0–7 (1)	0–5 (2)	0–10 (4)	0–57 (5)	3	2	5	7
		SS-05b	2018–2019	–	36	351	36/0	21–71	MSM (8)	0–8 (2)	0–6 (1)	0–10 (4)	0–75 (6)	3	3	4	31§
		SS-05c	2019	–	3	281	3/0	28–47	Unknown	0–4 (4)	0–3 (3)	2–6 (5)	2–7 (6)	1	3	5	5
		SS-05d	2019	–	4	29	2/2	1–41	Unknown	0–1 (0.5)	0–1 (1)	2–4 (3)	2–5 (4)	4	1	3	4

*Total number of days between the isolate dates of the first and last outbreak/subcluster-related isolates.

†Minimum and maximum pairwise differences between outbreak/subcluster-related isolates. Median pairwise genetic differences are in parentheses.

‡Largest genetic distance between neighbouring outbreak/subcluster-related isolates according to the minimum spanning trees (n differences).

§All isolates did not group together into a single minimum spanning cluster. The distance reported is the largest sum of minimum spanning distances between subcluster-related isolates.

CG, core genome; MLST, multi-locus sequence typing; MSM, men who have sex with men; SNVPhyl, single-nucleotide variant phylogenomics pipeline; WG, whole genome.

Features, intra-outbreak/subcluster genetic distances and largest minimum spanning distances determined by SNVPhyl, coreMLST, cgMLST and wgMLST of the four and five outbreaks under study in addition to the six and six subclusters Species Outbreaks/ Subclusters Year Serotypes No. of sequenced isolates Duration (days)* Sex (M/F) Age range Epidemiology (frequency) Intra-outbreak distance ranges† Largest minimum spanning distances‡ SNVPhyl coreMLST cgMLST wgMLST SNVPhyl coreMLST cgMLST wgMLST SF-01 2017–2019 2a and Y 31 1012 31/0 23–86 MSM (14) 0–37 (17) 0–30 (13) 1–41 (19) 1–53 (28) 23 16 23 26 SF-01a 2018–2019 2a 4 289 4/0 23–67 MSM (4)+Travel (1) 1–6 (3.5) 0–2 (1) 1–5 (3) 1–8 (4) 4 1 3 5 SF-01b 2019 2a 6 257 6/0 23–86 MSM (2) 0–4 (2) 0–4 (1) 1–5 (3) 2–7 (4) 3 3 3 3 SF-01c 2019 2a 5 69 5/0 36–59 MSM (3) 0–3 (1.5) 0–2 (1) 2–6 (3) 2–8 (4.5) 2 1 3 4 SF-01d 2018–2019 Y 7 322 7/0 29–78 MSM (1)+Travel (1) 0–10 (5) 0–9 (4) 2–16 (8) 5–18 (13) 5 4 8 8 SF-02 2018–2019 1b and 3b 42 529 40/2 24–72 MSM (23)+Travel (1) 0–17 (6) 0–10 (3) 0–14 (6) 0–26 (9) 8 4 6 11 SF-02a 2018–2019 1b and 3b 15 510 14/1 36–68 MSM (9) 0–10 (4) 0–4 (2) 0–8 (4) 1–12 (6) 3 2 3§ 5§ SF-02b 2019 1b 3 76 2/1 38–46 MSM (2) 0–1 (1) 0–1 (1) 0–3 (2) 2–5 (2) 1 1 2 2 SF-03 2018–2019 2a 7 226 7/0 39–71 MSM (6)+Travel (1) 0–6 (3) 0–3 (2) 1–6 (4) 2–8 (5) 3 2 4 4 SF-04 2018 1b 5 25 2/3 4–37 Food (3)+Familial (1) 0–2 (1) 0 (0) 1–6 (3) 1–7 (3) 1 0 3 3 SS-01 2013 – 4 5 0/4 24–54 Non-MSM (4) 0 (0) 0–1 (0.5) 1–3 (2) 1–4 (2) 0 1 2 2 SS-02 2014 – 3 22 0/3 46–53 Travel (3) 1–7 (6) 1–3 (1) 2–7 (4) 2–8 (5) 6 1 4 4 SS-03 2017–2019 – 57 797 54/3 21–75 MSM (29)+Travel (1) 0–16 (5) 0–9 (3) 0–16 (6) 0–34 (8) 6 4 6 16 SS-03a 2017–2018 – 16 155 16/0 23–75 MSM (12)+Travel (1) 0–3 (1) 0–2 (1) 0–6 (2) 0–8 (4) 1 1 4§ 4§ SS-03b 2018–2019 – 21 603 20/1 22–70 MSM (13) 0–4 (2) 0–5 (1) 0–10 (4) 0–29 (6) 2 4 4 16 SS-04 2018 – 7 33 3/4 1–76 Food (7) 0–1 (0) 0–1 (0) 0–3 (1) 0–5 (2) 1 1 1 2 SS-05 2018–2019 – 120 407 116/4 1–71 MSM (28)+Travel (2) 0–23 (8) 0–16 (6) 0–24 (10) 0–117 (15) 9 8 10 19 SS-05a 2018–2019 – 69 396 67/2 24–70 MSM (19)+Travel (1) 0–7 (1) 0–5 (2) 0–10 (4) 0–57 (5) 3 2 5 7 SS-05b 2018–2019 – 36 351 36/0 21–71 MSM (8) 0–8 (2) 0–6 (1) 0–10 (4) 0–75 (6) 3 3 4 31§ SS-05c 2019 – 3 281 3/0 28–47 Unknown 0–4 (4) 0–3 (3) 2–6 (5) 2–7 (6) 1 3 5 5 SS-05d 2019 – 4 29 2/2 1–41 Unknown 0–1 (0.5) 0–1 (1) 2–4 (3) 2–5 (4) 4 1 3 4 *Total number of days between the isolate dates of the first and last outbreak/subcluster-related isolates. †Minimum and maximum pairwise differences between outbreak/subcluster-related isolates. Median pairwise genetic differences are in parentheses. ‡Largest genetic distance between neighbouring outbreak/subcluster-related isolates according to the minimum spanning trees (n differences). §All isolates did not group together into a single minimum spanning cluster. The distance reported is the largest sum of minimum spanning distances between subcluster-related isolates. CG, core genome; MLST, multi-locus sequence typing; MSM, men who have sex with men; SNVPhyl, single-nucleotide variant phylogenomics pipeline; WG, whole genome. In the four non-MSM outbreaks studied, the maximum genetic diversity ranges from 0 to 7 hqSNVs/0–3 core alleles/3–7 CG alleles/4–8 WG alleles, which was consistent with the established 0–10 hqSNV/allele interpretation criteria measured for other single-strain outbreaks of foodborne enteric pathogens. In the MSM-related SF-01 outbreak, the genetic diversity ranges from 0 to 37 hqSNVs/0–30 core alleles/1–41 CG alleles/1–53 WG alleles, while in the other three large MSM-related outbreaks studied, the maximum genetic diversity ranges from 16 to 23 hqSNVs/9–16 core alleles/14–24 CG alleles/26–117 WG alleles. Notably, the diversity detected in large MSM-related outbreaks (SF-02, SS-03 and SS-05) by wgMLST is considerably larger than the other WGS-based subtyping methods. In particular, wgMLST added 93 alleles to the 24 allele diversity detected by cgMLST for the SS-05 outbreak. However, the diversity observed for the prolonged MSM-related SF-03 outbreak was inferior (6 hqSNVs/3 core alleles/6 CG alleles/8 WG alleles) as it only included seven cases. The maximum genetic diversity for the delimited subclusters ranges from 1 to 10 hqSNVs/1–9 core alleles/3–16 CG alleles/5–75 WG alleles. The large genetic differences observed between MSM-related outbreak strains through wgMLST will make creating real-time interpretation guidelines problematic for routine surveillance of spp. If foodborne outbreak-validated criteria were applied (i.e. 10 WG allele threshold) to determine relatedness and epidemiological concordance, all MSM-related outbreaks involving more than seven cases would have had epidemiologically related isolates excluded solely on this basis. Alternatively, the largest minimum spanning distances observed for all 9 and 11 outbreaks/subclusters (excluding SF-01) under study were ≤8 hqSNVs/4 core alleles/8 CG alleles/11 WG alleles, and ≤9 hqSNVs/8 core alleles/10 CG alleles/35 WG alleles, respectively. Here, it is demonstrated that if largest minimum spanning distances were used instead of maximum intra-outbreak distances for outbreak delimitations, all WGS-based subtyping methods would have been able to characterize most outbreaks correctly based on the established 0–10 hqSNV/allele interpretation criteria, with the exception of wgMLST, where discrepant clustering was observed for several subclusters (e.g. SS-03b and SS-05b). However, by eliminating these isolates from the analysis, the largest minimum spanning distances for SS-03, SS-03b and SS-05b fall to 8, 4 and 12 WG alleles. In addition, if we separate the SS-05d subcluster from the SS-05 outbreak altogether due to differences in epidemiology, its largest minimum spanning distance falls to 12 from 19 WG alleles.

MGE-mediated phylogenetic discrepancies and inflated genetic distances

The phylogenetic discrepancies discussed in previous sections and illustrated in Fig. S5 were further investigated. A clear example is also illustrated in Fig. 2, where the epidemiologically supported isolates SS050369 (Y) and SS055307 (Z) clustered on separate branches from the rest of the SS-03 outbreak via wgMLST on the ML tree. A distance of 19 WG alleles was found between both isolates and a combined average distance of 26 (range, 18–35) WG alleles was found to separate them from the other 55 outbreak-related isolates. However, these isolates cluster concordantly with the other SS-03b subcluster through all core genome-based subtyping methods with a distance of only 1 hqSNV/2 core alleles/3 CG alleles between them and an average distance of 5 (range, 0–14) hqSNVs/3 (range, 0–6) core alleles/7 (range, 2–13) CG alleles to the other 55 SS-03 outbreak-related isolates.

Fig. 2.

WgMLST ML phylogenetic tree and plasmid profiles of the 57 isolates involved in the SS-03 outbreak. aPlasmid clusters are defined as a group of closed reference plasmids with high sequence similarity by MOB-cluster. Contigs of plasmid origin in the draft assemblies of the isolates under study were then assigned a plasmid cluster. The cause of the inflated genetic distances was thus investigated by analysing the allele variants used to generate the wgMLST phylogeny for these two isolates along with a temporally linked SS-03b-related isolate (SS053646). Isolates SS055307 and SS050369 differed by 0 to 1 hqSNVs/0–2 core alleles/3–5 CG alleles/22–26 WG alleles from SS053646. The combined WG allele differences of the three pairwise comparisons revealed that wgMLST detected differences in an additional 25 loci compared to cgMLST. MOB-recon determined that 24 out of the 25 allele differences were encoded on contigs associated to the plasmid clusters c476 or c487. The plasmid cluster profiles of the 57 outbreak-related isolates can be found in Fig. 2 alongside the wgMLST ML tree. Eleven out of the 22 different plasmid clusters were present in over 80 % of the SS-03 outbreak-related isolates. The other 11 plasmid clusters were present sporadically in either 1 to 3 isolates. The plasmid profile of this outbreak strain was thus relatively stable through time. Two conjugative plasmid clusters, c487 and c544, were present in ≥98 % of the isolates. Interestingly, the plasmid cluster c476 was also conjugative, but was only present in the discrepantly clustered isolates SS050369 (Y) and SS055307 (Z). The additional allelic differences observed for SS050369 and SS055307 from the other SS-03 isolates were thus caused by the acquisition of the conjugative plasmid c476, which generated multiple differences from homologous genes already present in the ubiquitous plasmid c487. However, the large diversity between both isolates was caused by the acquisition of a similar ‘type’ of plasmid according to MOB-typer (c476), while they seemed to have gone through different evolutionary paths, causing multiple allelic differences between them as well.

Locations of the hqSNVs and allele variants used to differentiate outbreak- and subcluster-related isolates

Following the results of the previous section, all outbreaks and subclusters were evaluated based on the location of the hqSNVs and allele variants used to generate phylogeny by the four WGS-based subtyping approaches. Fig. 3 illustrates the number of hqSNVs and allele differences according to these locations for all outbreaks and subclusters under study.

Fig. 3.

Summary of the reference-based hqSNV locations and BioNumerics loci used to generate phylogeny via SNVPhyl, coreMLST, cgMLST and wgMLST for all 10 (a) and 11 (b) outbreaks and subclusters.

Summary of the reference-based hqSNV locations and BioNumerics loci used to generate phylogeny via SNVPhyl, coreMLST, cgMLST and wgMLST for all 10 (a) and 11 (b) outbreaks and subclusters. Most genetic differences were due to mutations in the chromosome for all four WGS-based subtyping methods, with the exception of the MSM-related outbreak SS-05 and subclusters SF-01c, SF-02b, SS-03b, SS-05a and SS-05b. The enhanced discrimination of wgMLST over cgMLST is largely due to allele differences mapped to plasmids and prophages. Most plasmid differences are caused by ‘plasmid copies’, where the same loci could be found on different plasmid sequences throughout an outbreak/subcluster and their corresponding alleles were sometimes discerned by upwards of several hundred mutations. Most of these loci were associated with plasmid maintenance, mobilization and conjugation. All outbreak- and subcluster-related isolates were also further distinguished by IS elements. However, due to the abundance and repeated nature of IS elements in the spp. genomes, most differences here would be considered false positives. In the 4 non-MSM outbreaks under study, most (15/19) genetic differences added by cgMLST and wgMLST over coreMLST are also considered false positives (e.g. chromosomal repeats and IS elements). The genetic differences due to ‘plasmid copies’ and ‘prophages’ were exceptionally high for the outbreak SS-05 and its subclusters SS-05a and SS-05b. Similarly to the SS-03 outbreak’s phylogenetic discrepancies illustrated in Fig. 2, two SS-05a isolates, SS131423 (W) and SS121129 (X) illustrated in Fig. S5, seem to have acquired distinct conjugative c476 plasmids encoding multiple loci already encoded on the omnipresent conjugative plasmid c487, creating many allelic variations between each other and the rest of the SS-05a subcluster-related isolates. In the subcluster SS-05b, similar observations were found as 9 isolates who seem to have acquired several new conjugative plasmids are responsible for 93/95 ‘plasmid copies’ differences. In addition, 4 SS-05b isolates seem to have acquired the phage SSU5 (NC_018843) according to PHASTER, but 1 isolate seems to have acquired a variant, creating all 46 allele differences accorded to ‘prophages’. Table S2 lists the problematic loci (e.g. chromosome repeats, plasmid-associated loci, prophage-associated loci and IS element transposase genes) included in the MLST-based loci schemes. The inclusion of highly repeated genes, such as IS element transposases, might have also given a deceptive appearance of higher discriminatory power measured previously for cgMLST and wgMLST. In support of this, the loci ECOLI03837 encoding the fimbria biosynthesis transcriptional regulator fimZ had the allele calls 422 and 423 randomly assigned to the complete isolate collection due to different contig lengths caused by an IS element (IS2) truncation. However, while the fimZ gene is known as an serotype 2 a strain 2457T-specific pseudogene [56], it was not detected by BioNumerics in the isolate collection for .

Robustness and the genetic distance variance of different input datasets

During surveillance, public health personnel evaluate the relatedness of isolates occurring within a determined time frame in order to detect potential outbreak clusters. In this time frame, the rate of unrelated isolates to the number of outbreak-associated cases can vary greatly depending on various factors (e.g. time of the year, region, etc.). The robustness of each WGS-based subtyping method was thus evaluated because of the need of an approach not easily influenced by the diversity of the input dataset under comparison, thus allowing interlaboratory comparisons. Based on our previous results, the core genome-based subtyping methods (e.g. cgSNV, coreMLST and cgMLST) seem more reliable when investigating spp. outbreaks, as they eliminate most confounding MGEs (e.g. plasmids, insertion sequence elements, prophages). However, the results generated by core genome-based approaches, such as SNVPhyl and cgMLST, are inherently dependent on the collection of isolates under comparison [24]. The addition of distantly related isolates (as is the case when comparing various outbreaks and sporadic isolates together) will inevitably reduce the discriminatory power of these methods, in addition to reducing the genetic variation of the outbreaks in question by decreasing the size of the shared genome. Fig. 4 summarizes these variabilities by comparing the outbreak/subcluster diversities found earlier to the diversities generated when using each outbreak/subcluster as distinct inputs through box plots for and .

Fig. 4.

Box and Tukey whisker plots of the pairwise genetic distances for all 10 (a) and 11 (b) outbreaks and subclusters via the use of isolated (I) outbreak/subcluster-specific collections and the complete (C) species-specific isolate collections as inputs to SNVPhyl, coreMLST, cgMLST and wgMLST. The black dots inside the boxes represent the mean genetic distances. Minor increases in distance metrics were observed for isolated outbreaks via SNVPhyl for and . The coreMLST method proved to be the most robust, as no changes were detected for either species. The cgMLST method, as expected, generated the most notable changes in distance metrics, but MSM-related outbreaks of were the most affected. The distance metrics of isolated outbreaks via cgMLST increased the most, since the increased clonality observed in the inputs increased the core genome substantially so as to resemble the results generated by wgMLST. However, wgMLST had an inverse effect, as isolated outbreaks produced minor decreases in distances. The phylogenies created by SNVPhyl, coreMLST and wgMLST were largely congruent, with the exception of changes in terminal branch lengths as a result of the differences in distances discussed above. However, the cgMLST tree topologies for certain isolated MSM-related outbreaks of resembled the discrepant topologies generated by wgMLST previously discussed in this paper. Therefore, cgMLST should not be used for sole outbreak investigations, since accessory genomes are likely stable throughout an outbreak and can cause epidemiologically discordant phylogenies similar to the use of wgMLST.

serotype conversions and impact on outbreak delimitations

In the collection of 91 isolates, 4 serotypes were included: 1b (n=47), 2a (n=34), Y (n=8) and 3b (n=2). Not all serotypes were monophyletic. Serotype conversion events were observed for outbreaks SF-01 and SF-02. The gtrI sequence from the GenBank accession number AF139596.1 [57] and the SfII bacteriophage sequence from the GenBank accession number AF021347.1 [58] were used as queries against the 1b/3b and 2a/Y isolates, respectively, in order to predict gene presence and mutations via local alignments. Forty SF-02 isolates presented serotype 1b, while two isolates (SF118225 and SF164148) presented the 3b serotype, but clearly clustered within the same outbreak clade with a minimum distance of 1 and 2 hqSNVs/2 and 0 core alleles/3 and 2 CG alleles/6 and 3 WG alleles to a serotype 1b isolate. The only notifiable difference by agglutination between a 3b and a 1b serotype for is the addition of a glucosyl group on the N-acetylglucosamine unit of the O-antigen, which is mediated by the gtrI gene product [59]. A comparative DNA analysis of the 42 SF-02 outbreak-related isolates’ gtrI gene with a reference sequence (AF139596.1) [57] resulted in a shared single-nucleotide deletion (−1A) at position 1175 that rendered a frameshift for both serotype 3b-presenting isolates. The other 40 isolates did not present any mutations, with the exception of 2 isolates (SF195966 and SF197491) with the last 26 and 2001 nucleotides of the gtrI gene deleted, respectively. Additionally, the SF-01 outbreak included a Y serotype subcluster (SF-01d, n=7) among the other SF-01-related subclusters and isolates (n=24) presenting serotype 2a. The SF-01d subcluster was distanced by a minimum of 19 hqSNVs/14 core alleles/21 CG alleles/29 WG alleles to the other SF-01 isolates. The only detectable difference by agglutination between a Y serotype and a 2a serotype is the presence of a glucosyl group on the rhamnose III position, which is possible by the gtrII gene product, a serotype-specific glucosyltransferase. However, the gtrII gene is known to be encoded by the serotype-converting bacteriophage SfII, where the loss of this sequence is commonly known to cause these serotype conversions in addition to amino acid substitution events [58, 60]. Here, the 7 isolates presenting the Y serotype did not possess both sugar transferase genes (bgt and gtrII) regularly encoded by the SFII bacteriophage (AF021347), while the other 24 serotype 2a isolates possessed both genes.

Discussion

The general goal of this study was to evaluate the WGS-based subtyping methods readily accessible by Canadian public health laboratories for the surveillance of spp. While other foodborne pathogens under the surveillance of PNC had been fully validated for wgMLST, there was a void for the interpretation of prolonged transmission patterns regularly observed in and outbreaks involving MSM. In this study, several components from each subtyping method were therefore compared, including the methods’ compatibility and discriminatory power. The four WGS-based subtyping methods tested were generally considered concordant due to high correlation coefficients (0.939–0.985), but wgMLST was not deemed compatible in the dataset (R 2=0.4627) due to its heightened ability to detect variation in the whole genome, which resulted in some abnormally large genetic distances between outbreak-related isolates (e.g. isolates W through Z). Similar to our results, previous studies on other bacterial species have shown that SNV-based and coreMLST/cgMLST methods were congruent [25, 61–63] and that correlation coefficients were greater for cgMLST than wgMLST [63]. However, an SNV-based subtyping approach was shown to be highly compatible with wgMLST for foodborne outbreaks (R 2=0.96) [42, 64]. Based on genetic distances alone, wgMLST also seemed compatible with core genome-based subtyping approaches when tested on other enteric pathogens [23, 24]. This difference in compatibility emphasizes the differences in environmental pressures and genomic plasticity between long-term clusters and the clonal pathogen outbreaks regularly studied (e.g. foodborne). All WGS-based subtyping methods studied were highly discriminatory for both spp. datasets, but cgMLST and wgMLST were found to be the most discriminatory, as they were able to assign different sequence types to most isolates (Simpson’s DI, 95 % CIs: 0.997–1.000). In comparison to other enteric bacteria, Pearce et al. [25] found similar results, as their core genome SNV-based approach provided greater resolution than coreMLST for serovar Enteritidis. However, Vincent et al. [23] found that SNVPhyl was the most discriminatory for distinguishing outbreak isolates of serovar Heidelberg from Quebec in comparison to cgMLST and wgMLST. Jackson et al. [23] also found that a hqSNV analysis produced more differences among outbreak-related isolates compared to wgMLST for foodborne outbreaks, indicating a higher discriminatory power for SNV-based analyses, although Henri et al. [64] did not find any difference. These results showcase the limitation of MLST-based subtyping as they consider mutations, recombinations and indels as single evolutionary events. Therefore, a single allelic shift can comprise multiple genetic changes, which would otherwise be detected by an SNV-based subtyping approach. On the other hand, our results were similar to the findings of Rumore et al. [24], as wgMLST was the most discriminatory compared to SNVPhyl for verotoxigenic O157:H7. This concordance is indicative of a more dynamic and diverse E. coli/Shigella accessory genome, where their shared E. coli/Shigella wgMLST scheme may have measurable advantages when it comes to discriminatory power compared to the wgMLST schemes constructed for other enteric pathogens. This advantage is most likely due to the larger size of the E. coli/Shigella pan-genome, enabling the detection of a wide range of loci, including many involved with mobile genetic elements (e.g. plasmids, prophages and transposons). These concordances and discriminatory powers were also translated in the intra-outbreak genetic distances observed. In the four non-MSM outbreaks studied, the maximum genetic diversity ranged from 0 to 7 hqSNVs/0–3 core alleles/3–7 CG alleles/4–8 WG alleles, which was consistent with documented non-MSM (≤5 hqSNVs) [65] and (≤7 hqSNVs) outbreaks [66]. This concordance demonstrates the shared limited transmission of the pathogen, thus resembling point source contaminations. However, our non-MSM outbreaks only included three to seven cases, which limits this observed consistency as non-MSM outbreaks involving a larger population may not always follow the same 0–10 hqSNV/allele guideline. In the four MSM-related outbreaks studied (excluding the complete SF-01 outbreak), the maximum genetic diversity ranged from 1 to 23 hqSNVs/0–16 core alleles/3–24 CG alleles/5–117 WG alleles. Here, the SF-01 outbreak was excluded from the distance ranges described, since it seems to represent a genetically diverse reservoir population (or an endemic strain) where multiple outbreak events (i.e. subclusters a-d) and isolated cases seem to have emerged. Therefore, it was probably wrong of us to encompass every subcluster together as a single-strain outbreak due to its high diversity, as observed by the core genome-based approaches (0–37 hqSNVs/0–30 core alleles/1–41 CG alleles/1–53 WG alleles) and superior largest minimum spanning distances (23 hqSNVs/16 core alleles/23 CG alleles/26 WG alleles). This highlights the importance of gathering detailed epidemiological data to supplement WGS-based subtyping methods. While all SF-01 subclusters occurred in the same time frame and involved MSM cases, their differences in phylogeny may indicate deeper causes in epidemiology other than simply ‘MSM’. Other subclusters were also defined in order to detect potential transmission routes and spillover events into other communities [67]. For example, the SS-05d subcluster involved women and children with unknown epidemiology, while the complete SS-05 outbreak would be considered MSM-related. All MSM-related outbreaks surpassed the 10 genetic difference guidelines for SNVPhyl, cgMLST and wgMLST, with the exception of the SF-03 outbreak (n=7 isolates). The other 3 MSM-related outbreaks (SF-02, SS-03 and SS-05), which contained from 42 to 120 cases, have gained a substantial amount of diversity going from cgMLST to wgMLST, usually doubling the distances observed. Notably, for outbreak SS-05, it went from a maximum of 24 CG alleles to 117 WG alleles. The hqSNV diversities observed here seem concordant to the 64 SNP difference observed for an MSM-related outbreak involving 121 cases [65] when assuming that genetic diversity correlates with the size of the effective population of prolonged outbreak events. However, while defining the maximum number of WG allele differences required to classify isolates as related/unrelated to an outbreak strain may be a desirable course of action for outbreak delimitations, it is highly unlikely that a universal distance threshold will be able to consistently predict epidemiological relationships [24, 26, 68–70]. Some authors even predict misleading interpretations of transmission networks due to the inadequacy of only sequencing a single isolate per case, thus ignoring the within-host diversity of the pathogen population [34]. This heterogeneity between epidemiologically linked cases highlights the importance of including tree topologies as indicators of evolution in addition to epidemiological evidence and traceback data in combination with genetic distances to define spp. clusters. Nevertheless, we have shown that the use of minimum spanning distances could be useful as an alternative measure of shared epidemiology for prolonged outbreak events, since all outbreak- and subcluster-related isolates did not surpass 10 hqSNV/allele differences with their neighbouring isolates through SNVPhyl, coreMLST and cgMLST. Most wgMLST minimum spanning distances also respected this threshold, with the exception of several instances, where the clustering was considered discordant or where the distances were measured outside the delimited subclusters. Further studies are thus warranted due to its flexibility to create stable WGS interpretation guidelines. However, a major weakness of this study was the limited epidemiological evidence available and the assumptions made on outbreak/subcluster delimitations based on the SNVPhyl phylogeny. Therefore, some outbreaks/subclusters analysed here might have included some isolates with no true shared epidemiology and vice versa. The inflated distances and discordant clustering observed with the use of the wgMLST approach were investigated through a comparison of the predicted plasmid profiles detected in the SS-03 outbreak. We have shown that homologous genes encoded on co-inhabiting plasmids can cause multiple allelic differences only detected by the wgMLST method. However, these discrepancies are partly due to the default BioNumerics settings of only tabulating the allele call with the lowest allele identification number when multiple allele calls are present for a single locus. An easy setting change in BioNumerics could be of interest, as it could eliminate the loci with multiple allele calls from the analysis and remove the effect of repeated loci on the genetic distances measured. It is also important to mention that while MOB-recon has shown high sensitivity and specificity for plasmid detection, contigs of plasmid origin detected through short-read sequencing can be split across multiple clusters and/or merged into clusters with other non-related contigs [55]. In addition, repetitive elements with multiple copies will only be assigned to one plasmid cluster code via MOB-recon’s winner-takes-all strategy [55]. The cluster codes mentioned in this paper are therefore used for descriptive purposes only and further studies are recommended in order to better understand the plasmid diversity observed in the MSM-related outbreaks, as plasmids can be rapid disseminators of antimicrobial resistance traits [71]. The drawbacks of the use of wgMLST were also illustrated in our analysis of the locations of the hqSNV and allele variants used by the different subtyping approaches to determine sequence types. While subtyping methods should approximate epidemiology, the inclusion of MGE-encoded loci in the cg/wgMLST schema was shown to reduce this reliability via the presence of plasmid-, prophage- and IS element-related loci. Although the scientific community had expressed their concern regarding including pseudogenes and paralogous genes within a dataset used for surveillance purposes and estimated that such datasets could generate spurious genetic differences leading to inaccurate clustering [21, 25, 29, 63], most outbreaks studied did not present such erroneous results. However, some homologous recombinations, plasmid acquisitions and prophage sequence variations have been documented to affect phylogeny in prior outbreak investigations [41, 63, 72]. The emergence of such false diversity in our MSM-related outbreaks is thought to coincide with ’s competitive edge over to accept and maintain with more stability horizontally transferred DNA [39, 40]. In addition, prolonged transmission patterns providing greater opportunities for horizontal gene transfers [65] and the environmental pressures of antibiotic use by MSM for the treatment of other sexually transmitted diseases [13, 73, 74] are probable causes for such wgMLST results. This pressure would be weaker in cases of non-MSM outbreaks relating to short-term food or community transmissions, which would consequently present more clonal clusters of in the absence of a complex accessory genome, as was shown in our results. In addition, serotype conversions have been observed in prolonged outbreaks due to serotype-converting bacteriophages and frameshift mutations similarly described elsewhere [75, 76]. These results highlight the caution needed during outbreak investigations to not solely exclude cases based on serotypes. Phylogeny is therefore a more trustworthy tool, since we have data demonstrating that serotype conversions can happen spontaneously in an outbreak strain. So far, our data have shown that core genome-based subtyping methods have been given the upper hand over wgMLST, although some have been deemed to impact global inter-laboratory comparisons (e.g. SNVPhyl and cgMLST) [21] due to differences in reference genome choice, pipeline parameters and input datasets used by different jurisdictions [68]. Nevertheless, by utilizing a consistent set of conserved loci and allele designations, such as the EnteroBase scheme used in coreMLST, these variations vanish and enable global data comparisons in addition to a potential classification nomenclature due to the absence of differences observed for different input datasets. While minor increases in distance metrics were observed for isolated outbreaks via SNVPhyl for and , these results were concordant with Reimer et al.’s [31] comparison of different input datasets for outbreaks, as the size of the shared genome had little influence on the outbreak diversities, but increased nonetheless. However, the cgMLST approach was shown to be unreliable for inter-laboratory comparisons, while wgMLST produced minor decreases in distances. This decrease is due to the categorical(values) coefficient used in BioNumerics to calculate the pairwise distances. During a wgMLST analysis with isolated outbreaks, only the n total component will vary compared to the analysis of complete collections. Therefore, only including closely related isolates in the wgMLST analysis will reduce the total number of loci (n total) detected in the input dataset due to minimal genetic diversities and similar accessory genomes, and will consequently minimize the effect of the n total/n common factor on the n differences component. Applying the categorical(differences) coefficient instead could eliminate the fluctuating nature of the wgMLST analysis by only comparing true genetic variants (n differences) and remove the effect of loci presence/absence. To conclude, it is of utmost importance that subtyping methods approximate epidemiology in an outbreak or surveillance scenario. In other words, isolates that cluster closely together and share a recent common ancestor via phylogenomic approaches should be indicative of a shared source of infection or chain of transmission [63]. It is also assumed that subtyping methods with a higher discriminatory power will indicate stronger epidemiological relationships. With that being said, even though wgMLST proved to be the most discriminatory for our and datasets, core genome-based subtyping methods were more phylogenetically consistent and epidemiologically concordant due to their exclusion of genetic variation in the accessory genome and filtering processes of repetitive loci and confounding MGEs. While MGE-mediated discrepancies were only observed in prolonged MSM-related outbreaks of , similar irregularities could theoretically occur at any time in any pathogen, but a higher degree of caution is needed for highly recombinant species and long spanning transmission patterns with strong selective pressures. These (1) highlight the importance of validating WGS-based subtyping methods for each pathogen separately on a basis of genome plasticity and outbreak dynamics, (2) support the use of a combination of different WGS analyses in order to obtain a better perspective of a given outbreak and (3) emphasize the application of a certain flexibility to proposed interpretation thresholds of relatedness. However, horizontally acquired genetic elements should not be deemed unusable, as plasmid-, IS element-, phage-, antimicrobial resistance-, virulence- and CRISPR-based subtyping have been shown to be valuable tools for public health investigations and risk assessments for other pathogens [70, 77, 78]. Further studies are therefore recommended in order to validate the parallel use of similar approaches to improve the typeability of spp. in the WGS era. We have also shown the utility of using a minimum spanning distance metric for future outbreak investigations, since it can be applied to outbreaks of different transmission natures and is not influenced by the time span of a given outbreak. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

70 in total

1. Intensified shigellosis epidemic associated with sexual transmission in men who have sex with men--Shigella flexneri and S. sonnei in England, 2004 to end of February 2015.

Authors: I Simms; N Field; C Jenkins; T Childs; V L Gilbart; T J Dallman; P Mook; P D Crook; G Hughes
Journal: Euro Surveill Date: 2015-04-16

2. Intercontinental dissemination of azithromycin-resistant shigellosis through sexual transmission: a cross-sectional study.

Authors: Kate S Baker; Timothy J Dallman; Philip M Ashton; Martin Day; Gwenda Hughes; Paul D Crook; Victoria L Gilbart; Sandra Zittermann; Vanessa G Allen; Benjamin P Howden; Takehiro Tomita; Mary Valcanis; Simon R Harris; Thomas R Connor; Vitali Sintchenko; Peter Howard; Jeremy D Brown; Nicola K Petty; Malika Gouali; Duy Pham Thanh; Karen H Keddy; Anthony M Smith; Kaisar A Talukder; Shah M Faruque; Julian Parkhill; Stephen Baker; François-Xavier Weill; Claire Jenkins; Nicholas R Thomson
Journal: Lancet Infect Dis Date: 2015-04-27 Impact factor: 25.071

3. Whole-genome-based Mycobacterium tuberculosis surveillance: a standardized, portable, and expandable approach.

Authors: Thomas A Kohl; Roland Diel; Dag Harmsen; Jörg Rothgänger; Karen Meywald Walter; Matthias Merker; Thomas Weniger; Stefan Niemann
Journal: J Clin Microbiol Date: 2014-04-30 Impact factor: 5.948

Review 4. Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches.

Authors: A C Schürch; S Arredondo-Alonso; R J L Willems; R V Goering
Journal: Clin Microbiol Infect Date: 2018-01-05 Impact factor: 8.067

5. Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics.

Authors: G M Pupo; R Lan; P R Reeves
Journal: Proc Natl Acad Sci U S A Date: 2000-09-12 Impact factor: 11.205

6. Outbreak of Shigella sonnei in Montréal's ultra-Orthodox Jewish community, 2015.

Authors: P A Pilon; B Camara; S Bekal
Journal: Can Commun Dis Rep Date: 2016-04-07

7. A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens.

Authors: Lee S Katz; Taylor Griswold; Amanda J Williams-Newkirk; Darlene Wagner; Aaron Petkau; Cameron Sieffert; Gary Van Domselaar; Xiangyu Deng; Heather A Carleton
Journal: Front Microbiol Date: 2017-03-13 Impact factor: 5.640

8. PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods.

Authors: Alexandre P Francisco; Cátia Vaz; Pedro T Monteiro; José Melo-Cristino; Mário Ramirez; Joäo A Carriço
Journal: BMC Bioinformatics Date: 2012-05-08 Impact factor: 3.169

9. Shared genome analyses of notable listeriosis outbreaks, highlighting the critical importance of epidemiological evidence, input datasets and interpretation criteria.

Authors: Aleisha Reimer; Kelly Weedmark; Aaron Petkau; Christy-Lynn Peterson; Matthew Walker; Natalie Knox; Heather Kent; Philip Mabon; Chrystal Berry; Shaun Tyler; Lorelee Tschetter; Morganne Jerome; Vanessa Allen; Linda Hoang; Sadjia Bekal; Clifford Clark; Celine Nadon; Gary Van Domselaar; Franco Pagotto; Morag Graham; Jeff Farber; Matthew Gilmour
Journal: Microb Genom Date: 2019-01-16

10. Commensal Escherichia coli are a reservoir for the transfer of XDR plasmids into epidemic fluoroquinolone-resistant Shigella sonnei.

Authors: Maia A Rabaa; Stephen Baker; Pham Thanh Duy; To Nguyen Thi Nguyen; Duong Vu Thuy; Hao Chung The; Felicity Alcock; Christine Boinett; Ho Ngoc Dan Thanh; Ha Thanh Tuyen; Guy E Thwaites
Journal: Nat Microbiol Date: 2020-01-20 Impact factor: 17.745

4 in total

1. Characterization of Extended-Spectrum β-Lactamase-Producing Shigella sonnei in Spain: Expanding the Geographic Distribution of Sequence Type 152/CTX-M-27 Clone.

Authors: Lorena López-Cerero; Emilio Stolz; Marina R Pulido; Alvaro Pascual
Journal: Antimicrob Agents Chemother Date: 2022-06-28 Impact factor: 5.938

2. Evaluation of whole-genome sequencing-based subtyping methods for the surveillance of Shigella spp. and the confounding effect of mobile genetic elements in long-term outbreaks.

Authors: Isabelle Bernaquez; Christiane Gaudreau; Pierre A Pilon; Sadjia Bekal
Journal: Microb Genom Date: 2021-11

3. Virulence plasmid pINV as a genetic signature for Shigella flexneri phylogeny.

Authors: Giulia Pilla; Gabriele Arcari; Christoph M Tang; Alessandra Carattoli
Journal: Microb Genom Date: 2022-06

4. Clinical and Genomic Investigation of an International Ceftriaxone- and Azithromycin-Resistant Shigella sonnei Cluster among Men Who Have Sex with Men, Montréal, Canada 2017-2019.

Authors: Christiane Gaudreau; Isabelle Bernaquez; Pierre A Pilon; Alexandre Goyette; Nada Yared; Sadjia Bekal
Journal: Microbiol Spectr Date: 2022-06-01

4 in total