| Literature DB >> 32101514 |
Claudia E Coipan1, Timothy J Dallman2, Derek Brown3, Hassan Hartman2, Menno van der Voort4, Redmar R van den Berg5, Daniel Palm6, Saara Kotila6, Tom van Wijk1, Eelco Franz1.
Abstract
A large European multi-country Salmonella enterica serovar Enteritidis outbreak associated with Polish eggs was characterized by whole-genome sequencing (WGS)-based analysis, with various European institutes using different analysis workflows to identify isolates potentially related to the outbreak. The objective of our study was to compare the output of six of these different typing workflows (distance matrices of either SNP-based or allele-based workflows) in terms of cluster detection and concordance. To this end, we analysed a set of 180 isolates coming from confirmed and probable outbreak cases, which were representative of the genetic variation within the outbreak, supplemented with 22 unrelated contemporaneous S. enterica serovar Enteritidis isolates. Since the definition of a cluster cut-off based on genetic distance requires prior knowledge on the evolutionary processes that govern the bacterial populations in question, we used a variety of hierarchical clustering methods (single, average and complete) and selected the optimal number of clusters based on the consensus of the silhouette, Dunn2, and McClain-Rao internal validation indices. External validation was done by calculating the concordance with the WGS-based case definition (SNP-address) for this outbreak using the Fowlkes-Mallows index. Our analysis indicates that with complete-linkage hierarchical clustering combined with the optimal number of clusters, as defined by three internal validity indices, the six different allele- and SNP-based typing workflows generate clusters with similar compositions. Furthermore, we show that even in the absence of coordinated typing procedures, but by using an unsupervised machine learning methodology for cluster delineation, the various workflows that are currently in use by six European public-health authorities can identify concordant clusters of genetically related S. enterica serovar Enteritidis isolates; thus, providing public-health researchers with comparable tools for detection of infectious-disease outbreaks.Entities:
Keywords: epidemiology; hierarchical clustering; infectious disease; surveillance; unsupervised machine learning; whole-genome sequencing
Mesh:
Year: 2020 PMID: 32101514 PMCID: PMC7200063 DOI: 10.1099/mgen.0.000318
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Flowchart of the methodology used for the analysis of the bacterial isolates selected for this study.
Data used in the clustering analysis and the values corresponding to the optimal partition for each of the three hierarchical clustering methods used
Optimal_k, the optimal number of clusters as identified based on the silhouette, McClain–Rao and Dunn2 index; diameter, the maximum within-cluster distance; separation, minimum between-clusters distance.
|
Workflow |
Clustering |
Optimal_k |
Diameter |
Separation |
|---|---|---|---|---|
|
SNP1 |
Average |
k14 |
14 |
14 |
|
SNP2 |
Average |
k12 |
16 |
3 |
|
MLSTcg1 |
Average |
k12 |
13 |
10 |
|
MLSTcg2 |
Average |
k13 |
15 |
9 |
|
MLSTcg3 |
Average |
k13 |
13 |
12 |
|
MLSTwg |
Average |
k13 |
13 |
10 |
|
SNP1 |
Complete |
k14 |
14 |
14 |
|
SNP2 |
Complete |
k13 |
13 |
1 |
|
MLSTcg1 |
Complete |
k13 |
9 |
6 |
|
MLSTcg2 |
Complete |
k13 |
11 |
7 |
|
MLSTcg3 |
Complete |
k13 |
13 |
12 |
|
MLSTwg |
Complete |
k13 |
18 |
10 |
|
SNP1 |
Single |
k14 |
14 |
14 |
|
SNP2 |
Single |
k12 |
16 |
3 |
|
MLSTcg1 |
Single |
k11 |
17 |
10 |
|
MLSTcg2 |
Single |
k13 |
15 |
9 |
|
MLSTcg3 |
Single |
k13 |
13 |
12 |
|
MLSTwg |
Single |
k13 |
13 |
10 |
Fig. 2.Pairwise correlation between the genetic distances of the various workflows. The diagonal shows the density plots of the distances in each of the six workflows. The upper half of the plot indicates the Spearman coefficients of correlation for each combination of distance matrices. In the dotplots, the x-axis represents the genetic distance among isolates, as measured by the workflow indicated on the column label; the y-axis represents the genetic distance among isolates, as measured by the workflow indicated in the row label. Only the distances between isolates that are present in both workflows of a pairwise comparison are depicted in the figure.
Fig. 3.Internal validity indices for combinations of workflows and clustering algorithms, for k=3–20. The values of all indices are scaled and re-centred around 0 for better visualization. Maximum values of Dunn2 and silhouette, and minimum values of McClain–Rao indicate optimal number of clusters. The bold black vertical lines indicate the consensus optimal number of clusters as indicated in Table 1. The silhouette index is not defined for k>13 in average- and complete-linkage clustering of the SNP2 workflow.
Fig. 4.Correspondence among the partitions of the six workflows following clustering with one of the following algorithms: (a) average linkage, (b) complete linkage, (c) single linkage. The grey alluvials stand for the outbreak-linked isolates, while the orange alluvials stand for the non-outbreak isolates.
Fig. 5.Summary of Fowlkes–Mallows indices of concordance between any two partitions: (a) for pairwise comparisons of the six partitions, (b) for pairwise comparisons of each of the six partitions with the reference outbreak clusters. The Fowlkes–Mallows index can take values on the interval 0–1, where values closer to 0 indicate absence of correlation, while values closer to 1 indicate close to perfect correlation. The asterisks indicate the P values for pairwise t -test: *difference with P<0.05; **difference with P<0.01.
Fig. 6.Tanglegram of SNP1 and MLSTcg1 clusterings with complete linkage. The outbreak clusters 1 and 2 are shown in red and blue, respectively.