| Literature DB >> 32726198 |
Martinique Frentrup1, Zhemin Zhou2, Matthias Steglich3,1, Jan P Meier-Kolthoff1, Markus Göker1, Thomas Riedel3,1, Boyke Bunk1, Cathrin Spröer1, Jörg Overmann3,1,4, Marion Blaschitz5, Alexander Indra5, Lutz von Müller6, Thomas A Kohl7,8, Stefan Niemann7,8, Christian Seyboldt9, Frank Klawonn10,11, Nitin Kumar12, Trevor D Lawley12, Sergio García-Fernández13,14, Rafael Cantón13,14, Rosa Del Campo13,14, Ortrud Zimmermann15, Uwe Groß15, Mark Achtman2, Ulrich Nübel4,3,1.
Abstract
Clostridioides difficile is the primary infectious cause of antibiotic-associated diarrhea. Local transmissions and international outbreaks of this pathogen have been previously elucidated by bacterial whole-genome sequencing, but comparative genomic analyses at the global scale were hampered by the lack of specific bioinformatic tools. Here we introduce a publicly accessible database within EnteroBase (http://enterobase.warwick.ac.uk) that automatically retrieves and assembles C. difficile short-reads from the public domain, and calls alleles for core-genome multilocus sequence typing (cgMLST). We demonstrate that comparable levels of resolution and precision are attained by EnteroBase cgMLST and single-nucleotide polymorphism analysis. EnteroBase currently contains 18 254 quality-controlled C. difficile genomes, which have been assigned to hierarchical sets of single-linkage clusters by cgMLST distances. This hierarchical clustering is used to identify and name populations of C. difficile at all epidemiological levels, from recent transmission chains through to epidemic and endemic strains. Moreover, it puts newly collected isolates into phylogenetic and epidemiological context by identifying related strains among all previously published genome data. For example, HC2 clusters (i.e. chains of genomes with pairwise distances of up to two cgMLST alleles) were statistically associated with specific hospitals (P<10-4) or single wards (P=0.01) within hospitals, indicating they represented local transmission clusters. We also detected several HC2 clusters spanning more than one hospital that by retrospective epidemiological analysis were confirmed to be associated with inter-hospital patient transfers. In contrast, clustering at level HC150 correlated with k-mer-based classification and was largely compatible with PCR ribotyping, thus enabling comparisons to earlier surveillance data. EnteroBase enables contextual interpretation of a growing collection of assembled, quality-controlled C. difficile genome sequences and their associated metadata. Hierarchical clustering rapidly identifies database entries that are related at multiple levels of genetic distance, facilitating communication among researchers, clinicians and public-health officials who are combatting disease caused by C. difficile.Entities:
Keywords: Clostridioides (Clostridium) difficile; cgMLST; genomic population structure; hierarchical clustering; nosocomial infection; outbreak
Mesh:
Year: 2020 PMID: 32726198 PMCID: PMC7641423 DOI: 10.1099/mgen.0.000410
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Criteria for inclusion in a cgMLST scheme of a subset of wgMLST genes based on their properties in a reference set of 442 genomes (https://tinyurl.com/Cdiff-ref). (a) Numbers of genes versus frequency (% presence) within the reference set. In total, 2634 genes satisfied the cut-off criterion of ≥98 % presence (dashed line). (b) Numbers of genes versus intact ORF (% intact ORF) within the 2634 genes from (a). Overall, 2560 genes satisfied the cut-off criterion of ≥94 % intact ORF (dashed line). (c) Frequency of allelic variants versus gene size among the 2560 genes from (b). The genetic diversity was calculated using the GaussianProcessRegressor function in the sklearn module in Python. This function calculates the Gaussian process regression of the frequency of genetic variants on gene sizes, using a linear combination of a radial basis function kernel (RBF) and a white kernel [57]. The shadowed region shows a single-tailed 99.9% confidence interval (≤3 sigma) of the prediction. Altogether, 2556 loci fell within this area and were retained for the cgMLST scheme, while four were excluded due to excessive numbers of alleles.
Fig. 2.Binary logistic regression model to determine the probability that two genomes are related at ≤2 SNPs, given a certain difference in their cgMLST allelic profiles, based on the Oxfordshire dataset [13]. The number of SNPs was encoded as a binary dependent variable (1 if ≤2 SNPs, 0 if otherwise) and the number of allelic differences was used as a predictor variable.
Fig. 3.Minimum-spanning trees indicating the population structure of in four patients with recurrent CDI episodes. Red, first episode; blue, second episode.
Fig. 4.Neighbour-joining trees based on cgMLST showing the phylogenetic relationships among isolates from previously published CDI outbreaks as indicated [14, 19, 26]. Nodes are coloured by HC2. CC, cgST complex, i.e. related at level HC150; RT, PCR ribotype. The scale, indicating one allelic difference, applies to all trees.
Fig. 5.Timelines of two transmission chains, discovered retrospectively through inspection of files from CDI patients with closely related isolates (HC2). Colours indicate hospital wards, 'X' indicate diagnosis of CDI, and arrows indicate presumed transmission pathways. Minimum-spanning trees indicating genomic distances among isolates are shown on the right. Upper panel: patient P1 was diagnosed with CDI in hospital H2 and transferred to hospital H3 15 days later. Another five and 6 days later, respectively, patients P2 in hospital H2 and P3 in hospital H3 got diagnosed with CDI with closely related strains. Both these patients were on the same wards as the initial patient, who probably had been the source for the pathogen. Since there was no temporal overlap between patient P2 and the other patients in hospital H2, transmission may have occurred indirectly, possibly through environmental contamination. Lower panel: another putative transmission chain involved three patients that had shared time in hospital H2. Patients P4 and P5 got diagnosed with CDI on the same day after they had shared 7 days in this hospital, albeit on different medical wards. The third patient developed CDI with the same cgST 4 days after being transferred to another hospital (h5), but had previously stayed at hospital H2 during the time when CDI got diagnosed in the first two patients. Since the three patients stayed on different wards in hospital H2, transmission presumably occurred indirectly.
Fig. 6.Phylogenetic structure of three international epidemics, each of which has spread for about 25 years [9, 11]. Within each epidemic, the majority of isolates is related at level HC10, as indicated by the colours. CC, cgST complex, i.e. related at level HC150; RT, PCR ribotype.
Fig. 7.Rapid-neighbour-joining phylogenetic tree based on cgMLST variation from 13 515 genomes. Colours and numerals indicate CCs (HC150 clusters) with ≥10 entries, and information on predominant PCR ribotypes is provided in brackets.
Characteristics of cgST complexes (CC) with ≥100 entries
|
CC (HC150) |
PCR Ribotype |
Number of entries |
Sampling years |
Number of countries |
% isolates in HC2>21 |
% isolates from animal hosts |
|---|---|---|---|---|---|---|
|
4 |
027 |
2669 |
1985–2018 |
27 |
77 |
0 |
|
1 |
078, 126, 066 |
1222 |
1994–2018 |
26 |
61 |
17 |
|
17 |
017 |
769 |
1990–2017 |
24 |
64 |
0 |
|
3 |
001 |
768 |
1980–2017 |
16 |
62 |
0 |
|
6 |
020, 404 |
768 |
1995–2017 |
14 |
43 |
1 |
|
2 |
002 |
702 |
2006–2017 |
15 |
51 |
1 |
|
22 |
106, 500 |
531 |
1997–2017 |
7 |
59 |
3 |
|
86 |
005 |
468 |
1980–2017 |
8 |
41 |
0 |
|
34 |
014 |
421 |
1995–2017 |
10 |
35 |
0 |
|
55 |
015 |
318 |
2006–2017 |
6 |
37 |
0 |
|
71 |
014, 020 |
315 |
2004–2017 |
16 |
40 |
1 |
|
145 |
015 |
284 |
2006–2016 |
7 |
39 |
0 |
|
256 |
023 |
268 |
2001–2015 |
6 |
40 |
0 |
|
79 |
010 |
249 |
2003–2018 |
7 |
53 |
3 |
|
178 |
018, 356 |
243 |
2006–2017 |
7 |
52 |
0 |
|
242 |
039 |
199 |
2008–2017 |
4 |
58 |
1 |
|
10 |
012 |
159 |
1996–2017 |
7 |
52 |
0 |
|
88 |
014 |
132 |
1996–2016 |
9 |
33 |
8 |
|
11 |
070 |
110 |
2006–2017 |
6 |
32 |
0 |
|
187 |
054 |
109 |
2007–2018 |
6 |
47 |
0 |
|
141 |
001, 026 |
107 |
2007–2016 |
2 |
7 |
0 |
|
391 |
081 |
105 |
1996–2016 |
4 |
31 |
0 |
|
49 |
011, 056, 446 |
103 |
2001–2017 |
5 |
35 |
0 |
1isolates in HC2 clusters with >2 entries.
Fig. 8.Rapid-neighbour-joining phylogenetic tree based on cgMLST variation from 2263 genomes, for which PCR ribotyping information is available. Upper panel: nodes are coloured by PCR ribotype as indicated. Lower panel: nodes are coloured by CC (HC150 clusters).