Literature DB >> 31107202

A global to local genomics analysis of Clostridioides difficile ST1/RT027 identifies cryptic transmission events in a northern Arizona healthcare network.

Charles H D Williamson¹, Nathan E Stone¹, Amalee E Nunnally¹, Heidie M Hornstra¹, David M Wagner¹, Chandler C Roe¹, Adam J Vazquez¹, Nivedita Nandurkar², Jacob Vinocur², Joel Terriquez², John Gillece³, Jason Travis³, Darrin Lemmer³, Paul Keim¹, Jason W Sahl¹.

Abstract

Clostridioides difficile is a ubiquitous, diarrhoeagenic pathogen often associated with healthcare-acquired infections that can cause a range of symptoms from mild, self-limiting disease to toxic megacolon and death. Since the early 2000s, a large proportion of C. difficile cases have been attributed to the ribotype 027 (RT027) lineage, which is associated with sequence type 1 (ST1) in the C. difficile multilocus sequence typing scheme. The spread of ST1 has been attributed, in part, to resistance to fluoroquinolones used to treat unrelated infections, which creates conditions ideal for C. difficile colonization and proliferation. In this study, we analysed 27 isolates from a healthcare network in northern Arizona, USA, and 1352 publicly available ST1 genomes to place locally sampled isolates into a global context. Whole genome, single nucleotide polymorphism analysis demonstrated that at least six separate introductions of ST1 were observed in healthcare facilities in northern Arizona over an 18-month sampling period. A reconstruction of transmission networks identified potential nosocomial transmission of isolates, which were only identified via whole genome sequence analysis. Antibiotic resistance heterogeneity was observed among ST1 genomes, including variability in resistance profiles among locally sampled ST1 isolates. To investigate why ST1 genomes are so common globally and in northern Arizona, we compared all high-quality C. difficile genomes and identified that ST1 genomes have gained and lost a number of genomic regions compared to all other C. difficile genomes; analyses of other toxigenic C. difficile sequence types demonstrate that this loss may be anomalous and could be related to niche specialization. These results suggest that a combination of antimicrobial resistance and gain and loss of specific genes may explain the prominent association of this sequence type with C. difficile infection cases worldwide. The degree of genetic variability in ST1 suggests that classifying all ST1 genomes into a quinolone-resistant hypervirulent clone category may not be appropriate. Whole genome sequencing of clinical C. difficile isolates provides a high-resolution surveillance strategy for monitoring persistence and transmission of C. difficile and for assessing the performance of infection prevention and control strategies.

Entities: CellLine Chemical Disease Species

Keywords: Clostridioides difficile; antibiotic resistance; genomics; transmission

Mesh：

Substances：
DNA, Bacterial

Year: 2019 PMID： 31107202 PMCID： PMC6700662 DOI： 10.1099/mgen.0.000271

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

Whole genome sequencing data have been deposited in the National Center for Biotechnology Information Sequence Read Archive under BioProject accession number PRJNA438482 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA438482). Supporting data are provided as supplementary material in figshare: 10.6084/m9.figshare.7966775. The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. is a major cause of nosocomial infections that can result in toxic megacolon and death. Several lineages of have been associated with a large proportion of infections; one of these lineages is labelled as ribotype 027 (RT027) or sequence type 1 (ST1) in the multilocus sequence typing scheme. Here we analysed RT027/ST1 isolates sampled from clinical settings in northern Arizona. We identified multiple introductions of ST1 into the healthcare network as well as potential nosocomial transmission of closely related isolates. Our results suggest that acquisition of antibiotic resistance and gain and loss of specific genes may explain why the ST1 lineage is routinely associated with infections. We also introduce an in silico ribotyping approach to relate whole genome sequence data to PCR ribotyping information. The current work provides insight into infections and transmission on a local scale, places local isolates from a single healthcare network into a global context, and demonstrates the value of incorporating whole genome sequencing and comparative genomics into healthcare surveillance programmes.

Introduction

[1] is one of the most commonly observed diarrhoeal pathogens in hospital settings. infection (CDI) can range in severity from asymptomatic carriage or mild disease to toxic megacolon and death. The recent rise in the frequency in CDI has been attributed, in part, to the spread of fluoroquinolone-resistant strains of ribotype (RT) 027. Strain CD196, the earliest identified RT027 isolate [2], was isolated in France in 1985 [3]. RT027 was linked to CDI outbreaks in North America and Europe in the 2000s [4-8] and has spread around the world [9]. Although a decrease in the prevalence of RT027 has been reported in some regions [10], the RT027 lineage continues to be routinely isolated from clinical samples [11-16]. Although the mechanisms behind the success of the RT027 lineage are not fully understood, the increase of CDI cases caused by RT027 in the 2000s has been linked to strains with fluoroquinolone resistance [4, 9]. It has also been suggested that the lineage can displace endemic strains [17] and, more recently, Collins and colleagues [18] associated an increased ability to metabolize trehalose with epidemic ribotypes (RT027 as well as RT078), suggesting this trait along with the increased addition of trehalose to foods contributed to the success of RT027. Many typing methods have been used to characterize isolated from clinical settings. Ribotyping, a method that relies on differential amplification length profiles of ribosomal intergenic spacer regions, has been one of the most widely used procedures [19, 20]. Multilocus sequence typing (MLST) has also been a commonly applied approach for characterizing diversity and is more appropriate than ribotyping for examining evolutionary history and relatedness [21]. RT027 is associated with sequence type (ST) 1 in the MLST scheme [22]. Additional ribotypes associated with ST1 include RT016, RT036 and RT176 [21]. RT016 has been isolated from stool samples associated with CDI in England [23] and RT176 has been associated with CDI in Poland [24] and the Czech Republic [25]. Ribotyping and MLST are increasingly being replaced by comparative genomic methods that rely on whole genome sequence (WGS) data. Ribotyping results, MLST profiles and comparisons utilizing the entire genome are not always congruent [20]; WGS provides the highest resolution for comparative genomics and should be the focus of comparative studies moving forward. The primary focus of this study is a comparison of ST1, which forms a monophyletic clade (see Results) and includes many virulent isolates collected from human clinical samples. Multiple studies have investigated with comparative genomics, with several studies focused on disease caused by ST1 (RT027). In one comparative study of three genomes [2], the authors identified several genomic regions potentially associated with increased virulence of RT027. Since that time, genomic sequences from many additional strains have become available. A large comparative genomics study of isolates collected from patients with diarrhoea in hospital systems in the UK over a 4-year period indicated that the ST1 lineage was prevalent (17 % of samples) [26]. Another study that included 1290 isolates from patients with CDI in the same area found that 35 % of the isolates were ST1 [27]. He and colleagues [9] investigated the global phylogeny and spread of RT027 (ST1) with whole genome single nucleotide polymorphism (SNP) analysis and identified two epidemic lineages, both of which contained a mutation conferring fluoroquinolone resistance. As these studies demonstrate, ST1 isolates have been frequently identified from clinical samples; other studies have indicated that ST1 isolates are less commonly associated with other environments (soils, dogs, etc.) [28-31]. Although not conclusive, together these studies suggest that ST1 isolates may preferentially colonize the human gut over other environments. One of the primary concerns about CDI is emerging antimicrobial resistance (AMR). Antibiotic-associated pseudomembranous colitis has been associated with since the late 1970s [32-34]. Resistance to a variety of antimicrobials has been associated with isolates, many of which are multi-drug resistant ([35, 36] and references therein). AMR in may vary between lineages or by region due to differing antimicrobial use and can impact CDI with regard to infection, recurrence and disease outcome [35, 36]. As mentioned previously, fluoroquinolone resistance has been associated with many ST1 (RT027) strains, and this resistance probably contributed, at least partially, to the increase in CDI cases attributed to ST1 during the 2000s as fluoroquinolone use increased during the 1990s and early 2000s [4, 9, 37–39]. Fluoroquinolone resistance in ST1 strains has been associated with mutations in the gyrA and gyrB genes [40, 41]. Suggested clinical guidelines for treating CDI currently include treatment with vancomycin, fidaxomicin and metronidazole [42]. Although reduced susceptibility to these compounds has been reported, resistance in has had limited impact on the efficacy of these drugs in the clinic thus far [42, 43]. Understanding AMR in may yield insights into the prevalence of CDI caused by certain lineages and provide information regarding best practices for prescribing antibiotics. In this study, we examined 27 ST1 genomes generated as part of a surveillance project across two healthcare facilities in northern Arizona, USA, during 2016 and 2017 and publicly available genomes. Comparisons were conducted to: (1) characterize northern Arizona ST1 isolates in the context of a worldwide set of genomes; (2) evaluate potential transmission networks within northern Arizona healthcare facilities; (3) use genomics to identify potential mechanisms that have allowed for the widespread presence of the ST1 lineage in northern Arizona and across hospitals worldwide; and (4) better understand the pan-genomics and phylogenomics of the species in general. We also developed an in silico ribotyping method for relating whole genome sequence data and ribotyping information.

Methods

Genome download and sequence typing

All available genome assemblies (n=1092) were downloaded from GenBank on 30 November 2017. For quality control purposes, statistics were gathered for each genome for number of contigs, number of ambiguous nucleotides (non A,T,G,C) and total genome assembly size. Additionally, all raw sequencing data (paired-end Illumina data only) associated with available from the Sequence Read Archive (SRA) on 1 September 2017 were downloaded (sratoolkit fastq-dump). For isolates represented by multiple SRA runs, the runs were combined and labelled with the BioSample accession. Data were discarded if the read length was less than 75 bp. The multilocus sequence type (ST) of samples represented by raw read data was determined with stringMLST v0.5 (default settings) [44]. Samples typed as ST1 were assembled with SPAdes v3.10.0 (--careful --cov-cutoff auto -k auto) [45]. These genome assemblies were combined with GenBank assemblies for downstream analyses; in some cases, these reads represent duplicates of GenBank assemblies, but were treated independently in this study due to difficulties with association. Any GenBank assembly or assembly generated with SPAdes that contained more than 1500 contigs, more than 10 ambiguous nucleotides, an anomalous genome assembly size (final data set: <3 600 277 or >4 698 454 bp), or anomalous GC content (final data set: <28.07 or >29.74 mol%) was removed from the data set. In addition, all pairwise mash (v2.0, default sketch size) [46] distances were calculated for all genome assemblies in order to identify genomes annotated as but belonging to different species. Genomes with an average pairwise mash distance >0.03, which corresponds to <0.97 average nucleotide identity (a conservative cutoff value), were removed from the data set (average mash distances for remaining genomes were less than 0.02). The ST was determined for all assemblies passing quality control metrics with a custom script (https://gist.github.com/jasonsahl/2eedc0ea93f90097890879e56b0c3fa3) that utilizes blastn [47] and the PubMLST database (https://pubmlst.org/) for [22]. For ST1 genomes, if the ST of the assembly did not match the ST predicted by stringMLST, the assembly was removed from the data set. A total of 1850 genome assemblies from NCBI data (609 GenBank assemblies and 1241 in-house assemblies from the SRA identified as ST1) were included in this study (Table S1, available in the online version of this article). Additionally, publicly available raw sequencing reads representing non-ST1 isolates (n=3402) were included in the in silico ribotyping analyses.

isolation, DNA extraction, sequencing and assembly

Stool samples identified as containing from two northern Arizona healthcare facilities (labelled facility A and facility B) were collected under IRB No. 764034-NAH and were stored at −80 °C until processing. Isolation of from stool samples was performed as outlined by Edwards et al. [48] with the following modifications; one 10 µl loopful of partially thawed stool was re-suspended in 500 µl of sterile 1× PBS in aerobic conditions. The suspension was immediately transferred to a vinyl Type C anaerobic chamber (Coy Laboratory Products), where 100 µl was plated onto pre-reduced taurocholate-cefotoxin-cyloserine-fructose agar (TCCFA) and incubated at 36 °C for 24–48 h. Suspected were subcultured onto pre-reduced brain heart infusion agar supplemented with 0.03 % l-cysteine (BHIS) and incubated anaerobically at 36 °C for 24 h. Once isolation of suspected species was achieved, a lawn was created (also on BHIS) and incubated anaerobically at 36 °C for an additional 24 h. Genomic DNA was extracted from each isolate, species identification was confirmed via TaqMan PCR, and all positive extractions were processed for downstream WGS as described previously [31]. Based on this methodology, a total of 27 ST1 isolates were identified from hospitals in northern Arizona between March 2016 and September 2017 (Table 1). DNA was sequenced on the Illumina MiSeq platform and all genomes were assembled with SPAdes v3.10.0. The average per-contig depth of coverage was calculated with genomeCoverageBed (v2.27.1) [49] from BWA-MEM v0.7.7 alignments [50]. Additionally, 200 bases of each contig was aligned against the GenBank [51] nucleotide database with blastn (v2.7.1) [47] and the identity of the top hit was tabulated. If the alignment was from a known contaminant or from a different species sequenced on the same run, the contig was manually removed from the assembly.

Table 1.

ST1 isolates collected from a healthcare network in northern Arizona

Isolation date (YYYY-MM-DD)

SRA accession

Sequence type

In silico genome screen for markers of interest

Isolate name

Lineage ¹

tcdA/tcdB ²

tcdC ³

cdtA/cdtB ²

gyrA:82 ⁴

TetM ⁵

ErmB ⁵

CdeA ⁵

DfrF ⁵

HS-FS-000016-I-01

2016-04-20

SRR6890740

ST1

FQR1

A+/B+

Identical to CD196

+ / +

–

HS-FS-000023-I-01

2016-04-09

SRR6890728

ST1

FQR1

A+/B+

Identical to CD196

+/ +

–

HS-FS-000030-I-01

2016-04-05

SRR6890724

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000082-I-02

2016-10-22

SRR6890738

ST1

FQR1

A+/B+

Identical to CD196

+/+

HS-FS-000084-I-02

2016-10-30

SRR6890739

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000103-I-02

2016-11-15

SRR6890736

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000127-I-03

2016-12-03

SRR6890737

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000148-I-02

2016-12-12

SRR6890734

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000151-I-02

2016-12-08

SRR6890735

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

+⁶

–

HS-FS-000173-I-03

2017-01-03

SRR6890732

ST1

FQR1

A+/B+

Identical to CD196

+/+

–

HS-FS-000020-I-01

2016-04-25

SRR6890741

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000021-I-01

2016-04-27

SRR6890730

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000022-I-01

2016-04-20

SRR6890731

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000024-I-01

2016-04-09

SRR6890729

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000025-I-01

2016-03-29

SRR6890726

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000031-I-01

2016-03-26

SRR6890725

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000034-I-02

2016-05-04

SRR6890722

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000043-I-01

2016-05-31

SRR6890723

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000057-I-02

2016-07-09

SRR6890747

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000091-I-02

2016-09-16

SRR6890746

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000188-I-02

2017-01-11

SRR6890733

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000251-I-02

2017-05-30

SRR6890745

ST1

FQR2

A+/B+

Identical to CD196

+/+

–

HS-FS-000027-I-01

2016-04-29

SRR6890727

ST1

A+/B+

Identical to CD196

+/+

–

HS-FS-000194-I-03

2017-02-05

SRR8166226

ST1

A+/B+

Identical to CD196

+/+

–

HS-FS-000264-I-02

2017-07-29

SRR6890744

ST1

A+/B+

Identical to CD196

+/+

–

HS-FS-000287-I-02

2017-08-30

SRR6890743

ST1

A+/B+

Identical to CD196

+/+

–

HS-FS-000292-I-03

2017-09-06

SRR6890742

ST1

A+/B+

Identical to CD196

+/+

–

1, Lineages defined in He et al. 2013. 2, present if BSR value >0.8. 3, BSR value=1. 4, GyrA mutation inferring fluoroquinolone resistance: I=resistant and T=susceptible. 5, present if BSR value >0.9. 6, two putative copies.

ST1 isolates collected from a healthcare network in northern Arizona Isolation date (YYYY-MM-DD) SRA accession Sequence type genome screen for markers of interest Isolate name Lineage tcdA/tcdB tcdC cdtA/cdtB gyrA:82 TetM ErmB CdeA DfrF HS-FS-000016-I-01 2016-04-20 SRR6890740 ST1 FQR1 A+/B+ Identical to CD196 + / + I – + + – HS-FS-000023-I-01 2016-04-09 SRR6890728 ST1 FQR1 A+/B+ Identical to CD196 +/ + I – + + – HS-FS-000030-I-01 2016-04-05 SRR6890724 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000082-I-02 2016-10-22 SRR6890738 ST1 FQR1 A+/B+ Identical to CD196 +/+ I + + + + HS-FS-000084-I-02 2016-10-30 SRR6890739 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000103-I-02 2016-11-15 SRR6890736 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000127-I-03 2016-12-03 SRR6890737 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000148-I-02 2016-12-12 SRR6890734 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000151-I-02 2016-12-08 SRR6890735 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – +6 + – HS-FS-000173-I-03 2017-01-03 SRR6890732 ST1 FQR1 A+/B+ Identical to CD196 +/+ I – + + – HS-FS-000020-I-01 2016-04-25 SRR6890741 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000021-I-01 2016-04-27 SRR6890730 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000022-I-01 2016-04-20 SRR6890731 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000024-I-01 2016-04-09 SRR6890729 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000025-I-01 2016-03-29 SRR6890726 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000031-I-01 2016-03-26 SRR6890725 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000034-I-02 2016-05-04 SRR6890722 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000043-I-01 2016-05-31 SRR6890723 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000057-I-02 2016-07-09 SRR6890747 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000091-I-02 2016-09-16 SRR6890746 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000188-I-02 2017-01-11 SRR6890733 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000251-I-02 2017-05-30 SRR6890745 ST1 FQR2 A+/B+ Identical to CD196 +/+ I – – + – HS-FS-000027-I-01 2016-04-29 SRR6890727 ST1 A+/B+ Identical to CD196 +/+ T – – + – HS-FS-000194-I-03 2017-02-05 SRR8166226 ST1 A+/B+ Identical to CD196 +/+ T – – + – HS-FS-000264-I-02 2017-07-29 SRR6890744 ST1 A+/B+ Identical to CD196 +/+ T – – + – HS-FS-000287-I-02 2017-08-30 SRR6890743 ST1 A+/B+ Identical to CD196 +/+ T – – + – HS-FS-000292-I-03 2017-09-06 SRR6890742 ST1 A+/B+ Identical to CD196 +/+ T – – + – 1, Lineages defined in He et al. 2013. 2, present if BSR value >0.8. 3, BSR value=1. 4, GyrA mutation inferring fluoroquinolone resistance: I=resistant and T=susceptible. 5, present if BSR value >0.9. 6, two putative copies.

Simulated reads

Genome assemblies have been shown to provide more noise when inferring the phylogenetic structure of a species compared to raw read data [52]. For all publicly available genome assemblies downloaded from GenBank and passing filters, simulated reads were generated with ART [53] version MountRainier with the following command: art_illumina -ss MSv3 -l 250 f 75 m 300 s 30. Simulated reads instead of genome assemblies were used for SNP discovery and phylogenetics for ST1 genomes.

Phylogenetic analyses

To generate a preliminary phylogeny, genome assemblies (n=1877) were aligned against the finished genome of CD630 (GCA_000009205.1) with NUCmer (MUMmer v3.23) [54], and SNPs were called in conjunction with NASP v1.1.2 [52]. A maximum-likelihood phylogeny was inferred on an alignment of 85 331 concatenated SNPs (Table S2) called from a core genome alignment of 1 860 526 positions with IQ-TREE v1.6.1 [55] using the best-fit model (GTR+F+ASC+R5) identified by ModelFinder [56] and the UFBoot2 ultrafast bootstrapping option [57]. Phangorn v.2.4.0 [58] was used to calculate the consistency index (excluding parsimony-uninformative SNPs) and the retention index (using all SNPs) [59]. All trees were visualized in FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/) or the Interactive Tree of Life online tool [60]. To generate a phylogeny for ST1 genomes (n=1379), short read data (simulated reads for genome assemblies downloaded from NCBI and actual Illumina short reads for publicly available data and 27 newly sequenced genomes) were aligned against the finished ST1 genome CD196 (GCA_000085225.1) with BWA-MEM v0.7.7 [50]. SNPs were called from the alignments with the Unified Genotyper method in GATK v3.3–0 [61, 62]. Positions in duplicated regions of the reference genome (identified with NUCmer), positions with less than 10× coverage, and positions with a mixture of alleles (<0.9 single allele) were removed from the analyses. SNP positions with calls in at least 90 % of the analysed genomes were concatenated (n=4283 SNPs, Table S3). A maximum-likelihood tree was inferred with IQ-TREE v1.6.1 (best-fit model TVMe+ASC+R2). The tree was rooted with the ST1 genome ERR030325 based upon previous phylogenetic analysis. Phangorn was used to calculate the consistency index and retention index as described above.

Root-to-tip regression and divergence time analysis

To estimate the date of divergence of the successful ST1 lineage, a set of genomes (n=90; 87 ST1 genomes and three outgroup genomes) for which accurate sample dates were known were analysed. SNPs were called with NASP [52] and the program PHIPack [63] was used to test for evidence of recombination, as this can confound divergence-dating analyses. To calculate accurate divergence times, recombination in ST1 SNPs was identified and removed using the program ClonalFrameML [64]. Non-recombinatory SNP positions present in all ST1 genomes were processed further (Table S4). The presence of a temporal signal was assessed through regression analysis implementing root-to-tip genetic distance as a function of the sample year in the program TempEst version 1.5.1 [65] (http://tree.bio.ed.ac.uk/software/tempest/). The determination coefficient, R 2, was used as a measure of clocklike behaviour with the best-fitting root selected in an effort to maximize R 2. Additionally, 10 000 random permutations of the sampling dates over the sequences were performed in an effort to evaluate the significance of the regression results [66]. The best nucleotide substitution model was inferred using the Bayesian information criterion in the software IQ-TREE version 1.5.5 [55]. beast version 1.8.4 [67]was used to estimate evolutionary rates and time to the most recent common ancestor (TMRCA) through a Bayesian molecular clock analysis using tip dating. beast analysis was run with a correction for invariant sites by specifying a Constant Patterns model in the beast xml file. The numbers of constant As, Cs, Ts and Gs were added to the beast xml file. A ‘path and stepping stone’ sampling marginal-likelihood estimator was used to determine the best-fitting clock and demographic model combinations [68]. The log marginal likelihood was used to assess the statistical fits of different clock and demographic model combinations (Table S5). Four independent chains of one billion iterations were run for the best clock and demographic model combination. Convergence among the four chains was confirmed in the program Tracer version 1.6.0 (http://tree.bio.ed.ac.uk/software/tracer/).

transmission and persistence analysis.

To provide insight into the persistence and transmission of within and among healthcare facilities in northern Arizona, whole genome SNP phylogenies were coupled with epidemiological data. For northern Arizona isolate genomes within lineages of interest, core genome SNPs were called with NASP (reference CD196; GCA_000085225.1) and maximum-parsimony phylogenies were inferred with Phangorn [58] (parsimony trees were used for easily mapping the number of SNP differences between isolates). Epidemiological data were collected from healthcare facilities but, in some cases, information was incomplete.

In silico predicted AMR profiling

Proteins (n=2177) from the comprehensive antimicrobial resistance database (CARD) [69] were downloaded on 18 December 2017. Protein sequences were aligned against all ST1 genomes with the tblastn option in LS-BSR. Any peptide with a blast score ratio (BSR) [70] >0.9, which is equivalent to 90 % identity over 100 % of the peptide length [71], was investigated further. As antibiotic resistance within has been associated with mobile genetic elements [72, 73], the ST1 genomes were screened for the presence of a set of transposons: Tn916 (accession U09422), Tn1549 (AF192329), Tn4451 (U15027), Tn4453a (AF226276), Tn5397 (AF333235), Tn5398 (AF109075), Tn6194 (HG475346), Tn6215 (KC166248), Tn6218 (HG002387), TnB1230 (AJ222769), CTn1-like (extracted from R20291 genome FN545816.1: 4099441–4125374) [72], Tn6192 (FN545816.1: 2087833–2125356), and Tn6105 (FN545816.1: 2059876–2074593). Sequence reads were aligned to the reference transposon sequences with BWA-MEM v0.7.7 [50] and the breadth of coverage for analysed transposon sequences (depth of coverage of 3×) was used as an indicator of presence or absence of the transposon. blastn and tblastn [47] were also used to test for the presence of a transposon. Several mutations associated with fluoroquinolone resistance in have been published in the literature [36, 40, 41, 74]. For GyrA (CD630DERM_00060) and GyrB (CD630DERM_00050), predicted protein sequences were extracted for all genome assemblies from tblastn alignments and then aligned with muscle v3.8.31 [75]. The states of mutations associated with fluoroquinolone resistance were manually investigated.

Antibiotic resistance testing

Four ST1 isolates collected as part of this study (n=27) exhibiting various in silico predicted AMR profiles were screened for resistance to vancomycin, tetracycline and ciprofloxacin on blood agar using Etests (bioMérieux). These isolates were chosen as they represented the diversity of in silico predicted AMR profiles among northern Arizona isolates. Inhibition ellipses were examined at 24 and 48 h. Minimum inhibitory concentration (MIC) breakpoints were based upon recommendations by CLSI (https://clsi.org/), the European Committee on Antimicrobial Resistance Testing (http://www.eucast.org/clinical_breakpoints/) and the available literature [41] and are as follows: vancomycin: <2 µg ml−1 susceptible, 2–4 µg ml−1 intermediate, >4 µg ml−1 resistant; tetracycline: >16 µg ml−1 resistant; ciprofloxacin: >16 µg ml−1 resistant.

Comparative genomics

For large-scale comparative genomics, genome assemblies were processed with the LS-BSR pipeline [76]. In each genome coding regions (CDSs) were predicted with Prodigal v2.60 [77] and clustered with USEARCH v10.0.240_i86linux32 [78] at an identity of 0.9. A representative sequence for each cluster was then aligned against all analysed genomes with BLAT v35x.1 [79]. Scripts provided with the LS-BSR tool were used to identify core genome CDSs (pan_genome_stats.py) and compare BSR values of CDSs among groups of genomes (compare_BSR.py). To identify CDSs potentially differentially conserved among groups of interest, BSR values for individual CDSs were compared between genomes belonging to different STs (ST1: n=1379; ST8: n=31; ST15: n=34; and ST63: n=13). A CDS was considered conserved in a group (ST) if the BSR value for the CDS was greater than 0.8 for greater than 95 % of target-group genomes and the BSR value was greater than 0.4 in less than 5 % of genomes of the non-target group. To complement the LS-BSR pipeline and account for fragmented draft assemblies, CDS/region presence/absence was also analysed by aligning sequence reads to CDSs/regions/genomes of interest with BWA-MEM and evaluating the presence of a CDS/region in a genome based upon breadth of coverage (>80 %) at a depth of coverage of 3×. Gene regions have been identified in ST1 genomes that have been linked to virulence [2] (Table S6). Because these regions were discovered utilizing a small number of genomes, we screened the peptide sequences from these regions against all genomes with LS-BSR using the tblastn alignment option. Additionally, genome assemblies were screened for genomic features associated with trehalose metabolism [18] (Table S6). For TreR (CBA65726.1), sequences were extracted for all genome assemblies from tblastn alignments and then aligned with muscle [75]. The state of a mutation associated with increased trehalose metabolism was manually investigated. Genome assemblies were also screened for proteins encoded by a four-gene region associated with increased trehalose metabolism (FN665653.1: 3231100–3237100) with LS-BSR using tblastn.

In silico ribotyping

Standard ribotyping primers (5′-GTGCGGCTGGATCACCTCCT-3′, 5′-CCCTGCACCCTTAATAACTTGACC-3′) [19] were aligned against all completed ST1 genomes with an in silico PCR script (https://github.com/TGenNorth/vipr) and the predicted amplicon products were identified. Seven amplicons of defined lengths (Table S7) were conserved across all completed ST1 genomes. To test the ability of these amplicons to differentiate genomes, raw sequencing reads representing 4643 genomes (initial test set data downloaded from SRA in September 2017) were mapped against these seven amplicons with Kallisto [80] using the ‘--bias’ correction. For each sample, an amplicon was determined to be present if the read count for that amplicon was at least 20 % of the maximum read count for any of the seven amplicons for that sample; this allows for a small amount of indiscriminate read counting. The PCR ribotype for many of these samples was unknown; therefore, to further test the in silico ribotyping approach, sequencing reads representing 624 isolates that have been PCR ribotyped as part of another study [81] were also evaluated. This data set included multiple ribotypes (RT027 and RT176) within the ST1 lineage [81]. The in silico ribotyping approach was also applied to northern Arizona isolates (this study, n=27). For comparison with in silico results, PCR ribotyping was performed for 10 samples representing multiple sequence types. Ribotyping PCRs were conducted using the same forward and reverse primers described by Janezic et al. 2011 [82]. PCRs were carried out in 50 µl volumes containing the following reagents (given in final concentrations): 5–10 ng of gDNA template, 1× PCR buffer, 1.5 mm MgCl2, 0.2 mm dNTPs, 0.1 mg ml−1 BSA, 1.25 U Platinum Taq polymerase, and 1.0 µm of each primer. PCRs were cycled according to the following conditions: 95 °C for 5 min to release the polymerase antibody, followed by 35 cycles of 95 °C for 60 s, 57 °C for 30 s and 72 °C for 60 s, followed by a final elongation step of 72 °C for 10 min. We then conducted a PCR purification step aimed at removing primer dimer and unincorporated dNTPs using Agencourt AMPure XP beads (Beckman Coulter) at a 1 : 1 ratio (50 µl PCR product to 50 µl beads) according to the manufacturer’s recommendations with the following modifications: wash steps were conducted using 80 % ethanol and final products were eluted in 25 µl of 0.01 m Tris-HCl buffer, supplemented with 0.05 % Tween20. Electropherograms were generated for each of the PCR products using an Agilent 2100 Bioanalyzer (Agilent Technologies, Catalogue no. G2939BA), which provides greater resolution and specificity than a standard agarose gel. Specifically, the Agilent DNA 1000 kit (Agilent Biotechnologies, Catalogue no. 5067–1504) was used to analyse the sizing, quantity and separation patterns of 1 µl for each of the DNA libraries.

Genome accession information

ST1 genomes for isolates collected from healthcare facilities in northern Arizona were deposited in the National Center for Biotechnology Information Sequence Read Archive under BioProject accession PRJNA438482 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA438482). Individual accession numbers are shown in Table 1.

Results

ST1 isolates from two healthcare facilities in northern Arizona were examined to understand the diversity of ST1 strains circulating in northern Arizona in the context of a global collection of , to identify potential transmission events within a single healthcare network, and to characterize AMR and the core/pan-genome within the ST1 lineage.

Phylogenetic diversity of

A maximum-likelihood phylogeny (Fig. 1) was inferred on a concatenation of 85 331 SNPs (Table S2) identified from a 1 860 526 nt core genome alignment; this analysis included all genome assemblies that passed through all quality filters (n=1877; including 1379 ST1 genomes). The consistency index (excluding parsimony-uninformative SNPs) of this phylogeny was 0.30 and the retention index was 0.97. It is important to note that not all currently described STs are represented by genome assemblies included in this analysis. For example, ST11 genome assemblies, which include RT078, were not included as they were filtered out based on large mash distances. The results demonstrate that genomes identified as ST1, including all 27 isolates from northern Arizona, form a monophyletic clade. Other sequence types, such as ST2, ST3 and ST5, are paraphyletic, demonstrating the limitations of using sub-genomic sequence typing information to infer a common evolutionary history without consideration of WGS analysis.

Fig. 1.

Maximum-likelihood phylogeny of a global collection of genomes (n=1877) inferred from 85 331 SNPs showing the position of ST1 as well as other STs. ST1 forms a monophyletic clade (red) whereas other STs are paraphyletic (green, purple, blue). An additional phylogenetic analysis was conducted for only ST1 genomes (n=1379). A maximum-likelihood phylogeny (Fig. 2) was inferred on a concatenation of 4283 SNPs (Table S3). The consistency index (excluding parsimony-uninformative SNPs) was 0.87 and the retention index was 0.98. The GyrA Thr82Ile mutation conferring fluoroquinolone resistance is conserved in two lineages labelled FQR1 and FQR2 [9]. Isolates from two northern Arizona hospitals (n=27) that were sequenced as part of this study are found in FQR1 and FQR2 as well as lineages that do not have the GyrA Thr82Ile mutation. The northern Arizona isolates group into six independent lineages (labelled 1–6 in Fig. 2), separated by isolates collected from diverse geographical locations demonstrating that at least six separate introductions of ST1 isolates have occurred in the healthcare facilities over an 18-month time frame (Table 1, Fig. 2). Lineages that included more than one northern Arizona isolate [1, 5, 6] were investigated further to determine the number of SNPs differentiating strains within each lineage (pairwise comparisons of high-quality positions in non-repetitive regions against CD196 reference). Northern Arizona isolates within lineage 1, which also included isolates from other regions, vary by 0–31 SNPs (called from 3 617 365 core genomic positions). Isolates from Arizona sequenced as part of previous studies (n=7) [9, 83] are present in two lineages of the ST1 phylogeny, including lineage 1. These previously sequenced isolates from Arizona are from human and food sources [9, 84], and pairwise comparisons indicate that the Arizonan isolates from previous studies within lineage 1 of Fig. 2 (collected from food samples in 2007) are separated from northern Arizona isolates (this study) by 8–22 SNPs (called from 3 617 365 core genomic positions). Lineage 5 includes two northern Arizona isolates collected from the same patient; the isolates had no core genomic SNP differences. Lineage 6 is a monophyletic lineage that includes only northern Arizona isolates (this study); the genomes within this lineage vary by 0–17 pairwise SNPs (called from 3 895 562 core genomic positions). The average inter-lineage (lineages 1–6) distance between northern Arizona isolates sampled in this study is 136 SNPs.

Fig. 2.

Maximum-likelihood phylogeny of ST1 genomes (n=1379). Clinical isolates from healthcare facilities in northern Arizona (this study, in red) are present in six different lineages (1–6). The presence of the GyrA Thr82Ile mutation conferring quinolone resistance is indicated by blue branches (light blue indicates FQR1 and dark blue indicates FQR2 identified in a previous study [9]). Seven isolates from Arizona sequenced as part of previous studies are identified with purple triangles. Boxes highlight lineages (labelled 1 and 6) that contain multiple isolates from healthcare facilities in northern Arizona.

Timing

Bayesian analysis of SNP data (with recombination removed) representing a set of ST1 and outgroup genomes (n=90) for which sample dates are known estimated the SNP accumulation rate at 1.13E-7 substitutions per site per year [highest posterior density (HPD) interval: 8.0632E-8–1.4634E-7]. Considering the size of the genome, this rate correlates to less than one SNP per genome per year. This SNP accumulation rate is slightly lower than a previous estimate for ST1 [9], as well as within-host rates for isolates from clinical samples [26, 85]. The estimated divergence time for the ST1 lineage genomes included in our analysis is 40.06 years ago (95 % HPD interval between 32.14 and 49.54 years).

Analysis of transmission and persistence

Core genome SNP phylogenies for isolates collected from healthcare facilities in this study were paired with patient epidemiological data to provide insight into prevalence, acquisition and transmission within and between the two facilities (labelled A and B) (Fig. 3). The estimated evolutionary rate within ST1 genomes suggests that very few SNPs separate isolates involved in recent transmission events, which has been observed previously [26]. Patient location information reveals that patients with CDI-associated diarrhoea were frequently moved to multiple locations within a facility and once between facilities. Northern Arizona isolates within lineage 1 are closely related, and isolates within lineage 6 are also differentiated by very few SNPs (Fig. 2). Isolates separated by zero SNPs are observed at the same healthcare facility at multiple timepoints (e.g. lineage 6: HS-FS-000188, HS-FS-000020, HS-FS-000251 in Fig. 3) and across both healthcare facilities (e.g. lineage 1: HS-FS-000016, HS-FS-000023; lineage 6: HS-FS-000024, HS-FS-000031, HS-FS-000057 in Fig. 3). Isolates HS-FS-000057 and HS-FS-000043 are separated by 1 SNP and the two patients from which these isolates were sampled resided at the same skilled nursing facility, which is a third and separate entity from the two facilities at which samples were collected in this study. Thus, our analysis identifies potential transmission of within the healthcare network. Some isolates (e.g. HS-FS-000148 in lineage 1) are more distantly related to other sampled isolates, which could indicate infections acquired from community or environmental reservoirs. within lineages 1 and 6 appear to be prevalent in northern Arizona (although sampling in this study is limited to clinical specimens from two facilities) and are perhaps circulating within healthcare facilities.

Fig. 3.

Maximum-parsimony phylogenies for northern Arizona isolates within lineages 1 and 6 identified in Fig. 2 paired with patient location information to assess potential ST1 transmission and persistence within and among facilities. ST1 isolates originated from patient faecal samples collected at two facilities in the same healthcare network (facility A in purple, facility B in green). The location of the patient from which isolates originated is mapped over time (date formatting: YYYY-MM-DD) to the right of the phylogeny using rectangles. Patient locations at facility A are indicated with purple outlined rectangles whereas patient locations at facility B are indicated by rectangles outlined in green. Different wards within the facilities are indicated with differential shadings and patterns. Vertical black boxes and black arrows illustrate breaks in time. The letter ‘S’ indicates the date of collection for the faecal sample from which each isolate originated. Isolates HS-FS-000084 and HS-FS-000103 originated from the same patient. The patient from which isolate HS-FS-000025 originated was transferred between facilities. The analysis identifies potential transmission of within the healthcare network and persistence of a genotype within a patient.

In silico predicted AMR profiles in ST1 genomes

To understand AMR among northern Arizona isolates as well as ST1 in general, all ST1 genomes were screened against 2177 proteins in the CARD database with LS-BSR (Table S8). The results indicate that although no proteins had hits (BSR value >0.9) across all ST1 genomes, several proteins were highly conserved across many ST1 isolates. The efflux transporter encoded by the gene is highly conserved across ST1 genomes (average BSR value 0.99, BSR values >0.9 in 1375 of 1379 ST1 genomes); this CDS is also present in all northern Arizona isolates (Table 1). Although over-expression of the protein in has been correlated with increased fluoroquinolone (norfloxacin and ciprofloxacin) resistance, the role of the CdeA protein in is unclear [86]. Some isolates such as CD196 have high BSR values for the CdeA protein but are susceptible to some fluoroquinolones (gatifloxacin, moxifloxacin, levofloxacin [2]), indicating the presence of this gene alone does not infer resistance to newer fluoroquinolones in ST1 genomes. Genes encoding proteins associated with TetM and ErmB genetic elements previously described to confer resistance to tetracycline and erythromycin in [36, 87–89] are present in a number of ST1 genomes but are not universally conserved (TetM present in ~7 % of ST1 genomes, ErmB present in ~8 %). The tetM gene is present in one of the 27 northern Arizona isolates whereas the ermB gene is present in 10 northern Arizona isolates (all identified as FQR1) (Table 1). A gene encoding a dihydrofolate reductase (DfrF) protein was identified in 10 ST1 genomes including one isolate from northern Arizona (Table 1). DfrF has been associated with resistance to trimethoprim in [90] and [91]. Proteins encoded by the vanG operon associated with vancomycin resistance in (MIC 16 µg ml−1) [92] were identified for two ST1 genomes (BSR values ranging from 0.92 to 1), including one isolate from northern Arizona (HS-FS-000151). These genes were identified in addition to the vanG-like gene cluster identified in [89], which is highly conserved among genomes. Mobile genetic elements are common within genomes and have been associated with AMR [72, 73]; therefore, AMR markers identified by LS-BSR were evaluated in the context of transposons. The ermB gene present in 10 isolates from this study was associated with Tn6194 (HG475346.1) that has been demonstrated to be transferred between and [93] (Table S9). The tetM gene present in one northern Arizona isolate is not clearly associated with any of the previously described transposons. The contig on which the tetM gene is located shares high identity with some genomes (e.g. accession CP020488, locus tags B6S06_08495 – B6S06_08555) and with a genomic region of RT017 strain DSM 29627 (CP016102, CDIF29627_00594 – CDIF29627_00604) [94]. This region is also present in a number of ST1 genomes (publicly available SRA data included in this study) containing the tetM gene that do not show strong evidence of containing any of the screened AMR-associated transposons; this suggests the presence of a new and uncharacterized transposon in ST1. In addition to gene presence/absence, several SNP mutations have been described that confer resistance to quinolones [36, 40, 41, 74]. A comparison of SNP calls across all genomes at those positions demonstrated that >95 % of ST1 genomes contained the GyrA Thr82Ile mutation that confers quinolone resistance, whereas only ~12 % of non-ST1 genomes screened in this study have this mutation. Interestingly, genomic data for five of the 27 isolates from CDI cases at northern Arizona facilities indicate these isolates do not have the GyrA Thr82Ile mutation conferring quinolone resistance (lineages 2–5 in Fig. 2). Additional mutations in GyrA and GyrB associated with quinolone resistance were not broadly conserved in ST1 genomes (Table S10). These results suggest that AMR varies among ST1 isolates, including among northern Arizona isolates from the same hospital.

AMR testing

Four northern Arizona isolates (this study) with varied in silico predicted AMR profiles (Table 1) were screened for AMR (ciprofloxacin, tetracycline, vancomycin) using Etests (Table S11). These four isolates (HS-FS-000082, HS-FS-000127, HS-FS-000151, HS-FS-000264) were chosen to represent the diversity of in silico predicted AMR profiles among northern Arizona isolates (Table 1). As with most isolates [36], all four isolates were resistant to ciprofloxacin (MIC >32 µg ml−1), which is a commonly prescribed fluoroquinolone. One isolate (HS-FS-000082) was also resistant to tetracycline (MIC >32 µg ml−1) and this isolate contained the tetM gene (described above). Three of the four isolates had intermediate resistance (MIC 4 µg ml−1) to vancomycin as has been observed in other isolates [95, 96]. One of these isolates (HS-FS-000151) with intermediate vancomycin resistance contained CDSs associated with a vanG operon in (described above); however, this region did not appear to confer increased resistance to vancomycin in the tested isolate.

Pan-genome composition of ST1 genomes

LS-BSR analyses were performed to understand the size and extent of the ST1 core and pan-genome, to identify CDSs that may contribute to the association of ST1 with the hospital environment, and to evaluate potentially unique features of northern Arizona isolates. The strict core genome of 1379 ST1 genome assemblies was determined by LS-BSR to be 1991 CDSs (Table S12), which corresponds to approximately 55 % of the total CDSs in the genome for ST1 strain CD196. The analysed ST1 genomes spanned a wide range of assembly quality, which could result in underestimation of the core genome size. However, evaluating CDSs that are highly conserved among ST1 genomes could provide insight into why the lineage is so successful in addition to fluoroquinolone resistance. BSR values were used to identify CDSs that have been gained or lost in ST1 genomes. CDSs (n=44) were identified as highly conserved among ST1 genomes and largely absent from non-ST1 genomes, whereas 14 CDSs were identified as being lost from ST1 genomes (highly conserved among non-ST1 genomes and largely absent from ST1 genomes, see Methods for criteria) (Table S13). CDSs identified as highly conserved in ST1 include some genomic regions previously identified as unique to RT027 isolates in a comparison of three genomes [2]. For example, CDSs associated with the insertion of a catalytically more efficient thyA gene that disrupts the thyX gene in previously studied RT027 isolates [97] are highly conserved in the ST1 genomes analysed here; these CDSs were upregulated by an ST1/RT027 isolate during the early stage of an infection in a monoxenic mouse model [98]. Additional CDSs highly conserved in ST1 genomes are annotated as transposases or have unknown functions (Table S13). Binary toxin genes (cdtA and cdtB) are highly conserved within ST1 and largely absent from many other STs but did not meet the threshold defined here to be included as gained/lost CDSs. To understand if the phenomenon of gene acquisition/loss is common across all lineages of , a comparable analysis was conducted on ST8 (RT002 and RT159) genomes (n=31). The results demonstrate that although no CDSs are unique to ST8, 30 CDSs are differentially conserved in ST8 and two CDSs are generally deleted (see Methods for criteria). Genomes (n=13) from ST63 (RT053), which have been shown to contain similar in silico predicted AMR profiles (GyrA mutation) to ST1 (see below), were also compared to understand CDS conservation and loss. The results demonstrate that 26 CDSs are highly conserved among ST63 genomes and no regions appear to be specifically lost by this lineage. For genomes (n=34) identified as ST15 (RT010), which includes non-toxigenic strains, 118 CDSs were identified as highly conserved and 26 CDSs were identified as generally deleted. These analyses may suggest that ST15 and ST1 are adapting to different niches/environments. ST1 may be adapting to the human gut, and a large percentage of isolates from patients with CDI in northern Arizona belong to ST1. Although ST11 (RT078) genomes were excluded from the initial round of this analysis due to filtering genome assemblies by mash distance, CDSs identified as highly conserved or generally lost within ST1 were compared to ST11 lineage genomes (n=15, accession numbers in Table S13). The ST11 lineage has been associated with severe disease and has been commonly isolated from clinical samples as well as agricultural, animal and retail meat samples [84, 99–104]. Several CDSs identified as highly conserved within ST1 were also conserved in ST11 genomes (Table S13). These CDSs include a DNA-binding regulator, a histidine kinase and an RNA polymerase sigma factor. Some CDSs identified as generally lost within ST1 are conserved within ST11 (e.g. a TetR/AcrR family transcriptional regulator). Although LS-BSR analysis indicates that over half of the predicted CDSs present in the genome for strain CD196 are part of the strict core genome for ST1 isolates, some of the CDSs present in this genome (and others within ST1) are differentially conserved throughout the ST1 lineage. Some CDSs are lineage-specific or have been lost in particular lineages of ST1 (Fig. S1). In total, 5937 CDSs were identified in the accessory genome for ST1 and 1049 CDSs were identified as unique to one genome. This variation in genomic content within the ST1 lineage suggests that different lineages/isolates within ST1 will vary phenotypically, which could impact AMR, transmission and virulence. For example, Stone and colleagues [105] found variable toxin production between two ST1 isolates using a transepithelial electrical resistance (TEER) assay. To understand the genetic composition of northern Arizona isolates, LS-BSR was used to identify any unique CDSs within lineages 1–6 in Fig. 2 compared to other ST1 isolates, but no CDSs were found to be unique to any of these lineages. However, isolates HS-FS-000082 and HS-FS-000151, both within lineage 1, each contained a CDS not identified in any other ST1 genome (HS-FS-000082 – ABC transporter permease, HS-FS-000151 – site-specific integrase putatively associated with a second copy of ermB in this genome). Two northern Arizona isolates (HS-FS-000084, HS-FS-000103; lineage 1 in Fig. 2) that originated from the same patient over a 16-day period had smaller assembly sizes than many other ST1 genomes. These two genome assemblies are ~160 kb smaller than the average size of ST1 genomes included in this study and are ~210 kb smaller than other genomes within lineage 1 of Fig. 2. LS-BSR analysis indicated that 65 CDS are highly conserved in all other ST1 genomes but are absent in these two genome assemblies (Table S14). Poor assembly quality could result in a smaller genome with fewer CDSs; however, read mapping against the complete genome for CD196 identified a chromosomal region (~155 kb) that is present in all other ST1 genomes but is absent in the two isolates collected from the same patient. The 65 CDSs absent from the two genomes share high identity to CDSs within this chromosomal region of CD196 (Table S14). Interestingly, many of these CDSs are associated with carbohydrate metabolism and the phosphotransferase system (PTS). The PTS can impact regulation of many cellular processes [106] and differential expression of PTS genes by has been associated with nutrition shifts in vitro [107, 108] and during the course of mouse infection studies [98]. A set of 52 peptide sequences [2] previously associated with differential conservation in ST1 genomes (Table S6) were screened against all genomes with LS-BSR. The results indicate that several features associated with an epidemic ST1 strain are also present in non-ST1 genomes (Table S15) or are not highly conserved among other ST1 genomes. Although this result does not rule out that these regions are associated with virulence in some ST1 strains, it does suggest that these regions do not fully explain the success of the ST1 lineage. Collins and colleagues [18] identified genomic features associated with increased metabolism of trehalose in epidemic ribotypes. A mutation in the TreR protein (Leu172Ile) associated with trehalose metabolism is present in predicted proteins for 1375 of 1379 screened ST1 genome assemblies. Truncated predicted proteins were identified for two ST1 genome assemblies (ERR030337, GCA_900011385.1), and predicted proteins with low identity to the TreR protein for strain CD196 (CBA65726.1) were identified for two genome assemblies isolated from northern Arizona (HS-FS-000084, HS-FS-000103; isolates from the same patient – see above and lineage 1 in Fig. 2). Four protein sequences associated with trehalose metabolism in RT078 (ST11) isolates [18] were also screened against genome assemblies. High BSR values (≥0.96) for all four protein sequences were identified for genome assemblies representing multiple MLST sequence types (Table S15), including three ST1 genome assemblies (ERR026353, ERR030384, ERR251831; in-house assemblies from publicly available sequencing read data). The four protein sequences were not identified for any of the 27 sequenced northern Arizona isolates. The gold standard for genotyping has been ribotyping, wherein RT027 isolates are primarily associated with ST1 [22]. Here we have focused on ST1, which forms a monophyletic clade within a whole genome SNP phylogeny (Fig. 1). However, there appears to be a disconnect in the literature between researchers using ribotyping and those using either MLST or WGS approaches. To relate northern Arizona isolates to published PCR ribotypes without the need for performing PCR ribotyping in the laboratory, we explored the potential to extract ribotyping profiles from sequence data. To examine the utility of an in silico ribotyping approach, raw sequencing reads representing 4643 genomes (initial test set) from the SRA were aligned against amplicons predicted by probing standard ribotyping primers against three finished ST1 genomes, and hits were identified from samples with even read mapping across all amplicons (see Methods). An analysis of different thresholds for variable read mapping demonstrated that >20 % of the maximum read counts was an appropriate threshold for calling ribotype amplicons (Table S16). Of the 4643 genomes initially screened, 1241 were sequence typed as ST1 using stringMLST, and 1226 of these ST1 genomes were identified as RT027 based on in silico ribotype profiles (Table S17). The in silico ribotyping method correctly identified ~99 % of ST1 genomes based on their ribotype profiles. Only 17 of 3402 non-ST1 genomes were identified as ST1. Importantly, the PCR ribotypes for genomes in this data set were not evaluated, and it is assumed here that all ST1 genomes analysed are RT027 and all non-ST1 genomes are not RT027. ST1 also includes RT016, 036 and 176, but predominantly RT027 isolates have been associated with ST1. To further evaluate the utility of the in silico ribotyping approach, the methodology was applied to a set of sequencing reads representing 624 isolates that were PCR ribotyped in a recently published study [81]. Of the 624 isolates, 216 were PCR ribotyped as RT027 and all of these isolates were in silico ribotyped as RT027. Additionally, 21 PCR RT176 isolates, which are associated with the ST1 lineage, were included in this data set. Of these 21 isolates, only one was in silico ribotyped as RT027 (Table S17). No other isolate was incorrectly classified as RT027 in silico. The in silico ribotyping approach was then applied to isolates sequenced as part of this study, and all 27 northern Arizona isolates were in silico ribotyped as RT027 (Table S17). All northern Arizona isolates were also sequence typed as ST1 (in silico) according to the MLST scheme for . PCR ribotyping was performed on 10 isolates with various sequence types for comparison with in silico results (Fig. S2). Two ST1 isolates presented similar PCR ribotyping profiles, and PCR ribotyping profiles for these two ST1 isolates were comparable to in silico ribotyping results, although not exact matches – this is partly due to primer selection for in silico and PCR ribotyping methods. The in silico profile included seven bands; one ST1 isolate presented analogues to all seven bands plus one larger band whereas the other ST1 isolate presented six analogous bands plus one larger band (Table S18). Variation between ribotyping profiles predicted from four RT027 genome assemblies and actual ribotyping profiles has been reported previously [109]. Non-ST1 isolates produced dissimilar PCR ribotyping profiles to ST1 isolates, and very few non-ST1 isolates were identified as ST1 by the in silico ribotyping method. The in silico ribotyping method described here could potentially provide a means to connect ribotyping information from past studies to the expanding WGS data available for isolates.

Discussion

ST1 is a successful worldwide lineage that appears to be especially adapted to the human gut and healthcare environments. The rise of in hospitals globally has been attributed, at least partially, to the emergence and proliferation of ST1 (RT027) [110]. In this study, we sequenced 27 ST1 isolates collected between March 2016 and September 2017 from two healthcare facilities in northern Arizona (these 27 isolates represent ~15 % of all isolates collected as part of a larger surveillance study) and compared them to a worldwide collection of ST1 genomes. The results demonstrate that diverse ST1 isolates were present in the sampled healthcare facilities; northern Arizona isolates are distributed throughout the ST1 phylogeny, including within two previously identified fluoroquinolone resistant lineages (FQR1 and FQR2 [9]). At least six separate introductions (lineages) of ST1 into the two sampled healthcare facilities were observed over the period of surveillance (Fig. 2). Four of the introductions/lineages are represented by only one isolate or genotype (lineages 2–5 in Fig. 2), whereas lineages 1 and 6 contain multiple northern Arizona genotypes (defined as SNP variation among isolates). Lineage 6 is a distinct lineage that includes only isolates from northern Arizona facilities. Interestingly, two isolates collected from food samples in Arizona as part of previous studies [9, 84] are closely related to some of the clinical isolates sequenced in this study (lineage 1 in Fig. 2). within lineages 1 and 6 appear to be prevalent ST1 genotypes circulating within northern Arizona. WGS provides a high-resolution method for differentiating bacterial isolates and offers insight into how pathogens persist and are transmitted within healthcare networks. Several studies, including this one, have indicated that the SNP accumulation rate in is approximately one SNP per genome per year [26, 85]; however, SNP accumulation rates can be complicated for bacteria with a spore stage in their life cycle [111]. Bayesian analysis suggests that the divergence time for the ST1 lineage is approximately 40 years ago (95 % HPD interval between 32.14 and 49.54 years), which is consistent with the identification of the first ST1/RT027 isolate in 1985 [2, 3]. He and colleagues [9] analysed a data set of ST1 genomes and described two epidemic lineages, both of which acquired the GyrA Thr82Ile mutation conferring quinolone resistance; the time of emergence for these two ST1 lineages was estimated to be in the early 1990s. The SNP accumulation rate within must be considered when attempting to track isolates on an epidemiological time scale [112] and cases linked by recent transmission will probably be separated by very few SNPs [26]. The genomic variation among northern Arizona isolates suggests that although some CDI cases from which these isolates were collected may potentially be the result of transmission within healthcare facilities, other cases are seemingly unrelated. We observed that closely related isolates, based on WGS analysis, were sampled in patients at multiple facilities and over multiple time points. In two instances, WGS showed that the same genotype persisted in a patient rather than the patient being infected with a different genotype. Since patients with CDI are moved between facilities and similar genotypes are detected at different facilities, infection prevention strategies must be implemented across the entire healthcare network to be most effective. Putatively unrelated cases may be the result of particular ST1 genotypes circulating in northern Arizona reservoirs and emerging as CDI cases due to antibiotic use or other factors. Importantly, monitoring transmission and persistence of these ST1 isolates with sub-genomic methods, such as ribotyping or MLST, would not have been possible due to a lack of resolution provided by these methods. Incorporating WGS and comparative genomics into the healthcare network surveillance programme enabled more effective monitoring, which can hopefully improve patient health in the future. AMR varies within [36], and in silico predicted AMR profiles for ST1 isolates sampled in this study are also variable. Resistance to fluoroquinolones within ST1 has been the focus of many studies [4, 9, 37–39] and approximately 95 % of the ST1 genomes compared in this study have the missense mutation in the gyrA gene (Fig. 2, Table S10) linked to resistance to some fluoroquinolones. The use of quinolones to treat unrelated infections may select for ST1, supporting the proliferation of clinical CDI caused by this sequence type. Interestingly, the GyrA Thr82Ile mutation and other mutations known to confer quinolone resistance were absent in five of the 27 ST1 isolates collected in this study. These five isolates account for four introductions of the ST1 lineage into the healthcare network in this region (lineages 2–5 in Fig. 2); one facility within the healthcare network implemented a programme to restrict fluoroquinolone use in 2017, which may reduce selection for the GyrA Thr82Ile genotype. AMR screening of northern Arizona isolates with varying in silico predicted AMR profiles (Table S11) indicated that all isolates were resistant to ciprofloxacin; resistance to ciprofloxacin is common among [36]. Resistance to tetracycline was indicated for one northern Arizona isolate for which a tetM gene was identified in the genome. Use of tetracyclines has not been frequently linked to CDI [113]; however, tetracycline resistance has been associated with some lineages of such as RT078 (ST11) [114-116]. No strains were fully resistant to vancomycin, which is a recommended treatment option for CDI [42], despite the presence of CDSs associated with vancomycin resistance being detected in one tested genome. Continued screening of AMR is important to monitor AMR variability and provide insight for updating treatment guidelines. Although AMR plays an important role in epidemiology, other factors, in addition to quinolone resistance, probably contribute to the prevalence of CDI attributed to ST1. For example, the GyrA Thr82Ile mutation is also highly conserved in ST63 (RT053) genomes, although this ST is not as successful as ST1 based on reported genotype frequencies from hospital isolation [11, 13, 26, 27]. Additionally, ST1 appears to be less prevalent than other STs in many sampled environments (samples not associated with the human gut or hospital) based upon environmental survey studies [30, 31, 117]. LS-BSR analysis indicated that genomic regions were highly conserved within ST1 genomes, as well as a number of regions that have been broadly deleted within ST1 genomes (Table S13). Reduction of the core genome can be associated with niche differentiation [118, 119]. Interestingly, two isolates from the same patient (this study) demonstrated large deletions compared to other ST1 genomes. The observed genome region acquisition and loss across ST1 may be associated with its prevalence in the human/hospital environment. An analysis of other STs demonstrates that the gene gain/loss may be more common in ST1 than other toxigenic lineages. However, as most sampling efforts are focused on clinical settings, the prevalence of ST1 in other environments is not fully understood. Genetic variability within ST1 may also contribute to the success of the lineage. For instance, resistance to clindamycin is not universal within ST1 [4, 120] but would provide an advantage to some ST1 isolates. Genomics alone cannot definitively identify why ST1 is such a successful lineage of , but can identify targets for further investigation (e.g. CDSs identified in Table S13). Additionally, here we focus on bacterial genomics without consideration of factors such as variation in gene regulation, host immune response, or hospital practices such as antimicrobial use policies. isolates have been differentiated using a variety of methods including MLST and PCR ribotyping. In this study, we used the MLST designation, as we can easily genotype sequenced genomes in silico in the absence of ribotyping data. However, due to recombination, phylogenies inferred from a concatenation of MLST markers have been shown to be incongruent with WGS phylogenies [121]. Using WGS data for , we demonstrate that multiple STs are paraphyletic on the core genome SNP phylogeny (Fig. 1). This is important if common STs are expected to have a shared evolutionary history, yet the history is conflicting when using the entire genome. Ribotyping represents a legacy method that is still used to type isolates and infer relationships. However, there remains a disconnect between ribotyping and sequencing methods. In this study, we present a workflow that can associate raw sequence data to a ribotype profile (RT027). The amplicons that were used for in silico ribotyping were identified from only three complete ST1 genomes; draft genomes are not useful for the identification of predicted amplicons due to the potential collapse of repeat regions (e.g. 16S rRNA genes) [122]. As additional complete genomes are generated from diverse ribotypes, this method can be modified to associate ribotype information with whole genome sequence and MLST data, which is useful for bridging between different methodologies. WGS and analysis provides the highest resolution method for comparing isolates and should be included in epidemiological studies if possible. This study has provided a large-scale analysis of a single sequence type of in order to understand the diversity within a successful, world-wide lineage and place isolates from a local healthcare network into a global context. The analyses indicated multiple introductions of ST1 into the healthcare network, provided insight into potential transmission of within the healthcare network and identified AMR variation among isolates. The genomic diversity within the ST1 genomes suggests that broadly applying labels to ST1 genomes, such as hypervirulent clone, may be inappropriate without an in-depth analysis of the gene presence/absence and SNP markers for AMR. Additional studies into how genomic variation affects toxin production and virulence are needed to assess the phenotypic diversity of ST1 as it relates to the observed genotypic diversity; these studies are currently ongoing and will help shape how we approach studies using sub-genomic information, such as ribotyping and MLST. 1. A full listing of NCBI accessions for strains used in this paper is available in Table S1. 2. Previously identified genomic regions associated with ST1 virulence used in analyses in this study are provided in Table S6 or in the manuscript text. 3. An in silico polymerase chain reaction script used in this study is available at GitHub – https://github.com/TGenNorth/vipr https://github.com/TGenNorth/vipr. 4. An in silico MLST script used in this study is available at GitHub – (https://gist.github.com/jasonsahl/2eedc0ea93f90097890879e56b0c3fa3). Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file.

119 in total

Review 1. Microbial minimalism: genome reduction in bacterial pathogens.

Authors: Nancy A Moran
Journal: Cell Date: 2002-03-08 Impact factor: 41.582

2. VanG-type vancomycin-resistant Enterococcus faecalis strains isolated in Canada.

Authors: David A Boyd; Tim Du; Romeo Hizon; Brynn Kaplen; Travis Murphy; Shaun Tyler; Shirley Brown; Frances Jamieson; Karl Weiss; Michael R Mulvey
Journal: Antimicrob Agents Chemother Date: 2006-06 Impact factor: 5.191

3. The comprehensive antibiotic resistance database.

Authors: Andrew G McArthur; Nicholas Waglechner; Fazmin Nizam; Austin Yan; Marisa A Azad; Alison J Baylay; Kirandeep Bhullar; Marc J Canova; Gianfranco De Pascale; Linda Ejim; Lindsay Kalan; Andrew M King; Kalinka Koteva; Mariya Morar; Michael R Mulvey; Jonathan S O'Brien; Andrew C Pawlowski; Laura J V Piddock; Peter Spanogiannopoulos; Arlene D Sutherland; Irene Tang; Patricia L Taylor; Maulik Thaker; Wenliang Wang; Marie Yan; Tennison Yu; Gerard D Wright
Journal: Antimicrob Agents Chemother Date: 2013-05-06 Impact factor: 5.191

4. Comparative analysis of an expanded Clostridium difficile reference strain collection reveals genetic diversity and evolution through six lineages.

Authors: Cornelis W Knetsch; Elisabeth M Terveer; Chris Lauber; Alexander E Gorbalenya; Céline Harmanus; Ed J Kuijper; Jeroen Corver; Hans C van Leeuwen
Journal: Infect Genet Evol Date: 2012-06-15 Impact factor: 3.342

5. Clostridium difficile infection in Europe: a hospital-based survey.

Authors: Martijn P Bauer; Daan W Notermans; Birgit H B van Benthem; Jon S Brazier; Mark H Wilcox; Maja Rupnik; Dominique L Monnet; Jaap T van Dissel; Ed J Kuijper
Journal: Lancet Date: 2011-01-01 Impact factor: 79.321

6. Update of Clostridium difficile infection due to PCR ribotype 027 in Europe, 2008.

Authors: E J Kuijper; F Barbut; J S Brazier; N Kleinkauf; T Eckmanns; M L Lambert; D Drudy; F Fitzpatrick; C Wiuff; D J Brown; J E Coia; H Pituch; P Reichert; J Even; J Mossong; A F Widmer; K E Olsen; F Allerberger; D W Notermans; M Delmée; B Coignard; M Wilcox; B Patel; R Frei; E Nagy; E Bouza; M Marin; T Akerlund; A Virolainen-Julkunen; O Lyytikäinen; S Kotila; A Ingebretsen; B Smyth; P Rooney; I R Poxton; D L Monnet
Journal: Euro Surveill Date: 2008-07-31

7. Actin-specific ADP-ribosyltransferase produced by a Clostridium difficile strain.

Authors: M R Popoff; E J Rubin; D M Gill; P Boquet
Journal: Infect Immun Date: 1988-09 Impact factor: 3.441

8. Genetic organisation, mobility and predicted functions of genes on integrated, mobile genetic elements in sequenced strains of Clostridium difficile.

Authors: Michael S M Brouwer; Philip J Warburton; Adam P Roberts; Peter Mullany; Elaine Allan
Journal: PLoS One Date: 2011-08-18 Impact factor: 3.240

9. Dietary trehalose enhances virulence of epidemic Clostridium difficile.

Authors: J Collins; C Robinson; H Danhof; C W Knetsch; H C van Leeuwen; T D Lawley; J M Auchtung; R A Britton
Journal: Nature Date: 2018-01-03 Impact factor: 49.962

10. Two Distinct Patterns of Clostridium difficile Diversity Across Europe Indicating Contrasting Routes of Spread.

Authors: David W Eyre; Kerrie A Davies; Georgina Davis; Warren N Fawley; Kate E Dingle; Nicola De Maio; Andreas Karas; Derrick W Crook; Tim E A Peto; A Sarah Walker; Mark H Wilcox
Journal: Clin Infect Dis Date: 2018-09-14 Impact factor: 9.079

5 in total

1. Genomic Determination of Relative Risks for Clostridioides difficile Infection From Asymptomatic Carriage in Intensive Care Unit Patients.

Authors: Jay Worley; Mary L Delaney; Christopher K Cummins; Andrea DuBois; Michael Klompas; Lynn Bry
Journal: Clin Infect Dis Date: 2021-10-05 Impact factor: 9.079

Review 2. Clostridioides difficile phage biology and application.

Authors: Joshua Heuler; Louis-Charles Fortier; Xingmin Sun
Journal: FEMS Microbiol Rev Date: 2021-09-08 Impact factor: 16.408

Review 3. Development and Implementation of Whole Genome Sequencing-Based Typing Schemes for Clostridioides difficile.

Authors: Sandra Janezic; Maja Rupnik
Journal: Front Public Health Date: 2019-10-24

4. K-mer based prediction of Clostridioides difficile relatedness and ribotypes.

Authors: Matthew Phillip Moore; Mark H Wilcox; A Sarah Walker; David W Eyre
Journal: Microb Genom Date: 2022-04

5. Integrated genomic epidemiology and phenotypic profiling of Clostridium difficile across intra-hospital and community populations in Colombia.

Authors: Marina Muñoz; Daniel Restrepo-Montoya; Nitin Kumar; Gregorio Iraola; Milena Camargo; Diana Díaz-Arévalo; Nelly S Roa-Molina; Mayra A Tellez; Giovanny Herrera; Dora I Ríos-Chaparro; Claudia Birchenall; Darío Pinilla; Juan M Pardo-Oviedo; Giovanni Rodríguez-Leguizamón; Diego F Josa; Trevor D Lawley; Manuel A Patarroyo; Juan David Ramírez
Journal: Sci Rep Date: 2019-08-05 Impact factor: 4.379

5 in total