Literature DB >> 31666367

Hash-Based Core Genome Multilocus Sequence Typing for Clostridium difficile.

A Sarah Walker^1,2,3, Mark H Wilcox⁴, David W Eyre^5,1, Tim E A Peto^1,2,3, Derrick W Crook^1,2,3.

Abstract

Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralized database of sequentially numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to those of mapping-based approaches in Clostridium difficile, using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals. Hash-cgMLST provided the same results as standard cgMLST, with minimal performance penalty. Comparing 272 replicate sequence pairs using reference-based mapping, there were 0, 1, or 2 single-nucleotide polymorphisms (SNPs) between 262 (96%), 5 (2%), and 1 (<1%) of the pairs, respectively. Using hash-cgMLST, 218 (80%) of replicate pairs assembled with SPAdes had zero gene differences, and 31 (11%), 5 (2%), and 18 (7%) pairs had 1, 2, and >2 differences, respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but were reduced using the SKESA assembler. Considering 412 pairs of infections with ≤2 SNPS, i.e., consistent with recent transmission, 376 (91%) had ≤2 gene differences and 16 (4%) had ≥4. Comparing a genome to 100,000 others took <1 min using hash-cgMLST. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generate more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships.

Entities: Disease Species

Keywords: Clostridium difficilezzm321990; core genome MLST; quality assurance; whole-genome sequencing

Mesh：

Year: 2019 PMID： 31666367 PMCID： PMC6935933 DOI： 10.1128/JCM.01037-19

Source DB: PubMed Journal: J Clin Microbiol ISSN： 0095-1137 Impact factor: 5.948

INTRODUCTION

The rapid development of pathogen whole-genome sequencing offers huge potential for better understanding the epidemiology of many infections. When trying to intervene to stop transmission, it is often important to identify the most closely genetically related organisms already sequenced, as these represent potential recent sources of infection or cases that share an infection source. However, the rapidly growing scale of data generated makes identifying these closely related genomes among a background of many thousands of other genomes very challenging. Three main approaches can be taken to identify closely related genomes. Comparing single nucleotide polymorphisms (SNPs) identified following mapping to a reference genome offers high precision (1), but despite efforts to optimize computational approaches (2) it is relatively slow. In contrast, k-mer-based approaches based on hash algorithms, e.g., MASH (3) and PopPUNK (4), are fast, but the inherent and unstructured dimensionality reduction (e.g., summarizing the whole genome as 500 hash strings selected on the basis of sorted hash strings) can reduce precision in fine-scale transmission analyses. Core genome multilocus sequencing typing (cgMLST) (5) potentially provides a solution; genomes are summarized as a list of ∼2,000 to 3,000 numbers, with each number representing the unique sequence of each core gene, i.e., structured dimensionality reduction. This summary enables more rapid comparisons, as, taking the example of Clostridium difficile, only 2,270 gene allele numbers need be compared (6), rather than having to compare 4.3 million base pairs of sequence data for SNPs. A drawback of cgMLST as described to date is that it requires a centralized database of alleles of each gene to be maintained so cgMLST profiles generated by different laboratories are comparable. This centralized support can potentially be provided by academic, public health, or commercial organizations, but any given scheme’s sustainability is potentially limited by the funding available to support it. Additionally, for some pathogens, including C. difficile, several competing cgMLST/whole-genome MLST schemes (e.g., Enterobase [University of Warwick, UK], the cgMST.org Nomenclature Server [Ridom GmbH, Germany], and BioNumerics [bioMérieux, France]) containing different genes and profiles have been developed; the latter two are associated with commercial platforms for processing sequencing data. We therefore propose an alternative to cgMLST as described to date. Instead of maintaining a database of alleles, each allele is reproducibly converted to a unique hash, or short string of letters. This compresses each item of identical data to the same smaller representation, based on the sequence of an allele alone. Therefore, this process can be undertaken independently in different laboratories without the need to maintain or subscribe to a central database, but it still generates summary data in a reproducible form that can be exchanged by laboratories. This distributed approach avoids the potentially costly need to maintain a central database. This study has two main aims. The first is to demonstrate an implementation of hash-based cgMLST and to test whether hash-cgMLST profiles can be compared without a significant performance penalty compared to standard cgMLST; the second is to test the reproducibility and discriminatory power of cgMLST compared to SNP-based typing. The discriminatory power of cgMLST has been previously explored (for examples, see references 6–9); however, how cgMLST gene differences relate to SNP distances has not been comprehensively assessed. Instead, it is postulated that small numbers of SNPs are likely to fall in different genes, and so SNP distances and gene differences are likely to be similar for closely related isolates. We evaluate the extent to which this assumption holds. Related to this, only limited assessments of the reproducibility of cgMLST have been undertaken. The largest study to date involved the same Staphylococcus aureus DNA from 20 isolates undergoing sequencing in 5 laboratories (10). In this setting, in 80 comparisons (i.e., 20 sequences from 4 laboratories compared with the baseline laboratory), only 3 false gene differences were identified. We investigate whether these results can be replicated in C. difficile.

MATERIALS AND METHODS

Hash-cgMLST.

Using the cgMLST scheme of Bletz et al. (6), the first allele for each of the 2,270 genes was used to create a BLAST search query. Following previous descriptions (6, 10), BLAST searches for each gene required a 90% identity match, a matched length ≥99% of the query length, and the matched gene to be free from ambiguous characters or premature truncation. To avoid apparent truncated genes arising from misassembly, we checked the number of stop codons in the gene sequence and only retained matches with a single stop codon. To avoid truncation arising from contig breaks, we ensured that BLAST matches included the start and end of the query sequence. Other BLAST search parameters were as follows: “evalue=0.01, word_size=11, penalty=-1, reward=1, gapopen=5, gapextend=2.” The resulting genes were either matched to the database available at cgMLST.org, i.e., standard cgMLST, or hashed using an md5 algorithm to create a 32-character hexadecimal string. Deletions relative to the search query, represented by dashes in the matched gene sequence, were removed prior to generating the hash. This avoids false differences introduced by locally variable placement of these deletions introduced by BLAST. The resulting cgMLST and hash-cgMLST profiles were saved as JavaScript Object Notation (JSON) files, i.e., a format that could readily be exchanged between laboratories. Where no BLAST match was found for a gene in the scheme, an empty value was recorded and that gene excluded in pairwise comparisons. The choice of md5 hash provides 1632, i.e., 3.4 × 1038, possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash, but as the number of viable sequences for each gene in cgMLST databases is typically only tens to hundreds, this is very unlikely. Importantly, if a hash collision occurred, this would result in genomes appearing falsely more similar, rather than in falsely excluding potential transmission.

Sequence data.

During whole-genome sequencing of C. difficile undertaken in Oxford and Leeds (UK), we have routinely resequenced a subset of isolates as part of our internal quality assurance. We searched our database for isolates sequenced more than once. For a subset of these replicate sequences, the same extracted DNA was used to generate both sequences; for the remainder, it was not documented in our laboratory information management system whether the same DNA extract was resequenced or whether a fresh DNA extract was made from the same frozen isolate (Table S1). Paired-end sequence data for both types of replicate were generated using Illumina technology, including various iterations of the HiSeq and MiSeq platforms, with read lengths ranging from 100 to 150 bp for the majority of the sequences (two 50-bp sequences were also included). To compare the discriminatory power of hash-cgMLST compared to SNP-based typing, we processed 973 genomes from a previously published study of consecutive C. difficile infections over 1 year in six English hospitals using our hash-cgMLST and SNP pipelines (11).

Bioinformatic processing.

For hash-cgMLST typing, raw sequence data underwent adapter trimming and quality trimming using bbduk.sh from the BBMap package (version 38.32) (12). Stringent quality trimming was applied following Mellmann et al. (10); both the left and right ends of each read were trimmed to a Q30 threshold (using BBDuk parameters “ktrim=r k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=30”). Following this, the number of bases remaining in the trimmed reads was divided by the length of the 630 reference genome (13), 4,290,252 bp, to provide the mean high-quality coverage; this was required to be ≥50× for a sequence to be included in the study. Appropriate quality trimming and adapter removal were confirmed using FastQC (14). To check for contamination with non-C. difficile DNA, the species origin of sequence reads was classified using Kraken2 (15) and the MiniKraken2_v1 database (built from the RefSeq bacteria, archaea, and viral libraries). Following Bletz et al. (6), reads were de novo assembled using SPAdes (version 3.11.1) (16), with the “-careful” flag to reduce misassembly, using Burrows-Wheeler Aligner (BWA)-based mapping to confirm variants. Assembly quality metrics were obtained using the stats.sh script from BBMap (12). Samples with assembly sizes (base pairs in contigs) of >10% more or less than the median size were rejected. We also tested performance using SPAdes with an additional flag, “-only-assembler,” to disable the SPAdes internal read correction procedure. As an additional comparison, reads were also de novo assembled using SKESA (version 2.3) (17) with default settings. Reads (without stringent quality trimming) were also mapped to the 630 reference genome as described previously (1, 11, 18), using stampy (19) for mapping and mpileup (20) for variant calling, followed by quality filtering of variants. Variant calls were required to have a quality score of ≥30, be homozygous under a diploid model, be supported by ≥5 high quality reads (including ≥1 read in each direction and a consensus of ≥90% of bases), and not be within a repetitive region of the genome. See https://github.com/oxfordmmm/CompassCompact for example implementation. For inclusion, ≥70% of the reference genome needed to be called in the consensus sequence. Bases in the consensus sequence not passing quality filtering were denoted N rather than A, C, G, or T. The bioinformatic pipelines used in this study for assembly and hash-cgMLST were written as Nextflow workflows (21) and can be found at https://github.com/davideyre/hash-cgmlst. Information on required dependencies and system requirements is provided in the repository readme file.

Analysis.

Sequences meeting all quality thresholds (high-quality average coverage, assembly size, and proportion of reference genome called) were compared. For replicate sequences, when an isolate had been sequenced more than twice, a random sequence was chosen as the baseline sequence to which all other sequences from the same isolate were compared in order to avoid multiple counting. Pairwise observed SNP differences between replicates and recombination-corrected SNP differences between other C. difficile genomes were obtained using Python scripts, PhyML (22), and ClonalFrameML (23), as previously described (11) (https://github.com/davideyre/runListCompare). Whole-genome alignments were used as input for PhyML. Invariant sites, i.e., those called as the same base as the reference or as an unknown base (N) across all genomes were set to be the same base as the reference for computational efficiency, given that there was no evidence of variation at these sites. All other sites had evidence of variation in at least one genome and were included unchanged, including any genomes with an N at that site. The maximum likelihood approach taken accounts for the uncertainty in the phylogeny arising from some genomes having an N called at some variable sites. The number of cgMLST locus differences and number of loci compared were obtained using Python (https://github.com/davideyre/hash-cgmlst). Where no BLAST match was found for a gene in either (or both) of the genomes in a pairwise comparison, this was not counted toward the total number of cgMLST gene differences.

Data availability.

Sequence Read Archive (SRA) accession numbers for analyzed replicate genomes are provided in Table S1, with explanatory notes in the accompanying legend. Data for the 973 genomes from six English hospitals can be found under NCBI BioProject accession number PRJNA369188.

RESULTS

Hash-cgMLST provided the same results as standard cgMLST with a minimal performance penalty. Results are presented throughout using pairwise core-gene differences generated with hash-cgMLST, as these were identical to standard cgMLST gene differences if novel alleles were accounted for.

Comparison of hash-cgMLST and SNP typing performance in replicate sequences.

A total of 374 sequences from 104 isolates passed all quality checks and were available for comparison to investigate the reproducibility of sequencing followed by cgMLST for C. difficile transmission analyses. A median of 2 (interquartile range [IQR], 2 to 3; range, 2 to 27) sequences were available per isolate. Comparing replicate sequences with a randomly selected baseline sequence for each isolate yielded 272 comparisons for analysis. With perfect sequencing, no variants would be expected between pairs of sequences from the same isolate (replicate pairs). Using reference-based mapping and variant calling, there were 0 SNPs between 262 (96%) replicate pairs, 1 SNP between 5 (2%) pairs, and 2 SNPs between 1 (<1%) pair, i.e., a mean of 0.026 SNPs per pair, which equates to 1 false SNP call per 39 sequences (Fig. 1A). Based on the rate of C. difficile evolution and the extent of within-host genetic diversity, ≤2 SNPs are expected between >95% of cases related by recent transmission (1); it is therefore unlikely that transmission would be falsely excluded on the basis of the error rates seen.

FIG 1

Observed differences using SNP typing (panel A) and hash-cgMLST based on SPAdes (panel B) and SKESA (panel C) assemblies in 272 replicate sequence pairs. With perfect sequencing, no variants would be expected between pairs of sequences from the same isolate. Pairs of sequences known to have been obtained from the same pool of DNA are shown in dark blue. Where information was unavailable on whether the same pool of DNA was used or a fresh DNA extract was made from the same isolate, this is shown in light blue. Using either hash-cgMLST or standard cgMLST following assembly using SPAdes, 218 (80%) replicate pairs had zero gene differences, 31 (11%) pairs had 1 difference, 5 (2%) pairs had 2 differences, and 18 (7%) pairs had >2 differences, with a mean of 0.64 false gene differences per genome (Fig. 1B) (test for symmetry considering 0, 1, 2, and >2 SNPs or gene differences; P = 0.004). Applying a threshold of >2 gene differences to rule out transmission (by analogy with SNP-based metrics [1, 6]), the observed error rate would result in 6.6% (95% binomial confidence interval [CI], 4.0 to 10.3%) of transmission pairs being falsely excluded. Restriction to the subset of sequences for which sequencing was known to have been undertaken from the same pool of extracted DNA produced fewer gene differences (Fig. 1). Of 190 pairs, 189 (>99%) had 0 SNPs, and 1 (<1%) pair had 1 SNP. Based on cgMLST, 167 (88%) pairs had 0 gene differences, 19 (10%) had 1 difference, 4 (2%) had 2 differences, and none had >2 differences.

Predictors of false cgMLST gene differences.

The observation of greater differences between replicates when restricted to variation in the 2,270 core genes versus considering SNPs across the whole genome is potentially counterintuitive. However, it should be remembered that the whole-genome SNP approach depends on a different bioinformatic approach with sophisticated per-variant quality filtering, whereas cgMLST is based on de novo assembly with more limited quality filtering. We therefore investigated potential predictors of false cgMLST gene differences using the hash-cgMLST algorithm (potential predictors were identical to those in the standard cgMLST approach) to see if filtering could be improved. Although we had already restricted our analysis to only include sequences with a mean genome coverage of >50×, we investigated whether a more stringent threshold would improve performance (Fig. 2). There was no evidence that increased coverage was associated with fewer cgMLST gene differences (Spearman’s rho = −0.04; P = 0.43). There were only 2 sequences in the data set with 50-bp reads; the remainder had 100- or 150-bp reads. In total, 14/222 (6%) sequence pairs in which the minimum sequence length was 100 bp contained >2 gene differences, compared to 4/48 (8%) in pairs with both 150-bp reads (exact P = 0.54).

FIG 2

Relationship between hash-cgMLST gene differences in replicate sequence pairs and average genome coverage and read length. Jitter applied to points to assist visualization. SPAdes with the “-careful” flag was used to generate assemblies. The relationship between cgMLST gene differences and de novo assembly quality metrics is shown in Fig. 3A to C. Given the filtering applied, there was still an association between the number of false gene differences and the maximum absolute percentage deviation from the overall median assembly size (4,165,590 bp) within each replicate pair (which was constrained to be ≤10% for inclusion in the analysis) (Spearman’s rho = 0.21; P < 0.001; Fig. 3A, with both small and large assemblies contributing to this effect). The L50 value describes the minimum number of contigs required to achieve 50% of the assembly size, with higher values representing more fragmented lower quality assemblies. Higher L50 values were associated with greater rates of false gene differences (Spearman’s rho = 0.37; P < 0.001). A total of 9 (2%) of 257 pairs with both L50 values of ≤125 had >2 false gene differences, compared to 9/15 (60%) with one or more sequences with an L50 value of >125 (Fig. 3B). Another measure of assembly fragmentation is the total number of contigs; higher numbers of contigs were also associated with greater false gene differences (Spearman’s rho = 0.31; P < 0.001; Fig. 3C).

FIG 3

Relationship between hash-cgMLST gene differences in replicate sequence pairs and de novo assembly quality metrics (A to C) and Kraken2 read classification (D). Jitter applied to points to assist visualization. One point is omitted from Fig. 3D for ease of visualization with the proportion of reads classified as C. difficile (0.64) and 0 gene differences. SPAdes with the “-careful” flag was used to generate assemblies. Figure 3D shows the impact of the proportion of reads classified as C. difficile by Kraken2 on cgMLST gene differences. Within the data set, there was no evidence of significant contamination with a bacterial species other than C. difficile, and the most common species in all samples was C. difficile. However, the proportion of reads that could not be classified at all varied from 0 to 11% between sequences, with the exception of one replicate pair (36% and 24%). Higher rates of unclassified sequences were associated with higher false gene differences but without any clear separation of the data on this basis (Spearman’s rho = −0.23; P < 0.001).

Distribution of cgMLST gene differences in replicate sequences.

The gene differences observed between replicate sequences disproportionately affected a small number of genes (Table S2). Only 82 (4%) of 2,270 genes contained differences within the replicate sequences. To avoid multiple counting, we evaluated the number of isolates that contained at least a pair of replicates with gene differences. A total of 16 genes contained differences in two or more isolates’ replicates and, of these, 15 were due to the same nucleotide differing in all replicate pairs. The reproducible location of the differences observed for a given gene across different isolates is compatible with consistent misassembly (Table S2). If the 15 genes with identical gene differences affecting ≥2 isolates were excluded, the number out of the 272 replicate pairs with 0 gene differences increased from 218 (80%) to 236 (87%), and the number of pairs with >2 gene differences reduced from 18 (7%) to 14 (5%). (Fig. S1B). Using the full 2,270-gene set and disabling SPAdes internal read correction resulted in fewer false gene differences, namely, 0 differences in 236 (87%) pairs and >2 differences in 14 (5%) (Fig. S1C).

Alternative assembler (SKESA).

Use of SKESA in place of SPAdes as the assembler used for hash-cgMLST resulted in fewer differences between replicate pairs (Fig. 1C), namely, 241 (89%) pairs had 0 differences, 22 (8%) pairs had 1 difference, 6 (2%) pairs had 2 differences, and 3 (1%) pairs had 3 differences. This equates to 0.16 false gene differences per replicate pair sequenced. The median number of genes compared between replicate pairs was 2,225 (IQR, 2,187 to 2,235) using SKESA and 2,227 (IQR, 2,205 to 2,242) using SPAdes out of a possible maximum of 2,270 genes.

Benchmarking.

Samples were processed in parallel, with each sample using a single core from an Intel Xeon Gold 6150 2.70-GHz 18-core central processing unit (CPU). For a single sample, the median (IQR) time to undertake quality control and read filtering was 3.6 (2.7 to 4.9) minutes and 27.4 (19.6 to 35.4) minutes, respectively, to generate an assembly using Spades with read error correction and 16.3 (12.1 to 21.5) minutes without; SKESA took 19.4 (15.5 to 24.3) minutes. Creating a hash-cgMLST profile from the assemblies took 44.1 (43.5 to 44.9) seconds. After making hash-cgMLST profile files, comparing a single genome to 100,000 others using a single CPU core took 40.4 s. In contrast, 100,000 comparisons using a standard cgMLST approach took marginally less time—38.7 s—after loading the profiles into memory. cgMLST profiles can also be rapidly compared using a laptop or desktop; for example, using one core of an Intel i7 2.6-Ghz laptop processor, comparing the 973 samples from the six hospitals study required 467 Mb of memory and took 236 s for 472,879 comparisons, i.e., 49.9 s per 100,000 comparisons. Using the same laptop, creating hash-cgMLST profiles from existing assemblies typically took ∼40 s and required <100 Mb of memory.

Comparison of hash-cgMLST and SNP typing in data from six English hospitals.

We analyzed 973 genomes from a previous study of C. difficile transmission in six English hospitals (11). Of these, 56 failed the assembly size threshold and 20 failed the coverage threshold (one of these also failed the assembly threshold), leaving 898 (92%) genomes for analysis. We considered all pairs of genomes within ≤2 SNPs and used SPAdes (with the -only-assembler flag) or SKESA assemblies to test the extent to which the numbers of hash-cgMLST gene differences followed the number of SNPs (Fig. 4A and C). Of 412 pairs of sequences differing by ≤2 SNPs, according to analysis using SPAdes assemblies, 376 (91%) were within ≤2 gene differences, 30 (7%) had 3 differences, and 16 (4%) had ≥4 differences; according to analysis using SKESA assemblies, 406 (99%) had ≤2 gene differences, and the remainder all had ≤5 differences. The median number of genes called in each pair was 2,143 (IQR, 2,084 to 2,191) using SPAdes and 2,003 (IQR, 1,891 to 2,110) using SKESA.

FIG 4

Relationship between hash-cgMLST gene differences and SNPS in C. difficile genomes from consecutive infections in six English hospitals. (A) Distribution of hash-cgMLST gene differences between pairs of genomes within ≤2 SNPs. (B) Distribution of SNPs within pairs of genomes within ≤2 gene differences. Panels A and B were generated using SPAdes assemblies with the “-careful -only-assembler” flags. (C and D) The same analysis using the SKESA assembler. To achieve ≥99% sensitivity for identifying genomes within ≤2 SNPs required a threshold of ≤9 gene differences using SPAdes and ≤3 gene differences using SKESA, with associated positive predictive values (PPVs) of 11% (410/3,720) and 38% (410/1,092), respectively. Specificity was >99% with both assemblers (399,031/402,341 and 401,659/402,341, respectively). We also considered the distribution of recombination-corrected SNPs within pairs of genomes with ≤2 gene differences using hash-cgMLST. Following assembly with SPAdes, of 590 pairs of genomes, 376 (64%) were within ≤2 SNPs, with the maximum number of SNPs observed being 20 (Fig. 4B). Using SKESA analysis of 749 genome pairs, 406 (54%) were within ≤2 SNPs (Fig. 4D).

DISCUSSION

Here, we present the concept of hash-cgMLST as a tool for rapid comparison of bacterial sequencing data. This is a significant development from standard cgMLST approaches, as it removes the need for a central database of alleles. Such databases require resource-intensive curation to ensure they are maintained to a high standard. Additionally, allele numbering is currently done consecutively in a single location, which is problematic with large data sets that span many laboratories; hashes also overcome this limitation. We also provide the code to run the algorithms developed. This article also highlights important limitations of common implementations of cgMLST as a tool for high-resolution outbreak detection. Stringent filtering done on the basis of mapped data allows the number of false variant calls to be controlled; here, we obtained around 1 false SNP for every 39 genomes sequenced. In contrast, fine-grained per-base quality control is typically not implemented in studies using de novo assembly tools. Using SPAdes, we observed a mean of 0.64 false gene differences per replicate genome pair. The alternative assembler tested, SKESA, was able to better control false gene differences, with 0.16 per replicate pair, i.e., 1 error per every 6.3 genomes sequenced. The higher rates of false variation observed using cgMLST/hash-cgMLST led to the counterintuitive observation in some samples of more differences when comparing 2,270 genes than when comparing the whole genome. It should be noted that undertaking SNP-based analyses from alignments of de novo assemblies without further filtering of variants would be similarly affected. These errors can be reduced by ensuring that the assemblies studied are of high quality. Our data suggest that the previously described read quality trimming and filtering based on assembly sizes (6, 10) could be further improved by also only analyzing samples with an L50 value of less than ∼125. However, this stringent filtering would have resulted in 30% of the previously published data set studied being unavailable for analysis, calling its practicability into question. Although our approach does not depend on a database of alleles, it is dependent on the development of a high-quality cgMLST scheme, i.e., appropriate identification of core genes based on a large and diverse collection of genomes and careful selection of problematic genes for exclusion. Despite such an approach being taken in developing the C. difficile cgMLST scheme used, we show that removing a small number of genes from this cgMLST scheme would likely improve performance if using SPAdes assemblies, as a small subset of genes contained higher numbers of false gene differences (Table S2, Fig. S1). This highlights the importance of assessing the performance of each cgMLST scheme created on a per-species and per-scheme basis using appropriate test data sets that include replicate and closely related sequences. Many of the apparent errors seen in replicate pairs appear to arise from misassembly. SPAdes-based read correction did not improve accuracy and instead resulted in more, rather than fewer, differences between replicate pairs. Use of an alternative assembler, SKESA (17), reduced the number of replicate pairs with >2 differences to just 1%, with a minimal reduction in the number of genes compared between replicate pairs (median of 2,225, compared to 2,227 with SPAdes). The reduction in genes compared was greater in the clinical data set analyzed (medians of 2,143 and 2,003), but this reduced discriminatory power for transmission studies will usually be more than offset by reduced error rates (and therefore reductions in erroneous exclusion of transmission). Our data also highlight that extrapolating the ≤2-SNP threshold for identifying genetically plausible transmission events to two (or three [6]) gene differences may be inappropriate, depending on the choice of assembler and settings. Using SPAdes, 4% of pairs of samples within ≤2 SNPs were >3 genes different by cgMLST, whereas with SKESA this was only 1%. For public health applications optimized to identify potential transmission, to be ≥99% sure of not missing pairs of sequences within ≤2 SNPs, a threshold of ≤9 gene differences was needed for SPAdes assemblies and ≤3 differences with SKESA. However, these thresholds for SPAdes resulted in around 8 genome pairs that were >2 recombination-corrected SNPs apart being identified for every 1 pair within ≤2 SNPs (PPV, 11%) and 1.6 pairs that were >2 SNPs apart for every pair within ≤2 SNPs using SKESA (PPV, 38%). In this scenario, further SNP-based analysis based on mapping and filtered variant calling is likely to be required to determine which genomes are potentially related by recent transmission and which are not. In other cases, larger numbers of SNPs than gene differences were observed (Fig. 4B and D), which may arise from SNPs outside core genes, SNPs in uncalled genes, and imperfect correction of recombination events. Hash-cgMLST allowed rapid comparison of many thousands of bacterial genomes within seconds, using a relatively unoptimized Python script running on a single laptop or server CPU core. As comparisons with other genomes can be easily divided into independent parts, this task is readily parallelizable. Using hash-cgMLST, it is therefore potentially possible to compare each new sequence generated with millions of previous sequences. The summaries of each genome produced a roughly 130-kb JSON file, which is readily exchangeable between laboratories and could potentially be hosted alongside raw reads in sequence read archives. As such, each laboratory could maintain its own database of hash-cgMLST profiles and distances, as well as this information potentially being usefully provided as part of future web-based services based on publicly available data. Although, without further refinements, hash-cgMLST may not allow high-precision fine-scaled transmission studies, it has the potential to dramatically reduce the search space for closely related genomes, which can then be followed by more precise SNP-based analyses on a much smaller subset of genomes. Using SPAdes, we observed a higher rate of “false” gene differences between genomes in which the sequences were potentially generated from separate DNA extractions of the same isolates compared to that in genomes obtained from the same DNA extraction. It is therefore plausible that the differences observed represent true differences but also a form of variation that is much faster and more erratic than mutation/recombination rates based on filtered SNPs. The erratic nature of the variation observed is unlikely to be informative about recent transmission. We also did not see these differences to the same extent using an alternative assembler, SKESA. This study is potentially limited by not being an exhaustive investigation of all the potential options for assembly and for filtering de novo assembly data; in particular, further filtering of variants based on mapping reads back to assemblies, e.g., as done by Enterobase (24), may improve precision. Although we used Kraken2 to search for contamination with DNA from other species, contamination with C. difficile DNA from other concurrently processed samples may be an important contributor to some of the differences seen with hash-cgMLST, whereas resulting mixed calls can be filtered using mapped data. In conclusion, appropriately quality controlled cgMLST can identify clusters of related genomes rapidly and is an appropriate tool for surveillance and reducing the search space in outbreaks. The SKESA assembler, compared to SPAdes, was associated with lower rates of gene differences between replicate sequences and, when used for hash-cgMLST, more closely matched the number of SNPs between closely related samples. The approach we describe has potential to be deployed across a range of pathogens, including those where linkage across time and wide geographic space, i.e., cases involving very large sequencing data sets, may help resolve sources and routes of transmission, such as for foodborne infections. Refined variant calling based on mapping is likely required to precisely define close genetic relationships. This study highlights the need for detailed quality assurance to determine the performance of algorithms used for comparing genomes. Our hash-cgMLST implementation is freely available and provides an effective database-free approach to cgMLST.

22 in total

1. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors: Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal: J Comput Biol Date: 2012-04-16 Impact factor: 1.479

2. Core Genome Multilocus Sequence Typing Scheme for High- Resolution Typing of Enterococcus faecium.

Authors: Mark de Been; Mette Pinholt; Janetta Top; Stefan Bletz; Alexander Mellmann; Willem van Schaik; Ellen Brouwer; Malbert Rogers; Yvette Kraat; Marc Bonten; Jukka Corander; Henrik Westh; Dag Harmsen; Rob J L Willems
Journal: J Clin Microbiol Date: 2015-12 Impact factor: 5.948

3. The multidrug-resistant human pathogen Clostridium difficile has a highly mobile, mosaic genome.

Authors: Mohammed Sebaihia; Brendan W Wren; Peter Mullany; Neil F Fairweather; Nigel Minton; Richard Stabler; Nicholas R Thomson; Adam P Roberts; Ana M Cerdeño-Tárraga; Hongmei Wang; Matthew T G Holden; Anne Wright; Carol Churcher; Michael A Quail; Stephen Baker; Nathalie Bason; Karen Brooks; Tracey Chillingworth; Ann Cronin; Paul Davis; Linda Dowd; Audrey Fraser; Theresa Feltwell; Zahra Hance; Simon Holroyd; Kay Jagels; Sharon Moule; Karen Mungall; Claire Price; Ester Rabbinowitsch; Sarah Sharp; Mark Simmonds; Kim Stevens; Louise Unwin; Sally Whithead; Bruno Dupuy; Gordon Dougan; Bart Barrell; Julian Parkhill
Journal: Nat Genet Date: 2006-06-25 Impact factor: 38.330

Review 4. MLST revisited: the gene-by-gene approach to bacterial genomics.

Authors: Martin C J Maiden; Melissa J Jansen van Rensburg; James E Bray; Sarah G Earle; Suzanne A Ford; Keith A Jolley; Noel D McCarthy
Journal: Nat Rev Microbiol Date: 2013-09-02 Impact factor: 60.633

5. ClonalFrameML: efficient inference of recombination in whole bacterial genomes.

Authors: Xavier Didelot; Daniel J Wilson
Journal: PLoS Comput Biol Date: 2015-02-12 Impact factor: 4.475

6. Mash: fast genome and metagenome distance estimation using MinHash.

Authors: Brian D Ondov; Todd J Treangen; Páll Melsted; Adam B Mallonee; Nicholas H Bergman; Sergey Koren; Adam M Phillippy
Journal: Genome Biol Date: 2016-06-20 Impact factor: 13.583

7. Core Genome Multilocus Sequence Typing Scheme for Stable, Comparative Analyses of Campylobacter jejuni and C. coli Human Disease Isolates.

Authors: Alison J Cody; James E Bray; Keith A Jolley; Noel D McCarthy; Martin C J Maiden
Journal: J Clin Microbiol Date: 2017-04-26 Impact factor: 5.948

8. Comparison of Control of Clostridium difficile Infection in Six English Hospitals Using Whole-Genome Sequencing.

Authors: David W Eyre; Warren N Fawley; Anu Rajgopal; Christopher Settle; Kalani Mortimer; Simon D Goldenberg; Susan Dawson; Derrick W Crook; Tim E A Peto; A Sarah Walker; Mark H Wilcox
Journal: Clin Infect Dis Date: 2017-08-01 Impact factor: 9.079

9. Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors: Derrick E Wood; Steven L Salzberg
Journal: Genome Biol Date: 2014-03-03 Impact factor: 13.583

10. SKESA: strategic k-mer extension for scrupulous assemblies.

Authors: Alexandre Souvorov; Richa Agarwala; David J Lipman
Journal: Genome Biol Date: 2018-10-04 Impact factor: 13.583

10 in total

1. Establishment and Evaluation of a Core Genome Multilocus Sequence Typing Scheme for Whole-Genome Sequence-Based Typing of Pseudomonas aeruginosa.

Authors: Hauke Tönnies; Karola Prior; Dag Harmsen; Alexander Mellmann
Journal: J Clin Microbiol Date: 2021-02-18 Impact factor: 5.948

2. Comparison of Whole-Genome Sequence-Based Methods and PCR Ribotyping for Subtyping of Clostridioides difficile.

Authors: A Baktash; J Corver; C Harmanus; W K Smits; W Fawley; M H Wilcox; N Kumar; D W Eyre; A Indra; A Mellmann; E J Kuijper
Journal: J Clin Microbiol Date: 2021-12-15 Impact factor: 11.677

3. A publicly accessible database for Clostridioides difficile genome sequences supports tracing of transmission chains and epidemics.

Authors: Martinique Frentrup; Zhemin Zhou; Matthias Steglich; Jan P Meier-Kolthoff; Markus Göker; Thomas Riedel; Boyke Bunk; Cathrin Spröer; Jörg Overmann; Marion Blaschitz; Alexander Indra; Lutz von Müller; Thomas A Kohl; Stefan Niemann; Christian Seyboldt; Frank Klawonn; Nitin Kumar; Trevor D Lawley; Sergio García-Fernández; Rafael Cantón; Rosa Del Campo; Ortrud Zimmermann; Uwe Groß; Mark Achtman; Ulrich Nübel
Journal: Microb Genom Date: 2020-07-29

4. Whole genome sequencing of Clostridioides difficile PCR ribotype 046 suggests transmission between pigs and humans.

Authors: Anders Werner; Paula Mölling; Anna Fagerström; Fredrik Dyrkell; Dimitrios Arnellos; Karin Johansson; Martin Sundqvist; Torbjörn Norén
Journal: PLoS One Date: 2020-12-21 Impact factor: 3.240

5. Establishment of a Publicly Available Core Genome Multilocus Sequence Typing Scheme for Clostridium perfringens.

Authors: Mostafa Y Abdel-Glil; Prasad Thomas; Jörg Linde; Keith A Jolley; Dag Harmsen; Lothar H Wieler; Heinrich Neubauer; Christian Seyboldt
Journal: Microbiol Spectr Date: 2021-10-27

Review 6. Infection prevention and control insights from a decade of pathogen whole-genome sequencing.

Authors: D W Eyre
Journal: J Hosp Infect Date: 2022-02-12 Impact factor: 8.944

7. K-mer based prediction of Clostridioides difficile relatedness and ribotypes.

Authors: Matthew Phillip Moore; Mark H Wilcox; A Sarah Walker; David W Eyre
Journal: Microb Genom Date: 2022-04

8. Delineating Mycobacterium abscessus population structure and transmission employing high-resolution core genome multilocus sequence typing.

Authors: Stefan Niemann; Florian P Maurer; Margo Diricks; Matthias Merker; Nils Wetzstein; Thomas A Kohl
Journal: Nat Commun Date: 2022-08-23 Impact factor: 17.694

9. Decentralized Investigation of Bacterial Outbreaks Based on Hashed cgMLST.

Authors: Carlus Deneke; Laura Uelze; Holger Brendebach; Simon H Tausch; Burkhard Malorny
Journal: Front Microbiol Date: 2021-05-28 Impact factor: 5.640

10. Transition From PCR-Ribotyping to Whole Genome Sequencing Based Typing of Clostridioides difficile.

Authors: Helena M B Seth-Smith; Michael Biggel; Tim Roloff; Vladimira Hinic; Thomas Bodmer; Martin Risch; Carlo Casanova; Andreas Widmer; Rami Sommerstein; Jonas Marschall; Sarah Tschudin-Sutter; Adrian Egli
Journal: Front Cell Infect Microbiol Date: 2021-06-01 Impact factor: 5.293

10 in total