| Literature DB >> 31666367 |
A Sarah Walker1,2,3, Mark H Wilcox4, David W Eyre5,1, Tim E A Peto1,2,3, Derrick W Crook1,2,3.
Abstract
Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralized database of sequentially numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to those of mapping-based approaches in Clostridium difficile, using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals. Hash-cgMLST provided the same results as standard cgMLST, with minimal performance penalty. Comparing 272 replicate sequence pairs using reference-based mapping, there were 0, 1, or 2 single-nucleotide polymorphisms (SNPs) between 262 (96%), 5 (2%), and 1 (<1%) of the pairs, respectively. Using hash-cgMLST, 218 (80%) of replicate pairs assembled with SPAdes had zero gene differences, and 31 (11%), 5 (2%), and 18 (7%) pairs had 1, 2, and >2 differences, respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but were reduced using the SKESA assembler. Considering 412 pairs of infections with ≤2 SNPS, i.e., consistent with recent transmission, 376 (91%) had ≤2 gene differences and 16 (4%) had ≥4. Comparing a genome to 100,000 others took <1 min using hash-cgMLST. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generate more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships.Entities:
Keywords: Clostridium difficilezzm321990; core genome MLST; quality assurance; whole-genome sequencing
Mesh:
Year: 2019 PMID: 31666367 PMCID: PMC6935933 DOI: 10.1128/JCM.01037-19
Source DB: PubMed Journal: J Clin Microbiol ISSN: 0095-1137 Impact factor: 5.948
FIG 1Observed differences using SNP typing (panel A) and hash-cgMLST based on SPAdes (panel B) and SKESA (panel C) assemblies in 272 replicate sequence pairs. With perfect sequencing, no variants would be expected between pairs of sequences from the same isolate. Pairs of sequences known to have been obtained from the same pool of DNA are shown in dark blue. Where information was unavailable on whether the same pool of DNA was used or a fresh DNA extract was made from the same isolate, this is shown in light blue.
FIG 2Relationship between hash-cgMLST gene differences in replicate sequence pairs and average genome coverage and read length. Jitter applied to points to assist visualization. SPAdes with the “-careful” flag was used to generate assemblies.
FIG 3Relationship between hash-cgMLST gene differences in replicate sequence pairs and de novo assembly quality metrics (A to C) and Kraken2 read classification (D). Jitter applied to points to assist visualization. One point is omitted from Fig. 3D for ease of visualization with the proportion of reads classified as C. difficile (0.64) and 0 gene differences. SPAdes with the “-careful” flag was used to generate assemblies.
FIG 4Relationship between hash-cgMLST gene differences and SNPS in C. difficile genomes from consecutive infections in six English hospitals. (A) Distribution of hash-cgMLST gene differences between pairs of genomes within ≤2 SNPs. (B) Distribution of SNPs within pairs of genomes within ≤2 gene differences. Panels A and B were generated using SPAdes assemblies with the “-careful -only-assembler” flags. (C and D) The same analysis using the SKESA assembler.