| Literature DB >> 35384833 |
Matthew Phillip Moore1,2,3, Mark H Wilcox4, A Sarah Walker2,3,5, David W Eyre1,3,5.
Abstract
Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1905 diverse C. difficile genomes (differing by 0-168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome's ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.Entities:
Keywords: Clostridioides difficile; MASH; k-mer; ribotype
Mesh:
Year: 2022 PMID: 35384833 PMCID: PMC9453075 DOI: 10.1099/mgen.0.000804
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.The relationship between and measures of performance of non-redundant pairwise sourmash distances from assembled genomes and their corresponding core genome SNP distances. Top left shows a scatterplot of sourmash distances vs SNP distances (n=1 813 560) comparisons and top right, those ≥0.884 sourmash distance where sensitivity for pairs≤10 SNPs is 100 %. Bottom left shows performance for all sourmash distance thresholds predicting pairs that are ≤100 SNPs, ≤50 SNPs and ≤10 SNPs by positive predictive value vs sensitivity and bottom right the receiver operator curve with area under the curve values rounded to two decimal places.
Performance of k-mer based genome comparisons identifying pairs with ≤10 SNPs. Contigs<1000 bp were removed from assembled genomes before k-mer hash signatures were generated. Results for k-mer hash signatures are presented from sequencing reads with and without removal of low abundance reads. Core gene k-mer hash signatures were generated from multi-fasta files such that overlapping regions did not generate k-mers
|
K-mer source |
Sensitivity for including pairs≤10 SNPs used to determine Sourmash threshold |
Sourmash distance threshold |
Pairs reduced to (% reduction from 1813560) |
Positive Predictive Value (true positives / true positives and false positives) |
Clusters identified |
Median (range) pairs per cluster |
Largest SNP difference within cluster |
|---|---|---|---|---|---|---|---|
|
|
100 % sensitivity for all ≤10 SNPs |
0.884 |
161 934 (↓91.1%) |
32.1% (52,020/161,934) |
49 |
28 (1-99,681) |
4144 |
|
|
95 % sensitivity for all ≤10 SNPs |
0.973 |
108 266 (↓94.0%) |
45.8 % (49,538/108,266) |
119 |
3 (1-88,708) |
1009 |
|
|
100 % sensitivity for all ≤10 SNPs |
0.780 |
177 940 (↓90.2%) |
29.2% (52,020/177,940) |
38 |
102 (1-99,681) |
7705 |
|
|
95 % sensitivity for all ≤10 SNPs |
0.950 |
125 761 (↓93.1%) |
39.5 % (49,528/125,761) |
72 |
5 (1-96,520) |
1460 |
|
|
100 % sensitivity for all ≤10 SNPs |
0.071 |
1 492 029 (↓17.7%) |
3.5% (52,020/1,492,029) |
1 |
1 492 029 |
105 373 |
|
|
100 % sensitivity for all ≤10 SNPs |
0.978 |
157 114 (↓91.3%) |
33.1% (52,020/157,114) |
52 |
40.5 (1-99,681) |
3202 |
Fig. 2.Sourmash and SNP distances within lineage (ribotypes) from assembled genomes for which there were ≥50 genomes per ribotype. Left, a scatterplot shows the relationship between sourmash distance and SNPs within ribotype. Performance of all sourmash thresholds predicting pairs≤10 SNPs within ribotype is plotted top right with a receiver operator curve and bottom right positive predictive values vs sensitivity. Plots are consistently coloured by ribotype and area under the curve values are rounded to two decimal places.
Performance of k-mer based genome comparisons identifying pairs with ≤10 SNPs. All k-mer hash signatures were generated from assembled genomes per ribotype. Random positive predictive value is the performance of Sourmash threshold 0.000, or proportion of the genome pairs with ≤10 SNPs in the dataset
|
RT |
Median SNPs within lineage |
SNP range |
Sourmash threshold |
Search space reduction |
PPV |
Random PPV |
PPV/Random PPV |
|---|---|---|---|---|---|---|---|
|
|
17 |
0–919 |
0.932 |
6,896/7,503 (↓ 8.1%) |
25 .% |
29.4% |
0.8 |
|
|
103 |
0–5074 |
0.935 |
940/2,926 (↓67.9%) |
8.1% |
2.6% |
3.1 |
|
|
13 |
0–470 |
0.945 |
99,564/99,681 (↓0.11%) |
40.0% |
39.9% |
1 |
|
|
943 |
0–6242 |
0.924 |
464/1326 (↓65.0%) |
2.4% |
2.4% |
1 |
|
|
65 |
0–1756 |
0.920 |
5991/8001 (↓29.4%) |
9.5% |
6.7% |
1.4 |
|
|
44 |
0–1416 |
0.947 |
3422/3828 (↓10.6%) |
8.2% |
7.4% |
1.1 |
|
|
4428 |
0–5700 |
0.884 |
5061/13 366 (↓62.1%) |
8.4% |
3.2% |
2.6 |
|
|
1756 |
0–4391 |
0.939 |
1353/5671 (↓76.1%) |
12.0% |
2.9% |
4.1 |
|
|
9 |
0–1400 |
0.940 |
14 454/14 878 (↓2.8%) |
55.0% |
53.2% |
1 |
|
|
217.5 |
0–885 |
0.959 |
740/3570 (↓79.3%) |
29.00% |
6.0% |
4.8 |
|
|
24 576 |
0–168 519 |
0.884 |
161 934/1 813 560 (↓91.1%) |
32.1 % |
2.9 % |
11.1 |
Performance of ribotype prediction for 1000 genomes. Taking the five closest matches to a database of genome signatures (sourmash search) or full genomes (alignment) and predicting the searched genomes ribotype based on 5/5 or 4/5 concordance. Results would be reported as inconclusive when they lacked concordance, correct when the concordant ribotypes matched the searched and incorrect when they didn’t. A percent cutoff was further applied below which no individual matches had the same ribotype as the searched genome signature
|
Rule |
Correctly determined |
Incorrectly determined |
Determined inconclusive |
|---|---|---|---|
|
|
780 |
20 |
200 |
|
|
872 |
50 |
78 |
|
|
780 |
20 |
200 |
|
|
872 |
45 |
83 |
|
|
792 |
20 |
188 |
|
|
863 |
31 |
106 |