| Literature DB >> 35771206 |
Denis Volk1, Fan Yang-Turner1,2, Xavier Didelot3,4, Derrick W Crook1, David Wyllie5.
Abstract
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and Mycobacterium tuberculosis genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or M. tuberculosis genome amidst millions of samples.Entities:
Keywords: bacterial genomics; microbial relatedness; outbreak detection
Mesh:
Year: 2022 PMID: 35771206 PMCID: PMC9455716 DOI: 10.1099/mgen.0.000850
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Data sets used for benchmarking
|
Data set |
Description |
Nucleotides excluded from comparisons |
No. of sequences |
Median distance from reference |
Median no. of unknown bases |
|---|---|---|---|---|---|
|
|
Genomes sequenced by UKHSA, mapped and variation called against NC_000962.2 (4 411 532 nt) |
557291 |
12 832 |
1119 |
40 369 |
|
SARS-CoV-2 |
Genomes generated by COG-UK consortium, mapped and variation called against MN908947.3 (29 903 nt) |
386 |
1 529 081 |
17 |
10.5 |
Comparison of dataset load speed and comparison speed for Catwalk versus findNeighbour2 seqComparer
|
Implementation |
Time to load data [s per sample] |
Time to query neighbours with 20 SNV cut-off, per sample queried [median (25th, 75th centiles)] [μs] |
Memory usage [MB per sequence] |
|---|---|---|---|
|
findNeighbour2 |
0.31 |
1 700 (1500, 2100) |
1.78 |
|
Catwalk |
0.0076 |
0.98 (0.91, 1.04) |
0.15 |
|
Approximate performance difference |
40-fold speed-up |
1 750-fold speed-up |
12-fold lower memory usage |
Fig. 1.Catwalk search performance. Time taken to search neighbours of a single sample in the data set (n=12 832) (a, b) or a SARS-CoV-2 data set (n=1 529 081) (c, d) using different SNV cut-off values. Results are expressed as search time in μs ('micros') per sample in each data set to allow comparisons between data sets. Standard errors of the mean timing are less than 3 % of the plotted values in all cases, and are not shown. In (a) and (c), results are stratified by the number of unknown positions in the sequence used for the query sequence, in quantiles. In (b) and (d), results are stratified by the distance from the reference of the query sequence, in quantiles.