Literature DB >> 29132318

BugMat and FindNeighbour: command line and server applications for investigating bacterial relatedness.

Oriol Mazariegos-Canellas¹, Trien Do¹, Tim Peto¹, David W Eyre¹, Anthony Underwood², Derrick Crook¹, David H Wyllie^3,4.

Abstract

BACKGROUND: Large scale bacterial sequencing has made the determination of genetic relationships within large sequence collections of bacterial genomes derived from the same microbial species an increasingly common task. Solutions to the problem have application to public health (for example, in the detection of possible disease transmission), and as part of divide-and-conquer strategies selecting groups of similar isolates for computationally intensive methods of phylogenetic inference using (for example) maximal likelihood methods. However, the generation and maintenance of distance matrices is computationally intensive, and rapid methods of doing so are needed to allow translation of microbial genomics into public health actions.
RESULTS: We developed, tested and deployed three solutions. BugMat is a fast C++ application which generates one-off in-memory distance matrices. FindNeighbour and FindNeighbour2 are server-side applications which build, maintain, and persist either complete (for FindNeighbour) or sparse (for FindNeighbour2) distance matrices given a set of sequences. FindNeighbour and BugMat use a variation model to accelerate computation, while FindNeighbour2 uses reference-based compression. Performance metrics show scalability into tens of thousands of sequences, with options for scaling further.
CONCLUSION: Three applications, each with distinct strengths and weaknesses, are available for distance-matrix based analysis of large bacterial collections. Deployed as part of the Public Health England solution for M. tuberculosis genomic processing, they will have wide applicability.

Entities: Chemical Disease Species

Keywords: Bacterial genomes; Distance matrix; Phylogenetic analysis

Mesh：

Year: 2017 PMID： 29132318 PMCID： PMC5683244 DOI： 10.1186/s12859-017-1907-2

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Whole genomic sequencing of bacteria is becoming very common, generating large sequence collections [1]. There is a both a public health and scientific need for tools which allow an understanding of the evolutionary relationships between these isolates. Bacteria typically have genomes of a few megabases; for example, Mycobacterium tuberculosis has a genome size of about 4.4 × 106 bases [1]. Genome analysis has the potential to radically improve the detection of M. tuberculosis transmission [2] which is a national and global priority [3]. This is also true of multiple other organisms [4-7]. Therefore, there is a need to address the computational challenges associated with rapid detection of close relationships between sequences of the same bacterial species, including M. tuberculosis and other pathogenic bacterial species. Quantification of core-genome single nucleotide polymorphism (SNP) analysis is a common initial step in relatedness estimation [2]. This step involves comparison of sequencing reads mapped to each individual base, and identifying a consensus base at each position. Filters are applied designed to exclude variation of technical origin (e.g. due to sequencing error or mismapping), marking such bases as uncertain (‘N’), and noting high-confidence consensus base calls as one of A, C, T, or G [2]. In some cases, quantification of core-genome single nucleotide polymorphisms is sufficient to generate epidemiologically important information [2]; in others, the technique is successful at identifying similar samples (samples which are very recently diverged from each other), but evolutionary inference as to the history of deep branches may be incorrect due to missing data or to recombination, and methods using maximal likelihood (ML) with or without adjustment for recombination to obtain to adjust estimates of divergence are required, e.g. [7-9]. Such methods have high computational complexity making computation slow or unfeasible for very large sequence sets, limiting their use in near-real time estimation of relatedness [8, 9]. One solution to this problem, yielding information suitable for public health purposes, involves focusing computationally complex methods on much smaller sets of similar isolates, as identified by core genome SNP estimation. Here, we discuss requirements and strategies for identifying such small sets of similar isolates. In microbial genomic analyses, a large number of similar bacterial genomes undergo a common processing pathway independent of each other, frequently involving reference mapping in some form of high-performance computing environment. Specimens leave the environment individually, so if a computational service is provided to receive the output from such mapping, the service must support single specimen addition. It must additionally be capable of rapid sequence addition: adding individual results in an atomic, consistent, isolated and durable manner via a transaction of short duration. This is particularly important if the clients adding bacterial sequences to the relatedness service use synchronous connections to the relatedness monitoring service. Finally, the relatedness service must be rapidly queryable, so that computational processes consuming relatedness information can deliver closely related samples to interactive applications. Here, we present three applications we have developed to meet these requirements. BugMat is a fast, scalable C++ application for building large distance matrices. FindNeighbour is a server based tool for maintaining, persisting and searching complete matrices, built on top of BugMat. FindNeighbour2 uses reference-based compression [10] and a database to implement rapid sparse matrix maintenance. We illustrate the use of these tools with real data from large-scale M. tuberculosis, N. gonorrhoea and Salmonella enterica sequencing projects. We propose that our tools will have wide application in the growing field of genomic public health microbiology, and more generally for the analysis of large sequence sets.

Implementation

Study of mapped data

Our applications are designed to operate on mapped data, where at each base a single base call is present. Thus, the relationship between two samples x and y can be represented using the Hamming distance: where sites are the N sites in x and y. If we had precomputed variant_sites, where. variant_sites ϵ sites. and represents those sites which differ between x and y, then we would obtain the same distance by only considering variant_sites, rather than all sites, since the invariant sites do not contribute to the Hamming distance. We consider the problem as one requiring construction of matrix of pairwise distances D from a set of samples Xt containing t elements x1 .. xt, such that the element Di, j = h(xi, yi). We denote such a matrix as D(Xt): for t = 3, The diagonal is zero and the matrix is symmetrical, as h(x1, x 2) = = h(x2, x 1), so only half needs to be stored: To construct this ab initio, the asymptotic matrix construction time is . For large t, this (and memory requirements) approximates .

Iterative addition of samples

In the use case we are considering, we do not know the entire set of Xt samples when we start matrix construction, since samples are provided individually. Rather, we are required to iteratively add samples, progressively increasing matrix size in a series of t steps, where t is 1, 2, 3 and so on. Thence, to add a fourth sample to the matrix above, we are required to compute (t-1) matrix entries Addition of a single sample thus scales linearly with t, taking O(t-1) time. Nevertheless, a large number of sequence positions still need to be analysed.

Optimisation of pairwise hamming distance computation using a variation model

A potential optimisation concerns the computation of pairwise Hamming distances between samples. Bacterial genomes are typically about 5 × 106 nucleotides; the M. tuberculosis reference strain using for mapping in this work is 4.4 × 106 nucleotides. In common with many other bacteria, M. tuberculosis has a strong population structure, and comprises multiple ancient lineages which differ from each other by about 2000 ancient SNPs [11]. Thus, if we wish to compute difference matrices between M. tuberculosis genomes, the number of variant positions is expected to be much smaller than the genome size, perhaps a few thousand SNPs only. If we know that there is no variation at position i between the t sequences to be analysed, there is no need to perform pairwise comparisons between all t samples at this site, a process with ~O(t 2) complexity for large t. Rather, we can assign a Hamming distance of 0 for all invariant sites without pairwise evaluation. To allow this, we include code to maintain a vector of variant sites in the alignment, which we term the variation model. We maintain the variation model in a datastructure containing counts of all bases at each position. This model, which allows rapid assessment of whether sites are variant is updated on addition of each sample if and only if the position is not currently variant, since positions cannot become invariant as further samples are added. Pseudocode for this operation is in Additional file 1. The asymptotic time for addition of a single sample to the variation model is O(1), and allows large (~ 1000 ×) speedup in the subsequent expensive pairwise distance computation since most sites are invariant. Storage of the matrix allows addition of samples to a collection of t samples with a cost of O(t-1), rather than ~ O(t 2) for complete matrix recomputation.

Reference based compression

There is an alternative formulation of the problem. In this, the mapped sequence S is represented a series of sets SA, SC, SG, ST containing positions of non-reference A, C, G and T bases, and SN containing the positions of uncallable bases. This is an implementation of reference based compression [10], but importantly one in which the compression strategy aids in the computation of the Hamming distance d between two sequences R and S using set operations: Since Python has an interface to highly optimised C++ set operations, we elected to code the implementation in Python 3.

Presentation

We present two solutions using the matrix-based approach, and one using the set-based approach. BugMat is a C++ command line executable, tested on both Linux (using gcc) and Windows (using MinGW), which performs rapid one-off distance matrix construction using a two phase computation, firstly building an index of variant bases in the sequences, and subsequently performing multi-threaded computation of pairwise distances between every sample pair with shared memory to store all data. The second is FindNeighbour, a server application, built with the python web.py framework, and accessed using a XMLRPC API. Internally, it comprises two components. Firstly, an OpenMP parallelised C++ application derived from BugMat maintains an in-memory distance matrix derived from mapped genomic data. Secondly, a database allows storage and querying of arbitrary meta-data about the sequence. FindNeighbour uses this database internally to store quality information about the sequence, such as the number of bases called in the sequence (i.e. bases called as A, C, T, G), and timing information related to server activity. This is at present implemented using an on-disc relational database via an object relational mapper. Additionally, FindNeighbour performs automated matrix persistence to disk, and recovers stored matrices from disk on server restarts. Design of BugMat/Findneighbour software includes a series of technical optimisations supporting the implementation of the variation model described above. Firstly, we have chosen to code to core of our software in C++, which can be compiled using gcc with the O3 and funroll-loops optimisations flags. We have elected to minimise disc access overhead by reading sequences into shared memory, which imposes a memory cost but allows fast, dynamic read/write using the unordered_map data type, which is faster than the C++ standard library’s map type. Finally, we have parallelised pairwise comparisons using the openmp library, based on a fork-join paradigm. The third application is FindNeighbour2. This compresses the sequences supplied to it, and stores them both in RAM and on disc following reference-based compression. On addition of a sample, it is compared with all other samples and matches below a customisable threshold are written to a database. Otherwise, the implementation is similar to FindNeighbour. For both applications, identical API methods allow: addition of samples, comprising an identifier and a fasta sequence; testing whether an identifier is present; determining distance between a pair of samples; determining neighbours of one sample or all samples, given a sequence quality threshold. Table 1 summarises the similarities and differences between these technologies.

Table 1

Comparison of three solutions

Implementation	BugMat	FindNeighbour1	FindNeighbour2
Presentation	Command Line	Server application	Server application
Technology	In-memory matrix	In-memory matrix	In-memory reference based sequences
Can add samples	No	Yes	Yes
Role in production environment	Remove invariant sites before maximal likelihood tree generation	Store pairwise distances between samples	Store significant pairwise distances between samples
Stores all pairwise distances	Yes	Yes	No (customisable)
Uses database for sequence metadata storage	No	Yes	Yes
Uses database for pairwise distance storage	No	No	Yes
Requires reference sequence specified	No	No	Yes
Implementation	C++	Python, C++	Python

Comparison of three solutions. A comparison of the approaches taken by three solutions (BugMat, Findneighbour, Findneighbour2) is shown

Comparison of three solutions Comparison of three solutions. A comparison of the approaches taken by three solutions (BugMat, Findneighbour, Findneighbour2) is shown

Results and discussion

We initially tested the performance of both the BugMat and FindNeighbour tools using real data comprising mapped data from bacterial sequencing projects.

Matrix based methods

We tested the command-line BugMat executable using data set A (M. tuberculosis, Table 2). The initial variant model construction step is computationally inexpensive (Fig. 1), and allows subsequent fast pairwise distance matrix construction using only variant positions. With more than 100 sequences, time for matrix construction dominates runtime; with fewer samples, about half the runtime involves reading and writing file output (Fig. 1). With 100 samples, BugMat computed a distance matrix in fewer than 4 s using 16 cores. Four hundred samples took 70 s, and 4000 sequences took about 3 h seconds, with the latter requiring 20GB of RAM. This tool forms part of a production pipeline, and is used to strip out invariant sites prior to maximal likelihood tree generation for up to 500 related isolates.

Table 2

Data sets used for testing findNeighbour performance

Dataset	Sites called	Links less than SNP stored	Memory usage	Mean time to add one sample
A: M. tuberculosis, n = 15,985 mapped to NC_000962	329,714 sites excluded	20	23.5G	2.23 s
B: Neisseria gonorrhoea, n = 2455 mapped to NC_011035	All sites included	500	19.3G	2.95 s
C: Salmonella enterica, n = 5380 mapped to AM933172	51,897 sites excluded	20	7.4G	1.77 s

Data sets studied and FindNeighbour2 performance

The data sets studied, which can be downloaded at https://ora.ox.ac.uk/objects/uuid:82ce6500-fa71-496a-8ba5-ba822b6cbb50 are described. Also shown are performance characteristics of Findneighbour2 operating on them using the hardware in Additional file 2

Fig. 1

Performance of the BugMat application. The relationship between number of sequences processed, numbers of cores assigned to BugMat, and (a) memory usage and (b) various stages of sequence processing. Testing was performed a Ubuntu 14 VM instance with 16 cores, Intel Xeon E5-2680v2 processors at 2.8GHz, and 128GB RAM hosted within Genome England Ltd. Similar results were obtained using the specification in Additional file 2

Data sets used for testing findNeighbour performance Data sets studied and FindNeighbour2 performance The data sets studied, which can be downloaded at https://ora.ox.ac.uk/objects/uuid:82ce6500-fa71-496a-8ba5-ba822b6cbb50 are described. Also shown are performance characteristics of Findneighbour2 operating on them using the hardware in Additional file 2 Performance of the BugMat application. The relationship between number of sequences processed, numbers of cores assigned to BugMat, and (a) memory usage and (b) various stages of sequence processing. Testing was performed a Ubuntu 14 VM instance with 16 cores, Intel Xeon E5-2680v2 processors at 2.8GHz, and 128GB RAM hosted within Genome England Ltd. Similar results were obtained using the specification in Additional file 2 BugMat has various restrictions. Firstly, unlike sophisticated software capable of performing large-scale sequence alignment and distance matrix generation [12], it relies on pre-aligned sequences. It splits the workload across cores, but does not support distribution of work across multiple computational nodes [12]. Secondly, it is presently restricted to CPU-based architectures supporting OpenMP, including the emerging Intel Xeon Phi [13] architecture; however, similar sharding of tasks across multiple computational cores can be achieved using graphical processing units and CUDA. Thirdly, by design it builds matrices in memory. This imposes limits on matrix size which could be avoided introducing a disc-based storage capability, albeit at large performance cost. Finally, it builds matrices ab initio; it does not allow their updating, and has high rebuild times for large sample numbers. This last limitation is restrictive for public health applications in which ongoing comparison of isolates with a databank is required. FindNeighbour addresses this. Using the same test set as for BugMat, the time to add new samples to FindNeighbour is directly proportional to the number of previous samples present, as is expected from the internal software architecture (Fig. 2a). Server response times recovering neighbours within 20 single nucleotide variations of a sequence averaged 50 msec with 4000 samples stored. The application is very stable, and a production instance with over 21,500 samples on a 128GB RAM server (110GB RAM used) is operational as part of a Public Health England initiative to sequence all TB in England. Importantly, however, as with BugMat, memory requirements increase quadratically, and this limits further scalability.

Fig. 2

Performance of the FindNeighbour application. The time taken to add 4000 individual M. tuberculosis samples to a FindNeighbour server (Panel a) Testing was performed a Ubuntu 14 VM instance with 16 cores, Intel Xeon E5-2680v2 processors at 2.8GHz, and 128GB RAM hosted within Genome England Ltd. Samples which diverged from each other at various times in the past (depicted in Panel b) can either be added to one FindNeighbour instance, or multiple (63, see text) instances. The performance of these strategies is illustrated in Panel c One approach to this is to deploy multiple FindNeighbour instances as part of a map-reduce paradigm for the analysis of very large numbers of sequences; the map step depends on sequence features (Fig. 2b). Each of groups derived from the map operation can be processed by independent FindNeighbour instances, perhaps deployed on the same server. As an example, we have shown that a published method of subdividing M. tuberculosis based on ancestry [14] into 63 subgroups, can be readily accommodated with about 4× speed-up in the storage of 4000 samples and marked reductions in the amount of memory required (Fig. 2c). This map process will precede reduction, in which the responses of the relevant FindNeighbour instance are returned to the client. However, quadratic memory requirements will still limit scalability. A second approach, which is not necessarily mutually exclusive, is to move away from technologies which have quadratic memory requirements. FindNeighbour2 implements such, using reference based compression of mapped sequence data, producing linear memory requirements with rising sequence numbers. In Table 2, we summarise the performance of this technology on a series of test datasets, with various different clinically relevant cutoffs. Enhancements to FindNeighbour2 are planned. A prototype randomly assigns sequences to a series of FindNeighbour2 instances, each running with a single thread; one instance stores each new sequence. Organism-specific algorithms assigning an organism ‘to the right compartment’ for analysis are not attempted, as these require validation; rather, a new-against-all-existing comparison is performed, sharded across the instances. This has the advantage that true matches are guaranteed to be found, something which may be essential, in view of the clinical importance of identification of close neighbours for initiating public health action. However, as numbers of samples grow, the requirement to perform a new-against-all comparison may become rate limiting. If it does so, then approximate nearest neighbour matching could be deployed, using each variant locus as a dimension in a multi-dimensional clustering algorithm; algorithms exist which allow highly efficient selective searching of such data using k-d forests or k-means search trees [15]. In 2015, there were about 3 million cases of TB confirmed by culture globally [16]. Even if only 5% of these are sequenced annually, the generation of 1 M bacterial sequences will be achieved in the next decade. With a comparison rate of 15,000 sequences/s/core, and an in-memory footprint of about 2 MB/sequence, we project that we could store these in about 2 TB of RAM, either on a single server, or in a distributed manner. If 80 computational cores were available, the time to search for neighbours among these would approximate 1–2 s. The commercial cost of a single server with suitable hardware is at present about GBP 50,000, making the approach we describe using reference based compression a candidate for tracking of TB relatedness in large, international datasets, if reference based mapping approaches are used.

Conclusions

The FindNeighbour tools are, to our knowledge, first-in-class applications allowing progressive addition of mapped sequence data to a sequence store, thence allowing rapid identification of neighbours based on SNP distance data derived from reference mapping. These tools make no assumptions about organism phylogeny in their operation. Therefore, we believe these tools will have applicability to a range of applications in bacterial genomics.

Availability and requirements

For project home pages, please see section below. Operating systems and languages used are shown inTable. The software is released under LGPL licence. Pseudocode for maintaining a variation model. (PDF 271 kb) Environment used for FindNeighbour and Bugmat performance testing. (PDF 272 kb)

15 in total

1. Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel® Xeon Phi™ Coprocessor.

Authors: Sanchit Misra; Kiran Pamnany; Srinivas Aluru
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2015 Sep-Oct Impact factor: 3.710

2. Comparison of high-throughput sequencing data compression tools.

Authors: Ibrahim Numanagić; James K Bonfield; Faraz Hach; Jan Voges; Jörn Ostermann; Claudio Alberti; Marco Mattavelli; S Cenk Sahinalp
Journal: Nat Methods Date: 2016-10-24 Impact factor: 28.547

Review 3. Contact investigations for outbreaks of Mycobacterium tuberculosis: advances through whole genome sequencing.

Authors: T M Walker; P Monk; E Grace Smith; T E A Peto
Journal: Clin Microbiol Infect Date: 2013-02-21 Impact factor: 8.067

4. Prevalence of LRTI in Patients Presenting with Productive Cough and Their Antibiotic Resistance Pattern.

Authors: Sunil Vijay; Gaurav Dalela
Journal: J Clin Diagn Res Date: 2016-01-01

5. Prospective use of whole genome sequencing (WGS) detected a multi-country outbreak of Salmonella Enteritidis.

Authors: T Inns; P M Ashton; S Herrera-Leon; J Lighthill; S Foulkes; T Jombart; Y Rehman; A Fox; T Dallman; E DE Pinna; L Browning; J E Coia; O Edeghere; R Vivancos
Journal: Epidemiol Infect Date: 2016-10-26 Impact factor: 4.434

6. Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans.

Authors: Iñaki Comas; Mireia Coscolla; Tao Luo; Sonia Borrell; Kathryn E Holt; Midori Kato-Maeda; Julian Parkhill; Bijaya Malla; Stefan Berg; Guy Thwaites; Dorothy Yeboah-Manu; Graham Bothamley; Jian Mei; Lanhai Wei; Stephen Bentley; Simon R Harris; Stefan Niemann; Roland Diel; Abraham Aseffa; Qian Gao; Douglas Young; Sebastien Gagneux
Journal: Nat Genet Date: 2013-09-01 Impact factor: 38.330

7. Whole-Genome Sequencing Reveals the Contribution of Long-Term Carriers in Staphylococcus aureus Outbreak Investigation.

Authors: N C Gordon; B Pichon; T Golubchik; D J Wilson; J Paul; D S Blanc; K Cole; J Collins; N Cortes; M Cubbon; F K Gould; P J Jenks; M Llewelyn; J Q Nash; J M Orendi; K Paranthaman; J R Price; L Senn; H L Thomas; S Wyllie; D W Crook; T E A Peto; A S Walker; A M Kearns
Journal: J Clin Microbiol Date: 2017-05-03 Impact factor: 5.948

Review 8. Transforming clinical microbiology with bacterial genome sequencing.

Authors: Xavier Didelot; Rory Bowden; Daniel J Wilson; Tim E A Peto; Derrick W Crook
Journal: Nat Rev Genet Date: 2012-08-07 Impact factor: 53.242

9. Diverse sources of C. difficile infection identified on whole-genome sequencing.

Authors: Derrick W Crook; Mark H Wilcox; Tim E A Peto; A Sarah Walker; David W Eyre; Madeleine L Cule; Daniel J Wilson; David Griffiths; Alison Vaughan; Lily O'Connor; Camilla L C Ip; Tanya Golubchik; Elizabeth M Batty; John M Finney; David H Wyllie; Xavier Didelot; Paolo Piazza; Rory Bowden; Kate E Dingle; Rosalind M Harding
Journal: N Engl J Med Date: 2013-09-26 Impact factor: 91.245

10. Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study.

Authors: Dilrini De Silva; Joanna Peters; John Paul; David W Eyre; Kevin Cole; Michelle J Cole; Fiona Cresswell; Gillian Dean; Jayshree Dave; Daniel Rh Thomas; Kirsty Foster; Alison Waldram; Daniel J Wilson; Xavier Didelot; Yonatan H Grad; Derrick W Crook; Tim Ea Peto; A Sarah Walker
Journal: Lancet Infect Dis Date: 2016-07-12 Impact factor: 25.071

8 in total

1. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens.

Authors: Zhemin Zhou; Nabil-Fareed Alikhan; Martin J Sergeant; Nina Luhmann; Cátia Vaz; Alexandre P Francisco; João André Carriço; Mark Achtman
Journal: Genome Res Date: 2018-07-26 Impact factor: 9.043

2. Control of Artifactual Variation in Reported Intersample Relatedness during Clinical Use of a Mycobacterium tuberculosis Sequencing Pipeline.

Authors: David H Wyllie; Nicholas Sanderson; Richard Myers; Tim Peto; Esther Robinson; Derrick W Crook; E Grace Smith; A Sarah Walker
Journal: J Clin Microbiol Date: 2018-07-26 Impact factor: 5.948

3. Hash-Based Core Genome Multilocus Sequence Typing for Clostridium difficile.

Authors: A Sarah Walker; Mark H Wilcox; David W Eyre; Tim E A Peto; Derrick W Crook
Journal: J Clin Microbiol Date: 2019-12-23 Impact factor: 5.948

4. Computational pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters.

Authors: Christine Jandrasits; Stefan Kröger; Walter Haas; Bernhard Y Renard
Journal: PLoS Comput Biol Date: 2019-12-09 Impact factor: 4.475

5. A publicly accessible database for Clostridioides difficile genome sequences supports tracing of transmission chains and epidemics.

Authors: Martinique Frentrup; Zhemin Zhou; Matthias Steglich; Jan P Meier-Kolthoff; Markus Göker; Thomas Riedel; Boyke Bunk; Cathrin Spröer; Jörg Overmann; Marion Blaschitz; Alexander Indra; Lutz von Müller; Thomas A Kohl; Stefan Niemann; Christian Seyboldt; Frank Klawonn; Nitin Kumar; Trevor D Lawley; Sergio García-Fernández; Rafael Cantón; Rosa Del Campo; Ortrud Zimmermann; Uwe Groß; Mark Achtman; Ulrich Nübel
Journal: Microb Genom Date: 2020-07-29

Review 6. Infection prevention and control insights from a decade of pathogen whole-genome sequencing.

Authors: D W Eyre
Journal: J Hosp Infect Date: 2022-02-12 Impact factor: 8.944

7. Catwalk: identifying closely related sequences in large microbial sequence databases.

Authors: Denis Volk; Fan Yang-Turner; Xavier Didelot; Derrick W Crook; David Wyllie
Journal: Microb Genom Date: 2022-06

8. A Quantitative Evaluation of MIRU-VNTR Typing Against Whole-Genome Sequencing for Identifying Mycobacterium tuberculosis Transmission: A Prospective Observational Cohort Study.

Authors: David H Wyllie; Jennifer A Davidson; E Grace Smith; Priti Rathod; Derrick W Crook; Tim E A Peto; Esther Robinson; Tim Walker; Colin Campbell
Journal: EBioMedicine Date: 2018-08-01 Impact factor: 8.143

8 in total