| Literature DB >> 29132318 |
Oriol Mazariegos-Canellas1, Trien Do1, Tim Peto1, David W Eyre1, Anthony Underwood2, Derrick Crook1, David H Wyllie3,4.
Abstract
BACKGROUND: Large scale bacterial sequencing has made the determination of genetic relationships within large sequence collections of bacterial genomes derived from the same microbial species an increasingly common task. Solutions to the problem have application to public health (for example, in the detection of possible disease transmission), and as part of divide-and-conquer strategies selecting groups of similar isolates for computationally intensive methods of phylogenetic inference using (for example) maximal likelihood methods. However, the generation and maintenance of distance matrices is computationally intensive, and rapid methods of doing so are needed to allow translation of microbial genomics into public health actions.Entities:
Keywords: Bacterial genomes; Distance matrix; Phylogenetic analysis
Mesh:
Year: 2017 PMID: 29132318 PMCID: PMC5683244 DOI: 10.1186/s12859-017-1907-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of three solutions
| Implementation | BugMat | FindNeighbour1 | FindNeighbour2 |
|---|---|---|---|
| Presentation | Command Line | Server application | Server application |
| Technology | In-memory matrix | In-memory matrix | In-memory reference based sequences |
| Can add samples | No | Yes | Yes |
| Role in production environment | Remove invariant sites before maximal likelihood tree generation | Store pairwise distances between samples | Store significant pairwise distances between samples |
| Stores all pairwise distances | Yes | Yes | No (customisable) |
| Uses database for sequence metadata storage | No | Yes | Yes |
| Uses database for pairwise distance storage | No | No | Yes |
| Requires reference sequence specified | No | No | Yes |
| Implementation | C++ | Python, C++ | Python |
Comparison of three solutions. A comparison of the approaches taken by three solutions (BugMat, Findneighbour, Findneighbour2) is shown
Data sets used for testing findNeighbour performance
| Dataset | Sites called | Links less than SNP stored | Memory usage | Mean time to add one sample |
|---|---|---|---|---|
| A: | 329,714 sites excluded | 20 | 23.5G | 2.23 s |
| B: | All sites included | 500 | 19.3G | 2.95 s |
| C: | 51,897 sites excluded | 20 | 7.4G | 1.77 s |
Data sets studied and FindNeighbour2 performance
The data sets studied, which can be downloaded at https://ora.ox.ac.uk/objects/uuid:82ce6500-fa71-496a-8ba5-ba822b6cbb50 are described. Also shown are performance characteristics of Findneighbour2 operating on them using the hardware in Additional file 2
Fig. 1Performance of the BugMat application. The relationship between number of sequences processed, numbers of cores assigned to BugMat, and (a) memory usage and (b) various stages of sequence processing. Testing was performed a Ubuntu 14 VM instance with 16 cores, Intel Xeon E5-2680v2 processors at 2.8GHz, and 128GB RAM hosted within Genome England Ltd. Similar results were obtained using the specification in Additional file 2
Fig. 2Performance of the FindNeighbour application. The time taken to add 4000 individual M. tuberculosis samples to a FindNeighbour server (Panel a) Testing was performed a Ubuntu 14 VM instance with 16 cores, Intel Xeon E5-2680v2 processors at 2.8GHz, and 128GB RAM hosted within Genome England Ltd. Samples which diverged from each other at various times in the past (depicted in Panel b) can either be added to one FindNeighbour instance, or multiple (63, see text) instances. The performance of these strategies is illustrated in Panel c