| Literature DB >> 29467814 |
João A Carriço1, Maxime Crochemore2, Alexandre P Francisco3,4, Solon P Pissis2, Bruno Ribeiro-Gonçalves1, Cátia Vaz3,5.
Abstract
BACKGROUND: Microbial typing methods are commonly used to study the relatedness of bacterial strains. Sequence-based typing methods are a gold standard for epidemiological surveillance due to the inherent portability of sequence and allelic profile data, fast analysis times and their capacity to create common nomenclatures for strains or clones. This led to development of several novel methods and several databases being made available for many microbial species. With the mainstream use of High Throughput Sequencing, the amount of data being accumulated in these databases is huge, storing thousands of different profiles. On the other hand, computing genetic evolutionary distances among a set of typing profiles or taxa dominates the running time of many phylogenetic inference methods. It is important also to note that most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles.Entities:
Keywords: Computational biology; Hamming distance; Phylogenetic inference
Year: 2018 PMID: 29467814 PMCID: PMC5815242 DOI: 10.1186/s13015-017-0119-7
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Data structures used in our approach for each step
| Profile indexing | Candidate profile pairs enumeration | Pairs verification |
|---|---|---|
| Suffix array | Binary search | Naïve |
| LCP based clusters |
|
Real datasets used in the experimental evaluation
| Dataset | Typing method | Profile length | Number of distinct elements | References |
|---|---|---|---|---|
|
| wgMLST | 5446 | 5669 | (*) |
|
| wgMLST | 3002 | 6861 | [ |
|
| SNP | 22,143 | 1534 | [ |
|
| cgMLST | 235 | 1968 | [ |
(*) Dataset provided by the Molecular Microbiology and Infection Unit, IMM
Fig. 1Synthetic datasets, with and according to Theorem 1. Running time for computing pairwise distances by finding lower and higher bounds in the SA, and by processing LCP based clusters, as function of the input size
Fig. 2Synthetic datasets, with and . Running time for computing pairwise distances by finding lower and higher bounds in the SA, and by processing LCP based clusters, as function of the number d of profiles and for different values of k
Fig. 3Synthetic datasets, with and according to Theorem 1. Running time for computing pairwise distances naïvely, by finding lower and higher bounds in the SA, and by processing LCP based clusters, as a function of the number d of profiles
Time and percentage of pairs processed for each method and dataset
| Dataset |
| Naïve | Binary search | LCP clusters | |||
|---|---|---|---|---|---|---|---|
| t (s) | Pairs (%) | t (s) | Pairs (%) | t (s) | Pairs (%) | ||
|
| 8 | 108.59 | 100 | 0.22 | 0.06 |
| 0.06 |
| 16 | 109.30 | 100 | 0.48 | 0.32 |
| 0.32 | |
| 32 | 108.60 | 100 | 3.52 | 5.45 |
| 5.45 | |
| 64 |
| 100 | 231.05 | 99.98 | 162.36 | 99.98 | |
|
| 8 | 89.85 | 100 | 1.04 | 2.37 |
| 2.37 |
| 16 | 87.26 | 100 | 7.16 | 12.69 |
| 12.69 | |
| 32 | 85.36 | 100 | 36.29 | 33.22 |
| 33.22 | |
| 64 |
| 100 | 254.45 | 82.44 | 187.15 | 82.44 | |
|
| 89 | 28.83 | 100 | 16.63 | 91.48 |
| 91.48 |
| 178 |
| 100 | 46.98 | 99.91 | 32.03 | 99.91 | |
| 890 |
| 100 | 113.57 | 100 | 129.14 | 100 | |
|
| 8 | 0.56 | 100 | 0.02 | 0.93 |
| 0.93 |
| 16 | 0.57 | 100 | 0.05 | 1.71 |
| 1.71 | |
| 32 | 0.56 | 100 | 0.20 | 4.42 |
| 4.42 | |
| 64 |
| 100 | 5.63 | 73.36 | 5.01 | 73.36 | |
The minimum time for each row is highlighted in italic
Fig. 4The tree inferred for the largest connected component found with for the C. jejuni dataset.
Image produced by PHYLOViZ [35]