| Literature DB >> 28464826 |
Mohammed Al-Jaff1, Eric Sandström1, Manfred Grabherr2,3.
Abstract
BACKGROUND: A common challenge in bioinformatics is to identify short sub-sequences that are unique in a set of genomes or reference sequences, which can efficiently be achieved by k-mer (k consecutive nucleotides) counting. However, there are several areas that would benefit from a more stringent definition of "unique", requiring that these sub-sequences of length W differ by more than k mismatches (i.e. a Hamming distance greater than k) from any other sub-sequence, which we term the k-disjoint problem. Examples include finding sequences unique to a pathogen for probe-based infection diagnostics; reducing off-target hits for re-sequencing or genome editing; detecting sequence (e.g. phage or viral) insertions; and multiple substitution mutations. Since both sensitivity and specificity are critical, an exhaustive, yet efficient solution is desirable.Entities:
Keywords: Sequence mining; Software; k-disjoint problem
Mesh:
Year: 2017 PMID: 28464826 PMCID: PMC5414201 DOI: 10.1186/s12859-017-1644-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Data and process flowchart of microTaboo. The flowchart shows the main modules, data and processes involved in the internal algorithm of microTaboo. A and B (green and blue documents) represent either a single or multiple FASTA-files. For a user-specified given value of the sequence length W, all W-long substrings in B are converted into their respective N-code vectors, out of which a list of all such vectors in B is outputted. This list then is used to construct a Dictree (a hybrid dictionary-tree-list data structure) which is fed into the filter engine module responsible for filtering the all W-long substrings in A (after N-code conversion) and providing an output file contacting either the k-disjoint set, k-intersection or both
Fig. 2N-code conversion table and Mismatch matrix. The N-code conversion table & the mismatch matrix for N = 3. (left sub-figure) Concept of N-coding with N = 3 and only using nucleotides (A, C, G, T). Every unique sequence is assigned a unique number in lexicographical ascending order. (Right) Visualization of a mismatch matrix in N-code format where N = 3 and only using nucleotides (A, C, G, T). Each cell in the matrix contains the Hamming distances between the respective row and column element, i.e. the sequence or N-code value represented there. For example, the sequence “AAA” → <0 > and the sequence “AAG” → <2 > are at a Hamming distance of 1 away from each other as in the cells (0,2) and (2,0). Meanwhile, the distance between sequence “AAA” and “TTT” is 3 → <63 > as in cell (0, 63) and (63,0)
Fraction of unique sequences
| Organism | % k = 0 | % k = 1 | % k = 2 |
|---|---|---|---|
|
| 96.0 (97.4) | 91.2 (96.6) | 63.7 (94.2) |
|
| 92.3 (94.0) | 83.9 (92.9) | 40.7 (85.9) |
|
| 73.3 (83.0) | 32.9 (72.7) | 0.5 (4.3) |
Listed are the fraction of 20 (W = 20) nucleotides long sequences and the genomic territory covered (in parentheses) for k = 0, 1, 2 on C. albicans, D. melanogaster and M. musculus. For each run, copies of the files containing the genome for the organism of interest were placed both in the query folder and the “taboo” folder. For the mouse genome, only the genome file for chromosome 16 was placed in the query folder
Runtime scaling over multiple cores
| #cores | Time (s) | Speed up |
|---|---|---|
| 1 | 4271 | N/A |
| 2 | 2305 | 1.85 |
| 3 | 1726 | 2.47 |
| 4 | 1576 | 2.71 |
| 6 | 1098 | 3.90 |
| 8 | 1066 | 4.00 |
| 10 | 850 | 5.02 |
Runtime of microTaboo for different number of cores where all other parameters were fixed. Speed up factor is calculated compared to runtime for a single core. Enterobacteria phage lambda was used as query organism and Escherichia coli K12 was used as “taboo” organism. There parameters used were W = 60 and k = 3 for all runs