| Literature DB >> 24871320 |
Aqil M Azmi1, Abdulrakeeb Al-Ssulami1.
Abstract
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.Entities:
Mesh:
Year: 2014 PMID: 24871320 PMCID: PMC4037181 DOI: 10.1371/journal.pone.0095148
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Encoding of pair bases.
| AA | AC | AG | AT | CA | CC | CG | CT | GA | GC | GG | GT | TA | TC | TG | TT |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
This is the full list of all 2–lets candidate motifs (CanMotifs).
Figure 1Each occurrence of k–lets motif is augmented with right nucleotide to form the –lets CanMotif.
Figure 2Left augmentation of and the right augmentation of yield the same –lets motif.
Encoding of CanMotifs with different lengths (k).
|
| Max possible CanMotifs | First value | Last value |
| 2 | 16 | 0 | 15 |
| 3 | 64 | 16 | 79 |
| 4 | 256 | 80 | 335 |
| 5 | 1024 | 336 | 1359 |
| 6 | 4096 | 1360 | 5455 |
| 7 | 16384 | 5456 | 21839 |
| 8 | 65536 | 21840 | 87375 |
| 9 | 262144 | 87376 | 349519 |
| 10 | 1048576 | 349520 | 1398095 |
Figure 3A linear algorithm to generate a sorted list of encoded –lets CanMotifs from a sorted list of encoded k–lets motifs.
Each group is sorted individually.
Figure 4All the 3 and 4–lets identical string motifs in the sample sequence.
Experimental results of running our algorithm on selected sets from the data sets [27], [28] using
| No. string motifs | ||||
| Sequences | k-lets | Overlapping | Examples with starting positions | |
| No | Yes | |||
| dm02r | 11 | 2 | 15 |
|
| 10 | 9 | 25 |
| |
| 9 | 25 | 44 |
| |
| yst09r | 23 | 2 | 2 |
|
| 17 | 3 | 3 | ||
| 16 | 5 | 5 |
| |
| 15 | 13 | 13 |
| |
| hm20r | 42 | 1 | 4 |
|
| 41 | 3 | 7 |
| |
| dm01g | 14 | 2 | 2 |
|
|
| ||||
| 13 | 5 | 6 |
| |
| 12 | 11 | 13 | ||
| 11 | 28 | 32 | ||
| mus03g | 12 | 2 | 2 |
|
| 11 | 4 | 5 | ||
| 10 | 13 | 15 | ||
| yst01g | 14 | 1 | 2 |
|
| 13 | 3 | 5 |
| |
| 12 | 12 | 16 |
| |
| 11 | 37 | 44 |
| |
| hm20m | 18 | 1 | 1 |
|
| 17 | 3 | 3 |
| |
| 16 | 7 | 7 |
| |
| 15 | 17 | 19 |
| |
| 14 | 47 | 52 |
| |
Count the number of different motifs. For non-overlapping motifs we only consider motifs if their starting position is further apart than their length.
The starting position is based on index starting at 0. We followed [26] in treating each of the sequences as a single string. For example, yst09r.fasta is composed of 16 substrings each having 1000 nucleotides. These are merged into a single string with 16000 nucleotides.
This set includes real (sequences suffixed ‘r’), generic (sequences suffixed ‘g’), and markov (sequences suffixed ‘m’) data sets. Only larger sized identical string motifs are reported.
Execution time (in seconds) to find identical string motifs of all sizes on an Intel core i5 based PC running at 2.67 GHz with 4 GB RAM.
| Sequence | Size (# nucleotides) | Karci algorithm | Our algorithm |
| mus06r | 1500 | 0.57 | 0.33 |
| dm06r | 3000 | 1.77 | 0.40 |
| yst04r | 7000 | 9.49 | 0.57 |
| hm26r | 9000 | 18.56 | 0.83 |
| yst09r | 16000 | 53.43 | 1.21 |
| hm01r | 36000 | 596.43 | 1.70 |
| hm20r | 70000 | 2225.15 | 2.45 |
Figure 5The average execution time (seconds) to discover all the identical string motifs of all sizes in 10 randomly generated sequences of each length.
The algorithm clearly exhibits a linear behavior.
Execution time (in seconds) to discover all the identical string motifs of lengths not exceeding 40 nucleotides in real biological sequences.
| Organism | NCBI RefSeq | Size | Time |
| Vaccinia virus | NC_006998.1 | 0.19 M | 1.83 |
| Mycoplasma penetrans HF-2 | NC_004432.1 | 1.36 M | 10.38 |
| Lactobacillus acidophilus NCFM | NC_006814.3 | 1.99 M | 12.93 |
| Methanocella paludicola SANAE | NC_013665.1 | 2.96 M | 19.24 |
| Acidiphilium multivorum AIU301 | NC_015186.1 | 3.75 M | 27.51 |
| Mycobacterium tuberculosis H37Rv | NC_000962.2 | 4.41 M | 28.81 |
| Pectobacterium wasabiae WPP163 | NC_013421.1 | 5.06 M | 31.13 |
| Mesorhizobium opportunistum WSM2075 chromosome | NC_015675.1 | 6.88 M | 42.14 |
| Saccharopolyspora erythraea NRRL 2338 chromosome | NC_009142.1 | 8.21 M | 55.85 |
| Caenorhabditis elegans Bristol N2 chromosome III | NC_003281.10 | 13.78 M | 105.54 |
| Caenorhabditis elegans Bristol N2 chromosome II | NC_003280.10 | 15.28 M | 119.63 |
The size of the sequences is expressed in M (for millions).