| Literature DB >> 27832739 |
Lorraine A K Ayad1, Solon P P Pissis2, Ahmad Retha1.
Abstract
BACKGROUND: Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length ℓ of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time [Formula: see text] and space [Formula: see text] under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere.Entities:
Keywords: Approximate string matching; Dynamic programming; Fixed-length approximate string matching; Software library
Mesh:
Year: 2016 PMID: 27832739 PMCID: PMC5103500 DOI: 10.1186/s12859-016-1320-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Experiment I. Elapsed time in seconds of libFLASM under edit and Hamming distance models for n=m=10,000 and increasing factor length ℓ
Fig. 2Experiment II. Elapsed-time comparison in log10 seconds of different programs for ACSM under edit and Hamming distance models for m=32,64,128,256, n=1,000,000, and increasing distance threshold k
RF distance between the Original and Random datasets as well as the RF distance between the Original and Restored datasets using Cyclope and BEAR
| Dataset < |
| Cyclope | BEAR | BEAR |
|---|---|---|---|---|
|
|
| |||
| <12,2500,0.05,0.06,0.04> | 0.000 | 0.000 | 0.000 | 0.000 |
| <12,2500,0.20,0.06,0.04> | 0.000 | 0.000 | 0.000 | 0.000 |
| <12,2500,0.35,0.06,0.04> | 0.000 | 0.000 | 0.000 | 0.000 |
| <25,2500,0.05,0.06,0.04> | 0.000 | 0.000 | 0.000 | 0.000 |
| <25,2500,0.20,0.06,0.04> | 0.000 | 0.000 | 0.000 | 0.000 |
| <25,2500,0.35,0.06,0.04> | 0.045 | 0.045 | 0.000 | 0.000 |
| <50,2500,0.05,0.06,0.04> | 0.085 | 0.000 | 0.021 | 0.000 |
| <50,2500,0.20,0.06,0.04> | 0.043 | 0.000 | 0.000 | 0.000 |
| <50,2500,0.35,0.06,0.04> | 0.043 | 0.000 | 0.021 | 0.000 |
Elapsed-time comparison in seconds of Cyclope and BEAR
| Dataset < | Cyclope | BEAR | BEAR |
|---|---|---|---|
| <12,2500,0.05,0.06,0.04> | 79.09 | 15.92 | 46.53 |
| <12,2500,0.20,0.06,0.04> | 77.47 | 15.06 | 44.52 |
| <12,2500,0.35,0.06,0.04> | 76.76 | 14.85 | 45.44 |
| <25,2500,0.05,0.06,0.04> | 332.69 | 69.81 | 203.78 |
| <25,2500,0.20,0.06,0.04> | 342.94 | 69.28 | 208.85 |
| <25,2500,0.35,0.06,0.04> | 344.50 | 71.14 | 208.82 |
| <50,2500,0.05,0.06,0.04> | 1,317.81 | 293.45 | 851.07 |
| <50,2500,0.20,0.06,0.04> | 1,303.51 | 300.37 | 837.66 |
| <50,2500,0.35,0.06,0.04> | 1,359.90 | 286.88 | 854.83 |
Single motif extraction from real datasets
| Dataset | Parameters | Motif | Quorum (%) |
|---|---|---|---|
| RNA | <350,110> | RNA polymerase | 100 |
| Rpb2, domain 6 | |||
| Polymerase | <60,28> | RNA polymerase | 100 |
| Rpb2, domain 4 | |||
| <40,12> | RNA polymerase | 100 | |
| Rpb2, domain 5 | |||
| <90,30> | RNA polymerase | 100 | |
| Rpb2, domain 7 | |||
| Viruses | <350,150> | Viral methyltransferase | 100 |
| <130,50> | Cucumber mosaic | 100 | |
| virus 1a protein | |||
| <70,48> | Cucumber mosaic virus 1a | 100 | |
| protein C terminal | |||
| <250,130> | Viral (Superfamily 1) RNA | 100 | |
| helicase | |||
| Hypothetical | <130,45> | Type III restriction enzyme, | 100 |
| res subunit | |||
| Proteins | <60,30> | Helicase conserved | 100 |
| C-terminal domain |
Structured motif extraction from synthetic datasets
| Parameters | Implanted | Implanted |
|---|---|---|
| structured | structured | |
| motifs | motifs | |
| extracted | ||
| <(80,15)[5,15](60,10)[5,20](230,20)> | 25 | 25 |
| <(100,15)[5,15](80,10)[5,20](250,20)> | 25 | 25 |
| <(120,15)[5,15](100,10)[5,20](270,20)> | 25 | 25 |
| <(140,15)[5,15](120,10)[5,20](290,20)> | 25 | 25 |
| <(160,15)[5,15](140,10)[5,20](310,20)> | 25 | 25 |
| <(180,15)[5,15](160,10)[5,20](330,20)> | 25 | 25 |
| <(200,15)[5,15](180,10)[5,20](350,20)> | 25 | 25 |
Elapsed-time comparison in seconds for implementing the Chang and Marr index using a pattern of length 32
| Edit distance | Hamming distance | |||
|---|---|---|---|---|
|
| Naïve (s) | libFLASM (s) | Naïve (s) | libFLASM (s) |
| 5 | 0.01 | 0.00 | 0.01 | 0.00 |
| 6 | 0.08 | 0.02 | 0.08 | 0.01 |
| 7 | 0.67 | 0.9 | 0.55 | 0.05 |
| 8 | 6.20 | 0.50 | 4.81 | 0.25 |
| 9 | 34.00 | 2.74 | 23.99 | 1.43 |
| 10 | 145.56 | 11.76 | 96.71 | 6.24 |
Elapsed-time comparison in seconds for implementing the Chang and Marr index using a pattern of length 64
| Edit distance | Hamming distance | |||
|---|---|---|---|---|
|
| Naïve (s) | libFLASM (s) | Naïve (s) | libFLASM (s) |
| 5 | 0.04 | 0.01 | 0.04 | 0.00 |
| 6 | 0.23 | 0.03 | 0.22 | 0.02 |
| 7 | 1.45 | 0.15 | 1.31 | 0.09 |
| 8 | 10.76 | 0.82 | 9.27 | 0.46 |
| 9 | 95.01 | 5.29 | 76.21 | 2.76 |
| 10 | 673.17 | 24.51 | 520.12 | 12.51 |