| Literature DB >> 22373455 |
Abstract
BACKGROUND: The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program.Entities:
Mesh:
Year: 2012 PMID: 22373455 PMCID: PMC3392737 DOI: 10.1186/1756-0500-5-123
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Overlap complexity example. An example of the overlap complexity between two spaced seeds. Letters not taking part in the overlaps are grey and the overlapping pairs of 1's are underlined; the values of σ for each overlap are given in the last column
Figure 2The pseudocode of F. The pseudocode of the new faster implementation for overlap complexity computation, FastOC
Comparison of overlap complexity computation algorithms
| ℓ | OC | F | VF | |
|---|---|---|---|---|
| 9 | 15 | 0.012 | 0.004 | |
| 10 | 17 | 0.064 | 0.028 | |
| 11 | 18 | 0.116 | 0.048 | |
| 12 | 19 | 0.204 | 0.084 | |
| 13 | 20 | 0.340 | 0.136 | |
| 14 | 21 | 0.564 | 0.208 | |
| 15 | 23 | 2.792 | 0.948 | |
| 16 | 24 | 4.564 | 1.484 | |
| 17 | 25 | 7.276 | 2.648 | |
| 18 | 26 | 11.368 | 3.968 | |
Speed comparison between the existing and the new implementations of the overlap complexity function. Seeds of optimal length for weights between 9 and 18 are considered. In each case, the time (in seconds) is given for the computation of overlap complexity for all seeds with the given parameters. The VFastOC algorithm is the fastest (times in bold), up to 3.8 times faster than the original OC algorithm.
Figure 3An example of the data structures used in the new algorithm for hill climbing. The matrices OM and OCM and the σ arrays are given for the seeds s1 = 1 * 11 and s2 = 1**1*1
Figure 4The pseudocode of F. The pseudocode of the new faster algorithm for the hill climbing heuristic, FastHC
Figure 5The pseudocode of OCS. The pseudocode of the additional function OCSigma, used by the main function FastHC.
Figure 6The pseudocode of U. The pseudocode of the additional function UpdateSigma, used by the main function FastHC
Figure 7The pseudocode of U. The pseudocode of the additional function UpdateOM, used by the main function FastHC
Comparison of hill climbing algorithms
| [ℓ1..ℓ | HC | F | ||||
|---|---|---|---|---|---|---|
| 11 | 64 | .70 | 16 | [14..27] | 7.79 | |
| 22 | 50 | .85 | 10 | [25..37] | 10.79 | |
| 28 | 100 | .90 | 8 | [36..56] | 39.83 | |
| 28 | 150 | .90 | 8 | [39..63] | 69.14 | |
| 28 | 200 | .90 | 8 | [41..70] | 108.74 | |
| 28 | 100 | .90 | 16 | [33..59] | 471.51 | |
| 28 | 150 | .90 | 16 | [36..66] | 788.79 | |
| 28 | 200 | .90 | 16 | [39..72] | 1075.10 | |
Speed comparison between the existing (HC) and the new implementation (FastHC) of the hill climbing heuristic. Several sets of parameters are used. The times (in seconds) are given for a single multiple spaced seed with the given parameters. FastHC is up to 13.5 times faster than HS. Also, the improvement increases with the size of the input.
Sensitivity comparison of computed spaced seeds for PatternHunter
| BLAST | PH | PHII | Mandala | Iedera | SpEED | F | |||
|---|---|---|---|---|---|---|---|---|---|
| (contig.) | (spaced) | (16 seeds) | |||||||
| 11 | 64 | 0.70 | 30.0196 | 46.7122 | 92.4114 | 92.3811 | 92.0708 | 93.2526 | |
| 11 | 64 | 0.75 | 49.4494 | 69.5844 | 98.4289 | 98.4320 | 98.3391 | 98.6882 | |
| 11 | 64 | 0.80 | 71.3993 | 88.2070 | 99.8449 | 99.8448 | 99.8366 | 99.8820 | |
Sensitivity comparison with Mandala, Iedera, and SpEED (results from [18]). Seeds were computed with the same parameters as those of PatternHunter II. FastHC (sensitivity values in bold) is the best in all cases. The sensitivity of the original seeds is significantly improved.
Sensitivity comparison of computed spaced seeds for BFAST
| 1 seed (contig.) | 1 seed (spaced) | BFAST (16 seeds) | Mandala | Iedera | SpEED | F | |||
|---|---|---|---|---|---|---|---|---|---|
| 22 | 50 | 0.85 | 14.4649 | 26.8064 | 58.6907 | -- | 60.1535 | 60.8127 | |
| 22 | 50 | 0.90 | 36.6940 | 57.9846 | 87.3359 | -- | 87.9894 | 88.5969 | |
| 22 | 50 | 0.95 | 74.1153 | 90.8265 | 99.2249 | -- | 99.2196 | 99.3659 | |
Sensitivity comparison with Mandala, Iedera, and SpEED (results from [18]). Seeds were computed with the same parameters as those of BFAST. FastHC (sensitivity values in bold) is the best in all cases. The sensitivity of the original seeds is significantly improved.
Sensitivity comparison of computed spaced seeds of MegaBLAST weight
| MegaBLAST | 1 seed | F | ||||||
|---|---|---|---|---|---|---|---|---|
| (contig.) | (spaced) | 2 seeds | 4 seeds | 8 seeds | 16 seeds | |||
| 28 | 100 | 0.90 | 39.1436 | 69.3241 | 79.6629 | 87.5674 | 92.7762 | |
| 28 | 150 | 0.90 | 55.4870 | 87.6426 | 93.4308 | 98.7430 | 99.5137 | |
| 28 | 200 | 0.90 | 67.4412 | 94.9876 | 99.2937 | 99.7877 | 99.9409 | |
Using FastHC, we computed multiple seeds with the same weight as the default seed of MegaBLAST for similarity 90% and N ∈ {100, 150, 200}. The sensitivities of the new seeds are much higher than those of MegaBLAST. The values in bold show that the new 16, 2 and 2 seeds, respectively, reach sensitivities over 95% for N equal to 100, 150, and 200, respectively.