| Literature DB >> 20122210 |
Won-Hyoung Chung1, Seong-Bae Park.
Abstract
BACKGROUND: Introduction of spaced speeds opened a way of sensitivity improvement in homology search without loss of search speed. Since then, the efforts of finding optimal seed which maximizes the sensitivity have been continued today. The sensitivity of a seed is generally computed by its hit probability. However, the limitation of hit probability is that it computes the sensitivity only at a specific similarity level while homologous regions usually distributed in various similarity levels. As a result, the optimal seed found by hit probability is not actually optimal for various similarity levels. Therefore, a new measure of seed sensitivity is required to recommend seeds that are robust to various similarity levels.Entities:
Mesh:
Year: 2010 PMID: 20122210 PMCID: PMC3009509 DOI: 10.1186/1471-2105-11-S1-S37
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An illustration of hit integration. Solid line and dashed line stand for the curve of hit probability and hit integration, respectively. Gray area underneath the curve of hit probability implies the amount of hit integration at p. p is the upper limit of hit integration.
Top 5 optimal spaced seeds.
| seed | Markov | ||||
|---|---|---|---|---|---|
| 111**1*11**1*1*111 | 0.300273 (1) | 0.598730 (1) | 0.0875373 (6) | 0.466982 (2) | 0.68499 (2482) |
| 111*1**1*1**11*111 | 0.300265 (2) | 0.598713 (2) | 0.0876001 (3) | 0.467122 (1) | 0.68869 (1425) |
| 11*1*1*11**1**1111 | 0.300064 (3) | 0.598314 (3) | 0.0874004 (12) | 0.466131 (3) | 0.68301 (3359) |
| 111**11*1**1*1*111 | 0.300031 (4) | 0.598247 (4) | 0.0873905 (13) | 0.466015 (4) | 0.68588 (2145) |
| 111*1**1*111*111 | 0.300031 (5) | 0.598204 (6) | 0.0876591 (1) | 0.465521 (13) | 0.70225 (52) |
| 1111*111*1111 | 0.2950 (22472) | 0.5882 (22829) | 0.0832 (19513) | 0.4421 (25033) | 0.7094 (1) |
| 1111*111**1*111 | 0.2988 (470) | 0.5957 (508) | 0.0866 (249) | 0.4596 (747) | 0.7089 (3) |
| 111**1*1**11**1*111 | 0.2999 (17) | 0.5980 (14) | 0.0870 (65) | 0.4656 (12) | 0.6802 (4907) |
| 11111111111 | 0.2590 (46252) | 0.5167 (46252) | 0.0538 (46252) | 0.3002 (46252) | 0.6066 (46133) |
Top 5 optimal spaced seeds with weight 11 which are identified by hit integration are listed. The 5 seeds are ordered by HI [0, 1]. Each value indicates the sensitivity of the seed at the same row which is determined by the sensitivity measure at the same column. The number in the parentheses indicates the rank of the seed for each measure. The descriptions of the labeled seeds are as below; a: the optimal seed of HI [0, 1] and HI [0.5, 1], b: the default seed of PatternHunter, c: the optimal seed of HI [0.3, 0.7], d: the optimal seed computed by mandala, e: the seed representing 5th-order non-coding markov model in [5], f: the seed representing 0th-order non-coding markov model in [5], g: the default seed of BLAST.
Figure 2Distributions of biological data alignments. The distributions of the similarities of five biological data alignments are plotted by scale of 5%: mmX, mm1, gal, pan, and mixed (see Method).
Experimental sensitivities of the optimal seeds.
| seed | mixed | gal | mm1 | mmX | pan | average |
|---|---|---|---|---|---|---|
| 111**1*11**1*1*111 | 0.73837 (1) | 0.78102 (2) | 0.79375 (2) | 0.71692 (1) | 0.71909 (3) | 0.74983 (1) |
| 111*1**1*1**11*111 | 0.71615 (5) | 0.77729 (3) | 0.76228 (6) | 0.69906 (2) | 0.69820 (6) | 0.73060 (5) |
| 111*1**1*111*111 | 0.73103 (2) | 0.77070 (4) | 0.80484 (1) | 0.69386 (3) | 0.71802 (4) | 0.74369 (2) |
| 1111*111*1111 | 0.72246 (4) | 0.78234 (1) | 0.78821 (3) | 0.67856 (5) | 0.72535 (1) | 0.73939 (3) |
| 1111*111**1*111 | 0.72300 (3) | 0.76938 (6) | 0.78317 (4) | 0.68591 (4) | 0.72156 (2) | 0.73660 (4) |
| 111**1*1**11**1*111 | 0.71110 (6) | 0.76719 (5) | 0.77384 (5) | 0.67652 (6) | 0.70356 (5) | 0.72644 (6) |
| 11111111111 | 0.60068 (7) | 0.70459 (7) | 0.66819 (7) | 0.57455 (7) | 0.57556 (7) | 0.62471 (7) |
The experimental sensitivities of the optimal seeds are calculated from the five biological data sets(mixed, gal, mm1, mmX, and pan). The tested seeds are listed as below: a: the optimal seed of HI [0, 1] and HI [0.5, 1], b: the default seed of PatternHunter, c: the optimal seed of HI [0.3, 0.7], d: the optimal seed computed by mandala, e: the seed representing 5th-order non-coding markov model in [5], f: the seed representing 0th-order non-coding markov model in [5], g: the default seed of BLAST.
Figure 3The comparison of quantitative differences. The quantitative differences of hit integrations for five dominant seeds are compared each other: A: 111*1**11*1*1**111, B: 111*1*1**11*1**111, C: 111*11**1*1**1*111, D: 11**111*1**1*111*1, and E: 1111*1*11**1***111. The value of seed D is set to be 0 because it showed the lowest probability at all ranges. The values of the others are subtracted by the value of the seed D.