| Literature DB >> 17625018 |
Stefan Wolfsheimer1, Bernd Burghardt, Alexander K Hartmann.
Abstract
BACKGROUND: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant.Entities:
Year: 2007 PMID: 17625018 PMCID: PMC1945026 DOI: 10.1186/1748-7188-2-9
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Sketch of the graph of overlapping distributions q1,..., q4. Distant distributions have weak overlaps.
Figure 2Equilibration of the 4-letter system (L = M = 20) with temperatures T = 0.5, 0.6, 0.7, 1.0, ∞ Equilibrium is reached after 20000, 15000, 10000, 1000, 100 steps (indicated by arrows) respectively. S (t) is averaged over independent 250 runs.
Figure 3Score auto-correlation function for different temperatures (4 letters, L = M = 20). Circles indicate corre-sponding nthin from Raftery and Lewis [48,49].
Figure 4Empirical probabilities for the toy model (4 letters, L = M = 20) held at finite temperature. The dottet line showes the normalized mixture weight function .
Figure 5Score probabilities obtained throw the reweighting mixture technique for a 4-letter system with sequence-length L = 10, 20 and scoring parameters Eq. (15) using affine gap costs (α = 4, β = 2). For L = 10 the P (s) had also been been obtained by exact enumeration of all 42 × 10 configurations. A difference between the empirical curve is not visible in the plot.
Figure 6Rate of convergence of the MCMCMC data. The relative error ε (Smax) of the ground state for L = 10 and L = 20 depending on the number Nsamples of samples is shown. Inset: relative error of the final P (s) incomparison to the exact enumeration of all states for the smallest system L = 10.
Figure 7Probability distribution P(s) for gapped sequence alignment using BLOSUM62 matrices and affine gap costs with α = 12, β = 1 for two sequences lengths L = M = 40. The results for other lengths are summarized in additional file 1. Strong deviations from the Gumbel distribution become visible in the tail. The dotted lines show the original Gumbel distribution, when fitted to the region of high probability. The inset shows the same data with linear ordinate.
Figure 8Probability distribution P(s) for ungapped sequence alignment using BLOSUM62-matrices. Deviations form the Gumbel-distribution can only be observed for short sequences (L < 250). The inset shows the same data with linear ordinate.
Figure 9Relative error of the probability estimation using gapped sequence alignment and BLOSUM62 matrices.
Fit parameters of the modified Gumbel distribution Eq. (18) using the BLOSUM62 scoring matrix and affine gap costs with α = 10, β = 1 . 104 describes the estimated value of λ2 using the scaling relation Eq. (19). Fit parameters for other scoring systems are provided as supplementary material to this artilce [see additional file 1].
| 104 | 104 | |||||
| 40 | 0.3272 ± 0.108% | 8.6347 ± 0.412% | 0.1028 ± 0.65% | 15.597 ± 0.0676% | 79.05 | 8.1560 ± 12.485% |
| 60 | 0.3034 ± 0.086% | 6.2007 ± 0.285% | 0.0751 ± 0.60% | 18.455 ± 0.0645% | 49.40 | 6.1711 ± 12.907% |
| 80 | 0.2892 ± 0.070% | 4.8781 ± 0.222% | 0.0612 ± 0.53% | 20.644 ± 0.0540% | 21.67 | 5.0458 ± 13.280% |
| 100 | 0.2747 ± 0.072% | 4.3187 ± 0.330% | 0.0472 ± 0.58% | 22.413 ± 0.0611% | 39.42 | 4.3056 ± 13.627% |
| 150 | 0.2541 ± 0.083% | 3.2974 ± 0.529% | 0.0303 ± 0.61% | 25.682 ± 0.0422% | 39.46 | 3.2047 ± 14.437% |
| 200 | 0.2432 ± 0.063% | 2.6343 ± 0.344% | 0.0241 ± 0.52% | 28.257 ± 0.0412% | 10.47 | 2.5806 ± 15.214% |
| 250 | 0.2359 ± 0.071% | 2.1999 ± 0.454% | 0.0198 ± 0.60% | 30.196 ± 0.0459% | 9.40 | 2.1701 ± 15.984% |
| 300 | 0.2303 ± 0.061% | 1.9101 ± 0.348% | 0.0174 ± 0.54% | 31.934 ± 0.0408% | 2.00 | 1.8758 ± 16.758% |
| 350 | 0.2261 ± 0.046% | 1.6404 ± 0.239% | 0.0153 ± 0.41% | 33.334 ± 0.0300% | 1.27 | 1.6525 ± 17.544% |
| 400 | 0.2224 ± 0.052% | 1.4806 ± 0.266% | 0.0136 ± 0.49% | 34.556 ± 0.0369% | 1.36 | 1.4762 ± 18.347% |
| 600 | 0.2140 ± 0.062% | 1.0206 ± 0.384% | 0.0106 ± 0.64% | 38.561 ± 0.0472% | 2.15 | 1.0250 ± 21.787% |
| 800 | 0.2090 ± 0.063% | 0.7660 ± 0.419% | 0.0088 ± 0.67% | 41.320 ± 0.0457% | 1.82 | 0.7691 ± 25.697% |
Figure 10Probability distributions P(s) comparing different gap costs. The dotted line denote the distribution without Gaussian correction (λ2 = 0). Deviations from the Gumbel distribution become stronger for small gap costs. The inset shows the same data with linear ordinate.
Figure 11Scaling of the correction parameter λ2 (BLOSUM62). The decay of λ2 with system size shows approximately a power law near the logarithm-linear transition (two smallest gap costs). For this cases the fit to Eq. (19) is shown by a line (α = 10) and dots (α = 12). The lines of the remaining cases are guides to the eye conneting the data points.
Figure 12Scaling of the correction parameter λ2 (PAM250). The decay of λ2 with system size shows approximately a power law near the logarithm-linear transition (two smallest gap costs). For this cases the fit to Eq. (19) is shown by a line (α = 11) and dots (α = 13). The lines of the remaining cases are guides to the eye conneting the data points.
Fitting parameters of the scaling relation Eq. (19).
| Parameter | BLOSUM62 | BLOSUM62 |
| 0.00928 ± 0.0001 | 0.0309 ± 0.01 | |
| 0.643 ± 0.027 | 0.971 ± 0.08 | |
| 10-5 | 4.9 ± 1.2 | 3.2 ± 2.0 |
| Parameter | PAM250 | PAM250 |
| 0.0049 ± 0.0008 | 0.0053 ± 0.0005 | |
| 0.575 ± 0.046 | 0.591 ± 0.023 | |
| 10-5 | 3.015 ± 2.0 | 6.1 ± 1.1 |
Temperature parameters for sum-statistics.
| 40 | 2.75, 3, 3.5, 4, 7, ∞ | |||
| 60 | 2.75, 3, 3.5, 4, 7, ∞ | |||
| 80 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
| 100 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
| 150 | 2.75, 3, 3.5, 4, 7, ∞ | 3.75, 4, 4.5, 5, 8, ∞ | 5.25, 5.5, 6, 8, ∞ | 6, 6.25, 6.5, 7, 8, 12, ∞ |
| 200 | 3.25.3.5, 4, 7, ∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 4.75, 5, 5.25, 5.5, 6, 8, ∞ | 5.75, 6, 6.25, 6.5, 7, 8, 12,∞ |
| 300 | 3.25.3.5, 4, 7, ∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 4.75, 5, 5.25, 5.5, 6, 8, ∞ | 5.75, 6, 6.25, 6.5, 7, 8, 12,∞ |
| 400 | 3.25.3.5, 3.75, 4, 4.25, 5, 8,∞ | 3.75, 4, 4.25, 4.5, 5, 8, ∞ | 5.25, 5, 5.75, 6, 8, 10, ∞ | 6, 6.25, 6.5, 7, 9, 11,∞ |
Figure 13Score probability distributions for sum-statistics of the k-best scores (solid lines) for L = M = 200. The dotted lines denote the distribution without Gaussian correction (λ2 = 0). Deviations from Eq. (3) or Eq. (6) become only visible in the rare-event tail.
Correction parameter λ2 for the sum statistics k = 2 and k = 3. λ2 is estimated by a fit for Eq. (21) using optimal the Gumbel-parameters λ and S0 from optimal score statistics (k = 1). BLOSUM62 with affine gap costs (α = 12, β = 1) was used as scoring system.
| 104 | 104 | |
| 60 | 2.692 ± 0.30% | |
| 80 | 1.631 ± 0.63% | 1.074 ± 2.59% |
| 100 | 1.488 ± 0.23% | 0.649 ± 2.06% |
| 150 | 1.056 ± 0.06% | 0.344 ± 1.90% |
| 200 | 0.749 ± 0.13% | 0.280 ± 1.14% |
| 300 | 0.463 ± 0.15% | 0.189 ± 0.70% |
| 400 | 0.338 ± 0.29% | 0.139 ± 0.92% |
Figure 14Scaling of the correction parameter for BLOSUM62 sum-statistics (k = 1, 2, 3). λ2 is estimated by a fit for Eq. (21) using optimal the Gumbel-parameters λ and S0 from optimal score statistics (k = 1).