| Literature DB >> 16147981 |
Sergey Sheetlin1, Yonil Park, John L Spouge.
Abstract
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter lambda and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter lambda can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243-260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters lambda and k within the errors required (lambda, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.Entities:
Mesh:
Year: 2005 PMID: 16147981 PMCID: PMC1199557 DOI: 10.1093/nar/gki800
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Plot of estimates for against the global score y for the BLOSUM62 scoring matrix with an affine gap cost of 11 + g for a gap of length g, with random sequences whose letters are chosen according to the empirical Robinson and Robinson amino acid frequencies (3). Each point represents 30 000 random sequence-pairs generated by the importance sampling method with base length l = 50 and extended to random length L using Equation 4 with ɛ = 10−2. The error bars indicate the error estimate . The horizontal thick line k = 0.041 represents the previous best estimate of the Gumbel pre-factor k (20). The dotted line shows an example of the biased summary estimate from the robust regression, which we ascribe to the correlation between and .
Figure 3Plot of relative errors of estimate k obtained via robust regression using and against different numbers of simulations. Each bar represents an average over 20 absolute relative errors. The previous best estimate k = 0.041 is used as a basis for the relative error calculation. The relative errors from are shown with white bars; the one from with black bars.
Figure 2Plot of estimates for against the global score y for 106 realizations. The simulation conditions were the same as in Figure 1. The error bars showing for the under-sampled asymptotic regime y ∈ [41 100] are large and are omitted.
Estimates of λ for all online options of the BLASTP parameters
| Scoring matrix | Gap cost Δ( | λ | Average | Standard error | Relative error |
|---|---|---|---|---|---|
| BLOSUM45 | 15 + 2 | 0.203 | 0.2039 | 0.00061 | 0.30 |
| BLOSUM62 | 11 + | 0.267 | 0.2678 | 0.00088 | 0.33 |
| BLOSUM80 | 10 + | 0.299 | 0.3000 | 0.00056 | 0.19 |
| PAM30 | 9 + | 0.294 | 0.2931 | 0.00035 | 0.12 |
| PAM70 | 10 + | 0.291 | 0.2914 | 0.00037 | 0.13 |
All results used 100 simulations of 30 000 realizations each. In Table 1, the first and second column give the BLASTP parameter options. The third column gives λ from the online BLASTP documentation. The fourth column gives the average estimate from 100 simulations. The fifth column gives the corresponding standard error in (so the standard error mean, the actual accuracy of our results, is 0.1 times the standard error). The sixth column gives the percent relative error in , as calculated from the fourth and fifth columns.
Estimates of k for all online options of the BLASTP parameters
| Scoring matrix | Gap cost Δ( | k | Average | Standard error | Relative error |
|---|---|---|---|---|---|
| BLOSUM45 | 15 + 2 | 0.041 | 0.0401 | 0.0024 | 5.99 |
| BLOSUM62 | 11 + | 0.041 | 0.0410 | 0.0027 | 6.59 |
| BLOSUM80 | 10 + | 0.071 | 0.0706 | 0.0044 | 6.23 |
| PAM30 | 9 + | 0.110 | 0.1051 | 0.0108 | 10.27 |
| PAM70 | 10 + | 0.091 | 0.0899 | 0.0079 | 8.79 |
All results used 100 simulations of 30 000 realizations each. Table 2 has the same format as Table 1.