Literature DB >> 16147981

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment.

Sergey Sheetlin¹, Yonil Park, John L Spouge.

Abstract

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter lambda and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter lambda can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243-260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters lambda and k within the errors required (lambda, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

Entities: Chemical Disease Gene

Mesh：

Year: 2005 PMID： 16147981 PMCID： PMC1199557 DOI： 10.1093/nar/gki800

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Local sequence alignment is an indispensable computational tool in modern molecular biology. It is frequently used to infer the functional, structural and evolutionary relationships of a novel protein or DNA sequence by finding similar sequences of known function in a database. Arguably, the most important sequence database search program available is BLAST (the Basic Local Alignment Search Tool) (1,2). Using a heuristic algorithm, BLAST implicitly performs a local alignment of a protein or DNA query against sequences in the corresponding database. The BLAST output then ranks each potential database match according to an E-value, which is derived from the corresponding local maximum score, given in bits. For each local maximum score y, the corresponding E-value E gives (under a random model) the expected number of false positives with a lower rank in the output. Thus, a small E-value indicates that the corresponding alignment is unlikely to occur by chance alone, whereas a large E-value indicates an unremarkable alignment. Without doubt, BLAST's E-values contribute substantially to its popularity. Let us discuss the BLAST E-value E further here. (The Materials and Methods section also continues the discussion.) BLAST assumes a random model in which each unrelated pair of sequences A[1, m] = A1 ··· A and B[1, n] = B1 ··· B consists of random letters chosen independently from a background distribution. BLASTP (BLAST for proteins), e.g. assumes that random proteins are composed of amino acids chosen independently from the Robinson and Robinson frequency distribution (3). BLAST also requires an input, a matrix s(A, B) for scoring matches between the letters A and B. BLASTP, e.g. uses the BLOSUM62 scoring matrix (4) as its default, offering as alternatives a few other PAM (5) and BLOSUM matrices. BLAST also enhances its detection of remote sequence similarities by using gapped sequence alignment. The cost of introducing a gap into an alignment is given by the ‘gap penalty’ Δ(g), where g is the gap length. Practical gap penalties Δ are usually super-additive, i.e. Δ(g) + Δ(h)≥Δ(g + h), so the concatenation of optimal subsequence alignments has a score no less than the sum of their scores. (However, our theory is not restricted to super-additive gap penalties). Affine gap penalties Δ(g) = a + bg are typical in database searches. We refer to the letter distribution, the scoring matrix, and gap penalty collectively as ‘BLAST parameters’. Throughout the paper, we assume a ‘logarithmic regime’ (6) where the alignment scores of long random sequences have a negative expectation. In the logarithmic regime, the BLAST E-value E is approximately for large y. Under a Poisson approximation (7) for large y, the E-value E yields the P-value P = 1−exp(−E). Because of Equation 1, the tail probability P corresponds to a Gumbel distribution with ‘scale parameter’ λ and ‘pre-factor’ k. For ungapped local alignment (i.e. the special case Δ(g) = ∞, which disallows gaps in the optimal local alignment), a rigorous theory furnishes analytic formulas for the Gumbel parameters λ and k (7,8). For gapped local alignment, analytic results are scarce and usually come at a price: they depend on approximations whose accuracy in general is unknown (9–12). In the absence of a rigorous theory for gapped local alignment, computer simulations have confirmed the validity of Equation 1 (13–16), and in the absence of formulas, they also have provided estimates of λ and k (16–19). Because of the exponentiation in Equation 1, errors in λ have a greater practical impact than errors in k. Thus, for use in BLAST, λ must be known to within 1–4% relative error; k, to within 10% (20). Therefore, in statements about computational speed, the following implicitly assumes that the estimation of λ and k is carried out to these accuracies, unless stated otherwise. Presently, the BLAST program precomputes λ and k offline, using the so-called ‘island method’ (15,20). Because of the precomputation, users are given a narrow choice indeed of BLAST parameters. The choice of BLAST parameters would be much less restricted, if λ and k could be computed online (in, say, less than 1 s) before searching a database with arbitrary BLAST parameters. Accordingly, much recent research has been directed toward speeding estimation of λ and k. With the ultimate aim of estimating λ and k online, Bundschuh gave some interesting conjectures about λ (21,22). He then applied them in global alignment simulations that estimated λ as much as five faster than the island method. Later, we extended his conjectures, reducing the sequence length required to estimate λ by almost a factor of 10 (23). Despite their obvious promise, even with further improvements in speed and global alignment simulations will remain impractical for online estimation in BLAST, unless they can be made to estimate k as well. To remedy the problem, we relate k to global alignment and then exploit the relationship in simulations that estimate both λ and k.

MATERIALS AND METHODS

Notation for global sequence alignment

We denote the non-negative integers by ℤ+ = {0, 1, 2, 3,…}. Throughout the paper, the letters g, h, i, j, m, n and the letter y are the integers. Consider a pair A = A1A2… and B = B1B2… of infinite sequences. The corresponding global alignment graph Γ is a directed and weighted lattice graph in two dimensions, as follows. The vertices of Γ are , the non-negative two-dimensional integer lattice. Three sets of directed edges e come out of each vertex v = (i, j): northward, northeastward and eastward. One northeastward edge goes into (i + 1, j + 1) with weight s(A+1, B+1). For each g > 0, one eastward edge goes into (i + g, j) and one northward edge goes into (i,j + g); both are assigned the same weight −Δ(g) < 0. For simplicity, we assume s(A, B) and Δ(g) are always integers, with greatest common divisor 1. A directed path π = (v0, e1, v1, e2,…e, v) in Γ is a finite, alternating sequence of vertices and edges that starts and ends with a vertex. We say that the path π starts at v0 and ends at v. For instance, each gapped alignment of the subsequences A[i + 1, m] = A1…A and B[j + 1, n] = B+1…B corresponds to exactly one directed path that starts at v0 = (i, j) and ends at v = (m, n). The alignment's score is the ‘path weight’ , the sum of the weights W(e) of the edges e. By convention, any trivial path π = (v0) consisting of a single vertex has weight Wπ = 0. Let Π be the set of all paths π starting at v0 = (0, 0) and ending at v = (i, j). Define the ‘global score’ Sij = max{Wπ: π ∈ Πij}. The paths π starting at v0 and ending at v with weight Wπ = S are ‘optimal global paths’ and correspond to ‘optimal global alignments’ between A[1, i] and B[1, j]. The Needleman–Wunsch algorithm computes the global scores S (24). Let be the set of all paths π starting at v0 = (0,0). Define the ‘global maximum’ M = max{Wπ: π ∈ Π}, which is also the maximum of all global scores. Let denote the number of vertices with global score y. Define the lattice rectangle [0, n] = {0,1,…,n}. Our simulations involved a square subset [0,n]2 of . In particular single subscripts connote quantities for the square: Mn = max{Sij:(i, j) ∈ [0, n]2}, the square's global maximum; E = max{max0≤S, max0≤S}, its edge maximum; and Nn (y) = #{(i, j) ∈ [0, n]2:Sij = y}, the number of its vertices with global score y.

The formula for k from global alignment

We can show heuristically that k = limk, where (see our Appendix, online). Ultimately, the heuristics behind Equation 2 are based on two observations about random sequence matches. First, the two ends of a strong local alignment match are the mirrors of each other. Second, the right end of a strong alignment match looks the same for both local and global alignment. Equation 2 computes k from three components: the scale parameter λ, the probability P(M = y) of a global maximum y, and the expected number 𝔼N(y) of vertices with global score S = y. We now describe how our simulations determined the three components.

Numerical scheme for λ

First, we estimated λ from random global alignments (23). All simulations used to affine gap penalties Δ(g) = a + bg and the corresponding global alignment algorithms for computing S (25). Recall the edge maximum E (defined at the end of the notation for global sequence alignment). As shown elsewhere (23), its cumulant generating function satisfies where 0 ≤ δ < 1. The root of β1(λ) = 0 is our estimate for λ. To estimate 𝔼exp(λEn) efficiently, we used Bundschuh's importance sampling methods (21), which apply if the gap penalty is affine. Briefly, importance sampling is a variance-reduction technique for simulating rare events. In global alignment simulations, e.g. a large edge maximum is a rare event. By simulating optimal subsequence pairs in ‘hybrid alignment’ (a type of optimized Bayesian local alignment) (26), we ensured that our realizations frequently generated a large edge maximum E. Accordingly, we simulated a pair of sequences of some ‘base length’ n = l. After correcting for biases induced by the importance sampling distribution, we estimated 𝔼exp(λE). Equation 3 corresponds to an asymptotic equality with two free parameters to β0 and β1(λ), which we estimated with robust regression. Robust regression was originally developed as an antidote to outliers (27), which badly skew least-square regression (28–31). As noted elsewhere (23), however, robust regression is also remarkably suited for extracting asymptotic parameters like β0 and β1(λ). Robust regression requires the specification of an influence function, to quantify the influence of potential outliers on the regression result. Many influence functions exist (27), but the Andrews function with a = 1.339 [(27), p. 388; (29)] works well in asymptotic regression, because it ignores points that obviously lie outside the asymptotic regime (23). Accordingly, we applied robust regression to Equation 3. To solve β1(λ) = 0, let λ be the scale parameter for ungapped local alignment, which can be determined analytically. Because 0 ≤ λ ≤ λ, with repeated bisection of the interval [0, λ] yielded an estimate for the root of the equation β1(λ) = 0. In practice, multiple roots did not occur.

Numerical scheme for k

Next, we estimated ℙ(M = y) and 𝔼N(y). Importance sampling has already generated sequence-pairs of base length l for estimating λ. The bias in importance sampling tends to yield large global scores S, ascending toward the global maximum M. To determine N(y), we needed to simulate and count all vertices with global scores S = y. Therefore, we extended the sequence pair beyond the base length l using random letters with the unbiased Robinson and Robinson frequencies. The global scores S beyond the base length l became progressively smaller, thereby permitting determination of N(y). Given ɛ > 0, we simulated a random number of unbiased letters in each sequence, until we found some total length such that The edge maximum E is a maximum over 2L + 1 vertices. Therefore, for small enough stringencies ɛ > 0, if the edge maximum E of the contributing 2L + 1 vertices satisfies Equation 4, it is probable that M = M, because elongating the sequences is unlikely to increase the estimate of M. Similarly, the elongation does not increase the estimate of 𝔼N(y) much. After appropriate averaging, our simulations therefore yielded estimates and for ℙ(M = y) and 𝔼N(y). With the simulation estimates , and in hand, we found that errors in were negligible in practice. In contrast, the standard deviations sample (32) of and , denoted by s and s, were not. We calculated an estimate for k by substituting , , and into Equation 2. We estimated the error in from the equation Note that Equation 5 explicitly neglects the error in the estimate . Finally, we used robust regression to extract a summary estimate from the estimates for individual y. To begin with, consider a constant regression model η = 1α + e, where η is a column vector consisting of the values , 1 is a column vector whose elements are all 1, the constant α is the summary estimate , and e is the column vector consisting of the errors . Our ultimate aim is to compute rapidly, with as few realizations as possible. Unfortunately, for small numbers of realizations, the errors s and s are correlated with the corresponding estimates and . The correlations propagate to , noticeably biasing the summary estimate , with (see Figure 1).

Figure 1

Plot of estimates for against the global score y for the BLOSUM62 scoring matrix with an affine gap cost of 11 + g for a gap of length g, with random sequences whose letters are chosen according to the empirical Robinson and Robinson amino acid frequencies (3). Each point represents 30 000 random sequence-pairs generated by the importance sampling method with base length l = 50 and extended to random length L using Equation 4 with ɛ = 10−2. The error bars indicate the error estimate . The horizontal thick line k = 0.041 represents the previous best estimate of the Gumbel pre-factor k (20). The dotted line shows an example of the biased summary estimate from the robust regression, which we ascribe to the correlation between and .

To avoid the bias, we applied the constant regression model η′ = 1α′ + e′ to the errors themselves. The elements of the column vector η′ were the errors , with errors in each is taken to be a constant s derived though a standard formula [(27), p. 387], e′ = 1s. Robust regression thus gave a constant estimate of the errors . We substituted the constant error estimate back into the constant regression η = 1α + e of to derive a robust regression estimate for k. Although somewhat ad hoc, the constant regression of the errors successfully reduced biases (see Figure 3).

Figure 3

Plot of relative errors of estimate k obtained via robust regression using and against different numbers of simulations. Each bar represents an average over 20 absolute relative errors. The previous best estimate k = 0.041 is used as a basis for the relative error calculation. The relative errors from are shown with white bars; the one from with black bars.

Even for large simulations (e.g. 106 realizations), however, sampling of the event [M = y] was inadequate for many large y, with ℙ(M = y) likely being underestimated. Although the corresponding average was unbiased (in theory, at least), we suspect that it had a distribution whose skewing increased with y. Consequently, for large y, often slightly underestimated the true k, with improbable but substantial overestimations maintaining a correct expectation (see Figure 2). The putative skewing also made the anticipated relation ℙ(M = y) ≈ eλ ℙ(M = y + 1) fail for large y. To avoid skewing, we therefore restricted robust regression of to the range [a, b] of y that minimized the function

Figure 2

Plot of estimates for against the global score y for 106 realizations. The simulation conditions were the same as in Figure 1. The error bars showing for the under-sampled asymptotic regime y ∈ [41 100] are large and are omitted.

Software and Hardware

Computer code was written in C++ and compiled with the Microsoft® Visual C++® 6.0 compiler. The computer had a single Intel® Pentium® 4 2.8 GHz processor with 0.5 GB RAM and employed the Microsoft® Windows® 2000 operating system.

RESULTS

Tables 1 and 2 give estimates of the Gumbel parameters λ and k for all online options of the BLASTP parameters. They therefore confirm that our simulations and our formulas for k produced correct results. Other figures show results for the BLASTP default parameters, namely, the Robinson and Robinson amino acid frequencies (3), the BLOSUM62 scoring matrix and the gap cost Δ(g) = 11 + g. Other BLAST parameters tested gave comparable results, unless indicated otherwise (data not shown).

Table 1

Estimates of λ for all online options of the BLASTP parameters

Scoring matrix	Gap cost Δ(g)	λ	Average λ^	Standard error λ^	Relative error λ^ (%)
BLOSUM45	15 + 2g	0.203	0.2039	0.00061	0.30
BLOSUM62	11 + g	0.267	0.2678	0.00088	0.33
BLOSUM80	10 + g	0.299	0.3000	0.00056	0.19
PAM30	9 + g	0.294	0.2931	0.00035	0.12
PAM70	10 + g	0.291	0.2914	0.00037	0.13

All results used 100 simulations of 30 000 realizations each. In Table 1, the first and second column give the BLASTP parameter options. The third column gives λ from the online BLASTP documentation. The fourth column gives the average estimate from 100 simulations. The fifth column gives the corresponding standard error in (so the standard error mean, the actual accuracy of our results, is 0.1 times the standard error). The sixth column gives the percent relative error in , as calculated from the fourth and fifth columns.

Table 2

Estimates of k for all online options of the BLASTP parameters

Scoring matrix	Gap cost Δ(g)	k	Average k^	Standard error k^	Relative error k^ (%)
BLOSUM45	15 + 2g	0.041	0.0401	0.0024	5.99
BLOSUM62	11 + g	0.041	0.0410	0.0027	6.59
BLOSUM80	10 + g	0.071	0.0706	0.0044	6.23
PAM30	9 + g	0.110	0.1051	0.0108	10.27
PAM70	10 + g	0.091	0.0899	0.0079	8.79

All results used 100 simulations of 30 000 realizations each. Table 2 has the same format as Table 1.

Empirically, simulations using BLASTP default parameters needed a base length of l = 50 and a stringency ɛ = 10−2 for the accuracies required for (λ, 1%; k, 10%). For scoring matrices with more dominant diagonals than BLOSUM62, shorter base lengths sufficed, (e.g. for PAM30, l = 15 sufficed). Figure 1 plots the estimates with their standard error bars against global score y, up to y = 25. Each point represents 30 000 realizations. The horizontal thick line represents the previous best estimate k ≈ 0.041 and the dotted line, the biased summary estimate due to the positive correlation between and . Therefore Figure 1 motivated us to regress the errors in , to produce a constant error estimate , as described in the Materials and Methods. Figure 2 plots the estimates against global score y, up to y = 100. Each point represents 106 realizations. We obtained the estimate and used it to estimate . The range y ∈ [0, 3] is not asymptotic, so the do not approximate the true k very well. The range y ∈ [4, 40] is asymptotic, and it is adequately sampled, so the fluctuate randomly around the true k. The range y > 40 is also asymptotic, but it is not adequately sampled, so the usually underestimate the true k. Figure 2 motivated us to regress only in the range [a, b] minimizing Equation 6, as described in the Materials and Methods. Figure 3 plots the relative errors of the summary estimate using (with skewed error estimates and those using (with constant error estimate against different numbers of realizations). All errors in were computed relative to the approximation k ≈ 0.041. Each error plotted is the average of the absolute relative error for 20 independent simulations, each using the indicated number of realizations. White bars show the results for ; black bars, for . For 10 000 realizations, the constant error estimate reduces the relative errors dramatically. As the number of realizations increases, the difference in efficiency of estimation between and decreases. Figure 3 shows that 10 000 realizations estimated k with less than 10% relative error. The same 10 000 realizations also estimated with less than 0.8% relative error (data not shown). The simulations of Figure 3 estimated from 10 000 realizations, in less than 30 s. For comparison, the same simulations could have estimated in less than 7 s. For the PAM 30 matrix with Δ(g) = 9 + g, they estimated λ and k in less than 4 s.

DISCUSSION

BLAST programs (BLASTP, PSI-BLAST, etc.) are restricted to specific scoring schemes, because time-consuming local alignment simulations for estimating the corresponding Gumbel parameters must be done offline. However, simulations of global alignment can estimate the Gumbel scale parameter λ for local alignment (6). Some global alignment methods are as much as five times faster than the best local alignment methods (21,23), so global alignment has considerable potential for online estimation of the Gumbel parameter λ. This paper surmounts an obstacle to online estimation by demonstrating that simulations of global alignment can determine the Gumbel pre-factor k. Table 2 displays the results of global alignment simulations over a wide range of BLAST parameters, all of which gave correct estimates of the corresponding k and supported the validity of our methods for computing k. Global alignment simulation therefore appears a feasible method for estimating both Gumbel parameters, λ and k. (The BLASTP default parameters provide a standard for quantifying speed, so the following results apply to the BLASTP defaults, unless stated otherwise.) With local alignment, estimates of λ required 40 000 sequence-pairs of minimum length 600 (21); with our methods, 5000 sequence-pairs of maximum length 50 (23). In fact, our methods attained 1.3% accuracies in λ with only 1000 sequence-pairs of maximum length 50. In our hands, k was more difficult to estimate than λ, with 10% relative errors requiring 10 000 sequence-pairs of average length 140. In summary, the methods presented here for estimating the Gumbel parameters λ and k represent at least a 3-fold improvement in speed over local alignments. Online computation of the BLAST P-value requires more than the Gumbel parameters. It also requires an estimate of the ‘finite-size effect’ (10,13,33,34). Global alignment (or some variant of it) can indeed produce the required estimate (manuscript in preparation). Without the finite-size estimate in hand, however, we were not strongly motivated to incorporate technical improvements or heuristics into our methods. Bundschuh, e.g. implemented a diagonal-cutting heuristic to remove irrelevant off-diagonal elements in the global alignment matrix (21); we did not. The heuristic could probably speed our computation by a further factor of at least three. Online BLAST estimation of the Gumbel parameters is likely just a few years away.

21 in total

1. Local sequence alignments with monotonic gap penalties.

Authors: R Mott
Journal: Bioinformatics Date: 1999-06 Impact factor: 6.937

2. The estimation of statistical parameters for local alignment score distributions.

Authors: S F Altschul; R Bundschuh; R Olsen; T Hwa
Journal: Nucleic Acids Res Date: 2001-01-15 Impact factor: 16.971

3. Rapid assessment of extremal statistics for gapped local alignment.

Authors: R Olsen; R Bundschuh; T Hwa
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1999

4. Statistical significance of probabilistic sequence alignment and related local hidden Markov models.

Authors: Y K Yu; T Hwa
Journal: J Comput Biol Date: 2001 Impact factor: 1.479

5. Rapid significance estimation in local sequence alignment with gaps.

Authors: Ralf Bundschuh
Journal: J Comput Biol Date: 2002 Impact factor: 1.479

6. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins.

Authors: A B Robinson; L R Robinson
Journal: Proc Natl Acad Sci U S A Date: 1991-10-15 Impact factor: 11.205

7. Local alignment statistics.

Authors: S F Altschul; W Gish
Journal: Methods Enzymol Date: 1996 Impact factor: 1.600

8. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

Authors: S Karlin; S F Altschul
Journal: Proc Natl Acad Sci U S A Date: 1990-03 Impact factor: 11.205

9. The significance of protein sequence similarities.

Authors: J F Collins; A F Coulson; A Lyall
Journal: Comput Appl Biosci Date: 1988-03

10. An improved algorithm for matching biological sequences.

Authors: O Gotoh
Journal: J Mol Biol Date: 1982-12-15 Impact factor: 5.469

12 in total

1. Objective method for estimating asymptotic parameters, with an application to sequence alignment.

Authors: Sergey Sheetlin; Yonil Park; John L Spouge
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2011-09-13

2. CentroidHomfold-LAST: accurate prediction of RNA secondary structure using automatically collected homologous sequences.

Authors: Michiaki Hamada; Koichiro Yamada; Kengo Sato; Martin C Frith; Kiyoshi Asai
Journal: Nucleic Acids Res Date: 2011-05-11 Impact factor: 16.971

3. Frameshift alignment: statistics and post-genomic applications.

Authors: Sergey L Sheetlin; Yonil Park; Martin C Frith; John L Spouge
Journal: Bioinformatics Date: 2014-08-28 Impact factor: 6.937

4. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES.

Authors: Yonil Park; Sergey Sheetlin; John L Spouge
Journal: Ann Stat Date: 2009-12-01 Impact factor: 4.028

5. Parameters for accurate genome alignment.

Authors: Martin C Frith; Michiaki Hamada; Paul Horton
Journal: BMC Bioinformatics Date: 2010-02-09 Impact factor: 3.169

6. Gentle masking of low-complexity sequences improves homology search.

Authors: Martin C Frith
Journal: PLoS One Date: 2011-12-19 Impact factor: 3.240

7. Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.

Authors: Ankit Agrawal; Xiaoqiu Huang
Journal: BMC Bioinformatics Date: 2009-03-19 Impact factor: 3.169