Literature DB >> 27505054

Two Simple and Efficient Algorithms to Compute the SP-Score Objective Function of a Multiple Sequence Alignment.

Abstract

BACKGROUND: Multiple sequence alignment (MSA) is a crucial step in many molecular analyses and many MSA tools have been developed. Most of them use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score). The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming.
RESULTS: Given an alignment of n sequences and L sites, I introduce here optimized solutions reaching O(nL) time complexity for affine gap cost, instead of O(n2L), which are easy to implement.

Entities: Chemical Disease

Mesh：

Year: 2016 PMID： 27505054 PMCID： PMC4978502 DOI： 10.1371/journal.pone.0160043

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

A wide range of molecular analyses rely on multiple sequence alignments (MSA), e.g., prediction of tridimensional structures [1], phylogenetic inference [2] or detection of positive selection [3]. In all these studies, the initial MSA can strongly impact conclusions and biological interpretations [4,5]. As a consequence, MSA is a richly developed area of bioinformatics and computational biology. Most MSA software use a greedy approach to construct a first alignment that is then refined by optimizing the sum of pair score (SP-score) [6,7] using a 2-cut strategy. This strategy consists in partitioning the current solution into two sub-alignments that are subsequently re-aligned; the resulting MSA replaces the previous one if its SP score is improved [7,8,9,10]. The SP-score estimation is thus a bottleneck for most MSA tools since it is repeatedly required and is time consuming as existing solutions have a time complexity of O(n2L) for an alignment of n sequences and L sites. The practical importance of having an efficient solution to compute SP-score lies within the following ‘Muscle software paper’ [8] quotation: “Notice that computation of the SP score dominates the time complexity of refinement and of MUSCLE overall[…]. It is natural to seek an O(nL) expression for [SP-score estimation…], but to the best of our knowledge no solution is known.” The main result of this paper is an optimized algorithmic solution to estimate SP-score for affine gap costs in O(nL). I also introduce a more versatile solution able to handle more general gap cost penalty functions in O(nL + n2G), with G ≤ L being the maximum number of gap intervals within one aligned sequence.

Materials and Methods

Definitions and notations

A multiple sequence alignment (MSA) for a set of sequences {s…s}, defined with alphabet Σ, is a set of n sequences {S…S} which are defined on an enriched alphabet Σ ∪{‘-’} so that all Si have the same length L and, ∀i, removing ‘-’ from S leads to s. The aim of MSA tools is to position gaps (stretches of ‘-’) so that characters at the same position (constituting a site) are (most likely) homologous. This is usually achieved through a heuristic optimization of the sum of pair score (SP-score). The SP-score of an MSA is obtained by considering all pairwise alignments it induced. Given two sequences S and S of MSA the corresponding pairwise restriction ( |{S,S}) is the alignment made up of two sequences and obtained by removing the ‘-’ of S (resp. S) whenever S (resp. S) also has a gap at this position/site. The score of this pairwise alignment is the sum of the substitution/match scores induced by sites without gaps, plus the sum of scores associated to the gap intervals (maximal sub-sequences of consecutive gap symbols) observed in and . The SP-score of an alignment is classically obtained in O(n2L) by summing up the score of its induced pairwise alignments (Algorithm 1 and 2), and can be decomposed into two terms: SPs, the contribution of substitutions/matches and, SPg the contribution of induced gap intervals (denoted IG’). In molecular biology |Σ| is a constant so typically small (4 for nucleotides and 20 for amino acids) that L|Σ|2 and n|Σ|2, compared to nL, can safely be ignored in asymptotic complexity analysis. Under this assumption, all solutions described here have an O(nL) space complexity. Algorithm. 1. Basic O(n2L) computation of the SP-score of an alignment of n sites and L sequences given subst(.,.) and g_cost(.) functions. The subst(.,.) function provides, in O(1), the elementary score for two non gap characters on the same site, e.g. using BLOSUM matrix [11]. The g_cost(.) function provides, also in O(1), the cost of a gap interval based on its position and size. The compute_gap_intervals(.) subroutine (Algorithm 2) returns in O(|S|) the list of the gap intervals of its input sequence S. Algorithm 1: compute_SP_score Input: -The n aligned sequences S of alignment -a function subst(x,y) returning in O(1) the score for two non gap characters x and y on the same site of -a function g_cost(IG’) returning in O(1) the cost of a gap interval IG’ Output: the SP score of this alignment SPs = 0; SPg = 0 for S in S1 … S for S in S … S for k in 1…L if(not (S[k] == ‘–’ and S[k] == ‘–’)) if(S[k] ≠ ′–′ and S[k] ≠ ′–′) SPs = SPs + subst() for IG’ in compute_gap_intervals() ∪ compute_gap_intervals() SPg = SPg + g_cost (IG’) // e.g., g_cost(IG’) = {return gapO + IG’[length]*gapext} Return SPs + SPg Algorithm. 2. An O(L) algorithm to compute the list of gap intervals, ordered by their gap start position, of a sequence S of length L. Note that, thought IG[length] is not explicitly set, it is assumed it can be access since IG[length] is simply IG[end]-IG[start]+1. Algorithm 2: compute_gap_intervals Input: An aligned sequences S of length L Output: The list of gap intervals of S, ordered by gap start position LG = {}; // the list of gap intervals of S found so far IG = NULL; // the current gap interval for k in 1…L if(S[k] == ′–′ and IG = = NULL) // start a new gap interval IG = new Interval(start = k) if(S[k] ≠ ′–′ and IG ≠ NULL) // the current gap interval finish at previous position IG[end] = k-1; append a copy of IG to the list LG IG = NULL if (IG ≠ NULL) // handle terminal gap if any IG[end] = L; append a copy of IG to the list LG Return LG

Efficient algorithm to compute SP-score using general gap cost penalties

The SPs part of the SP-score can be computed in O(nL) by using a table of size L|Σ| containing for each site the number of each (non gap) symbol (e.g. [8]). This strategy does not work for SPg except for the basic, but unrealistic, constant gap cost where g_cost(IG’) = IG’[length].gapcost. I introduce here a more efficient solution to the SP-score computation problem accounting for most gap function penalties (including affine, log, log-affine penalties). The main idea is to pre-compute the list of gap intervals of each sequence S, ordered by gap start position, this can easily be done in O(L) using Algorithm 2. The compute_gap_intervals() and compute_gap_intervals() of Algorithm 1, observed in |{S,S}, can then be efficiently deduced by processing the gaps of compute_gap_intervals(S) ∪ compute_gap_intervals(S) according to order of opening (as done to merge two sorted lists in linear time, e.g. during a merge step of the ‘merge sort’ algorithm) while maintaining the number of gaps facing gaps encountered so far (i.e. the shift between S and S’ site coordinate systems for current position). The resulting SP-score algorithm (Algorithm 3) has a complexity of O(nL + n2G), with G ≤ ⌈L/2⌉ the maximum number of gap intervals within one aligned sequence, instead of O(n2L). Note that the difference with the naïve algorithm is especially important when sequences contain few long gap stretches but that in the worst case, where most sequences have a number of gap intervals close to L, this algorithm has the same O(n2L) complexity as the naïve solution.

Optimal algorithm to compute SP-score using affine gap cost penalties

Affine gap penalties (where g_cost(IG’) = gapO + IG’[length].gapext) are frequently used. For such gap penalties, the total of gap extension penalties (SPge) can also be efficiently computed in O(nL), by counting the number of gaps per site. However, gap opening cannot be counted exactly based on local site information (e.g [8]) only approximated. Though pessimistic gap count approximation [9] is often used during the dynamic programming steps producing new candidate alignments, the exact SP score is generally preferred to decide which alignments are better than the current one. Algorithm 4 provides a simple and exact solution to compute the SP-score under affine gap penalties in O(nL), which is also the time complexity for just reading an alignment of n sequences of L sites. The key idea is to note that a gap IG will add a number of gap opening penalties equal to n minus the number of interval IG so that IG ⊆ IG. In order to find out how many gaps encompass IG, sites are processed from left to right while maintaining an array indicating, for each left site, the number of gap stretches already opened at this position and not yet closed. For all gaps IG closing at the current position i, the value stored at index IG[deb] of this array provides, in O(1), the number of gap stretches encompassing IG; before considering site i+1, this array is maintained updated by decreasing by 1 all values stored at indices between IG[deb] and i for all IG ending at i—hence updates for all sites overall require O(nL). External and internal gaps are often penalized with different affine functions. The proposed O(nL) solution can handle this refinement by: firstly, using different characters (e.g. ‘-’ and ‘_’) to represent the two different gap types while computing SPge; and secondly, testing each gap interval type in Algorithm 3 (using gap interval start/end positions) to select the adequate gap opening cost. Algorithm. 3. Given the gap interval lists LG, LG of sequences and ; this algorithm returns in O(|LG| + |LG|) the restricted gap interval lists , that would be observed in |{S,S} without actually building this restricted alignment. Algorithm 3: compute_pairwise_restricted_gap_intervals Input:—LG, LG the ordered lists of gap intervals for and Output:—, the lists of gap intervals in |{S,S} IG = first(LG); IG = first(LG) shift = 0; = {} new Interval(start = -1) new Interval(start = -1) // using -1 allows to check if interval start has already been set or not while(IG≠NULL and IG≠NULL) if(IG[start] == IG[start]) if(IG == IG) // both intervals disappear when is restricted to |{S,S} IG = next(LG); new Interval(start = -1) IG = next(LG); new Interval(start = -1) elif(IG ⊂ IG)//IG disappear during restriction IG = next(LG); new Interval(start = -1) if([start] = = -1) //[start] is now known [start] = IG[start]-shift else // (IG⊂ IG) // IG disappear during restriction ……. // similar to previous case swapping i and j shift = shift+|IG ∩ IG| elif(IG[start] if(IG ⊂ IG) // IG disappear during restriction, shift increase if([start] = = -1) // set [start], if not already done, before increasing shift [start] = IG[start]-shift shift = shift+|IG ∩ IG| IG = next(LG); new Interval(start = -1) else // IG start after IG and is not included in IG if([start] = = -1) [start] = IG[start]-shift if(IG ∩ IG≠⊘) //[start] is now known and shift increase [start] = IG[start]-shift shift = shift + |IG ∩ IG| [end] = IG[end]-shift append to IG = next(LG); new Interval(start = -1) else // (IG[start] ……. // similar to previous case swapping i and j if(IG≠NULL) // handle last gaps in LG if([start] = = -1){[start] = IG[start]-shift} [end] = IG[end]-shift append to while((IG = next(LG) ≠ NULL) append new Interval(start = IG[start]-shift; end = IG[end]-shift)to if(IG≠NULL) …… //similar to previous block replacing i with j return ,

Conclusion

This paper introduces an optimized algorithmic solution to estimate SP-score for affine gap costs in O(nL) and a more versatile solution able to handle more gap cost penalty functions in O(nL + n2G), with G ≤ ⌈L/2⌉ being the maximum number of gap intervals per sequence. These optimizations will obviously be part of the next release of MACSE [10], the MSA software we developed to align nucleic sequences with respect to their amino acid translation while allowing them to contain frameshifts and/or stop codons (http://bioweb.supagro.inra.fr/macse/). Moreover, once stated those two algorithms are quite straightforward and can easily be included in the numerous existing MSA software relying on SP-score. Algorithm. 4. Efficient O(nL) computation of the contribution of gap opening cost (SP) for an alignment of n sites and L sequences. Algorithm 4: compute_SPgo_using_gap_intervals Input:—the n aligned sequences S1…S of - the costs of a gap opening within a sequence (gapO) or at its extremities (gapO_ext) Output: SPgo: the part of the SP-score of due to gap opening costs nbOpenGap = new Array of L integers initialized to 0; gapClosing = new Array of L empty lists of Intervals; for S in S1…S //construct LG in O(L) and update nbOpenGap and gapClosing arrays LG = compute_gap_intervals(S) foreach IG of LG for k in IG[start]…IG[end] nbOpenGap[k]++ append IG to gapClosing[IG[end]] SPgo = 0; // part of the SP score related to gap opening costs for i in 1 … L foreach IG in gapClosing[i] if(i == L OR IG[start] = = 1) SPgo = SPgo+(n-nbOpenGap[IG[start]])* gapO_ext else SPgo = SPgo+(n-nbOpenGap[IG[start]])* gapO foreach IG in gapClosing[i] for k in IG[start] … IG[end] nbOpenGap[k] = nbOpenGap[k]-1; return SPgo

8 in total

1. Alignments grow, secondary structure prediction improves.

Authors: Dariusz Przybylski; Burkhard Rost
Journal: Proteins Date: 2002-02-01

2. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

3. Alignment uncertainty and genomic analysis.

Authors: Karen M Wong; Marc A Suchard; John P Huelsenbeck
Journal: Science Date: 2008-01-25 Impact factor: 47.728

4. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis.

Authors: Ari Löytynoja; Nick Goldman
Journal: Science Date: 2008-06-20 Impact factor: 47.728

5. Gap costs for multiple sequence alignment.

Authors: S F Altschul
Journal: J Theor Biol Date: 1989-06-08 Impact factor: 2.691

6. Class of multiple sequence alignment algorithm affects genomic analysis.

Authors: Benjamin P Blackburne; Simon Whelan
Journal: Mol Biol Evol Date: 2012-11-09 Impact factor: 16.240

7. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.

Authors: Vincent Ranwez; Sébastien Harispe; Frédéric Delsuc; Emmanuel J P Douzery
Journal: PLoS One Date: 2011-09-16 Impact factor: 3.240

8. Molecular decay of the tooth gene Enamelin (ENAM) mirrors the loss of enamel in the fossil record of placental mammals.

Authors: Robert W Meredith; John Gatesy; William J Murphy; Oliver A Ryder; Mark S Springer
Journal: PLoS Genet Date: 2009-09-04 Impact factor: 5.917

8 in total

2 in total

Review 1. Developments in Algorithms for Sequence Alignment: A Review.

Authors: Jiannan Chao; Furong Tang; Lei Xu
Journal: Biomolecules Date: 2022-04-06

2. MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.

Authors: Vincent Ranwez; Emmanuel J P Douzery; Cédric Cambon; Nathalie Chantret; Frédéric Delsuc
Journal: Mol Biol Evol Date: 2018-10-01 Impact factor: 16.240

2 in total