Literature DB >> 27454115

Heuristics for multiobjective multiple sequence alignment.

Maryam Abbasi1, Luís Paquete2, Francisco B Pereira1,3.   

Abstract

BACKGROUND: Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment.
METHODS: We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. RESULTS AND
CONCLUSIONS: The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.

Entities:  

Keywords:  Iterated local search; Multiobjective optimization; Multiple sequence alignment

Mesh:

Year:  2016        PMID: 27454115      PMCID: PMC4959375          DOI: 10.1186/s12938-016-0184-z

Source DB:  PubMed          Journal:  Biomed Eng Online        ISSN: 1475-925X            Impact factor:   2.819


Background

Multiple sequence alignment (MSA) is of central importance to bioinformatics. This technique is useful to compare new sequences with other genomic sequences, unveiling their shared information and their significant differences. MSA methods are mainly essential to analyse biological sequences and to design applications in structure modeling, functional prediction, phylogenetic analysis and sequence database searching [1]. Currently, these approaches are also used in applications, for example, to compare protein structures, to predict protein mutations and interactions or to reconstruct phylogenetic trees [2, 3]. Moreover, these tools have found their place in medicine as well, mainly in the context of genetic screening and genetic engineering [4]. Most of the known alignment approaches have diverse optimization functions, along with assorted heuristics to search for the optimum alignment. These techniques consider a weighted sum formulation that maximizes the substitution score and penalizes indels/gaps. This is the usual procedure even in the case of pairwise sequence alignment [5]. However, the way of setting up the weights is very often not trivial. Moreover, analyzing only one alignment may lead to too simplistic interpretations of the data [6]. A multiobjective formulation of sequence alignment provides the practitioner a set of alignments that represents the trade-off between decreasing the number of gaps and increasing similarity. In bioinformatics, this formulation and algorithms can be found already for pairwise sequence (DNA/Protein) alignment  [7-10]. Abbasi et al. [7] present dynamic programming algorithms to compute the optimal set of alignments by treating the number of indels/gaps and the scores for (mis)matches/substitution as separate objectives. They also apply this method to analyze the construction of phylogenetic trees. Taneda [11] describes a heuristic approach for pairwise RNA sequence alignment that incorporates RNA structure information to approximate a set of optimal alignments. Schnattinger et al. [12] extend the work of Taneda by computing the optimal set. They treat the sequence alignment and the consensus structure calculation as separate objectives and solve both problems simultaneously with a dynamic programming approach. An extensive review about other problems in bioinformatics that are formulated as multiobjective optimization problems is explained in Handl et al. [13]. Few work has been done on multiobjective MSA (MMSA), which is much harder to be solved from a computational point of view. Only very recently, MSA has been treated as multiobjective optimization problem. Ortuño et al.  [14] used a multiobjective evolutionary algorithm based on the NSGA-II to optimize sum-of-pairs score, total columns and number of gaps. In another article [15], the authors extended the previous work by applying further biological features and considering different objectives such as strike score, non-gaps percentage and totally conserved columns. Likewise, Soto et al.  [16] applied a multiobjective evolutionary algorithm to optimize pre-aligned sequences by considering entropy and the metric Metal. In this article, we propose to approach MMSA with heuristic methods, extending the formulation given in Abbasi et al. [7] for an arbitrary number of sequences. The approach considers the sum-of-pairs score vector by maximizing the substitution score, based on a substitution matrix, and minimizing the number of indels/gaps. This article is an extended version of a conference article [17] providing a more thorough experimental analysis of the algorithms proposed.

Methods

We first give a definition of multiple sequence alignment, followed by its multiobjective counterpart. Then, a local search strategy is proposed.

Multiple sequence alignment

In MSA, homogeneous characters of a group of sequences are aligned together in columns. The following definition introduces the problem of MSA in a more mathematical form [18].

Definition 1

Multiple sequence alignment. Let be m strings over an alphabet . Let be an indel symbol, let and let denote the empty string. Let be a homomorphism defined by for all , and . A multiple sequence alignment of is a m-tuple of of strings of length over the alphabet , such that the following conditions are satisfied: for all For all there exists an such that . A way of evaluating the quality of an alignment is to score its columns by the sum-of-pairs score function. The score of a column is defined as:where the score , for , comes from a substitution matrix used for scoring pairwise sequence alignments (such as PAM and BLOSUM). Indels are scored by defining , where is the weight of an indel, and . The score for alignment is computed as

Multiobjective multiple sequence alignment

Many problems of practical relevance are frequently characterized by different objectives that simultaneously have to be taken into account when it comes to solve the problem. In most cases, these objectives conflict with each other and optimizing the problem for a specific objective might compromise other objectives. An appropriate approach to a multiobjective problem is to obtain a set of solutions in such a way that each of which cannot be improved in one objective without deteriorating at least one of the others [19]. Traditionally, approaches for solving sequence alignment are performed with a single objective function, which is based on a weighted sum of matches, mismatches, insertions and deletions. However, there is no agreement on how to specify weights for these parameters. In a multiobjective formulation, the practitioner does not have to define weights and can get access to much more information [10]. Let be an alignment of m sequences , as defined in the previous section. We define the following two score functionswhere the score , for is obtained from a substitution matrix and , for , is 1 if either or , and 0, otherwise. The multiobjective score sum-of-pairs for alignment isGiven two alignments and , ( dominates ) if and only if it holds that and , with at least one strict inequality. An alignment is Pareto optimal if there exists no other alignment such that . The set of all Pareto optimal alignments is called Pareto optimal alignment set. The image of a Pareto optimal alignment in the score space is a non-dominated score and the set of all non-dominated scores is called non-dominated score set. Although pairwise sequence alignment can be solved efficiently, that is, the running time to find the non-dominated score set is a polynomial function of the size of the sequences, this is no longer the case for the multiple counterpart for an arbitrary number of sequences. Thus, the goal, in practice, is to find an approximation to the non-dominated score set in a reasonable amount of time. In this article we explore the application of heuristic methods, in particular, local search algorithms.

Pareto local search

A local search algorithm starts from a feasible solution and searches locally for better neighbors to replace the current one. This neighborhood search is repeated until no improvement is found anymore and the algorithm stops in a local optimum. In the context of MMSA, the neighborhood function associates a set of feasible alignments to every feasible alignment . An alignment is a Pareto local optimum if there exists no alignment in such that . Pareto local search (PLS) [20] is a generic local search framework for multiobjective optimization problems that follows the principle above by keeping the best alignments into a special data structure, known as the archive. Each neighboring alignment that is non-dominated with respect to the alignments in the archive, is added to it. The algorithm “naturally” terminates once there is no neighbor of any alignment in the archive that is not dominated (a local optimal set). See [20, 21] for more details about this approach for multiobjective combinatorial optimization problems and its successful application to scheduling and graph problems. In the following we describe some options that were considered to apply PLS to MMSA: neighborhood function and starting alignment.

Neighborhood

We consider a k-block neighborhood for this problem, which consists of exchanging a substring of at most k characters in with indels in a gap, that is, a contiguous sequence of indels. In the following, we establish the conditions for two alignments to be k-block neighbors. Let , ..., denote m strings and let and be two alignments, where and . The alignments and are k-block neighbors if and only if the following conditions hold:Conditions (i) and (ii) state that both alignments should have the same size, and differ only in the jth string, respectively. In addition, condition (iii) and (iv) state that at most k characters from string do not occupy the same position in both alignments, and that those characters are contiguous. For illustration purpose, consider the following alignment: , for ; Let ; then ; For let and denote the position of in string and , respectively. Let and . Then, . Let and ; then , for . We list all possible 2-block neighbors as follows: It is possible to visit all k-block neighbors of an alignment in a straightforward way. Given the first string of an alignment , consider the leftmost substring of size one that contains only a character and an indel to its right. Then, exchange it with every indel in its right and stop when no indel is found. In case the substring has also indels to its left, repeat the same exchange procedure. If it is still possible, increase the size of the substring by one and repeat the same moves to its right and to its left; repeat the overall procedure until reaching a substring of size k. Then, consider the next substring of size one to the right and repeat the same procedure as above. Note that each move generates a new k-block neighbor of alignment . In order to maintain feasibility, the columns that contain only indels are deleted from the alignment.

Starting alignment

The starting solution may have a strong impact on the overall performance of the local search. We considered the three following possibilities for PLS:We choose Tcoffee since it is one of the best consistency based methods that outperforms the others existing programs in terms of accuracy and Clustal Omega since it outperforms Tcoffee whenever sequences with large N/C terminal extensions exist [24]. We used the default parameters suggested for these two programs. Rand: A random feasible alignment, which is obtained by inserting indels randomly into the strings, except in the largest one. This initialization option tends to generate a feasible alignment with a small number of indels. Clust: A feasible alignment obtained from program Clustal Omega [22] (available at http://www.clustal.org/omega). Tcoffee: A feasible alignment obtained from program Tcoffee [23] (available at http://www.tcoffee.org/Projects/tcoffee/).

Pareto iterated local search

The local search framework described in the previous section may get trapped in a local optima set of poor quality. One possibility for escaping from local optima is to perturb one or more alignments in the archive and restart the local search from those alignments. Pareto iterated local search (PILS) implements this search behavior and its main steps are presented in Algorithm 1. Similar to the single objective counterpart (see [25]), four components have to be specified in PILS: GenerateInitialSolution, which generates an initial set of alignments ; based on the experimental results, we consider two starting solutions previously described (Tcoffee and Clust). Perturbation, which modifies some of the current alignments in set leading to some intermediate set of alignments . Based on preliminary experiments, we decided to randomly choose 5 alignments from set in such a way that the corresponding non-dominated scores are spread enough in the objective space. Then, a gap of a given size is inserted into each sequence in random positions. PLS, which is applied to and returns an improved set of alignments . AcceptanceCriterion, which consists of merging the two sets, and , and filtering out the dominated scores.

Discussion and results

We performed several experiments in order to understand the effect of different parameters of PLS and PILS and to establish functional relationships between algorithm performance and instance features, such as the number of sequences and their sizes. Furthermore, we compare the quality of the alignments produced by our approaches with those produced by TCoffee and Clustal Omega. The implementations were coded in C and compiled with GCC version 4.6.1 with the −O3 compiler option. Experiments were performed in a cluster with 16 nodes, each one comprising an Intel Core i7 CPU with 4-core and 2 GB Ram, with operating system Ubuntu 11.10. Except the compiler option, no other code optimization technique was used in the experiments. For this experimental analysis, we have chosen sequences obtained from the benchmark database BAliBASE 3.0 [26]. In our substitution matrix, each cell value is 1 when the both residues are equal, and −1 otherwise. BAliBASE 3.0 is a well-known benchmark for evaluating the multiple sequence alignment algorithms. The data set Reference 1 consists of equidistant family sequences with two subgroups: RV11 and RV12. RV11 contains 38 data sets with less than 20 % residue identity between groups and RV12 contains 44 data sets with residue identity between 20 and 40 %. Reference 2 (RV20) contains 41 alignments comprising family sequences with more than 40 % similarity and a highly divergent orphan sequence. Reference 3, with 30 data sets, contains subfamilies such that the sequences within a given subfamily share more than 40 % identity, but any two sequences from different subfamilies share less than 20 % identity. The reference set RV40, with 49 data sets, contains sequences that are composed of groups with N/C-terminal extensions. The reference set RV50 contains 16 data sets with large internal insertions. We tested our approaches on all the instances from the sets RV11 and RV20, composed by sequences with different sizes and varying percentage identities. Since the heuristics proposed in this paper are stochastic, we ran each variant 10 times for each instance and recorded the contents of the archive for each run, namely the approximate set, once the termination criterion was met. To evaluate the quality of each approximate set, we computed its hypervolume indicator value. This indicator measures the area that is dominated by the approximate set, bounded by a reference point [27]; see an in-depth discussion about the hypervolume indicator in relation with other performance assessment methods in Zitzler et al. [28]. Moreover, it is known that the hypervolume is maximized when the optimal set is found [28]. Figure 1 illustrates the hypervolume indicator (shaded area) for a given approximate set (black points) and a given reference point (white point). We have chosen, as the reference point for each instance, the minimum substitution score minus one and the maximum number of indels plus one that were obtained from the runs of all variants. We merged all approximate sets produced from all runs for a given instance, extracted the non-dominated scores and computed the hypervolume indicator value of the resulting set, namely, the reference hypervolume indicator value, which is then used as a reference value to evaluate the relative performance of each approach. Then, for each approximate set, we computed its hypervolume indicator value and the percentage of this value with respect to the reference hypervolume indicator value; the larger the value, the better is the approximate set in terms of this indicator.
Fig. 1

Illustration of the hypervolume indicator [17]

Illustration of the hypervolume indicator [17] We analysed the effect of different k-values in the k-block neighborhood, as well as the effect of the three starting alignments on the performance of PLS. For a given instance, let Min denote the length of the smallest string; we consider . PLS terminates once it is not possible to find non-dominated neighboring alignments or the time limit of 5 minutes of CPU-time is reached. Tables 1 and 2 report the results obtained for PLS for the benchmark sets RV11 and RV20. The results were averaged over 10 runs, for each of the five k values and the three starting alignments. Column id corresponds to the instance id from the two benchmark sets (BB1100*.tfa and BB200*.tfa, where * denotes the id); column m gives the number of sequences; columns Min and Max correspond to the length of the smallest and the largest sequence, respectively. The instances are lexicographically ordered in terms of the number of sequences and the length of the smallest sequence. The values in italics correspond to the best result obtained for each instance. The last column Init shows the hypervolume indicator obtained for the non-dominated score of the two starting alignments obtained with Clustal Omega and Tcoffee.
Table 1

Percentage with respect to the reference hypervolume indicator value for local search with different k-block neighborhood sizes and three starting alignments (Clustal Omega, Tcoffee and Rand) for each benchmark instance in dataset RV11. The results are averaged over 10 runs. See text for more details

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}$$\end{document}k=Min \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/2$$\end{document}k=Min/2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/4$$\end{document}k=Min/4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/8$$\end{document}k=Min/8 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/16$$\end{document}k=Min/16
id m Min Max Rand Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Init
224632050.860.770.900.810.67 0.90 0.860.640.890.830.520.710.720.450.550.14
254641030.530.840.840.540.760.860.510.72 0.86 0.530.640.740.480.550.490.28
294811380.690.810.850.670.85 0.88 0.690.850.870.650.820.850.650.720.810.35
1483910.170.270.350.170.25 0.35 0.160.240.340.130.260.340.100.280.330.20
94973370.61 0.87 0.780.610.710.670.600.510.440.590.400.220.530.300.100.19
2141021390.51 0.91 0.860.540.860.900.530.780.810.540.720.760.520.610.720.18
841045400.79 0.89 0.840.790.890.820.790.760.860.800.490.810.740.330.750.03
1742472640.370.900.880.330.90 0.91 0.330.850.880.310.820.840.270.740.780.36
1542973270.460.920.870.440.930.890.530.890.930.490.88 0.94 0.450.900.920.38
1243203970.380.910.870.36 0.92 0.900.440.910.900.360.870.870.360.840.820.33
2443724650.500.890.830.50 0.90 0.900.500.800.760.520.740.730.510.660.680.17
443904560.35 0.91 0.850.36 0.91 0.880.360.900.890.350.840.850.350.760.770.23
344145160.43 0.92 0.880.420.900.900.410.890.900.420.860.900.420.840.880.38
1044904920.03 0.95 0.850.030.910.840.030.800.820.010.720.800.010.670.730.25
135511010.690.800.830.690.81 0.87 0.690.720.830.660.700.800.630.580.670.06
355711380.720.840.810.710.850.790.74 0.85 0.800.740.760.790.670.620.770.14
1151602420.54 0.94 0.870.540.940.900.520.920.900.500.800.850.530.690.760.10
37533511920.380.680.670.460.730.740.540.790.810.620.800.880.700.81 0.93 0.24
1465026340.390.900.850.360.930.900.470.940.940.37 0.96 0.960.500.950.950.52
267769060.610.750.770.760.790.850.880.840.910.890.790.92 0.96 0.660.820.13
2771754320.480.820.720.570.880.780.650.940.830.71 0.95 0.840.680.910.790.24
2372314070.540.810.760.610.860.810.660.900.880.62 0.93 0.890.720.910.830.37
28521930.850.87 0.88 0.840.860.870.850.820.860.840.730.830.820.670.780.26
681862830.310.800.730.310.860.770.320.860.830.32 0.88 0.88 0.330.800.860.28
3282264030.450.800.760.490.850.820.550.880.860.570.920.880.60 0.93 0.900.40
3882616140.230.790.760.280.820.800.350.890.860.470.900.920.540.91 0.94 0.37
3682984360.380.780.680.430.830.730.440.870.790.490.920.850.48 0.94 0.870.41
1683167290.270.710.770.320.760.820.380.800.850.500.830.890.580.88 0.95 0.37
3484017290.270.690.790.300.720.830.360.750.850.460.790.890.540.81 0.94 0.32
2092012370.290.870.760.300.900.830.30 0.92 0.870.270.890.880.270.860.890.40
793854570.370.790.680.390.840.770.380.880.800.410.920.890.41 0.93 0.890.37
2810932110.530.900.850.56 0.91 0.900.560.860.890.560.840.860.510.770.820.40
19102993960.350.740.740.380.800.760.400.850.820.42 0.91 0.870.44 0.91 0.900.43
3311852390.570.780.820.620.820.830.640.830.890.670.80 0.92 0.660.750.880.26
31113006110.220.770.730.260.800.750.310.850.770.400.880.840.47 0.96 0.840.21
30142363920.230.730.760.280.770.760.310.810.800.370.870.850.40 0.90 0.890.51
5143294650.200.700.710.210.730.730.260.750.760.300.820.790.31 0.85 0.85 0.53
18144187500.090.710.870.120.730.880.150.760.870.190.790.880.240.82 0.90 0.45
Table 2

Percentage with respect to the reference hypervolume indicator value for local search for different k-block neighborhood sizes and starting solutions (Clustal Omega, Tcoffee and Rand) for each benchmark instancein dataset RV20. The results are averaged over 10 runs. See text for more details

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}$$\end{document}k=Min \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/2$$\end{document}k=Min/2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/4$$\end{document}k=Min/4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/8$$\end{document}k=Min/8 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = \mathtt{Min}/16$$\end{document}k=Min/16
id m Min Max Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Rand Clust Tcoff Rand Init
2016746970.730.900.270.72 0.90 0.280.680.860.340.600.860.290.540.820.340.27
1162475270.950.600.16 0.95 0.560.190.880.530.220.820.470.180.730.340.170.50
2205215200.820.500.65 0.82 0.490.670.800.520.690.760.450.660.720.350.680.55
1121956310.89 0.96 0.720.870.810.740.820.770.770.790.740.760.800.670.900.27
7233814570.910.920.500.93 0.93 0.500.910.930.550.840.890.530.810.820.660.23
19244017290.870.930.620.940.950.670.910.960.650.86 0.96 0.660.830.870.750.42
1627106458 0.93 0.890.690.850.880.730.740.800.780.670.780.720.630.720.720.46
1227210634 0.95 0.750.350.890.750.370.770.710.370.500.590.360.390.520.420.12
1328447747 0.70 0.720.250.830.770.300.850.830.260.790.860.300.630.940.320.08
2929641670.76 0.87 0.740.670.830.780.660.800.770.560.820.740.530.790.780.25
9292544150.870.880.22 0.94 0.920.260.930.940.300.820.750.260.740.670.230.48
272927910520.750.620.280.830.700.310.840.760.36 0.88 0.830.320.870.870.450.35
102957712330.510.740.260.640.800.310.850.870.290.90 0.96 0.310.920.880.440.05
2430747390.810.650.340.830.790.340.860.820.410.85 0.90 0.360.810.840.390.31
2331202244 0.96 0.890.310.920.890.360.910.860.320.860.810.350.710.710.430.37
263227110160.780.640.440.840.730.470.850.830.45 0.87 0.860.490.760.850.590.34
35352269820.690.830.240.730.640.270.760.680.32 0.85 0.720.240.810.730.370.25
1537542740.710.850.770.800.940.800.58 0.94 0.830.480.860.810.450.800.790.03
3842791990.91 0.95 0.730.870.940.760.760.840.810.650.740.740.620.690.840.52
5423434740.490.870.430.570.910.440.630.930.500.65 0.94 0.470.640.970.600.11
17455097130.540.760.740.620.810.780.690.890.740.810.970.740.75 0.99 0.760.31
3047761550.490.840.180.55 0.88 0.200.540.830.270.520.760.190.510.830.270.30
334881155 0.93 0.870.570.890.820.580.780.790.640.660.670.610.650.690.590.57
41482931520 0.71 0.400.020.640.420.070.610.440.060.590.480.020.600.530.070.39
6512242930.740.930.260.780.950.310.69 0.96 0.300.690.930.270.680.860.430.15
18532963810.860.710.31 0.99 0.890.310.940.960.310.820.840.340.670.670.340.46
21534188380.760.650.270.750.720.330.780.770.280.74 0.81 0.320.720.830.300.36
28542416100.770.970.720.790.970.720.83 0.97 0.790.880.930.750.900.910.780.30
4554017340.800.720.210.760.760.250.800.800.250.87 0.89 0.240.870.930.390.19
8569815200.600.400.430.610.470.460.600.570.450.61 0.68 0.460.610.720.480.16
22583203810.670.630.110.740.680.150.840.780.170.890.870.130.88 0.91 0.290.39
34592345480.600.780.570.680.790.620.750.830.600.810.850.610.79 0.89 0.640.33
31601754320.810.710.400.790.800.450.740.860.470.650.920.400.61 0.89 0.580.38
3261931490.970.920.48 0.98 0.900.520.780.850.570.670.720.520.630.670.580.35
37651608100.650.490.030.590.540.040.560.590.050.560.660.080.55 0.73 0.130.42
14653337290.620.830.430.670.830.480.630.840.450.620.820.460.69 0.90 0.460.54
3744098000.520.710.770.690.880.810.600.800.840.540.750.800.73 0.93 0.830.60
25813725180.560.840.200.450.870.210.490.860.300.480.830.250.43 0.99 0.260.36
40872735700.440.770.330.360.870.340.380.860.340.410.830.360.36 0.90 0.510.11
3691823350.610.790.220.740.780.270.680.790.280.61 0.85 0.260.810.780.380.18
39912984580.520.850.200.400.820.220.450.850.220.49 0.94 0.220.390.760.290.33
Percentage with respect to the reference hypervolume indicator value for local search with different k-block neighborhood sizes and three starting alignments (Clustal Omega, Tcoffee and Rand) for each benchmark instance in dataset RV11. The results are averaged over 10 runs. See text for more details Percentage with respect to the reference hypervolume indicator value for local search for different k-block neighborhood sizes and starting solutions (Clustal Omega, Tcoffee and Rand) for each benchmark instancein dataset RV20. The results are averaged over 10 runs. See text for more details The results clearly indicate the best performance of PLS when starting with alignments Clust and Tcoffee, instead of a feasible random alignment. Also, the best k-block value strongly depends on the number of sequences and, to a smaller extent, to their sizes. As a rule, for the given cut-off time, large (small) k values achieve better performance on problems with a smaller (larger) number of sequences. Moreover, column Init indicates that PLS strongly improves upon the two starting alignments. Tables 3 and 4 report results obtained for PILS on the same benchmark sets, averaged over 10 runs, for k-block size equals to 2; we recall that we use Clust and Tcoffee as starting alignments. The results reported in the tables do not include the CPU-time taken to compute the starting alignments. PILS always terminates once the time limit of 5 minutes of CPU-time is reached. In this case, the limit time for each PLS run within PILS is minutes. Column gives the best value of PLS from Tables 1 and 2. The best results for each sequence set are shown in italics. In most of the instances, PILS improves over PLS, although the best performance may depend on the number of sequences and their size: a small (large) number of perturbations gives better performance on larger (smaller) sequences and larger (smaller) number of sequences.
Table 3

Percentage with respect to the reference hypervolume indicator value for PILS with several perturbation number (P) for each instances of data set RV11 [17]. The results are averaged over 10 runs. See text for more details

id m Min Max \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt{PLS}^{max}$$\end{document}PLSmax P = 1 P = 2 P = 4 P = 8 P = 16
224632050.900.910.870.890.88 0.97
254641030.860.860.89 0.91 0.890.87
294811380.880.910.910.940.92 0.98
1483910.350.360.380.400.42 0.44
94973370.870.870.81 0.87 0.830.86
2141021390.91 0.95 0.890.920.880.93
841045400.890.870.91 0.97 0.880.90
1742472640.910.920.930.950.96 0.97
1542973270.940.900.920.96 0.97 0.97
1243203970.920.920.910.960.97 0.99
244372465 0.90 0.870.850.860.830.80
443904560.91 0.93 0.870.880.880.84
344145160.92 0.92 0.890.890.880.88
1044904920.95 0.97 0.950.960.900.94
135511010.870.910.920.930.93 0.95
355711380.850.860.890.880.93 0.99
115160242 0.94 0.94 0.890.910.880.91
37533511920.930.920.940.91 0.99 0.98
1465026340.960.95 0.97 0.950.940.91
267769060.96 0.99 0.960.950.660.60
277175432 0.95 0.910.870.850.850.69
237231407 0.93 0.890.810.820.820.79
28521930.88 0.92 0.840.820.840.78
68186283 0.88 0.88 0.770.750.750.66
328226403 0.93 0.880.850.830.810.74
388261614 0.94 0.860.880.890.850.81
3682984360.940.940.940.940.96 0.97
1683167290.950.940.930.960.95 0.99
348401729 0.94 0.850.830.810.800.77
2092012370.920.96 0.97 0.960.900.83
793854570.930.920.92 0.97 0.930.88
2810932110.91 0.95 0.95 0.900.870.86
1910299396 0.91 0.850.850.830.830.81
331185239 0.92 0.92 0.910.900.870.84
3111300611 0.96 0.950.880.890.840.82
3014236392 0.90 0.890.860.810.800.78
5143294650.85 0.87 0.830.800.750.75
18144187500.90 0.94 0.920.900.880.87
Table 4

Percentage with respect to the reference hypervolume indicator value for PILS with several perturbation number (P) for each instances of data set RV20. The results are averaged over 10 runs. See text for more details

id m Min Max \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathtt{PLS}^{max}$$\end{document}PLSmax P = 1 P = 2 P = 4 P = 8 P = 16
201674697 0.95 0.95 0.95 0.880.830.73
116247527 0.90 0.90 0.90 0.870.880.85
2205215200.820.93 0.94 0.94 0.890.84
1121956310.90 0.97 0.950.910.860.85
7233814570.930.95 0.96 0.950.890.83
1924401729 0.96 0.910.940.950.950.87
16271064580.89 0.91 0.900.800.780.72
1227210634 0.89 0.89 0.820.800.590.52
1328447747 0.94 0.690.770.860.830.87
2929641670.83 0.89 0.840.820.810.81
9292544150.940.89 0.96 0.940.820.76
272927910520.870.730.810.870.90 0.92
102957712330.920.640.730.800.90 0.94
2430747390.860.800.850.90 0.91 0.88
23312022440.92 0.98 0.950.910.870.73
263227110160.860.730.850.90 0.93 0.82
35352269820.830.800.780.820.88 0.90
153754274 0.94 0.820.91 0.94 0.860.80
384279199 0.94 0.94 0.94 0.840.730.69
542343474 0.97 0.830.850.870.910.93
1745509713 0.97 0.710.740.810.870.94
3047761550.840.84 0.87 0.830.800.83
3348811550.89 0.95 0.890.810.670.69
414829315200.640.880.840.820.83 0.86
651224293 0.95 0.850.91 0.95 0.930.86
1853296381 0.96 0.700.870.940.830.67
21534188380.830.890.910.93 0.94 0.91
2854241610 0.97 0.96 0.97 0.960.940.93
4554017340.930.810.800.840.89 0.94
8569815200.720.720.780.810.85 0.87
22583203810.890.620.680.790.91 0.92
34592345480.850.750.750.82 0.88 0.86
3160175432 0.92 0.870.870.830.820.88
3261931490.970.97 0.98 0.880.730.67
37651608100.840.800.840.810.80 0.91
14653337290.660.840.840.830.86 0.90
3744098000.880.730.900.810.76 0.95
25813725180.870.820.860.850.81 0.96
40872735700.870.830.920.910.88 0.94
3691823350.810.820.830.770.78 0.92
3991298458 0.85 0.730.760.790.81 0.85
Percentage with respect to the reference hypervolume indicator value for PILS with several perturbation number (P) for each instances of data set RV11 [17]. The results are averaged over 10 runs. See text for more details Percentage with respect to the reference hypervolume indicator value for PILS with several perturbation number (P) for each instances of data set RV20. The results are averaged over 10 runs. See text for more details We complemented the analysis to gain insight into the absolute quality of the alignments produced by PILS, when compared to the reference alignments available in the BAliBASE benchmark. We rely on the correctly aligned pairs (SP) and columns correctly aligned (TC) measures, as performed in [15]. These ratios can be computed by the program BAliScore which is available by ftp from ftp-igbmc.u-strasbg.fr/pub/BAliBASE [29]. We used seven different data sets from BAliBASE with specific features chosen from this benchmark; see Table 5. Columns Reference and id correspond to the reference and id number of the data set in the BAliBASE benchmark; columns m, Min and Max correspond to the number of sequences, length of the smallest and largest sequence, respectively; Len corresponds to the size of reference alignment in the BAliBASE benchmark.
Table 5

The selective data sets from difference reference of the benchmark

Reference id Identity m Min Max Len
RV111<20 %4839196
RV114<20 %4390456603
RV1220<20 and >40 %4118129141
RV1242<20 and >40 %4448561611
RV2020>40 %1674697615
RV201>40 %16247527780
RV301715231370416
RV4010967214275
RV40149298609712
RV5049386505547
The selective data sets from difference reference of the benchmark We ran PILS 10 times on these data sets, and for each collection of runs, we chose the best alignment produced in terms of score. We computed the SP and TC ratios by using this alignment and the reference alignment available for each chosen data set in BAliBASE. For calculation of the SP ratio, we used the substitution matrix PAM 250. Moreover, for comparison purpose, we performed the same procedure for the alignments produced by Clustal Omega and Tcoffee. Tables 6 and 7 show the results obtained with SP and TC ratios, respectively. The italics face values represent the best ratio. From Table 6, it can be observed that PILS obtained the best value for all the tested data sets. In Table 7, the best TC ratio is either from Tcoffee or Clustal Omega. Note that a null TC value was obtained by the alignment of Clustal Omega in RV12 id 20 and RV20 id 20, and by TCoffee in set from RV20 and id 1.
Table 6

The results of Sp score in Tcoffee, Clustal Omega and PILS on the selected test cases

Reference id Clustal Omega Tcoffee PILS
RV1110.9560.965 0.985
RV1140.0330.706 0.788
RV12200.3310.973 0.981
RV12420.6780.789 0.85
RV20200.5350.934 0.946
RV2010.9390.596 0.951
RV30170.7650.787 0.811
RV40100.8750.872 0.878
RV40140.8780.890 0.894
RV5040.9730.983 0.988
Table 7

The results of TC score in Tcoffee, Clustal Omega and PILS on the selected test cases

Reference id Clustal Omega Tcoffee PILS
RV1110.912 0.930 0.885
RV1140.408 0.554 0.378
RV12200.000 0.957 0.851
RV12420.548 0.662 0.434
RV20200.000 0.775 0.654
RV201 0.775 0.0000.459
RV30170.552 0.581 0.535
RV4010 0.639 0.5900.330
RV4014 0.671 0.6580.463
RV504 0.919 0.9400.892
The results of Sp score in Tcoffee, Clustal Omega and PILS on the selected test cases The results of TC score in Tcoffee, Clustal Omega and PILS on the selected test cases

Conclusions

The local search introduced in this article is able to provide high quality alignments in a reasonable amount of time. Since local search can be stuck in local optima, we propose a method that perturbs the set of alignments in the archive and restarts the local search, such as done by iterated local search. The results show that the hypervolume indicator has improved over different number of perturbations. The size of the trade-off returned by our approaches is still considerably large, which may be overwhelming for the practitioner. One possibility is to present a smaller subset of alignments that are representative of the complete trade-off; such notion of representativeness may be based on metrics of uniformity (the most spread subset) or coverage (the subset that best covers the complete set). Algorithms that allow to find optimal representative subsets are available in [30]. For the future work, we might try other starting alignments that are not just based on the similarity but also taking into account other biological criteria such as the secondary structure of the alignments. Further, we may test different perturbation methods. One of the possibilities is to use the phylogenetic trees of the sequences and apply the ratchet strategy [31, 32] which may help to create a new starting solution with more information from existing results.
  17 in total

1.  T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Authors:  C Notredame; D G Higgins; J Heringa
Journal:  J Mol Biol       Date:  2000-09-08       Impact factor: 5.469

2.  BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.

Authors:  Julie D Thompson; Patrice Koehl; Raymond Ripp; Olivier Poch
Journal:  Proteins       Date:  2005-10-01

3.  Increasing the efficiency of searches for the maximum likelihood tree in a phylogenetic analysis of up to 150 nucleotide sequences.

Authors:  David A Morrison
Journal:  Syst Biol       Date:  2007-12       Impact factor: 15.683

Review 4.  Multiple protein sequence alignment.

Authors:  Jimin Pei
Journal:  Curr Opin Struct Biol       Date:  2008-05-14       Impact factor: 6.809

5.  The impact of multiple protein sequence alignment on phylogenetic estimation.

Authors:  Li-San Wang; Jim Leebens-Mack; P Kerr Wall; Kevin Beckmann; Claude W dePamphilis; Tandy Warnow
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2011 Jul-Aug       Impact factor: 3.710

6.  Structural RNA alignment by multi-objective optimization.

Authors:  Thomas Schnattinger; Uwe Schöning; Hans A Kestler
Journal:  Bioinformatics       Date:  2013-04-24       Impact factor: 6.937

Review 7.  Multiobjective optimization in bioinformatics and computational biology.

Authors:  Julia Handl; Douglas B Kell; Joshua Knowles
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2007 Apr-Jun       Impact factor: 3.710

8.  iPBA: a tool for protein structure comparison using sequence alignment strategies.

Authors:  Jean-Christophe Gelly; Agnel Praveen Joseph; Narayanaswamy Srinivasan; Alexandre G de Brevern
Journal:  Nucleic Acids Res       Date:  2011-05-17       Impact factor: 16.971

9.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

10.  MOSAL: software tools for multiobjective sequence alignment.

Authors:  Luís Paquete; Pedro Matias; Maryam Abbasi; Miguel Pinheiro
Journal:  Source Code Biol Med       Date:  2014-01-08
View more
  1 in total

1.  Main findings and advances in biomedical engineering and bioinformatics from IWBBIO 2015.

Authors:  Franscisco M Ortuño; Olga Valenzuela; Peter Glösekötter; Ignacio Rojas
Journal:  Biomed Eng Online       Date:  2016-07-15       Impact factor: 2.819

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.