Literature DB >> 20497995

DIALIGN-TX and multiple protein alignment using secondary structure information at GOBICS.

Amarendran R Subramanian¹, Suvrat Hiran, Rasmus Steinkamp, Peter Meinicke, Eduardo Corel, Burkhard Morgenstern.

Abstract

We introduce web interfaces for two recent extensions of the multiple-alignment program DIALIGN. DIALIGN-TX combines the greedy heuristic previously used in DIALIGN with a more traditional 'progressive' approach for improved performance on locally and globally related sequence sets. In addition, we offer a version of DIALIGN that uses predicted protein secondary structures together with primary sequence information to construct multiple protein alignments. Both programs are available through 'Göttingen Bioinformatics Compute Server' (GOBICS).

Entities: Species

Mesh：

Year: 2010 PMID： 20497995 PMCID： PMC2896137 DOI： 10.1093/nar/gkq442

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Multiple sequence alignment (MSA) is the basis of almost all methods for sequence analysis in bioinformatics. Thus, the results of these methods crucially depend on the underlying alignments. A striking example is a recent study by Wong et al. (1). These authors demonstrated that uncertainties in multiple alignments drastically influence the output of standard phylogeny programs. Development and evaluation of MSA methods is therefore a central field of research in bioinformatics since the mid-1980s. Recent reviews on MSA methods are given, for example, by Edgar and Batzoglou (2), Morrison (3) or Kemena and Notredame (4).

DIALIGN

Since its first release in 1996, DIALIGN is a widely used software tool for multiple alignment of DNA, RNA and protein sequences (5,6). It differs in various aspects from other MSA algorithms. DIALIGN tries to align only those parts of the sequences to each other that exhibit some statistically relevant degree of similarity. Non-related parts of the sequences remain unaligned. This way, the method combines local and global alignment features. It returns global alignments where sequences are homologous over their entire length, but local alignments where only local homologies are detectable. DIALIGN constructs alignments based on gap-free local alignments, so-called fragments for which a scoring function is defined based on the probability of their random occurrence. Multiple alignments are constructed in a greedy way by incorporating fragments that are mutually consistent, i.e. fragments that fit into one single output MSA (7). As most MSA methods, the standard version of DIALIGN is fully automated and works without human intervention. In addition, however, DIALIGN has an option for ‘anchored alignment’ where MSAs are produced in a ‘semi-automatic’ way (8,9). With this option, the program can be ‘forced’ to align user-defined positions of the sequences to each other, and the remainder of the sequences is aligned automatically. Anchored alignment can also be used to speed-up the alignment procedure where long genomic sequences are to be aligned (10,11) or to study the behaviour of alignment methods in detail (12). Numerous studies have shown that DIALIGN is superior to other MSA tools if locally related sequence sets are aligned, but on globally related sequences with weak primary-sequence similarity, it is often outperformed by global methods such as ‘CLUSTAL W’ (13), ‘MUSCLE’ (14,15), ‘MAFFT’ (16) or ‘PROBCONS’ (17). Since the first release of the DIALIGN, various alternative optimization algorithms have been applied to the fragment-based alignment approach in order to improve its performance (18,19), but recent results indicate that the relative weakness of DIALIGN on global homologies is due to the underlying objective function and not so much on the greedy optimization algorithm (12).

DIALIGN-TX

DIALIGN-T is a complete re-implementation of DIALIGN developed by the first author of this article (20). In the first step, it performs all possible pairwise alignments of the input sequences in the sense of DIALIGN (21,22). For multiple alignment, however, DIALIGN-T uses a number of heuristics to prevent the algorithm from aligning spurious, isolated random similarities that might destroy a biologically more meaningful global alignment. For example, in the greedy algorithm for MSA, DIALIGN-T considers not only the local degree of similarity in a fragment, but also its context. Fragments that are part of a high-scoring pairwise alignment are preferred compared to isolated fragments. Also, low-scoring regions are removed from long fragments to counterbalance the bias of DIALIGN in favour of high-scoring fragments and to support groups of lower scoring fragments. Together with some other heuristics, this led to a considerable improvement of the performance compared with the original implementation of DIALIGN. These ideas were taken a step further in the latest release of the program, ‘DIALIGN-TX’ (23). Here, the traditional progressive approach to multiple alignment (24–26) is adapted to the fragment-based alignment as used in DIALIGN. First a guide tree is calculated based on pairwise fragment alignments. Then pairwise alignments of sequences and groups of previously aligned sequences are performed going from the leaves to the root of the guide tree. In traditional progressive alignment methods, such groups of already aligned sequences are represented as ‘profiles’ and aligned by ‘profile alignment’. This is not possible in DIALIGN, where an alignment is seen as a consistent set of fragments and only parts of the sequences may be aligned. To align two groups G1 and G2 of previously aligned sequences to each other, DIALIGN-TX selects a set of fragments each of which aligns a sequence from G1 with a sequence from G2. A vertex-cover algorithm by Clarkson (27) is used to remove inconsistencies and to select high-scoring sets of consistent fragments.

DIALIGN USING PROTEIN SECONDARY STRUCTURE INFORMATION

As most methods for multiple protein alignment, DIALIGN and DIALIGN-TX are based on primary structure information alone. However, attempts have been made in the past to use predicted secondary structures in alignment algorithms (28,29). We implemented a software pipeline that takes predicted protein secondary structures as additional input information for DIALIGN. where w( f ) is the original, primary sequence-based fragment weight as used in DIALIGN (6). s( f ) is a measure of similarity at the secondary-structure level for fragments and is defined as Here, m is the proportion of matching states x, and p the proportion of predicted states x, where x can be H, E or C, as predicted by the PSIPRED program. Optimal values for the parameters α, β, γ and δ have been identified using a least squares support vector machine (32). In the first step, the standard version of DIALIGN is run to obtain pairwise alignments in the sense of the fragment-based alignment approach. That is, an optimal chain of fragments is calculated for each pair of input sequences. Next, we run PSIPRED (30) on the individual sequences to predict their secondary structures. PSIPRED is one of the most accurate de novo predictors for protein secondary structures (31). It assigns one of three different states—‘helix’ (H), strand (E) or ‘coil’ (C)—to every position of the sequences. We defined a modified weight function w′ on the set of fragments that takes both primary and secondary structure into account. Based on the secondary structures predicted by PSIPRED, a new weight score w′(f) of a fragment f is defined as The measure Sov( f ) of the similarity of predicted secondary structures for the segments composing the fragment f has been defined by Kim and Xie (29) on the basis of the original Sov score (33). For multiple alignment, we use the greedy algorithm implemented in DIALIGN, but fragments are ranked according to their sequence structure-based weights w′ instead of the sequence-based weights w. Technically, this is done by defining the fragments contained in the respective pairwise DIALIGN alignments as ‘anchor points’ using the modified scores w′( f ) as weights that determine the priority of fragments in the greedy algorithm. We evaluated our secondary structure-based MSA approach using the current release of ‘BAliBASE 3’ (34). Table 1 shows that, ‘on average’, the performance of DIALIGN using secondary structure information is similar to the performance of the program with primary-sequence information alone. For many data sets, however, we observed great differences in the resulting alignments. In some cases, the structure-based alignments were far better than the original ones, while in other cases it was the other way around. For some sequence sets, our secondary structure approach achieved an improvement of 29.7 percentage points in the sum-of-pairs (SP) score (or a relative improvement of 62%, respectively) compared to the purely sequence-based alignment. Therefore, we believe that our secondary structure-based alignments may contain valuable information that is not available in sequence-based MSAs and could therefore be a useful addition to sequence-based alignments.

Table 1.

Performance of DIALIGN 2.2 with primary sequence information alone, our secondary structure-based alignment (DIALIGN SEC) and DIALIGN-TX on BAliBASE 3 under the ‘sum-of-pairs’ scoring scheme

	RV11	RV12	RV20	RV30	RV40	RV50
DIALIGN 2.2	50.7	86.2	86.9	71.0	82.3	79.8
DIALIGN SEC	49.8	84.5	86.6	74.7	83.1	81.8
DIALIGN-TX	51.5	89.1	87.8	76.1	83.6	82.2

On average, the performance of our secondary-structure based alignment is similar to the original version of the program. However, for some data sets in BAliBASE, there are great differences between the sequence-based and sequence-structure-based alignments. DIALIGN-TX clearly outperforms the previous release of DIALIGN with and without structure information. More detailed test results and a comparison to other methods are given in (23).

Performance of DIALIGN 2.2 with primary sequence information alone, our secondary structure-based alignment (DIALIGN SEC) and DIALIGN-TX on BAliBASE 3 under the ‘sum-of-pairs’ scoring scheme On average, the performance of our secondary-structure based alignment is similar to the original version of the program. However, for some data sets in BAliBASE, there are great differences between the sequence-based and sequence-structure-based alignments. DIALIGN-TX clearly outperforms the previous release of DIALIGN with and without structure information. More detailed test results and a comparison to other methods are given in (23).

WWW SERVER AT GOBICS

To make the new versions of DIALIGN easily available to the research community, we set up WWW interfaces for them at ‘Göttingen Bioinformatics Compute Server’ (GOBICS). DIALIGN-TX is available at http://dialign-tx.gobics.de/submission. Various parameter values can be selected by the user. For exclusion of low-scoring regions in long fragments, the minimum fragment length T from which low-scoring sub-fragments are excluded can be specified, as well as the length L of low-scoring regions that are excluded from alignment. That is, if a fragment f of length ≥T contains a sub-fragment of length L, this sub-fragment is removed and f is split into the two remaining sub-fragments. Also, there are options to increase the program speed, possibly at the expense of sensitivity. For DNA alignment, there are several options to translate DNA fragments into peptide fragments according to the genetic code and to consider open reading frames for alignment. The downloadable program versions contains more options and adjustable parameters which are explained in the user guide. Also, the downloadable program now comes with an ‘anchored-alignment’ option. DIALIGN with secondary-structure information is available at: http://dialign-sec.gobics.de/submission.

FUNDING

Deutsche Forschungsgemeinschaft (grants MO 1048/1-1 and MO 1048/6-1 to B.M., in part). Funding for open access charge: Annual budget of department of bioinformatics, University of Göttingen. Conflict of interest statement. None declared.

27 in total

1. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment.

Authors: A Zemla; C Venclovas; K Fidelis; B Rost
Journal: Proteins Date: 1999-02-01

2. The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences.

Authors: Michael Brudno; Rasmus Steinkamp; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. ProbCons: Probabilistic consistency-based multiple sequence alignment.

Authors: Chuong B Do; Mahathi S P Mahabhashyam; Michael Brudno; Serafim Batzoglou
Journal: Genome Res Date: 2005-02 Impact factor: 9.043

4. Protein multiple alignment incorporating primary and secondary structure information.

Authors: Nak-Kyeong Kim; Jun Xie
Journal: J Comput Biol Date: 2006-12 Impact factor: 1.479

5. Alignment uncertainty and genomic analysis.

Authors: Karen M Wong; Marc A Suchard; John P Huelsenbeck
Journal: Science Date: 2008-01-25 Impact factor: 47.728

6. Multiple DNA and protein sequence alignment based on segment-to-segment comparison.

Authors: B Morgenstern; A Dress; T Werner
Journal: Proc Natl Acad Sci U S A Date: 1996-10-29 Impact factor: 11.205

7. Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC.

Authors: Dirk Pöhler; Nadine Werner; Rasmus Steinkamp; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

8. Multiple sequence alignment with user-defined anchor points.

Authors: Burkhard Morgenstern; Sonja J Prohaska; Dirk Pöhler; Peter F Stadler
Journal: Algorithms Mol Biol Date: 2006-04-19 Impact factor: 1.405

9. DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.

Authors: Amarendran R Subramanian; Jan Weyer-Menkhoff; Michael Kaufmann; Burkhard Morgenstern
Journal: BMC Bioinformatics Date: 2005-03-22 Impact factor: 3.169

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

4 in total

1. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts.

Authors: Xin Deng; Jianlin Cheng
Journal: BMC Bioinformatics Date: 2011-12-14 Impact factor: 3.169

2. Automatic detection of anchor points for multiple sequence alignment.

Authors: Florian Pitschi; Claudine Devauchelle; Eduardo Corel
Journal: BMC Bioinformatics Date: 2010-09-02 Impact factor: 3.169

3. PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions.

Authors: Jaime Huerta-Cepas; Salvador Capella-Gutierrez; Leszek P Pryszcz; Ivan Denisov; Diego Kormes; Marina Marcet-Houben; Toni Gabaldón
Journal: Nucleic Acids Res Date: 2010-11-12 Impact factor: 16.971

4. Base-By-Base version 2: single nucleotide-level analysis of whole viral genome alignments.

Authors: William Hillary; Song-Han Lin; Chris Upton
Journal: Microb Inform Exp Date: 2011-06-14

4 in total