Literature DB >> 23620293

DIALIGN at GOBICS--multiple sequence alignment using various sources of external information.

Layal Al Ait¹, Zaher Yamak, Burkhard Morgenstern.

Abstract

DIALIGN is an established tool for multiple sequence alignment that is particularly useful to detect local homologies in sequences with low overall similarity. In recent years, various versions of the program have been developed, some of which are fully automated, whereas others are able to accept user-specified external information. In this article, we review some versions of the program that are available through 'Göttingen Bioinformatics Compute Server'. In addition to previously described implementations, we present a new release of DIALIGN called 'DIALIGN-PFAM', which uses hits to the PFAM database for improved protein alignment. Our software is available through http://dialign.gobics.de/.

Entities: Gene Species

Mesh：

Year: 2013 PMID： 23620293 PMCID： PMC3692126 DOI： 10.1093/nar/gkt283

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

‘DIALIGN’ is a tool for pairwise and multiple alignment of nucleic acid or protein sequences (1). The program combines global and local alignment features, its main strength is its ability to discover local homologies among sequences without detectable global homology. This makes the program particularly useful to analyse remotely related protein families or genomic sequences where functional regions are typically conserved at the primary-sequence level, whereas non-functional parts of the sequences are less conserved. In many studies, DIALIGN has been successfully used to analyse protein families or genomic sequences, see e.g. (2,3). Many versions of DIALIGN have been developed since the program was first introduced in 1996. The standard version of the program performs alignments without human intervention and is based on primary-sequence information alone. Later versions of DIALIGN can use additional sources of information or expert knowledge to produce more accurate alignments. The most recent addition is an option for protein alignment where the input sequences are searched against the Pfam database of protein domains (4). Positions of the sequences matching the same position in some Pfam domain are then preferably aligned (5). This latest program version is outlined in the present article. During the first years, the main development work on DIALIGN was carried out at ‘University of Bielefeld’. The ‘Bielefeld Bioinformatics Server’ (BiBiServ) still offers various program versions for online usage and for download. Later, the work on DIALIGN was continued at ‘University of Göttingen’, and more recent versions of the program are offered via ‘Göttingen Bioinformatics Compute Server’ (GOBICS) at www.gobics.de.

PREVIOUS VERSIONS OF DIALIGN

DIALIGN 2.2

To calculate a multiple sequence alignment (MSA), the standard version of the program, ‘DIALIGN 2.2’, first calculates all pairwise alignments of the input sequences as described in (6). That is, a ‘sparse dynamic programming’ algorithm is used to find an optimal alignment in the sense of a segment-based ‘objective function’ (7). MSAs are then calculated based on these pairwise alignments using a time-efficient greedy algorithm described in (8). No human intervention is necessary or possible. This version of the program is available through BiBiServ at http://bibiserv.techfak.uni-bielefeld.de/.

DIALIGN-TX

Greedy algorithms are fast but may be error prone. In DIALIGN, the greedy algorithm may select spurious random similarities among the input sequences that prevent the program from aligning biologically meaningful homologies. Thus, a more recent development, ‘DIALIGN-TX’ (9), uses various heuristics to reduce the influence of isolated random similarities on the resulting MSA. Among other approaches, it uses a mixture of the ‘greedy’ algorithm used in the original ‘DIALIGN’ implementation with a more classical ‘progressive’ approach. ‘DIALIGN-TX’ is available online through ‘GOBICS’ at http://dialign-tx.gobics.de/; the source code is freely available from the same URL.

Anchored DIALIGN

Most MSA programs are fully automated. That is, except for parameter tuning, they do not allow nor require any human intervention during the alignment procedure. This is adequate, of course, if no further information is available, or if large amounts of data have to be analysed automatically. Often, however, the user of an MSA program has already some expert knowledge about the sequences to be aligned, e.g. he/she may know some homologies among the input sequences that should be aligned. In such cases, it would be desirable to have an MSA program that uses this expert information and aligns the remainder of the sequences automatically. The ‘anchored alignment’ option in ‘DIALIGN’ is doing this (10). Here, the user can specify segments of the input sequences that should be aligned with each other, so-called ‘anchor points’ for the alignment. The remainder of the sequences is then aligned automatically, respecting the constraints given by the user-selected ‘anchor points’. Technically, an ‘anchor point’ is a pair of equal-length segments from two distinct sequences. As it may not be possible to include all user-defined anchor points in one single output MSA, the program has to prioritize the proposed anchor points. To this end, ‘scores’ can be given to the selected anchor points to define their priority.

Aligning long DNA sequences with DIALIGN, CHAOS and ABC

The run time of most pairwise alignment methods is proportional to the product of the sequence length. Thus, if long genomic sequences are aligned, program run time becomes an issue. To overcome this problem, methods for genomic sequence alignment usually start with a fast search for strong local similarities. In a second step, sequences between those similarities are aligned with a slower, but more sensitive, method. On our web server, we use the program ‘CHAOS’ (11) to quickly identify local alignments of genomic sequences; we then align the remainder of the sequences with ‘DIALIGN’. Finally, the results are visualized with the software ‘ABC’ (12), see (13) for more details. This approach is available on our server at http://dialign.gobics.de/ chaos-dialign-submission.

DIALIGN USING PFAM MATCHES

Recently, the developers of Clustal Ω proposed an approach to MSA that they called ‘External Profile Alignment’ (14). Here, the user can provide a pre-calculated ‘profile HMM’ (15) of a protein domain that he/she thinks may be present in the input sequences. Matching sequences are then locally aligned to this ‘external profile’ and thereby, indirectly, aligned to each other. In the latest version of ‘DIALIGN’, we apply this approach systematically. In short, we search all input sequences against the Pfam database of protein domains. Segments of the sequences matching to the same positions in some Pfam domain are then preferentially aligned in the final output MSA. We called this new approach ‘DIALIGN-PFAM’; a first version of this approach is described in a conference paper (5). The algorithm described later in the text is slightly different from this original version; Figure 1 shows a flowchart for our algorithm.

Figure 1.

Flowchart of DIALIGN-PFAM.

Algorithm

Each protein family in Pfam is represented by a model consisting of one or several MSAs of domains and ‘profile Hidden Markov Models’ (pHMM) derived from these alignments. The first step in our approach is to scan the input sequences against Pfam using ‘HMMER’ (16). ‘HMMER’ assigns quality scores to matches between a query protein sequence and models of protein domains in a database. To control which ‘HMMER’ hits are used by our algorithm, we use two threshold values for the E-values of these hits. Our first threshold parameter, E, applies to full models in Pfam and ensures that only models with an E-value less than E are taken into consideration. The second threshold, E, applies to single domains such that profiles, which satisfy the first threshold condition, are further filtered with this one. As default values, we use for E and for E. After ‘HMMER’ matches to Pfam are obtained and filtered with our threshold parameters, the next step is to construct so-called ‘domain blocks’, which are the basis of our alignment approach. A ‘domain block’ consists of two or more segments of the input sequences that are matched, possibly with gaps, to the same Pfam domain. This way, segments from one ‘domain block’ are, indirectly, aligned to each other, i.e. two positions from the input sequences are aligned if they are matched to the same position in some Pfam domain. In a third step, the user can manually inspect the aligned ‘domain blocks’ obtained in this way and select or de-select them for the final multiple alignment step. Finally, the selected ‘domain blocks’ are used by ‘DIALIGN’ as ‘anchors’ to calculate a multiple alignment of the input sequences. Technically, pairs of segments of the input sequences aligned to the same segment in some Pfam domain are defined as ‘anchor points’. For a single Pfam domain, it is usually possible to integrate all derived ‘anchor points’ into one output MSA. In ‘DIALIGN’ terminology, these anchor points are generally ‘consistent’ with each other. It may not be possible, however, to integrate anchor points from all the selected ‘domain blocks’ into one single output MSA. Because of such possible ‘inconsistencies’, we have to determine the priority of the selected blocks. To this end, we define for each ‘domain block’ a ‘score’, as the sum of the scores of all involved ‘HMMER’ hits to Pfam. The priority of an anchor point is then defined according to this score; anchor points derived from our domain blocks are considered in the order of decreasing scores. That means, our program first accepts all anchor points from the ‘domain block’ with the highest score, then the anchor points from the block with the second highest score—as long as they are consistent with the already accepted anchor points—and so forth.

Interactive selection of blocks

After all ‘domain blocks’ have been calculated as described earlier in the text, the user has the option to view these blocks in two different ways. A ‘local view’ shows the local MSA, possibly containing gaps, that has been derived from all matches to one specific Pfam domain. In addition, a ‘global view’ of a given block is provided showing the non-aligned full input sequences with the segments from the block highlighted. By default, all the constructed blocks are included in the multiple alignment process, but the user can decide to discard an arbitrary number of blocks. In our original conference paper (5), we reported benchmark results on ‘BAliBASE’ (17) and ‘SABmark’ (18) for a previous version of our algorithm. In short, ‘DIALIGN’ using Pfam hits performed consistently better than the standard version of the program that uses primary-sequence information alone. The modified algorithm outlined in the present article produces slightly better results than the original version described in (5), but is considerably faster. We are planning to give a detailed comparison of these two algorithms in an extended journal version of our conference paper.

Input/Output

‘DIALIGN-PFAM’ takes as an input a file in ‘FASTA’ format containing a set of protein sequences. The user can adjust the threshold parameters E and E for the Pfam search; default values are provided. As scanning Pfam with ‘HMMER’ may take a while, the user is given a URL where he/she can retrieve the results of the HMMER search later, to continue with the next step of the program. Figure 2 shows the local and global view on a simple ‘domain block’ involving three sequences identified from an input set of seven protein sequences. As the final alignment process by DIALIGN may also take some time, the user is given another URL to retrieve the final MSA later. The result of a program run will be stored and are downloadable from our server for 1 week. ‘DIALIGN-PFAM’ is available online at http://dialign-pfam.gobics.de/ SequenceAlignment/.

Figure 2.

Example program run with DIALIGN-PFAM. An input file with seven protein sequences was uploaded to our server. Our program used HMMER to search each of the seven input sequences against Pfam. Overall, matches to five different Pfam domains were found by HMMER. (a) Each line in the first table corresponds to one of the matched Pfam domains, e.g. the first line corresponds to the Thioredoxin domain. The second column indicates how many of the input sequences matched to the respective domain (e.g. five of our seven input sequences matched to the Thioredoxin domain). By clicking ‘View’ in the third and fourth column, respectively, the user can look at ‘alignments’ of the Pfam matches and at their positions within the input sequences. The checkboxes on the left-hand side can be used to select/deselect matches to Pfam domains as anchor points for the final MSA calculated by our program. By default, all matches are selected. (b) The second table is obtained by clicking ‘View’ in the third column of table (a). It shows a multiple alignment of segments of the input sequences matching to the same Pfam domain (so-called ‘local view’). In our example (b), three input sequences (1grx_, 1erv_, 1j0f_A) were matched to the same Pfam domain. The alignment in (b) was constructed by our program by aligning those sequence positions to each other that were matched by HMMER to the same position in the corresponding Pfam domain. (c) The third table is obtained by clicking ‘View’ in the fourth column of the table in (a). It shows the ‘global view’, i.e. the positions of the matching segments in the respective input sequences. Segments matched by HMMER to the corresponding Pfam domain are shown in red.

Example

Figure 2 shows an example of how ‘domain blocks’ are shown to the user by DIALIGN-PFAM. Here, we ran the program on a set of seven protein sequences. Matches to five different Pfam domains were found by ‘HMMER’. As shown in Figure 2a, five of the input sequences had matches to the Thioredoxin domain, three sequences had matches to the Glutaredoxin domain, two sequences had matches to the SH3BGR domain, five sequences had matches to AhpC-TSA domain and three sequences had matches to the Redoxin domain. In Figure 2b, the ‘local’ view of the Glutaredoxin domain block is shown. Figure 2c shows the ‘global’ view of this domain block within the input sequences; here, matches to the Glutaredoxin domain are shown in red.

15 in total

1. Analysis of vertebrate SCL loci identifies conserved enhancers.

Authors: B Göttgens; L M Barton; J G Gilbert; A J Bench; M J Sanchez; S Bahn; S Mistry; D Grafham; A McMurray; M Vaudin; E Amaya; D R Bentley; A R Green; A M Sinclair
Journal: Nat Biotechnol Date: 2000-02 Impact factor: 54.908

2. The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences.

Authors: Michael Brudno; Rasmus Steinkamp; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.

Authors: Julie D Thompson; Patrice Koehl; Raymond Ripp; Olivier Poch
Journal: Proteins Date: 2005-10-01

4. Multiple DNA and protein sequence alignment based on segment-to-segment comparison.

Authors: B Morgenstern; A Dress; T Werner
Journal: Proc Natl Acad Sci U S A Date: 1996-10-29 Impact factor: 11.205

5. HMMER web server: interactive sequence similarity searching.

Authors: Robert D Finn; Jody Clements; Sean R Eddy
Journal: Nucleic Acids Res Date: 2011-05-18 Impact factor: 16.971

6. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

7. Multiple sequence alignment with user-defined anchor points.

Authors: Burkhard Morgenstern; Sonja J Prohaska; Dirk Pöhler; Peter F Stadler
Journal: Algorithms Mol Biol Date: 2006-04-19 Impact factor: 1.405

8. ABC: software for interactive browsing of genomic multiple sequence alignment data.

Authors: Gregory M Cooper; Senthil A G Singaravelu; Arend Sidow
Journal: BMC Bioinformatics Date: 2004-12-08 Impact factor: 3.169

9. The Pfam protein families database.

Authors: Robert D Finn; John Tate; Jaina Mistry; Penny C Coggill; Stephen John Sammut; Hans-Rudolf Hotz; Goran Ceric; Kristoffer Forslund; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal: Nucleic Acids Res Date: 2007-11-26 Impact factor: 16.971

10. Fast and sensitive multiple alignment of large genomic sequences.

Authors: Michael Brudno; Michael Chapman; Berthold Göttgens; Serafim Batzoglou; Burkhard Morgenstern
Journal: BMC Bioinformatics Date: 2003-12-23 Impact factor: 3.169

14 in total

1. The MYB182 protein down-regulates proanthocyanidin and anthocyanin biosynthesis in poplar by repressing both structural and regulatory flavonoid genes.

Authors: Kazuko Yoshida; Dawei Ma; C Peter Constabel
Journal: Plant Physiol Date: 2015-01-26 Impact factor: 8.340

2. Expansion of gene clusters, circular orders, and the shortest Hamiltonian path problem.

Authors: Sonja J Prohaska; Sarah J Berkemer; Fabian Gärtner; Thomas Gatter; Nancy Retzlaff; Christian Höner Zu Siederdissen; Peter F Stadler
Journal: J Math Biol Date: 2017-12-19 Impact factor: 2.259

3. A retroviral mutagenesis screen identifies Cd74 as a common insertion site in murine B-lymphomas and reveals the existence of a novel IFNgamma-inducible Cd74 isoform.

Authors: Magdalena Pyrz; Bruce Wang; Matthias Wabl; Finn Skou Pedersen
Journal: Mol Cancer Date: 2010-04-23 Impact factor: 27.401

4. Homolog of protein kinase Mζ maintains context aversive memory and underlying long-term facilitation in terrestrial snail Helix.

Authors: Pavel M Balaban; Matvey Roshchin; Alia Kh Timoshenko; Alena B Zuzina; Maria Lemak; Victor N Ierusalimsky; Nikolay A Aseyev; Aleksey Y Malyshev
Journal: Front Cell Neurosci Date: 2015-06-22 Impact factor: 5.505

5. MDAT- Aligning multiple domain arrangements.

Authors: Carsten Kemena; Tristan Bitard-Feildel; Erich Bornberg-Bauer
Journal: BMC Bioinformatics Date: 2015-01-28 Impact factor: 3.169

6. Massive expansion of Ubiquitination-related gene families within the Chlamydiae.

Authors: Daryl Domman; Astrid Collingro; Ilias Lagkouvardos; Lena Gehre; Thomas Weinmaier; Thomas Rattei; Agathe Subtil; Matthias Horn
Journal: Mol Biol Evol Date: 2014-07-28 Impact factor: 16.240

7. PnpProbs: a better multiple sequence alignment tool by better handling of guide trees.

Authors: Yongtao Ye; Tak-Wah Lam; Hing-Fung Ting
Journal: BMC Bioinformatics Date: 2016-08-31 Impact factor: 3.169

8. HOXB9 induction of mesenchymal-to-epithelial transition in gastric carcinoma is negatively regulated by its hexapeptide motif.

Authors: Qing Chang; Li Zhang; Changyu He; Baogui Zhang; Jun Zhang; Bingya Liu; Naiyan Zeng; Zhenggang Zhu
Journal: Oncotarget Date: 2015-12-15

9. A hydrophobic residue in the TALE homeodomain of PBX1 promotes epithelial-to-mesenchymal transition of gastric carcinoma.

Authors: Changyu He; Zhenqiang Wang; Li Zhang; Liyun Yang; Jianfang Li; Xuehua Chen; Jun Zhang; Qing Chang; Yingyan Yu; Bingya Liu; Zhenggang Zhu
Journal: Oncotarget Date: 2017-07-18

10. Characterization of the AP2/ERF Transcription Factor Family and Expression Profiling of DREB Subfamily under Cold and Osmotic Stresses in Ammopiptanthus nanus.

Authors: Shilin Cao; Ying Wang; Xuting Li; Fei Gao; Jinchao Feng; Yijun Zhou
Journal: Plants (Basel) Date: 2020-04-04