Literature DB >> 15980528

Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC.

Dirk Pöhler¹, Nadine Werner, Rasmus Steinkamp, Burkhard Morgenstern.

Abstract

Comparative analysis of genomic sequences is a powerful approach to discover functional sites in these sequences. Herein, we present a WWW-based software system for multiple alignment of genomic sequences. We use the local alignment tool CHAOS to rapidly identify chains of pairwise similarities. These similarities are used as anchor points to speed up the DIALIGN multiple-alignment program. Finally, the visualization tool ABC is used for interactive graphical representation of the resulting multiple alignments. Our software is available at Göttingen Bioinformatics Compute Server (GOBICS) at http://dialign.gobics.de/chaos-dialign-submission.

Entities: Species

Mesh：

Year: 2005 PMID： 15980528 PMCID： PMC1160147 DOI： 10.1093/nar/gki386

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

During the last few years, cross-species sequence comparison has become a widely used approach to genome sequence analysis. The underlying idea is that functional regions of genomic sequences tend to be more conserved during evolution than non-functional parts. Thus, islands of local sequence similarity among two or several genomic sequences usually indicate biological functionality. This phylogenetic footprinting principle has been used by many researchers to detect novel functional elements in genomic sequences. Genomic sequence comparison has been used for gene prediction (1–5), to discover regulatory elements (6,7) and to study genomic duplications (8,9). Recently, multiple sequence comparison has been used to identifiy signature sequences of bacteria and viruses for rapid detection of pathogene microorganisms as part of the US biodefense program (10) and to detect non-coding functional RNA (11). All these studies rely on pair-wise or multiple alignments of genomic sequences; their accuracy is therefore limited by the accuracy of the underlying alignment tools. Consequently, development of algorithms for genomic sequence alignment has become a high priority in Bioinformatics research, see (12,13) for a survey. A systematic evaluation of the currently used software tools for multiple alignment of genomic sequences has been carried out by Pollard et al. (14).

THE CHAOS/DIALIGN APPROACH

DIALIGN is a general-purpose alignment program that combines global and local alignment features (15,16). Such an approach is particularly appropriate when genomic sequences are to be aligned where locally conserved regions may be separated by non-related parts of the sequences. As a stand-alone tool, however, DIALIGN is too slow to align long genomic sequences as the program running time grows quadratically with the average sequence length. Therefore, an anchoring option has been implemented. Here, user-specified anchor points can be used to reduce the alignment search space, thereby improving the program running time (17). To find suitable anchor points, we use the local alignment program CHAOS (18). In a first step, our system applies CHAOS to identify chains of local similarities among all pairs of input sequences in a multiple sequence set. In a second step, DIALIGN is used to accurately align the regions between the similarities identified by CHAOS. Our anchored-alignment approach can be applied for pair-wise as well as multiple alignment. For multiple alignment, CHAOS is run on all possible pairs of input sequences. The resulting local pair-wise similarities are then checked for consistency by DIALIGN and non-consistent ones are eliminated. This procedure is similar to the greedy approach that DIALIGN uses to construct multiple alignments, see (16).

ALIGNMENT VISUALIZATION WITH ABC

Alignments of large genomic sequences are hard to interpret without specialized visualisation tools. ABC (Application for Browsing Constraints) is an interactive Java tool that has recently been developed by Cooper et al. (19) for intuitive and efficient exploration of multiple alignments of genomic sequences. It can be used to move quickly from a summary view of the entire alignment via arbitrary levels of resolution down to the level of individual nucleotides. ABC can graphically represent additional information, such as the degree of local sequence conservation or annotation data, such as the locations of genes, etc. (Figure 1).

Figure 1

Visualization of multiple alignments using ABC (19). The user can interactively switch between different levels from a global view of the output alignment down to the level of individual residues.

At our server, we offer ABC to visualize multiple alignments produced by CHAOS and DIALIGN. The degree of local similarity among the input sequences is graphically represented based on the weight scores used by DIALIGN to assess the local degree of similarity among the sequences to be analyzed. The standard DIALIGN output file represents the degree of local similarity in a pair-wise or multiple alignment, using stars or numbers below the alignment. For each alignment column, the weight scores of all fragments connecting residues at this column are summed up and normalized, see (16) for a precise definition of fragment weights. We use the same measure of local sequence similarity for graphical representation by ABC. Note that this is only a rough measure of sequence conservation. It is possible that columns with identical nucleotide composition receive different similarity values if they are connected by fragments with different weight scores. It is also important to keep in mind, that our similarity values are not absolute values but are normalized such that in every alignment the column of maximum local similarity obtains a certain fixed score. Nevertheless, our graphical representation gives a good overview of the local degree of conservation among a sequence set.

THE CHAOS/DIALIGN/ABC WWW SERVER

The input data for our web server is a single text file containing two or several genomic sequences in FASTA format. The maximum total length of the input sequences is currently 3 MB. The server runs CHAOS and DIALIGN on the input sequences. Visualization of the results with ABC can be chosen as an additional option. This requires that the user has Java installed on his computer. For small input data, the resulting alignment is immediately shown on the computer screen—either in standard DIALIGN format or using ABC if this option has been chosen. For larger sequence sets, the program output is stored at our server; the corresponding web addresses are sent to the user by email. Different output files are created: (i) the output alignment in DIALIGN format, (ii) the same alignment in FASTA format, (iii) a list of fragments, i.e. local segment pairs, that are used as building blocks for the DIALIGN alignment, and (iv) a list of anchor points identified by CHAOS. These files are provided as plain text files. In addition the optional ABC output is stored at the server together with these standard output files. Alignments in DIALIGN format contain additional information about the degree of local sequence similarity in the multiple alignment. Also, the program distinguishes between nucleotides that could be aligned and nucleotides with no statistically significant matches to the compared sequences. Upper-case and lower-case letters are used to indicate which nucleotides are considered to be aligned. This output format and the ABC output are designed for visual inspection of the returned alignments. The output in FASTA format contains essentially the same information but is more appropriate for further automatic analysis as most sequence analysis programs accept FASTA-formatted files as input data. The list of returned fragments is annotated with some additional information that may be useful for more detailed analyses. This includes quality scores (so called weights) of the fragments indicating the degree of local sequence similarity. In addition, calculated overlap weights are returned. Overlap weights reflect not only the similarity between two segments but also the degree of overlap with other segment pairs involving different pairs of sequences as described in (15). Finally, the fragment list states for each fragment if it was consistent with other fragments and could be included into the multiple alignment or if it had to be rejected because of non-consistency. The fragment list is also designed for automatized post-processing. It is easy to parse and contains more information than the resulting alignment alone. In addition to the fragment list, a list of anchor points created by CHAOS is returned. Our WWW server provides detailed online help regarding input and output formats.

AVAILABILITY

Our software is available through Göttingen Bioinformatics Compute Server (GOBICS): .

18 in total

1. Integrating genomic homology into gene structure prediction.

Authors: I Korf; P Flicek; D Duan; M R Brent
Journal: Bioinformatics Date: 2001 Impact factor: 6.937

2. SGP-1: prediction and validation of homologous genes based on sequence alignments.

Authors: T Wiehe; S Gebauer-Jung; T Mitchell-Olds; R Guigó
Journal: Genome Res Date: 2001-09 Impact factor: 9.043

3. Analysis of vertebrate SCL loci identifies conserved enhancers.

Authors: B Göttgens; L M Barton; J G Gilbert; A J Bench; M J Sanchez; S Bahn; S Mistry; D Grafham; A McMurray; M Vaudin; E Amaya; D R Bentley; A R Green; A M Sinclair
Journal: Nat Biotechnol Date: 2000-02 Impact factor: 54.908

Review 4. An applications-focused review of comparative genomics tools: capabilities, limitations and future challenges.

Authors: Patrick Chain; Stefan Kurtz; Enno Ohlebusch; Tom Slezak
Journal: Brief Bioinform Date: 2003-06 Impact factor: 11.622

5. AGenDA: homology-based gene prediction.

Authors: Leila Taher; Oliver Rinner; Saurabh Garg; Alexander Sczyrba; Michael Brudno; Serafim Batzoglou; Burkhard Morgenstern
Journal: Bioinformatics Date: 2003-08-12 Impact factor: 6.937

6. Surveying phylogenetic footprints in large gene clusters: applications to Hox cluster duplications.

Authors: Sonja J Prohaska; Claudia Fried; Christoph Flamm; Günter P Wagner; Peter F Stadler
Journal: Mol Phylogenet Evol Date: 2004-05 Impact factor: 4.286

7. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons.

Authors: G G Loots; R M Locksley; C M Blankespoor; Z E Wang; W Miller; E M Rubin; K A Frazer
Journal: Science Date: 2000-04-07 Impact factor: 47.728

8. Independent Hox-cluster duplications in lampreys.

Authors: Claudia Fried; Sonja J Prohaska; Peter F Stadler
Journal: J Exp Zool B Mol Dev Evol Date: 2003-10-15 Impact factor: 2.656

9. Benchmarking tools for the alignment of functional noncoding DNA.

Authors: Daniel A Pollard; Casey M Bergman; Jens Stoye; Susan E Celniker; Michael B Eisen
Journal: BMC Bioinformatics Date: 2004-01-21 Impact factor: 3.169

10. Fast and sensitive multiple alignment of large genomic sequences.

Authors: Michael Brudno; Michael Chapman; Berthold Göttgens; Serafim Batzoglou; Burkhard Morgenstern
Journal: BMC Bioinformatics Date: 2003-12-23 Impact factor: 3.169

3 in total

1. DIALIGN-TX and multiple protein alignment using secondary structure information at GOBICS.

Authors: Amarendran R Subramanian; Suvrat Hiran; Rasmus Steinkamp; Peter Meinicke; Eduardo Corel; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2010-05-23 Impact factor: 16.971

2. Unravelling cis-regulatory elements in the genome of the smallest photosynthetic eukaryote: phylogenetic footprinting in Ostreococcus.

Authors: Gwenael Piganeau; Klaas Vandepoele; Sébastien Gourbière; Yves Van de Peer; Hervé Moreau
Journal: J Mol Evol Date: 2009-08-20 Impact factor: 2.395

3. Accelerated exchange of exon segments in Viperid three-finger toxin genes (Sistrurus catenatus edwardsii; Desert Massasauga).

Authors: Robin Doley; Susanta Pahari; Stephen P Mackessy; R Manjunatha Kini
Journal: BMC Evol Biol Date: 2008-07-08 Impact factor: 3.260

3 in total