Literature DB >> 15980470

One-Block CYRCA: an automated procedure for identifying multiple-block alignments from single block queries.

Milana Frenkel-Morgenstern¹, Alice Singer, Hagit Bronfeld, Shmuel Pietrokovski.

Abstract

One-Block CYRCA is an automated procedure for identifying multiple-block alignments from single block queries (http://bioinfo.weizmann.ac.il/blocks/OneCYRCA). It is based on the LAMA and CYRCA block-to-block alignment methods. The procedure identifies whether the query blocks can form new multiple-block alignments (block sets) with blocks from a database or join pre-existing database block sets. Using pre-computed LAMA block alignments and CYRCA sets from the Blocks database reduces the computation time. LAMA and CYRCA are highly sensitive and selective methods that can augment many other sequence analysis approaches.

Entities: Chemical Gene

Mesh：

Year: 2005 PMID： 15980470 PMCID： PMC1160248 DOI： 10.1093/nar/gki488

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Comparison of multiple sequence alignments (profiles) with other profiles can identify subtle protein relationships beyond the resolution of sequence-to-sequence or sequence-to-profile comparisons (1–5,7–9). The main advantages of using profiles instead of sequences are better characterization of the compared regions and the possibility for giving more weight to, or using only, conserved regions. Using only conserved regions significantly reduces the search space and avoids possibly spurious hits by non-conserved and misaligned regions. Blocks are local ungapped profiles of the most conserved regions of protein families and domains (2). LAMA is a profile-to-profile alignment method, previously developed by us, for comparing blocks with each other and for searching databases of blocks with block queries (5). It is a highly sensitive method for detecting sequence similarities that are often not found by other profile-to-profile and sequence-to-profile methods (6). LAMA alignments do not use gaps, since the compared profiles are short and are themselves constructed from ungapped conserved regions. CYRCA is a method for detecting weak protein sequence similarities by aligning multiple blocks (3). The resulting multiple-block alignments are identified as block sets with consistent and transitive relationships, derived from pairwise block alignments previously found by LAMA. Namely, if blocks A, B and C are aligned to each other in the same phase in overlapping regions, then these blocks are probably genuinely similar to each other, even if each pairwise alignment score is insignificant by itself (Figure 1). CYRCA implements this approach by using graph theory and a bottom-up algorithm. Blocks are represented as graph nodes and their LAMA alignments as the graph edges. The simplest transitive block relationship is a triangle graph (a cycle of three blocks). CYRCA first identifies consistent triangles, joins triangles with common edges and finally adds linear edges that have very high alignment scores. CYRCA sets are, thus, identified from large-scale LAMA comparisons of many blocks with each other, typically using the whole Blocks database. These comparisons take a few days to compute. CYRCA analyses are used to annotate the Blocks database. Analysis of specific blocks has identified biologically significant and genuine relationships (10,11), but it requires manual interventions.

Figure 1

A graphical example of a consistent set of aligned blocks. (a) Three consistently aligned pairs of blocks A, B and C are presented. Blocks are presented as rectangles with one position marked by a vertical line. The aligned region is shown for each pair of aligned blocks. (b) A consistent CYRCA set obtained from the block alignments shown in (a). Such basic consistent graphs are then joined to form larger consistent sets (3).

Here we present a procedure and a web server for automatically adding new blocks to previously constructed or constructing new CYRCA sets.

ONE-BLOCK CYRCA ALGORITHM

In the first step of the algorithm, each of the query blocks is compared by LAMA with the database for which CYRCA sets were previously computed (the current version of the Blocks database). All hits above the user-specified score threshold are retained. Next, all of the blocks found to be similar to the query or queries are compared by LAMA with each other. This is actually implemented using pre-computed LAMA results. The resulting query hits (graph edges), together with the edges found within the database, are then examined using the CYRCA algorithm. This can identify new sets or join the block queries to existing CYRCA sets (Figure 2).

Figure 2

Flow diagram of the One-Block CYRCA procedure.

The One-Block CYRCA procedure analyzes the relationship of one or a few block queries to a database, whereas the basic CYRCA procedure inter-compares a whole database. We took advantage of this to use a more sensitive and time-consuming value for the CYRCA cycle size parameter. One-Block CYRCA sets are identified by first locating consistent cycles of any size, not just triangular ones as in the basic method of Kunin et al. (3). This allows a more in-depth analysis by One-Block CYRCA.

DESCRIPTION OF THE WEB INTERFACE

Input to the server () is one or more blocks supplied by the user. The blocks can be in the Blocks database format (2) () or in another commonly used multiple sequence alignment format (multiple FASTA, CLUSTAL or MSF). These latter formats can be found in many multiple alignments databases, such as Pfam (12), CDD (13) and SMART (14), and as the output of multiple alignment programs such as MEME (15), T-COFFEE (16) and DIALIGN (17). From these alignment types only ungapped regions wider than four columns will be used. The user can upload the input from a local file or can paste it into the query window. If an email address is provided, the output will be sent to it. Parameter default values are supplied but can be changed by the user. The default Z-score threshold parameter of the LAMA alignment significance (5.6) corresponds to ∼1% significance level. The ‘Linear edge’ threshold score parameter default value (8.0) is more selective, since it is used for adding to CYRCA sets linear edges whose consistency cannot be checked (3). The output of the server includes the CYRCA sets found with the query blocks. These can be expanded pre-computed or new sets. If the queries were from the Blocks database, it is possible that the sets will be unchanged pre-computed ones. The sets are shown with the description of their blocks, list of all the set edges (pairwise alignments) and phase alignment of all blocks. There are also links to the block entries in the Blocks database and to interactive graph representations and superimposition of structures present in different blocks of each set (Figure 3).

Figure 3

Representative output from the One-Block CYRCA server.

EXAMPLE

HNH and GIY-YIG nuclease domains are often accompanied by regions with conserved sequence motifs. We have previously shown that these motifs are similar to known DNA binding motifs and probably confer the substrate specificity to the nuclease catalytic regions (11). Our analysis was based on block-to-block alignments found by LAMA and CYRCA. This required careful manual intervention since the nuclease-associated modular DNA-binding domains (NUMODs) blocks we found were not part of the Blocks database. Submitting the NUMOD motifs to the One-Block CYRCA server returned the block sets we used to identify their function (Figure 3).

DISCUSSION

The high selectivity of One-Block CYRCA is derived from the transitive nature of its search. It is not a simple query-against-database search. All hits found by the query are examined further if they are consistently similar to each other. This identifies a set of similar blocks that can form a multiple-block alignment. This CYRCA approach was described by us in Ref. (3). The One-Block CYRCA method has a novel combination of searching with short queries (corresponding to local sites on proteins) and using the powerful methodology of profile-to-profile methods. Other servers and programs either use short queries to compare with sequences or compare long gapped profiles (or HMMs) with other profiles. Our approach allows the identification of weak and localized similarity between proteins embedded in otherwise different contexts.

17 in total

1. Increased coverage of protein families with the blocks database servers.

Authors: J G Henikoff; E A Greene; S Pietrokovski; S Henikoff
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Consistency analysis of similarity between multiple alignments: prediction of protein function and fold structure from analysis of local sequence motifs.

Authors: V Kunin; B Chan; E Sitbon; G Lithwick; S Pietrokovski
Journal: J Mol Biol Date: 2001-03-30 Impact factor: 5.469

3. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory.

Authors: Golan Yona; Michael Levitt
Journal: J Mol Biol Date: 2002-02-01 Impact factor: 5.469

4. T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Authors: C Notredame; D G Higgins; J Heringa
Journal: J Mol Biol Date: 2000-09-08 Impact factor: 5.469

5. Finding weak similarities between proteins by sequence profile comparison.

Authors: Anna R Panchenko
Journal: Nucleic Acids Res Date: 2003-01-15 Impact factor: 16.971

6. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance.

Authors: Ruslan Sadreyev; Nick Grishin
Journal: J Mol Biol Date: 2003-02-07 Impact factor: 5.469

7. Distribution and function of new bacterial intein-like protein domains.

Authors: Gil Amitai; Olga Belenkiy; Bareket Dassa; Alla Shainskaya; Shmuel Pietrokovski
Journal: Mol Microbiol Date: 2003-01 Impact factor: 3.501

Review 8. New types of conserved sequence domains in DNA-binding regions of homing endonucleases.

Authors: Einat Sitbon; Shmuel Pietrokovski
Journal: Trends Biochem Sci Date: 2003-09 Impact factor: 13.807

9. Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure.

Authors: Milana Frenkel-Morgenstern; Hillary Voet; Shmuel Pietrokovski
Journal: Bioinformatics Date: 2005-05-03 Impact factor: 6.937

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3 in total