Literature DB >> 17526519

The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods.

Sebastien Moretti¹, Fabrice Armougom, Iain M Wallace, Desmond G Higgins, Cornelius V Jongeneel, Cedric Notredame.

Abstract

The M-Coffee server is a web server that makes it possible to compute multiple sequence alignments (MSAs) by running several MSA methods and combining their output into one single model. This allows the user to simultaneously run all his methods of choice without having to arbitrarily choose one of them. The MSA is delivered along with a local estimation of its consistency with the individual MSAs it was derived from. The computation of the consensus multiple alignment is carried out using a special mode of the T-Coffee package [Notredame, Higgins and Heringa (T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000; 302: 205-217); Wallace, O'Sullivan, Higgins and Notredame (M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006; 34: 1692-1699)] Given a set of sequences (DNA or proteins) in FASTA format, M-Coffee delivers a multiple alignment in the most common formats. M-Coffee is a freeware open source package distributed under a GPL license and it is available either as a standalone package or as a web service from www.tcoffee.org.

Entities: Chemical

Mesh：

Year: 2007 PMID： 17526519 PMCID： PMC1933118 DOI： 10.1093/nar/gkm333

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The computation of an accurate multiple sequence alignment (MSA) is central to a large number of bioinformatics analyses, ranging from phylogeny, profile construction, structure prediction and more recently sequence/structure activity relationship. Despite its importance, the MSA problem has not yet met with a definitive answer and a wide variety of alternative methods are currently available (3,4). All these methods are meant to address the same problem in different ways. In recent years, many efforts have been undertaken to characterize their relative accuracy but the overall outcome suggests that there is no such thing as a perfect MSA method, with each individual method having specific strengths and weaknesses. In practice, evaluation is made using structure-based MSAs as a standard of truth and the expected accuracy of a method is deduced from its ability to produce a structurally correct sequence alignment while using sequence information only. At least five such collections of reference alignments (5–8) have been established, and although some methods give better average results than others, one cannot know in advance which method will outperform the others on a given dataset. As such, it is always possible for the worst method to outperform all the others on a specific dataset. For the biologist, this makes it impossible to use anything other than a weak statistical argument (i.e. best method on average) to choose one method among the others when computing an alignment. The design of meta-methods (or jury-based methods) is one way of addressing such situations in biology. Meta-methods are meant to combine the output of several alternative methods into one final output. They are based on the empirical reasoning that errors produced by independent prediction systems should not be consistent, therefore suggesting agreement as an indication of correctness. Such an approach was successfully used in the field of gene predictions (9) or for secondary structure predictions (10). Combining alignments, however, is less simple than building consensus prediction and it is only in 1999 that an effective strategy was proposed by Bucka-Lassen (11). An alternative to the Bucka-Lassen strategy, using consistency, was later introduced in the T-Coffee (1) algorithm. Recently, this algorithm was further modified in order to address the problem of combining alternative MSAs into one (2). T-Coffee (1) is a progressive consistency-based algorithm that compiles an alignment on the basis of its consistency with a collection of pairwise constraints. In practice, the constraints correspond to pairs of residues that could end up aligned in the final alignment. These constraints, however, are not necessarily all compatible with one another and the goal of the algorithm is to fit as many as possible within the final alignment, while discarding those that were hopefully biologically less relevant. The term consistency refers to the notion that one tries to compute the alignment having the highest possible consistency with the constraint list. This notion was introduced by Gotoh (12) and later re-used in several algorithms (13,14). In 2000, Notredame et al. (1) described a variation of the progressive algorithm using consistency as a scoring scheme. This combination proved quite successful and is now at the core of several MSA packages (15–17). In its default mode, T-Coffee uses, as a list of constraints, all the pair-wise matches extracted from a compilation of all possible global pair-wise alignments and the 10 best local alignments from each pair of sequences. Yet, this is merely one of the possible recipes to assemble such a list of constraints, and alternatives are possible. For instance, ProbCons (16) uses suboptimal pairwise global alignments (as emitted by an HMM with posterior decoding); PCMA (15) uses pairwise profile comparisons and Expresso (18) uses a mixture of sequence and structure-based alignments. Following the same principle, it is also possible to generate alternative MSAs and compile them into a single list of constraints. This latest approach forms the basis of M-Coffee (2), where eight MSA methods are used to generate alternative MSAs. Extensive benchmarking showed that this combination results in a modest but consistent improvement over each individual method, with M-Coffee producing the best scoring alignment on two of three of the datasets contained in BaliBase (5), Prefab (6) and Homstrad (2). Another interesting by-product of alignment combination is the possibility of estimating the local consistency between the final alignment and the individual alignments. This amounts to measuring, for every residue, the fraction of individual alignments that support its position in the final alignment. This measure is named the CORE index (Consistency of Overall Residue Evaluation) and was shown to be very informative with respect to the overall alignment accuracy (19). These initial reports recently gained further support thanks to some extensive analysis carried out by Sonhammer et al.(20) whose results indicate that the consistency between an MSA and a pre-computed collection of alternative alignments gives very reliable information with respect to the structural correctness of that alignment. As such, the local consistency measure appears to be one of the most reliable predictors of alignment accuracy available today. The server we present here computes an alignment with eight of the most commonly used MSA packages. It then outputs a consensus alignment along with a CORE-based local evaluation that can either be color-coded or ASCII based. Two mirrors of these services currently run on separate clusters: one at the Swiss Institute of Bioinformatics on the Vital-IT framework, the other at the CNRS in Marseilles, France. Both mirrors can be accessed via the T-Coffee homepage: www.tcoffee.org and extra mirrors should be added in the close future.

METHODS

Primary library: computation of the initial MSAs

The principle of M-Coffee is to compute several alternative multiple alignments in order to combine them into one consensus alignment. By default, eight methods were chosen for this purpose: PCMA (Version 2.0) (15), POA (Version 2.0) (21), Dialign-t (Version 0.2.1) (22), MAFFT (Version 5.431, L-INS-i) (17), Muscle (Version 3.6), ProbCons (Version 1.2), ClustalW (23) and T-Coffee (1). Apart from MAFFT that is used in its most accurate mode (mafft- - localpair- -maxiterate 1000) all the methods are run on the initial dataset using the default parameters. This produces an MSA that is then turned into a T-Coffee primary library. All these libraries are then combined in order to generate an MSA.

M-Coffee alignment computation

In order to compute the final alignment, the server runs the following command: t_coffee -method poa_msa, dialignt_msa, mafft_msa, clustalw_msa, muscle_msa, probcons_msa, t_coffee_msa, pcma_msa.

Using the M-Coffee server

The server can be accessed at www.tcoffee.org. Following the M-Coffee link will either take the user to the regular or advanced mode. The regular mode merely requires the user to cut and paste a set of sequences in FASTA format. The advanced mode (Figure 1) offers more possibilities and guides the user with a series of bulleted points:

Figure 1.

Cut and paste your sequences. Sequences should be in FASTA format. Duplicated names are now supported although not recommended. Alignment computation. This section defines the way the primary library is computed. For instance, selecting only lalign_id_pair and slow_pair will lead to the computation of a regular T-Coffee MSA. The lower section (xxx_msa) displays the list of available MSA methods. Selecting only one of these methods will generate the corresponding alignment. Selecting several methods (or all of them, as in the regular mode displayed on Figure 1) will lead to a consensus T-Coffee MSA. If the MSA method one wants to combine is missing on this form, another server named ‘Combine’ should be used (accessible from www.tcoffee.org). The ‘Combine’ server works on the same principle as M-Coffee but does not compute the MSAs itself and requires the user to cut and paste pre-computed MSAs. At this point it should be used if one wants to incorporate specific constraints or structure-based sequence alignments. Output. The Output section makes it possible to control the output format. The most notable element is score_html that will cause the server to produce a colored version of the final alignment (Figure 2). In this output, residues are individually colored according to the consistency of their alignment with the T-Coffee library. Residues in red are in perfect agreement with every constituting multiple alignment while those in blue have the lowest agreement (i.e. the lowest support in the individual MSAs). Previous analysis indicates that 90% of the residues having a score of 7 or higher (dark yellow, orange and red) are correctly aligned (24). A text version of this output is available as score_ascii where each residue is replaced with its consistency estimation on a scale between 0 and 9 (9 corresponding to the red-brick residues in the color-output). These score_ascii files can be used to process multiple alignments (block extraction) using seq_reformat, one of the utilities distributed along with T-Coffee. For this purpose, users can download their alignment, the score_ascii file and use the command line version of T-Coffee with the following syntax:

Figure 2.

Typical colored output. This output was obtained by using the kinase1_ref5 from BaliBase. Correctly aligned residues (as judged from the reference) are in upper case, non-correct ones are in lower case. In this colored output, each residue has a color that indicates the agreement of the individual MSAs with respect to the alignment of that specific residue. Dark red indicates residues aligned in a similar fashion among all the individual MSAs; blue indicates a very low agreement. Dark yellow, orange and red residues can be considered to be reliably aligned.

Method selection on the advanced M-Coffee server form. Each check box corresponds to either a pairwise (_pair) or a multiple sequence alignment method (_msa). Users should choose their methods of choice in order to combine them. Typical colored output. This output was obtained by using the kinase1_ref5 from BaliBase. Correctly aligned residues (as judged from the reference) are in upper case, non-correct ones are in lower case. In this colored output, each residue has a color that indicates the agreement of the individual MSAs with respect to the alignment of that specific residue. Dark red indicates residues aligned in a similar fashion among all the individual MSAs; blue indicates a very low agreement. Dark yellow, orange and red residues can be considered to be reliably aligned. t_coffee -other_pg seq_reformat -in -struc_in -struc_in_f number_aln -action +keep ‘[5-9]’ Where is the name of the alignment and the name of the score_asccii file. This syntax will replace by a gap (‘-') every residue having an ascii_score lower than 5 (green and blue residues on the colored output).

CONCLUSION AND FUTURE DEVELOPMENTS

M-Coffee provides biologists with a useful alternative to the a priori choice of an MSA method. Although M-Coffee does not entirely solve the question of which method should be used, its local scoring scheme makes it easier to read the alignment and determine which portions are the most informative. Further developments will include making more methods available, as well as making it possible to combine sequences and structures, using the Expresso protocol.

23 in total

1. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment.

Authors: B Morgenstern
Journal: Bioinformatics Date: 1999-03 Impact factor: 6.937

2. Combining many multiple alignments in one improved alignment.

Authors: K Bucka-Lassen; O Caprani; J Hein
Journal: Bioinformatics Date: 1999-02 Impact factor: 6.937

Review 3. Multiple sequence alignment.

Authors: Robert C Edgar; Serafim Batzoglou
Journal: Curr Opin Struct Biol Date: 2006-05-05 Impact factor: 6.809

4. Motif recognition and alignment for many sequences by comparison of dot-matrices.

Authors: M Vingron; P Argos
Journal: J Mol Biol Date: 1991-03-05 Impact factor: 5.469

5. HOMSTRAD: a database of protein structure alignments for homologous families.

Authors: K Mizuguchi; C M Deane; T L Blundell; J P Overington
Journal: Protein Sci Date: 1998-11 Impact factor: 6.725

6. JPred: a consensus secondary structure prediction server.

Authors: J A Cuff; M E Clamp; A S Siddiqui; M Finlay; G J Barton
Journal: Bioinformatics Date: 1998 Impact factor: 6.937

7. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

8. MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Authors: Kazutaka Katoh; Kei-ichi Kuma; Hiroyuki Toh; Takashi Miyata
Journal: Nucleic Acids Res Date: 2005-01-20 Impact factor: 16.971

9. M-Coffee: combining multiple sequence alignment methods with T-Coffee.

Authors: Iain M Wallace; Orla O'Sullivan; Desmond G Higgins; Cedric Notredame
Journal: Nucleic Acids Res Date: 2006-03-23 Impact factor: 16.971

10. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

Authors: G P S Raghava; Stephen M J Searle; Patrick C Audley; Jonathan D Barber; Geoffrey J Barton
Journal: BMC Bioinformatics Date: 2003-10-10 Impact factor: 3.169

96 in total

1. Toward understanding the mechanism of action of the yeast multidrug resistance transporter Pdr5p: a molecular modeling study.

Authors: Robert M Rutledge; Lothar Esser; Jichun Ma; Di Xia
Journal: J Struct Biol Date: 2010-10-27 Impact factor: 2.867

2. Structural and functional analysis of SmeT, the repressor of the Stenotrophomonas maltophilia multidrug efflux pump SmeDEF.

Authors: Alvaro Hernández; María J Maté; Patricia C Sánchez-Díaz; Antonio Romero; Fernando Rojo; José L Martínez
Journal: J Biol Chem Date: 2009-03-26 Impact factor: 5.157

Review 3. Aegerolysins: structure, function, and putative biological role.

Authors: Sabina Berne; Ljerka Lah; Kristina Sepcić
Journal: Protein Sci Date: 2009-04 Impact factor: 6.725

4. Monoamine neurotransmitters as substrates for novel tick sulfotransferases, homology modeling, molecular docking, and enzyme kinetics.

Authors: Emine Bihter Yalcin; Hubert Stangl; Sivakamasundari Pichu; Thomas N Mather; Roberta S King
Journal: ACS Chem Biol Date: 2010-11-15 Impact factor: 5.100

5. Epithelial Shaping by Diverse Apical Extracellular Matrices Requires the Nidogen Domain Protein DEX-1 in Caenorhabditis elegans.

Authors: Jennifer D Cohen; Kristen M Flatt; Nathan E Schroeder; Meera V Sundaram
Journal: Genetics Date: 2018-11-08 Impact factor: 4.562

6. Tentacle Transcriptomes of the Speckled Anemone (Actiniaria: Actiniidae: Oulactis sp.): Venom-Related Components and Their Domain Structure.

Authors: Michela L Mitchell; Gerry Q Tonkin-Hill; Rodrigo A V Morales; Anthony W Purcell; Anthony T Papenfuss; Raymond S Norton
Journal: Mar Biotechnol (NY) Date: 2020-01-24 Impact factor: 3.619

7. Predicting Intraserotypic Recombination in Enterovirus 71.

Authors: Andrew Woodman; Kuo-Ming Lee; Richard Janissen; Yu-Nong Gong; Nynke H Dekker; Shin-Ru Shih; Craig E Cameron
Journal: J Virol Date: 2019-02-05 Impact factor: 5.103

8. Regulation of cardiac ATP-sensitive potassium channel surface expression by calcium/calmodulin-dependent protein kinase II.

Authors: Ana Sierra; Zhiyong Zhu; Nicolas Sapay; Vikas Sharotri; Crystal F Kline; Elizabeth D Luczak; Ekaterina Subbotina; Asipu Sivaprasadarao; Peter M Snyder; Peter J Mohler; Mark E Anderson; Michel Vivaudou; Leonid V Zingman; Denice M Hodgson-Zingman
Journal: J Biol Chem Date: 2012-12-06 Impact factor: 5.157

9. Differential expression and phylogenetic analysis suggest specialization of plastid-localized members of the PHT4 phosphate transporter family for photosynthetic and heterotrophic tissues.

Authors: Biwei Guo; Sonia Irigoyen; Tiffany B Fowler; Wayne K Versaw
Journal: Plant Signal Behav Date: 2008-10

10. Evolution of MIR168 paralogs in Brassicaceae.

Authors: Silvia Gazzani; Mingai Li; Silvia Maistri; Eliana Scarponi; Michele Graziola; Enrico Barbaro; Jörg Wunder; Antonella Furini; Heinz Saedler; Claudio Varotto
Journal: BMC Evol Biol Date: 2009-03-23 Impact factor: 3.260