Literature DB >> 17452345

PROMALS web server for accurate multiple protein sequence alignments.

Jimin Pei¹, Bong-Hyun Kim, Ming Tang, Nick V Grishin.

Abstract

Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/

Entities: CellLine Chemical Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17452345 PMCID： PMC1933189 DOI： 10.1093/nar/gkm227

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The quality of multiple sequence alignments directly affects their applications in similarity searches, structure modeling, functional prediction and phylogenetic analysis. Preparing accurate multiple alignments for distantly related proteins (e.g. sequence identity below 20%) remains a difficult task. Fast accumulation of database protein sequences also poses a demand to improve alignment speed. Aligning all sequences together by dynamic programming is not feasible for large numbers of sequences (1). Progressive alignment methods reduce the problem of aligning multiple sequences to making a limited number of pairwise alignments. Although progressive methods can be fast, errors made at early stages are not corrected. Classic progressive methods such as ClustalW (2) can give reasonable results for similar sequences, but fail to produce accurate alignments for divergent sequences (3). In recent years, extensive research has been conducted to improve alignment quality for progressive methods. Refinement after progressive steps is an effective way of correcting alignment errors (4,5). Consistency-based alignment strategy (6) derives a better scoring function before the progressive alignment steps. ProbCons (7) introduced and MUMMALS (8) implemented a probabilistic treatment of consistency derived from pairwise alignment hidden Markov models. Additional information from protein structures and database homologs can lead to further improvement of alignment quality (5,9–11). We developed PROMALS (12), a progressive method that combines recent advanced techniques to improve multiple alignment quality, especially for distantly related proteins. PROMALS integrates additional information from database searches and secondary structure predictions into a new hidden Markov model that aligns profiles. The alignment scoring function of PROMALS is based on probabilistic consistency among profile–profile comparisons. PROMALS has shown improved results as compared to other leading methods, such as SPEM (13), MUMMALS, ProbCons and MAFFT (12). Here, we describe the PROMALS web server for multiple protein sequence alignments. In addition to alignment construction, this server outputs useful information about predicted secondary structures, sequence grouping and positional conservation for target sequences.

PROMALS MULTIPLE ALIGNMENT PROCEDURE

Being a progressive method, PROMALS sets the order of pairwise alignments according to a tree built by a k-mer counting method (4). To improve alignment speed, PROMALS has two alignment stages for easy and difficult cases, as first implemented in our program PCMA (14). In the first stage, highly similar sequences are progressively aligned in a fast way with a weighted sum-of-pairs measure of BLOSUM62 (15) scores. This procedure results in a set of pre-aligned groups that are relatively divergent from each other. In the second alignment stage, a representative sequence is selected from each pre-aligned group. For each representative sequence, PSI-BLAST (16) is used to identify homologs from the sequence database UNIREF90 (17), and the PSI-BLAST profile (checkpoint file) is used to predict secondary structures by PSIPRED (18). For each pair of representatives, profiles are derived from the PSI-BLAST alignments and PSIPRED secondary structure prediction, and a matrix of posterior probabilities of matches between positions are obtained by a profile–profile hidden Markov model (12). These matrices are used to calculate the probabilistic consistency scoring function, which is used to progressively align the representative sequences. Then the pre-aligned groups obtained in the first stage are merged to the alignment of the representatives. Finally, gap placements in highly gapped regions are refined to make the gap patterns more realistic. The alignment accuracy results of PROMALS and several other methods on SABmark (19) and PREFAB 4.0 (4) benchmarks are shown in Table 1.

Table 1.

Evaluation of alignment methods on SABmark and PREFAB benchmarks

Method	SABmark- twi(209/7.7)	SABmark- sup(425/8.3)	PREFAB (1682/45.2)
PROMALS	0.391	0.665	0.790
SPEM	0.326	0.628	0.774
MUMMALS	0.196	0.522	0.731
ProbCons	0.166	0.485	0.716
MAFFT-linsi	0.184	0.510	0.722
MUSCLE	0.136	0.433	0.680
ClustalW	0.127	0.390	0.617

Average Q-scores of two SABmark data sets (‘twi’ for ‘twilight zone’ set, ‘sup’ for ‘superfamily’ set) and the PREFAB 4.0 data set are shown. Q-score is the number of correctly aligned residue pairs in the test alignment divided by the total number of aligned residue pairs in the reference alignment. For each data set, the two numbers in the parentheses separated by a slash are the number of alignments tested and the average number of sequences per alignment, respectively. For each data set, PROMALS yields statistically higher accuracy (bold numbers) than any other method (P-value < 0.000001) according to Wilcoxon signed rank test. PROMALS and SPEM use secondary structure prediction and database homologs in alignment process, while the other five methods only utilize the input sequences.

Evaluation of alignment methods on SABmark and PREFAB benchmarks Average Q-scores of two SABmark data sets (‘twi’ for ‘twilight zone’ set, ‘sup’ for ‘superfamily’ set) and the PREFAB 4.0 data set are shown. Q-score is the number of correctly aligned residue pairs in the test alignment divided by the total number of aligned residue pairs in the reference alignment. For each data set, the two numbers in the parentheses separated by a slash are the number of alignments tested and the average number of sequences per alignment, respectively. For each data set, PROMALS yields statistically higher accuracy (bold numbers) than any other method (P-value < 0.000001) according to Wilcoxon signed rank test. PROMALS and SPEM use secondary structure prediction and database homologs in alignment process, while the other five methods only utilize the input sequences.

PROMALS WEB SERVER

The PROMALS web server is available at: http://prodata.swmed.edu/promals/(Figure 1).

Figure 1.

Front page of the PROMALS server. The main section allows the user to paste or upload sequences and enter an email address for the results. Options to modify alignment parameters, PSI-BLAST searches and output format are provided. A brief description of each option is available by clicking on the option's name. A document with detailed description of the server is provided. The stand-alone versions of PROMALS can be downloaded from this page.

Input

The user can paste protein sequences or upload a sequence file. The sequences can be in FASTA format and identical sequence names are not allowed. PROMALS also recognizes CLUSTAL format alignments as input. If such an alignment is provided, it is split into individual sequences and these sequences will be re-aligned by PROMALS. The user can enter a name to identify the submitted job. It is also recommended that the user provide an email address to receive alignment results, as PROMALS can take a considerable amount of time to finish for a large number of divergent sequences, due to the time-consuming steps of running PSI-BLAST searches and profile consistency measure. On a data set of 1785 SCOP (20,21) domain pairs with up to 48 homologs added (the average number of sequences is 41.6 per alignment), the average CPU time of PROMALS is about half an hour under default settings (12). The actual time to finish an alignment job depends on factors such as the number of sequences and their lengths, the diversity among the sequences, the numbers of homologs found in database searches and the server load. It can take several hours for the server to finish aligning a sequence set with a large number of distantly related sequences (>50).

Alignment options

A number of alignment options are provided in the web page. One important parameter is the identity threshold that determines the partition of fast alignment stage and slow alignment stage, and thus balances alignment quality and speed. Lowering this threshold can cause more sequences to be aligned in a fast and less accurate way, resulting in fewer representative groups subject to the time and memory-consuming steps of PSI-BLAST searches and profile consistency measure. This tradeoff generally leads to less computational time but lower alignment quality. If the number of pre-aligned groups is large (e.g. >100), PROMALS could run out of memory during the consistency measure step and generate an error message with the report of the number of pre-aligned groups in the second alignment stage. In this case, the user can lower the identity threshold (default 0.6) so that the number of sequence groups subject to consistency measure can be reduced. We also provide options for changing weights of amino acid scoring and predicted secondary structure scoring. The default values were determined by a large scale testing on divergent SCOP superfamily domains (20,21). Several parameters for running PSI-BLAST and processing PSI-BLAST alignments (used for generating amino acid profiles) are also provided, such as e-value cutoff, the number of PSI-BLAST iterations, identity cutoff to remove divergent hits, and the number of homologs kept for profile calculation.

Output of PROMALS results

The web server reports the resulting alignment in a standard CLUSTAL format. In addition, the server provides a colored alignment with information about sequence grouping, secondary structure predictions and positional conservation (Figure 2). Sequence grouping is reflected by the color of sequence names. Sequences with magenta names are representatives from pre-aligned groups. Sequences with black names immediately under a representative sequence belong to the same pre-aligned group as the representative sequence. For example, in Figure 2, ‘Q7U096_MYCBO_208_376’ and ‘Q1TAV7_9MYCO_205_370’ belong to the same pre-aligned group, and they are aligned in the fast alignment stage. Predicted secondary structures are shown for representative sequences (residues with red and blue fonts are predicted to be α-helices and β-strands, respectively). Above each alignment block, conserved positions are marked by their conservation indices (integer values from 0 to 9) calculated using our program AL2CO (22). The line beneath each alignment block shows consensus secondary structure predictions derived from predictions of individual representative sequences (‘h’: α-helix; ‘e’: β-strand). Such a coloring and labeling scheme provides additional information about the PROMALS alignment, and is helpful for further sequence and structural analysis of the target sequences. In addition to the alignments, the server also provides links to the original input sequences and intermediate results of PSI-BLAST alignments and PSI-PRED secondary structure predictions.

Figure 2.

An example of colored alignment produced by the PROMALS server. These sequences are adenylate/guanylate cyclase catalytic domains selected from the PFAM database (Accession number: PF00211) (23). The first line in each alignment block begins with ‘Conservation:’ and shows conservation index numbers for conserved positions. The last line in each block begins with ‘Consensus_ss:’ and shows the consensus secondary structure predictions (‘h’: α-helix; ‘e’: β-strand). Each representative sequence has a magenta name and is colored according to PSIPRED secondary structure predictions (red: α-helix, blue: β-strand). A representative sequence and the immediate sequences below it with black names, if there are any, form a closely related group (determined by the option ‘Identity threshold’). Sequences within each group are aligned in a fast way. The groups are aligned using profile consistency with enhanced information from database searches and secondary structure predictions.

23 in total

1. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches.

Authors: J D Thompson; F Plewniak; J Thierry; O Poch
Journal: Nucleic Acids Res Date: 2000-08-01 Impact factor: 16.971

2. AL2CO: calculation of positional conservation in a protein sequence alignment.

Authors: J Pei; N V Grishin
Journal: Bioinformatics Date: 2001-08 Impact factor: 6.937

3. PCMA: fast and accurate multiple sequence alignment based on profile consistency.

Authors: Jimin Pei; Ruslan Sadreyev; Nick V Grishin
Journal: Bioinformatics Date: 2003-02-12 Impact factor: 6.937

4. The ASTRAL Compendium in 2004.

Authors: John-Marc Chandonia; Gary Hon; Nigel S Walker; Loredana Lo Conte; Patrice Koehl; Michael Levitt; Steven E Brenner
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

5. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

6. 3DCoffee: combining protein sequences and structures within multiple sequence alignments.

Authors: Orla O'Sullivan; Karsten Suhre; Chantal Abergel; Desmond G Higgins; Cédric Notredame
Journal: J Mol Biol Date: 2004-07-02 Impact factor: 5.469

7. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

8. PROMALS: towards accurate multiple sequence alignments of distantly related proteins.

Authors: Jimin Pei; Nick V Grishin
Journal: Bioinformatics Date: 2007-01-31 Impact factor: 6.937

9. A tool for multiple sequence alignment.

Authors: D J Lipman; S F Altschul; J D Kececioglu
Journal: Proc Natl Acad Sci U S A Date: 1989-06 Impact factor: 11.205

10. The Pfam protein families database.

Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

37 in total

1. Identification of a family of effectors secreted by the type III secretion system that are conserved in pathogenic Chlamydiae.

Authors: Sandra Muschiol; Gaelle Boncompain; François Vromman; Pierre Dehoux; Staffan Normark; Birgitta Henriques-Normark; Agathe Subtil
Journal: Infect Immun Date: 2010-11-15 Impact factor: 3.441

2. Polar freshwater cyanophage S-EIV1 represents a new widespread evolutionary lineage of phages.

Authors: C Chénard; A M Chan; W F Vincent; C A Suttle
Journal: ISME J Date: 2015-03-27 Impact factor: 10.302

3. The stability of myocilin olfactomedin domain variants provides new insight into glaucoma as a protein misfolding disorder.

Authors: J Nicole Burns; Katherine C Turnage; Chandler A Walker; Raquel L Lieberman
Journal: Biochemistry Date: 2011-06-09 Impact factor: 3.162

4. Hypomorphic mutations in POLR3A are a frequent cause of sporadic and recessive spastic ataxia.

Authors: Martina Minnerop; Delia Kurzwelly; Holger Wagner; Anne S Soehn; Jennifer Reichbauer; Feifei Tao; Tim W Rattay; Michael Peitz; Kristina Rehbach; Alejandro Giorgetti; Angela Pyle; Holger Thiele; Janine Altmüller; Dagmar Timmann; Ilker Karaca; Martina Lennarz; Jonathan Baets; Holger Hengel; Matthis Synofzik; Burcu Atasu; Shawna Feely; Marina Kennerson; Claudia Stendel; Tobias Lindig; Michael A Gonzalez; Rüdiger Stirnberg; Marc Sturm; Sandra Roeske; Johanna Jung; Peter Bauer; Ebba Lohmann; Stefan Herms; Stefanie Heilmann-Heimbach; Garth Nicholson; Muhammad Mahanjah; Rajech Sharkia; Paolo Carloni; Oliver Brüstle; Thomas Klopstock; Katherine D Mathews; Michael E Shy; Peter de Jonghe; Patrick F Chinnery; Rita Horvath; Jürgen Kohlhase; Ina Schmitt; Michael Wolf; Susanne Greschus; Katrin Amunts; Wolfgang Maier; Ludger Schöls; Peter Nürnberg; Stephan Zuchner; Thomas Klockgether; Alfredo Ramirez; Rebecca Schüle
Journal: Brain Date: 2017-06-01 Impact factor: 13.501

5. Rational engineering of type II restriction endonuclease DNA binding and cleavage specificity.

Authors: Richard D Morgan; Yvette A Luyten
Journal: Nucleic Acids Res Date: 2009-06-30 Impact factor: 16.971

6. A unique family of Mrr-like modification-dependent restriction endonucleases.

Authors: Yu Zheng; Devora Cohen-Karni; Derrick Xu; Hang Gyeong Chin; Geoffrey Wilson; Sriharsa Pradhan; Richard J Roberts
Journal: Nucleic Acids Res Date: 2010-05-05 Impact factor: 16.971

7. Structures of Arg- and Gln-type bacterial cysteine dioxygenase homologs.

Authors: Camden M Driggers; Steven J Hartman; P Andrew Karplus
Journal: Protein Sci Date: 2014-11-17 Impact factor: 6.725

8. Regulation of Ras localization and cell transformation by evolutionarily conserved palmitoyltransferases.

Authors: Evelin Young; Ze-Yi Zheng; Angela D Wilkins; Hee-Tae Jeong; Min Li; Olivier Lichtarge; Eric C Chang
Journal: Mol Cell Biol Date: 2013-11-18 Impact factor: 4.272

9. Evolution of DNA polymerases: an inactivated polymerase-exonuclease module in Pol epsilon and a chimeric origin of eukaryotic polymerases from two classes of archaeal ancestors.

Authors: Tahir H Tahirov; Kira S Makarova; Igor B Rogozin; Youri I Pavlov; Eugene V Koonin
Journal: Biol Direct Date: 2009-03-18 Impact factor: 4.540

10. The genome and structural proteome of an ocean siphovirus: a new window into the cyanobacterial 'mobilome'.

Authors: Matthew B Sullivan; Bryan Krastins; Jennifer L Hughes; Libusha Kelly; Michael Chase; David Sarracino; Sallie W Chisholm
Journal: Environ Microbiol Date: 2009-10-14 Impact factor: 5.491