Literature DB >> 21515632

PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences.

Sayed Mohammad Ebrahim Sahraeian¹, Byung-Jun Yoon.

Abstract

In this article, we introduce PicXAA-Web, a web-based platform for accurate probabilistic alignment of multiple biological sequences. The core of PicXAA-Web consists of PicXAA, a multiple protein/DNA sequence alignment algorithm, and PicXAA-R, an extension of PicXAA for structural alignment of RNA sequences. Both PicXAA and PicXAA-R are probabilistic non-progressive alignment algorithms that aim to find the optimal alignment of multiple biological sequences by maximizing the expected accuracy. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences. PicXAA-Web integrates these two algorithms in a user-friendly web platform for accurate alignment and analysis of multiple protein, DNA and RNA sequences. PicXAA-Web can be freely accessed at http://gsp.tamu.edu/picxaa/.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21515632 PMCID： PMC3125727 DOI： 10.1093/nar/gkr244

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Multiple sequence alignment (MSA) plays important roles in various problems in molecular biology, such as phylogenetic analysis, predicting the structure of biomolecules, identification of conserved sequence motifs and many others (1,2). Given a set of unaligned sequences, we can find their best alignment by optimizing an objective function that measures the quality of the alignment. For example, we may find the optimal multiple sequence alignment by maximizing the so-called sum-of-pairs (SP) score through dynamic programming (3,4). However, this optimization problem is NP-complete (5) and thus intractable as the number of sequences increases. The same holds true for structural alignment of non-coding RNA (ncRNA) sequences, where we need to consider the structural similarity across sequences, in addition to their sequence similarity, to obtain an accurate sequence alignment that is biologically meaningful (6). A widely adopted solution for reducing the overall computational complexity is the progressive alignment scheme (7), which tries to construct the multiple sequence alignment in a progressive manner by repetitively performing pairwise alignments according to a guide tree. Many popular MSA algorithms, such as CLUSTALW (8), T-Coffee (9), ProbCons (10), ProbAlign (11), MUSCLE (12), MAFFT (13), MUMMALS (14), and MSAProbs 0.9.4 (15), as well as many RNA structural alignment algorithms, such as Murlet (16), RAF (17), STRAL (18), LocARNA (19), CentroidAlign (20), PMcomp (21), MXSCARNA (22), R-Coffee (23), LARA (24) and MAFFT-xinsi (25) adopt this progressive approach to construct MSAs. Although the progressive alignment approach is computationally efficient, it tends to propagate early stage alignment errors throughout the alignment process, which may significantly degrade the quality of the final alignment. To address this problem, a number of non-progressive alignment algorithms have been also proposed, where examples include DIALIGN (26), AMAP (27), FSA (28), RNASampler (29), MASTER (30), Stemloc-AMA (31), PicXAA (32) and PicXAA-R (33). In this article, we introduce PicXAA-Web, a user-friendly web server developed based on PicXAA, a recently proposed MSA algorithm (32) and PicXAA-R (33), an extension of PicXAA for structural alignment of RNA sequences. PicXAA is a probabilistic non-progressive alignment algorithm that finds protein (or DNA) MSAs with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities among sequences. PicXAA-R is an extension of PicXAA, which tries to find an accurate structural alignment of non-coding RNAs (ncRNAs) through a greedy approach. PicXAA-R efficiently constructs an accurate multiple RNA alignment by using both the folding information in each RNA sequence and the local similarities between different RNA sequences. PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs, and it consistently yields accurate alignment results. As shown in refs (32,33) through extensive experiments on several widely used benchmark sets, both PicXAA and PicXAA-R outperform many state-of-the art MSA algorithms, in terms of accuracy and efficiency. PicXAA and PicXAA-R are especially effective for aligning sequences that have high local similarities but relatively low overall similarities. Both algorithms (especially, PicXAA-R) scale very well as the number of sequences grows, which makes them suitable for analyzing large datasets.

PicXAA

The main goal of PicXAA is to find the MSA that maximizes the expected number of correctly aligned residue pairs. To this aim, it first computes the posterior pairwise alignment probability P(x ∼ y | x, y) between residues x ∈ x and y ∈ y for all sequence pairs (x, y) in the input sequence set. PicXAA then applies an improved probabilistic consistency transformation that incorporates the information from other homologous sequences in the set to improve the estimated posterior pairwise alignment probability. Next, it sorts the pairwise residue alignments according to their alignment probability into an ordered set . Using an efficient graph-based technique, PicXAA greedily builds up the alignment by inserting the most probable residue alignment (x, y) ∈ into the MSA, provided that it satisfies certain consistency conditions. After obtaining the initial alignment, PicXAA goes through a refinement step to improve the alignment quality in sequence regions with low alignment probability. Table 1 shows the performance of PicXAA on three well-known benchmark datasets: BAliBASE 3.0 (34), IRMBASE 2.0 (26) and SABmark 1.65 (35). In this table, the alignment accuracy is reported based on two different criteria: the SP score, which is the percentage of the correctly aligned residue pairs, and the column score (CS), which is the percentage of the correct columns in the alignment. For comparison, Table 1 also shows the performance of several state-of-the-art MSA algorithms, including ProbAlign 1.1 (10), ProbCons 1.12 (10), MUMMALS 1.01 (14), MAFFT 6.708 (13) with two different options (‘-linsi’ and ‘-einsi’), MSAProbs 0.9.4 (15) and CLUSTALW 2.0.10 (8). We report the performance of PicXAA for using three different types of methods for computing the posterior alignment probabilities: (i) partition function (PF), (ii) pair-HMM (PHMM) and (iii) structural pair-HMM (SPHMM). Further details of these methods can be found in ref. (32).

Table 1.

Performance evaluation of PicXAA based on BAliBASE 3.0, IRMBASE 2.0 and SABmark 1.65

Method	BAliBASE	IRMBASE	SABmark
	3.0	2.0	1.65
			Twilight	Superfamily
	SP/CS	SP/CS	f_D/f_M	f_D/f_M
PicXAA-PF	87.86 / 59.32	89 / 50.08	16.75 / 15.37	49.66 / 41.41
PicXAA-PHMM	86.55 / 56.28	90.76 / 54.48	17.12 / 14.65	50.37 / 41.13
PicXAA-SPHMM	86.67 / 56.14	72.75 / 33.02	20.99 / 17.12	53.53 / 42.77
ProbAlign	87.61 / 58.82	81.68 / 36.69	15.86 / 13.05	48.66 / 39.82
ProbCons	86.42 / 56.01	85.3 / 42.51	16.64 / 13.55	48.56 / 39.51
MUMMALS	85.53 / 53.85	68.44 / 24.62	19.99 / 18.23	52.09 / 42.74
MAFFT-linsi	87.22 / 59.28	89.44 / 46.02	17.42 / 13.16	50.47 / 40.01
MAFFT-einsi	87.05 / 58.95	91.77 / 48.21	17.77 / 13.07	49.94 / 39.23
MSAProbs	87.78 / 60.67	83.31 / 38.48	17.41 / 13.67	50.51 / 40.75
ClustalW	75.37 / 38.01	26.34 / 2.44	12.87 / 8.72	38.62 / 30.27

Performance evaluation of PicXAA based on BAliBASE 3.0, IRMBASE 2.0 and SABmark 1.65 As shown in Table 1, and discussed in ref. (32) in more details, PicXAA consistently yields accurate alignment results on various benchmark datasets with different characteristics. Especially, the advantage of PicXAA stands out more clearly on datasets that consist of sequences with only local similarities, as many progressive methods fail to capture these similarities faithfully.

PicXAA-R

PicXAA-R extends the ideas in PicXAA to the structural alignment of multiple RNA sequences (33). In addition to the pairwise alignment probability P(x ∼ y |x, y) between residues in different sequences, PicXAA-R also incorporates the base-pairing probability P(x ∼ x | x) between bases x and x in the same sequence x. PicXAA-R employs several probabilistic consistency transformations to obtain improved base pairing and base alignment probabilities by considering both sequence and structural similarities among sequences. These enhanced probabilities are used in a two-step greedy alignment process. In the first step, PicXAA-R constructs the structural skeleton of the alignment by greedily adding the most probable alignment between base pairs with high base-pairing probabilities. In the next step, PicXAA-R updates the obtained skeleton by successively adding the most probable base alignments. As in PicXAA, the initial MSA obtained from the greedy construction process is further refined to improve the alignment quality in low similarity regions. Figure 1 compares the performance of PicXAA-R with several well-known RNA sequence alignment algorithms, such as MAFFT-xinsi 6.717 (25), MXSCARNA 2.1 (22), CentroidAlign (20) and ProbConsRNA 1.10 (10). Figure 1 shows the alignment accuracy of the compared algorithms in terms of the SPS score (solid lines) as well as the computational complexity in terms of the CPU time (dashed lines) based on BraliSub and LocExtR datasets (36). The accuracy (and the complexity) of each algorithm is shown as a function of the number of sequences in the alignment, in order to show its scalability. As shown in this figure, and also discussed in more details in ref. (33), PicXAA-R is one of the fastest RNA structural alignment algorithms, which consistently yields highly accurate structural alignment of multiple RNAs, especially for datasets that consist of RNA sequences with high local similarity but low percentage identity.

Figure 1.

Performance evaluation of PicXAA-R and several other algorithms based on BraliSub and LocExtR data sets. The SPS score is shown in solid lines and the CPU time is shown in dashed lines.

PicXAA WEB SERVER

PicXAA-Web provides an interactive web-based platform for aligning multiple biological sequences using PicXAA or PicXAA-R. PicXAA can be used for aligning a set of protein (or DNA) sequences and PicXAA-R can be used for structural alignment of RNAs. The user can specify which algorithm to use. The unaligned input sequences (in FASTA format) can be entered either directly through the input window or by uploading a sequence file. Two example inputs are provided by the server, which can be easily loaded into the input window to quickly try out the alignment algorithms.

Options and parameter setting

PicXAA-Web allows the user to choose different options and parameters for PicXAA and PicXAA-R. One can also simply use the default parameters.

Options for PicXAA

PicXAA has three options for computing the posterior pairwise residue alignment probabilities P(x, y|x, y) using one of the following schemes: (i) partition function (PF) (11), (ii) pair-HMM (PHMM) (10) and (iii) structural pair-HMM (SPHMM) (14). The default option is PF. SPHMM can only be used for protein sequences, while PF and PHMM can be used for both protein and nucleotide sequences. The respective advantages and disadvantages of these methods have been discussed in ref. (32). SPHMM is nearly three times more complex than the two other methods, but it usually yields more accurate alignment results when aligning structurally similar proteins. Typically, PF outperforms PHMM for most datasets, while PHMM yields better alignment results for locally similar sequences.

Options for PicXAA-R

For structural alignment of ncRNA, PicXAA-R should be used. The user can adjust three parameters employed in the PicXAA-R algorithm (33): Parameter for intra-sequence consistency transformation: the parameter α takes a value between 0 and 1 (default value is α = 0.4). α = 0 makes the algorithm to ignore the originally estimated base-pairing probabilities and simply use the transformed probabilities, while α = 1 makes the algorithm to use the original probabilities and not employ the transformed probabilities. Otherwise, the original and transformed probabilities will be linearly combined. Parameter for four-way consistency transformation: the parameter β also takes a value between 0 and 1 (default value is β = 0.1). β = 0 results in simply using the transformed probabilities, while β = 1 results in using the original probabilities. Otherwise, the original and transformed probabilities will be linearly combined. Threshold for identifying reliable basepairs: the parameter T specifies the minimum base-pairing probability of the base pairs that should be considered during the structural skeleton construction step [see (33) for details]. T should be between 0 and 1, and its default value is set to T = 0.5. In general, smaller T will result in a larger structural skeleton.

Output

The output can be generated either in CLUSTALW or in MFA (multiple FASTA) format. Upon submitting a sequence set, a Job ID will be assigned and a progress window will be launched with a link to the result page and additional information about the job being performed. The page will reload every 5s until the output is ready. Once the MSA is ready, the alignment result will automatically appear on the web browser. An email notification containing a link to the final result will be sent, if the user has provided an email address. A link to a downloadable output file is also provided in the result page. Sample output pages are shown in Figure 2. The web server includes a help page that contains comprehensive and easy-to-follow guidelines for using PicXAA-Web. The links to the freely downloadable C++ source code for PicXAA and PicXAA-R are also provided.

Figure 2.

Sample output pages for (A) PicXAA and (B) PicXAA-R.

CONCLUSIONS

In this article, we introduced PicXAA-Web, a web-based tool for aligning multiple biological sequences, developed based on two recently proposed algorithms: PicXAA and PicXAA-R. PicXAA provides a user-friendly web-platform for accurate alignment and analysis of protein, DNA and RNA sequences, and it can serve as a useful resource for the molecular biology research community.

FUNDING

Funding for open access charge: Texas A&M faculty start-up fund. Conflict of interest statement. None declared.

35 in total

1. T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Authors: C Notredame; D G Higgins; J Heringa
Journal: J Mol Biol Date: 2000-09-08 Impact factor: 5.469

2. ProbCons: Probabilistic consistency-based multiple sequence alignment.

Authors: Chuong B Do; Mahathi S P Mahabhashyam; Michael Brudno; Serafim Batzoglou
Journal: Genome Res Date: 2005-02 Impact factor: 9.043

3. Probalign: multiple sequence alignment using partition function posterior probabilities.

Authors: Usman Roshan; Dennis R Livesay
Journal: Bioinformatics Date: 2006-09-05 Impact factor: 6.937

4. Murlet: a practical multiple alignment tool for structural RNA sequences.

Authors: Hisanori Kiryu; Yasuo Tabei; Taishin Kin; Kiyoshi Asai
Journal: Bioinformatics Date: 2007-04-25 Impact factor: 6.937

5. RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment.

Authors: Xing Xu; Yongmei Ji; Gary D Stormo
Journal: Bioinformatics Date: 2007-05-30 Impact factor: 6.937

6. Specific alignment of structured RNA: stochastic grammars and sequence annealing.

Authors: Robert K Bradley; Lior Pachter; Ian Holmes
Journal: Bioinformatics Date: 2008-09-16 Impact factor: 6.937

7. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering.

Authors: Sebastian Will; Kristin Reiche; Ivo L Hofacker; Peter F Stadler; Rolf Backofen
Journal: PLoS Comput Biol Date: 2007-02-22 Impact factor: 4.475

8. MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information.

Authors: Jimin Pei; Nick V Grishin
Journal: Nucleic Acids Res Date: 2006-08-26 Impact factor: 16.971

9. Fast statistical alignment.

Authors: Robert K Bradley; Adam Roberts; Michael Smoot; Sudeep Juvekar; Jaeyoung Do; Colin Dewey; Ian Holmes; Lior Pachter
Journal: PLoS Comput Biol Date: 2009-05-29 Impact factor: 4.475

10. A fast structural multiple alignment method for long RNA sequences.

Authors: Yasuo Tabei; Hisanori Kiryu; Taishin Kin; Kiyoshi Asai
Journal: BMC Bioinformatics Date: 2008-01-23 Impact factor: 3.169

4 in total

1. AlignMe--a membrane protein sequence alignment web server.

Authors: Marcus Stamm; René Staritzbichler; Kamil Khafizov; Lucy R Forrest
Journal: Nucleic Acids Res Date: 2014-04-21 Impact factor: 16.971

2. Accurate multiple network alignment through context-sensitive random walk.

Authors: Hyundoo Jeong; Byung-Jun Yoon
Journal: BMC Syst Biol Date: 2015-01-21

3. Effective comparative analysis of protein-protein interaction networks by measuring the steady-state network flow using a Markov model.

Authors: Hyundoo Jeong; Xiaoning Qian; Byung-Jun Yoon
Journal: BMC Bioinformatics Date: 2016-10-06 Impact factor: 3.169

4. CentroidAlign-Web: A Fast and Accurate Multiple Aligner for Long Non-Coding RNAs.

Authors: Haruka Yonemoto; Kiyoshi Asai; Michiaki Hamada
Journal: Int J Mol Sci Date: 2013-03-18 Impact factor: 5.923

4 in total