Literature DB >> 29931282

PASTA for proteins.

Abstract

Summary: PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. Availability and implementation: PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29931282 PMCID： PMC6223367 DOI： 10.1093/bioinformatics/bty495

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Protein multiple sequence alignment is a key first step in much biological research, including protein structure and function prediction, domain identification, inference of ancestral proteins and construction of protein–protein interaction networks. However, alignment error can have a substantial impact on the downstream analyses and large datasets can be particularly difficult to align with high accuracy. For these reasons, among others, there is a great interest in the development of new protein sequence alignment methods that can provide good accuracy on large datasets (Iantorno ; Le ).

2 Materials and methods

PASTA (Mirarab ) is a method that was designed to improve the accuracy and scalability of a base method for multiple sequence alignment (Abuín ). PASTA computes an initial tree, and then iterates between alignment estimation and tree estimation, typically performing three iterations. Each iteration uses the selected base method to compute alignments on small subsets with at most 200 sequences, and then merges those alignments into an alignment on the full dataset. Once the full alignment is computed, a maximum likelihood tree is computed using FastTree-2 (Price ). The standard version of PASTA enables only a few base methods; here, we explore the impact of including other base methods for protein multiple sequence alignment. In addition to the use of MAFFT L-INS-i as the subset aligner in PASTA, we include two ways of running MAFFT version 7.149b: G-INS-i and Homologs (Katoh and Standley, 2013), CONTRAlign version 1.04 (Do ) and ProbCons 1.12 (Do ). The public distribution of MAFFT Homologs is limited to 99 sequences, and we turned off the flag restricting its analysis to small datasets to enable it to analyze larger datasets.

3 Results

We explored accuracy and running time on BAliBASE (Thompson ), a collection of protein sequences with reference alignments based on structural features, restricted to datasets with at least 50 or more sequences. These 224 datasets have between 50 and 807 sequences. When the input to PASTA is at most 200 sequences, it decomposes the dataset into two subsets; otherwise, PASTA decomposes into subsets with at most 200 sequences. We compare base alignment methods (several variants of MAFFT, ProbCons and CONTRAlign) and PASTA used with each of these base alignment methods (denoted by PASTA+X, with X the base method), using three iterations. We explore the choice of base method on alignment quality using three standard metrics: SP-score (i.e. recall), Modeler score (i.e. precision) and column score (TC, the percentage of columns in the reference alignment completely recovered). We also examine the impact on tree accuracy, where the reference tree is computed by running RAxML version 8.2.11 (Stamatakis, 2006) on the reference alignment with 100 bootstraps [using the AA sequence evolution model reported for each dataset in Nguyen ] and then collapsing all edges with bootstrap support below 75%. Finally, we report the time it takes to run. The PASTA and MAFFT variants all take advantage of multi-threading, but all methods were given 1 node with 12 cores. As PASTA is designed mainly for large datasets, we report results for the 25 datasets with 200 or more sequences here; see Supplementary Material for results on 199 datasets with 50 to 199 sequences. Accuracy is also impacted by the average percent pairwise sequence identity (PID; Liu ; Mirarab ) and the BaliBASE datasets range from 11.8% to 66.4% in PID; we therefore separate results for these datasets based on PID below and above 25%. Figure 1A shows the average running time of these different methods on eight large RV10 BAliBASE datasets (RV10 is a subset of the BALiBASE data). CONTRAlign is by far the most computationally intensive (>18 h), ProbCons is the next most intensive (8 h) and MAFFT Homologs is the third most intensive (7.6 h). PASTA makes these slow methods much faster: PASTA+CONTRAlign uses half of an hour while CONTRAlign uses over 18 h, PASTA+ProbCons uses 4 h instead of over 8 h and PASTA+Homologs uses less than 1 h whereas MAFFT Homologs uses more than 7 h. The remaining methods are all fairly fast, even on these very large protein datasets; PASTA makes the fastest of these methods slower, but the differences in running time are small (i.e. the biggest increase in average running time is for MAFFT L-INS-i, but even there PASTA+L-INS-i completes in under 1 h).

Fig. 1.

Results on large datasets. (A) Average running times on eight large RV10 BAliBASE datasets. (B) Average Total Column score on the eight large RV10 BAliBASE datasets. (C) Average Total Column score on all datasets with 200 or more sequences (grouped by average percent sequence identity). (D) Average tree accuracy on eight large RV10 BAliBASE datasets. The error bars show standard error The highest TC scores on the eight large RV10 BAliBASE datasets are obtained by PASTA+MAFFT G-INS-i, with PASTA+MAFFT L-INS-i in a close second place (Fig. 1B). The lowest TC scores are for CONTRAlign and ProbCons, suggesting that they degrade in accuracy on these large datasets. However, PASTA increases the TC scores for all base methods. The average TC scores on the full set of datasets with at least 200 sequences (Fig. 1C) show the impact of PID on the absolute and relative accuracy of the different methods. For the datasets with PID above 25%, MAFFT Homologs and MAFFT L-INS-i have the best average TC scores, followed fairly closely by PASTA used with any variant of MAFFT and then by MAFFT G-INS-i. PASTA+ProbCons comes next, followed by ProbCons and PASTA+CONTRAlign and finally with CONTRAlign in last place. Thus, for the easier datasets (with PID > 25%), PASTA improves the TC scores for most methods and slightly decreases the TC scores for the two best-performing methods. When the PID is low, however, TC scores drop and the relative performance between methods changes. Here, the best average TC scores are obtained using PASTA+MAFFT G-INS-i, followed closely by MAFFT G-INS-i; CONTRAlign is in third place and PASTA+ProbCons and PASTA+CONTRAlign are nearly tied and in fourth place. The lowest average TC scores are obtained by ProbCons. For these harder datasets (PID < 25%), the impact of PASTA is variable–sometimes improving scores and sometimes reducing scores, but when it reduces scores the reductions are small. We also compared methods with respect to the accuracy of maximum likelihood trees computed on their alignments, using the eight large RV10 BAliBASE datasets. CONTRAlign and ProbCons had the lowest accuracy of all methods, but using PASTA improved the accuracy substantially; all other methods had similar average accuracy on these datasets (Fig. 1D).

4 Conclusions

This study shows that PASTA can be used to improve the scalability of several protein alignment methods. The optimal choice of PASTA variant (i.e. sub-aligner) depends on the properties of the dataset, but PASTA reduces the running time of ProbCons and CONTRAlign and improves the TC scores and tree accuracy these methods obtain on large datasets. Click here for additional data file.

11 in total

1. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs.

Authors: J D Thompson; F Plewniak; O Poch
Journal: Bioinformatics Date: 1999-01 Impact factor: 6.937

2. ProbCons: Probabilistic consistency-based multiple sequence alignment.

Authors: Chuong B Do; Mahathi S P Mahabhashyam; Michael Brudno; Serafim Batzoglou
Journal: Genome Res Date: 2005-02 Impact factor: 9.043

3. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2006-08-23 Impact factor: 6.937

Review 4. Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.

Authors: Stefano Iantorno; Kevin Gori; Nick Goldman; Manuel Gil; Christophe Dessimoz
Journal: Methods Mol Biol Date: 2014

5. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Authors: Kevin Liu; Sindhu Raghavan; Serita Nelesen; C Randal Linder; Tandy Warnow
Journal: Science Date: 2009-06-19 Impact factor: 47.728

6. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

7. PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.

Authors: Siavash Mirarab; Nam Nguyen; Sheng Guo; Li-San Wang; Junhyong Kim; Tandy Warnow
Journal: J Comput Biol Date: 2014-12-30 Impact factor: 1.479

8. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

9. Ultra-large alignments using phylogeny-aware profiles.

Authors: Nam-Phuong D Nguyen; Siavash Mirarab; Keerthana Kumar; Tandy Warnow
Journal: Genome Biol Date: 2015-06-16 Impact factor: 13.583

10. Protein multiple sequence alignment benchmarking through secondary structure prediction.

Authors: Quan Le; Fabian Sievers; Desmond G Higgins
Journal: Bioinformatics Date: 2017-05-01 Impact factor: 6.937

3 in total

1. The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways.

Authors: Rémi Zallot; Nils Oberg; John A Gerlt
Journal: Biochemistry Date: 2019-10-04 Impact factor: 3.162

2. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.

Authors: Andrew F Neuwald; Christopher J Lanczycki; Theresa K Hodges; Aron Marchler-Bauer
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

Review 3. The Studies in Constructing Yeast Cell Factories for the Production of Fatty Acid Alkyl Esters.

Authors: Yang Zhang; Xiao Guo; Huaiyi Yang; Shuobo Shi
Journal: Front Bioeng Biotechnol Date: 2022-01-11

3 in total