Literature DB >> 29506019

Parallelization of MAFFT for large-scale multiple sequence alignments.

Tsukasa Nakamura1,2, Kazunori D Yamada2,3, Kentaro Tomii1,2,4,5, Kazutaka Katoh2,6.   

Abstract

Summary: We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation: This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2018        PMID: 29506019      PMCID: PMC6041967          DOI: 10.1093/bioinformatics/bty121

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


A large number of biological sequences from widely divergent organisms are becoming available. Accordingly, the need for multiple alignments of large numbers of sequences is increasing for various kinds of sequence analysis. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large multiple sequence alignments (MSAs) in independent benchmarks (Le ; Yamada ). However, this method was impractical for actual analyses, requiring large computational resources in both space and time to perform all-to-all pairwise alignments by dynamic programming (DP) (Needleman and Wunsch, 1970), which are used for a guide tree and a scoring function similar to COFFEE (Notredame ). Here, we introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Our strategies to reduce computational costs are (i) parallelization across multiple machines and/or processor cores using MPI and Pthreads to increase speed and (ii) the use of a high-speed shared filesystem, which is becoming common for processing big data. An MPI-based parallelization of another high-accuracy MSA method, MSAProbs, was recently released (González-Domínguez ), but it cannot be applied to thousands of sequences. The present update of MAFFT is designed to satisfy the need for accurately aligning large numbers of sequences but is not applicable to long genomic sequences since the length dependence of the computational cost is unchanged. The G-large-INS-1 option is available in MAFFT versions 7.355 or later and the online service (Katoh ). Accuracy of G-large-INS-1 was compared with that of conventional G-INS-1 using different benchmarks, QuanTest (Le ) (Fig. 1), HomFam (Sievers ), OXFam (Raghava ; Yamada ) and ContTest (Fox ) (Supplementary Table S1). Both methods ran with different input orders and/or minor variations in pairwise alignment and guide tree (see Supplementary Data) in order to assess instability of accuracy scores (Boyce ). In all cases, the difference between G-INS-1 (blue lines in Fig. 1) and G-large-INS-1 (red lines) was small.
Fig. 1.

(a) QuanTest. Accuracy of protein secondary structure prediction based on various sizes of MSAs by G-large-INS-1 (red bold lines), G-INS-1 (version 7.245; blue bold lines) and other popular methods. We used 1940 (out of 2265) entries so that JPred (Drozdetskiy ) can be consistently applied to the MSAs by all methods. (b)–(g), Parallelization efficiency of all-to-all alignment stage (b, d and f) and progressive stage (c, e and g) when applying G-large-INS-1 to LSU rRNA (b, c) sdr (d, e) and zf-CCHH (f, g). Green squares and magenta triangles are the computational time on NFS and Lustre filesystem, respectively. Lines are the expected time based on the cases using seven cores [NFS; green solid lines in (b), (d) and (f)], 35 cores [Lustre; magenta dotted lines in (b), (d) and (f)] and single core (c, e and g), assuming a perfect efficiency. The calculations with NFS (green) were performed on a heterogeneous cluster system (each node has 16–20 cores of Intel Xeon E5-2660 v3 2.6 GHz, E5-2680 2.7 GHz and E5-2670 v2 2.50 GHz with 64–128GB RAM). The calculations with the Lustre filesystem (magenta) were performed on Intel Xeon E5-2695 v4 2.10 GHz 36 cores with 256GB RAM per node using Lustre version 2.5.42

(a) QuanTest. Accuracy of protein secondary structure prediction based on various sizes of MSAs by G-large-INS-1 (red bold lines), G-INS-1 (version 7.245; blue bold lines) and other popular methods. We used 1940 (out of 2265) entries so that JPred (Drozdetskiy ) can be consistently applied to the MSAs by all methods. (b)–(g), Parallelization efficiency of all-to-all alignment stage (b, d and f) and progressive stage (c, e and g) when applying G-large-INS-1 to LSU rRNA (b, c) sdr (d, e) and zf-CCHH (f, g). Green squares and magenta triangles are the computational time on NFS and Lustre filesystem, respectively. Lines are the expected time based on the cases using seven cores [NFS; green solid lines in (b), (d) and (f)], 35 cores [Lustre; magenta dotted lines in (b), (d) and (f)] and single core (c, e and g), assuming a perfect efficiency. The calculations with NFS (green) were performed on a heterogeneous cluster system (each node has 16–20 cores of Intel Xeon E5-2660 v3 2.6 GHz, E5-2680 2.7 GHz and E5-2670 v2 2.50 GHz with 64–128GB RAM). The calculations with the Lustre filesystem (magenta) were performed on Intel Xeon E5-2695 v4 2.10 GHz 36 cores with 256GB RAM per node using Lustre version 2.5.42 Large amounts of RAM are required if conventional tools for high-quality MSAs are applied to a large number of sequences. For example, MAFFT-L-INS-i and MSAProbs-MPI used at most 9.23GB and 74.8GB for a subset of 1000 sequences in QuanTest. For a larger subset (4000 sequences), MAFFT-G-INS-1 and QuickProbs2 (Gudyś and Deorowicz, 2017) used at most 26.0 GB and 411 GB RAM, respectively. In contrast, G-large-INS-1 used only 5.72GB at most, for the subset of 4000 sequences. Memory usage for larger problems (up to ∼90 000 sequences) is shown in Supplementary Table S1, which suggests that this advantage increases with the number of sequences. Note that G-large-INS-1 uses files to save temporary data and thus requires a high-speed filesystem when the input sequences are very short, as discussed below. Parallelization efficiency in three examples is shown in Figure 1(b–g), separately for two stages: (i) the all-to-all alignment stage (b, d and f) and (ii) the progressive alignment stage (c, e and g). For LSU rRNA sequences (b, 1521–4102 bases, 1000 sequences randomly selected from the SEED alignment in Silva (Glöckner ) and protein sequences with usual lengths (d, 21–297 amino acids, 50 157 sequences, the ‘sdr’ family taken from HomFam), the wall-clock time for the all-to-all alignment stage decreased almost linearly with the number of cores used for the calculation. However, for a dataset with very short sequences (f, 12–35 amino acids, 88 345 sequences, the ‘zf-CCHH’ family taken from HomFam), the efficiency differs depending on filesystem: high in Lustre (shown with magenta triangles) but low in NFS (shown with green squares). This difference is due to the balance between calculation and disk operations. As noted earlier, a considerable amount of temporary data is written in parallel into the filesystem: approximately 218 MB, 100 GB and 142 GB for LSU rRNA, ‘sdr’ and ‘zf-CCHH’, respectively, in the examples shown here. Overhead due to these disk operations is almost negligible in the former two cases but not in the latter case, where alignment of ∼23 amino acids takes only a short time in comparison with the time to write the temporary data to disk using NFS. Figure 1c, e and g suggest that the wall-clock time of the progressive stage varies for each run and does not linearly decrease, but usually this is not a speed-limiting step. CPU time and wall-clock time for various problems are shown in Supplementary Table S1. Until now, it was necessary to use highly approximate methods, such as the FFT-NS-2 option of MAFFT or the progressive option of Clustal Omega, in order to construct large MSAs. In terms of the MSA itself, the accuracy of these methods tends to decrease along with the increase in the number of sequences. This was first pointed out by Sievers and confirmed by Le . The increase in accuracy observed in Figure 1a for more than 200 sequences is due to the prediction phase not due to the alignment phase (see the last section in Supplementary Data and black dashed lines in Supplementary Fig. S1). As a result, it was difficult to know how many sequences should be included in an MSA. With more sequences, the MSA has richer comparative information, but the alignment quality is expected to decrease. The optimal balance between these two factors may differ by case. In contrast, the accuracy of G-large-INS-1 and G-INS-1 (red and blue dashed lines in Supplementary Fig. S1) was robust to data size in this test. The number of sequences to include in the MSA can now be determined simply based on the computational resources available and the requirements for the downstream analysis. Click here for additional data file.
  14 in total

1.  COFFEE: an objective function for multiple sequence alignments.

Authors:  C Notredame; L Holm; D G Higgins
Journal:  Bioinformatics       Date:  1998-06       Impact factor: 6.937

Review 2.  25 years of serving the community with ribosomal RNA gene reference databases and tools.

Authors:  Frank Oliver Glöckner; Pelin Yilmaz; Christian Quast; Jan Gerken; Alan Beccati; Andreea Ciuprina; Gerrit Bruns; Pablo Yarza; Jörg Peplies; Ralf Westram; Wolfgang Ludwig
Journal:  J Biotechnol       Date:  2017-06-23       Impact factor: 3.307

3.  Making automated multiple alignments of very large numbers of protein sequences.

Authors:  Fabian Sievers; David Dineen; Andreas Wilm; Desmond G Higgins
Journal:  Bioinformatics       Date:  2013-02-21       Impact factor: 6.937

4.  A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors:  S B Needleman; C D Wunsch
Journal:  J Mol Biol       Date:  1970-03       Impact factor: 5.469

5.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

6.  Instability in progressive multiple sequence alignment algorithms.

Authors:  Kieran Boyce; Fabian Sievers; Desmond G Higgins
Journal:  Algorithms Mol Biol       Date:  2015-10-09       Impact factor: 1.405

7.  JPred4: a protein secondary structure prediction server.

Authors:  Alexey Drozdetskiy; Christian Cole; James Procter; Geoffrey J Barton
Journal:  Nucleic Acids Res       Date:  2015-04-16       Impact factor: 16.971

8.  Protein multiple sequence alignment benchmarking through secondary structure prediction.

Authors:  Quan Le; Fabian Sievers; Desmond G Higgins
Journal:  Bioinformatics       Date:  2017-05-01       Impact factor: 6.937

9.  Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees.

Authors:  Kazunori D Yamada; Kentaro Tomii; Kazutaka Katoh
Journal:  Bioinformatics       Date:  2016-07-04       Impact factor: 6.937

10.  QuickProbs 2: Towards rapid construction of high-quality alignments of large protein families.

Authors:  Adam Gudyś; Sebastian Deorowicz
Journal:  Sci Rep       Date:  2017-01-31       Impact factor: 4.379

View more
  206 in total

1.  Optimization of qRT-PCR assay for zika virus detection in human serum and urine.

Authors:  Maria Del Pilar Martinez Viedma; Vinita Puri; Lauren M Oldfield; Reed S Shabman; Gene S Tan; Brett E Pickett
Journal:  Virus Res       Date:  2019-02-10       Impact factor: 3.303

2.  Draft genome sequence of Solanum aethiopicum provides insights into disease resistance, drought tolerance, and the evolution of the genome.

Authors:  Bo Song; Yue Song; Yuan Fu; Elizabeth Balyejusa Kizito; Sandra Ndagire Kamenya; Pamela Nahamya Kabod; Huan Liu; Samuel Muthemba; Robert Kariba; Joyce Njuguna; Solomon Maina; Francesca Stomeo; Appolinaire Djikeng; Prasad S Hendre; Xiaoli Chen; Wenbin Chen; Xiuli Li; Wenjing Sun; Sibo Wang; Shifeng Cheng; Alice Muchugi; Ramni Jamnadass; Howard-Yana Shapiro; Allen Van Deynze; Huanming Yang; Jian Wang; Xun Xu; Damaris Achieng Odeny; Xin Liu
Journal:  Gigascience       Date:  2019-10-01       Impact factor: 6.524

3.  Inducible aging in Hydra oligactis implicates sexual reproduction, loss of stem cells, and genome maintenance as major pathways.

Authors:  Shixiang Sun; Ryan R White; Kathleen E Fischer; Zhengdong Zhang; Steven N Austad; Jan Vijg
Journal:  Geroscience       Date:  2020-06-23       Impact factor: 7.713

4.  Anaerotalea alkaliphila gen. nov., sp. nov., an alkaliphilic, anaerobic, fermentative bacterium isolated from a terrestrial mud volcano.

Authors:  Anastasia Frolova; A Yu Merkel; A A Novikov; E A Bonch-Osmolovskaya; A I Slobodkin
Journal:  Extremophiles       Date:  2021-04-23       Impact factor: 2.395

5.  SMCHD1 mutation spectrum for facioscapulohumeral muscular dystrophy type 2 (FSHD2) and Bosma arhinia microphthalmia syndrome (BAMS) reveals disease-specific localisation of variants in the ATPase domain.

Authors:  Richard J L F Lemmers; Nienke van der Stoep; Patrick J van der Vliet; Steven A Moore; David San Leon Granado; Katherine Johnson; Ana Topf; Volker Straub; Teresinha Evangelista; Tahseen Mozaffar; Virginia Kimonis; Natalie D Shaw; Rita Selvatici; Alessandra Ferlini; Nicol Voermans; Baziel van Engelen; Sabrina Sacconi; Rabi Tawil; Meindert Lamers; Silvère M van der Maarel
Journal:  J Med Genet       Date:  2019-06-26       Impact factor: 6.318

6.  Unravelling unexplored diversity of cercosporoid fungi (Mycosphaerellaceae, Mycosphaerellales, Ascomycota) in tropical Africa.

Authors:  Yalemwork Meswaet; Ralph Mangelsdorff; Nourou S Yorou; Meike Piepenbring
Journal:  MycoKeys       Date:  2021-06-17       Impact factor: 2.984

7.  Phylogenetic background and habitat drive the genetic diversification of Escherichia coli.

Authors:  Marie Touchon; Amandine Perrin; Jorge André Moura de Sousa; Belinda Vangchhia; Samantha Burn; Claire L O'Brien; Erick Denamur; David Gordon; Eduardo Pc Rocha
Journal:  PLoS Genet       Date:  2020-06-12       Impact factor: 5.917

8.  Pan-genome analysis of Riemerella anatipestifer reveals its genomic diversity and acquired antibiotic resistance associated with genomic islands.

Authors:  Dekang Zhu; Zhishuang Yang; Jinge Xu; Mingshu Wang; Renyong Jia; Shun Chen; Mafeng Liu; Xinxin Zhao; Qiao Yang; Ying Wu; Shaqiu Zhang; Yunya Liu; Ling Zhang; Yanling Yu; Xiaoyue Chen; Anchun Cheng
Journal:  Funct Integr Genomics       Date:  2019-10-25       Impact factor: 3.410

9.  Deianiraea, an extracellular bacterium associated with the ciliate Paramecium, suggests an alternative scenario for the evolution of Rickettsiales.

Authors:  Michele Castelli; Elena Sabaneyeva; Olivia Lanzoni; Natalia Lebedeva; Anna Maria Floriano; Stefano Gaiarsa; Konstantin Benken; Letizia Modeo; Claudio Bandi; Alexey Potekhin; Davide Sassera; Giulio Petroni
Journal:  ISME J       Date:  2019-05-09       Impact factor: 10.302

10.  Using MARRVEL v1.2 for Bioinformatics Analysis of Human Genes and Variant Pathogenicity.

Authors:  Julia Wang; Dongxue Mao; Fatima Fazal; Seon-Young Kim; Shinya Yamamoto; Hugo Bellen; Zhandong Liu
Journal:  Curr Protoc Bioinformatics       Date:  2019-09
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.