Literature DB >> 25819675

ExaML version 3: a tool for phylogenomic analyses on supercomputers.

Alexey M Kozlov1, Andre J Aberer1, Alexandros Stamatakis2.   

Abstract

MOTIVATION: Phylogenies are increasingly used in all fields of medical and biological research. Because of the next generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. We present ExaML version 3, a dedicated production-level code for inferring phylogenies on whole-transcriptome and whole-genome alignments using supercomputers.
RESULTS: We introduce several improvements and extensions to ExaML: Extensions of substitution models and supported data types, the integration of a novel load balance algorithm as well as a parallel I/O optimization that significantly improve parallel efficiency, and a production-level implementation for Intel MIC-based hardware platforms.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25819675      PMCID: PMC4514929          DOI: 10.1093/bioinformatics/btv184

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

ExaML (Exascale Maximum Likelihood) is a relatively new code for large-scale phylogenetic analyses on supercomputers. It implements the RAxML (Stamatakis, 2014) search algorithm and replaces RAxML-Light (Stamatakis ) which was inefficient for large, partitioned phylogenomic datasets that currently represent the typical use case. Hence, in ExaML (v. 1) we implemented a radically different parallelization approach (Stamatakis and Aberer, 2013) that improved parallel efficiency by up to a factor of 3 and also increased scalability. For two large alignments that were recently published in Science [51 taxa, unpartitioned, DNA sites & 48 taxa, four partitions, DNA sites in Jarvis ; 144 taxa, 100 partitions, 1, 240, 377 DNA sites & 144 taxa, 770 partitions, 576 840 AA sites in Misof ], we identified and resolved further performance bottlenecks. Note that, ExaML can also be used for analyzing datasets with 10-20 genes and up to 55 000 taxa, but scalability will be limited to at most 100 cores. For ExaML (v. 2), we developed and integrated algorithms for improved load balance (Zhang and Stamatakis, 2012; Kobert ) and also optimized the parallel I/O for reading multiple sequence alignments. We also started exploring if and how ExaML can be ported to the Intel MIC (Many Integrated Core) hardware architecture (Kozlov ) in a proof-of-concept setting. Here, we present ExaML (v. 3) which offers—apart from new models and data types—a production-level implementation for the Intel MIC architecture that required a substantial amount of re-engineering. We also present a novel parallel alignment I/O method. Finally, we completely re-wrote the user manual. ExaML is a stable and well-documented code for large-scale phylogenetic inference on x86 Linux/MAC clusters (compiles with gcc,icc,clang). It addresses and provides generally applicable solutions for several performance bottlenecks in parallel phylogentic likelihood calculations on partitioned alignments.

2 New features

2.1 Models and data types

Apart from DNA and protein data, ExaML now also supports binary (two-state) characters. This data type can be used, for instance, to analyze genome-wide indel patterns. The number of available protein substitution models now also includes the LG4M and LG4X models (Le ) as well as the recently published stmtREV (Liu ) model. In addition, ExaML can automatically determine the best-scoring protein substitution model for each partition via a newly implemented standard test procedure that uses either (i) the likelihood score, (ii) the AIC (Akaike Information Criterion), (iii) the cAIC (corrected AIC) or (iv) the BIC (Bayesian Information Criterion). Finally, a new option for conducting a maximum likelihood estimate of the base frequencies has become available.

2.2 Parallel I/O optimization

The parallel I/O to read in the input alignment is optimized in two ways. Initially, a plain-text PHYLIP file is analyzed and transformed into a binary file format via a dedicated parser component. The parser can be executed on a standard server and does therefore not consume valuable supercomputer time for issues such as checking the format, detecting duplicate taxon names, compressing site patterns, etc. The binary alignment file contains global data information (alignment length, data types, model options, partition boundaries) as well as the raw sequences stored in the order of the partitions. That is, the sequences for all taxa of partition one are stored first, then, the sequences for all taxa of partition two, etc. This, yet unpublished, binary format allows each ExaML process to concurrently read only those parts of the alignment on which it will be computing likelihoods. This optimization yielded an acceleration of more than one order of magnitude for the start-up phase of ExaML during which the alignment is read (see on-line Supplementary for more details).

2.3 New load balance algorithm

The typical use case for modern likelihood-based inferences are so-called partitioned analyses, where subsets of the alignment sites form partitions that evolve under a distinct set of evolutionary model parameters (e.g. α shape parameter, GTR rates, base frequencies, etc.). ExaML parallelizes likelihood calculations over alignment sites using MPI (Message Passing Interface). The time for likelihood calculations on a partition i consist of a constant part (irrespective of the partition length), mainly the calculation of the P matrix via exponentiation of the Q matrix given the branch length t, that is, (Felsenstein, 2004). Once the P matrix has been computed one can then calculate the per-site likelihoods for each site in partition i in parallel. We observed that, because of additional synchronization and communication overhead, it is not advantageous to first parallelize all P matrix calculation and subsequently (in a second parallel region) calculate all per-site likelihood calculations. Thus, because P needs to be computed redundantly by every process, even if it has just one alignment site belonging to partition i, we need to minimize the number of partitions for which a process calculates the likelihood. At the same time, we need to evenly distribute the sites among processors for optimal load balance (i.e. we need to split up some partitions among processes). To this end, we formulated a bi-criterion problem to define the optimal data distribution of partitions and sites to processors (Kobert ). We showed that: (i) the optimization problem is -hard, (ii) an approximation algorithm with a guaranteed bound exists, (iii) the algorithm misses the optimum by at most one partition (one additional P calculation is required at one or more processes than in the optimal solution). We also showed that this new data distribution algorithm outperforms previous approaches (cyclic distribution of sites, monolithic distribution of partitions) in almost all cases with only minor deviations in cases where the performance was worse. In addition, we showed that ExaML runs up to three times faster [see Fig. 4b in Kobert ] than with the previous data distribution schemes.

2.4 Checkpointing

As RAxML-Light, ExaML also allows for checkpointing and subsequently re-starting tree searches from light-weight binary checkpoints. This is particularly useful when using typical supercomputer configurations with 24 or 48 hr run-time limits. Apart from the tree search, ExaML also offers an option to estimate model parameters and branch lengths on a given set of fixed trees. This option is also checkpointable.

2.5 ExaML MIC version

An increasing number of supercomputing centers now operate systems with heterogeneous CPU/MIC compute nodes. To this end, we have transformed our initial proof-of-concept MIC code into production-level hybrid software that can leverage the capacity of all resources in such a system. A detailed description of this novel and non-trivial parallelization approach that requires exploiting parallelism and handling load balance at essentially two levels (MPI among MIC cards and ‘normal’ CPUs and OpenMP within cards) is provided in the on-line Supplementary. As we show in the supplement, the hybrid code allows to make better use of currently available hardware resources.

3 User support and future work

User support is provided via the RAxML Google group at: https://groups.google.com/forum/?hl=en#!forum/raxml. The ExaML source code contains a comprehensive manual. Future work includes the continued maintenance and support of ExaML and the implementation of additional models [e.g. models with ascertainment bias correction (Lewis, 2001)], data types and search algorithms (Nguyen ).
  8 in total

1.  A likelihood approach to estimating phylogeny from discrete morphological character data.

Authors:  P O Lewis
Journal:  Syst Biol       Date:  2001 Nov-Dec       Impact factor: 15.683

2.  Modeling protein evolution with several amino acid replacement matrices depending on site rates.

Authors:  Si Quang Le; Cuong Cao Dang; Olivier Gascuel
Journal:  Mol Biol Evol       Date:  2012-04-06       Impact factor: 16.240

3.  Phylogenomics resolves the timing and pattern of insect evolution.

Authors:  Bernhard Misof; Shanlin Liu; Karen Meusemann; Ralph S Peters; Alexander Donath; Christoph Mayer; Paul B Frandsen; Jessica Ware; Tomáš Flouri; Rolf G Beutel; Oliver Niehuis; Malte Petersen; Fernando Izquierdo-Carrasco; Torsten Wappler; Jes Rust; Andre J Aberer; Ulrike Aspöck; Horst Aspöck; Daniela Bartel; Alexander Blanke; Simon Berger; Alexander Böhm; Thomas R Buckley; Brett Calcott; Junqing Chen; Frank Friedrich; Makiko Fukui; Mari Fujita; Carola Greve; Peter Grobe; Shengchang Gu; Ying Huang; Lars S Jermiin; Akito Y Kawahara; Lars Krogmann; Martin Kubiak; Robert Lanfear; Harald Letsch; Yiyuan Li; Zhenyu Li; Jiguang Li; Haorong Lu; Ryuichiro Machida; Yuta Mashimo; Pashalia Kapli; Duane D McKenna; Guanliang Meng; Yasutaka Nakagaki; José Luis Navarrete-Heredia; Michael Ott; Yanxiang Ou; Günther Pass; Lars Podsiadlowski; Hans Pohl; Björn M von Reumont; Kai Schütte; Kaoru Sekiya; Shota Shimizu; Adam Slipinski; Alexandros Stamatakis; Wenhui Song; Xu Su; Nikolaus U Szucsich; Meihua Tan; Xuemei Tan; Min Tang; Jingbo Tang; Gerald Timelthaler; Shigekazu Tomizuka; Michelle Trautwein; Xiaoli Tong; Toshiki Uchifune; Manfred G Walzl; Brian M Wiegmann; Jeanne Wilbrandt; Benjamin Wipfler; Thomas K F Wong; Qiong Wu; Gengxiong Wu; Yinlong Xie; Shenzhou Yang; Qing Yang; David K Yeates; Kazunori Yoshizawa; Qing Zhang; Rui Zhang; Wenwei Zhang; Yunhui Zhang; Jing Zhao; Chengran Zhou; Lili Zhou; Tanja Ziesmann; Shijie Zou; Yingrui Li; Xun Xu; Yong Zhang; Huanming Yang; Jian Wang; Jun Wang; Karl M Kjer; Xin Zhou
Journal:  Science       Date:  2014-11-06       Impact factor: 47.728

4.  Mitochondrial phylogenomics of early land plants: mitigating the effects of saturation, compositional heterogeneity, and codon-usage bias.

Authors:  Yang Liu; Cymon J Cox; Wei Wang; Bernard Goffinet
Journal:  Syst Biol       Date:  2014-07-28       Impact factor: 15.683

5.  RAxML-Light: a tool for computing terabyte phylogenies.

Authors:  A Stamatakis; A J Aberer; C Goll; S A Smith; S A Berger; F Izquierdo-Carrasco
Journal:  Bioinformatics       Date:  2012-05-24       Impact factor: 6.937

6.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors:  Alexandros Stamatakis
Journal:  Bioinformatics       Date:  2014-01-21       Impact factor: 6.937

7.  IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies.

Authors:  Lam-Tung Nguyen; Heiko A Schmidt; Arndt von Haeseler; Bui Quang Minh
Journal:  Mol Biol Evol       Date:  2014-11-03       Impact factor: 16.240

8.  Whole-genome analyses resolve early branches in the tree of life of modern birds.

Authors:  Erich D Jarvis; Siavash Mirarab; Andre J Aberer; Bo Li; Peter Houde; Cai Li; Simon Y W Ho; Brant C Faircloth; Benoit Nabholz; Jason T Howard; Alexander Suh; Claudia C Weber; Rute R da Fonseca; Jianwen Li; Fang Zhang; Hui Li; Long Zhou; Nitish Narula; Liang Liu; Ganesh Ganapathy; Bastien Boussau; Md Shamsuzzoha Bayzid; Volodymyr Zavidovych; Sankar Subramanian; Toni Gabaldón; Salvador Capella-Gutiérrez; Jaime Huerta-Cepas; Bhanu Rekepalli; Kasper Munch; Mikkel Schierup; Bent Lindow; Wesley C Warren; David Ray; Richard E Green; Michael W Bruford; Xiangjiang Zhan; Andrew Dixon; Shengbin Li; Ning Li; Yinhua Huang; Elizabeth P Derryberry; Mads Frost Bertelsen; Frederick H Sheldon; Robb T Brumfield; Claudio V Mello; Peter V Lovell; Morgan Wirthlin; Maria Paula Cruz Schneider; Francisco Prosdocimi; José Alfredo Samaniego; Amhed Missael Vargas Velazquez; Alonzo Alfaro-Núñez; Paula F Campos; Bent Petersen; Thomas Sicheritz-Ponten; An Pas; Tom Bailey; Paul Scofield; Michael Bunce; David M Lambert; Qi Zhou; Polina Perelman; Amy C Driskell; Beth Shapiro; Zijun Xiong; Yongli Zeng; Shiping Liu; Zhenyu Li; Binghang Liu; Kui Wu; Jin Xiao; Xiong Yinqi; Qiuemei Zheng; Yong Zhang; Huanming Yang; Jian Wang; Linnea Smeds; Frank E Rheindt; Michael Braun; Jon Fjeldsa; Ludovic Orlando; F Keith Barker; Knud Andreas Jønsson; Warren Johnson; Klaus-Peter Koepfli; Stephen O'Brien; David Haussler; Oliver A Ryder; Carsten Rahbek; Eske Willerslev; Gary R Graves; Travis C Glenn; John McCormack; Dave Burt; Hans Ellegren; Per Alström; Scott V Edwards; Alexandros Stamatakis; David P Mindell; Joel Cracraft; Edward L Braun; Tandy Warnow; Wang Jun; M Thomas P Gilbert; Guojie Zhang
Journal:  Science       Date:  2014-12-12       Impact factor: 47.728

  8 in total
  72 in total

1.  Whole-Genome Analyses Resolve the Phylogeny of Flightless Birds (Palaeognathae) in the Presence of an Empirical Anomaly Zone.

Authors:  Alison Cloutier; Timothy B Sackton; Phil Grayson; Michele Clamp; Allan J Baker; Scott V Edwards
Journal:  Syst Biol       Date:  2019-11-01       Impact factor: 15.683

Review 2.  Reconstructing ancient genomes and epigenomes.

Authors:  Ludovic Orlando; M Thomas P Gilbert; Eske Willerslev
Journal:  Nat Rev Genet       Date:  2015-06-09       Impact factor: 53.242

3.  Evolutionary history of Polyneoptera and its implications for our understanding of early winged insects.

Authors:  Benjamin Wipfler; Harald Letsch; Paul B Frandsen; Paschalia Kapli; Christoph Mayer; Daniela Bartel; Thomas R Buckley; Alexander Donath; Janice S Edgerly-Rooks; Mari Fujita; Shanlin Liu; Ryuichiro Machida; Yuta Mashimo; Bernhard Misof; Oliver Niehuis; Ralph S Peters; Malte Petersen; Lars Podsiadlowski; Kai Schütte; Shota Shimizu; Toshiki Uchifune; Jeanne Wilbrandt; Evgeny Yan; Xin Zhou; Sabrina Simon
Journal:  Proc Natl Acad Sci U S A       Date:  2019-01-14       Impact factor: 11.205

4.  A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life.

Authors:  Donovan H Parks; Maria Chuvochina; David W Waite; Christian Rinke; Adam Skarshewski; Pierre-Alain Chaumeil; Philip Hugenholtz
Journal:  Nat Biotechnol       Date:  2018-08-27       Impact factor: 54.908

5.  Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data.

Authors:  Lily C Hughes; Guillermo Ortí; Yu Huang; Ying Sun; Carole C Baldwin; Andrew W Thompson; Dahiana Arcila; Ricardo Betancur-R; Chenhong Li; Leandro Becker; Nicolás Bellora; Xiaomeng Zhao; Xiaofeng Li; Min Wang; Chao Fang; Bing Xie; Zhuocheng Zhou; Hai Huang; Songlin Chen; Byrappa Venkatesh; Qiong Shi
Journal:  Proc Natl Acad Sci U S A       Date:  2018-05-14       Impact factor: 11.205

6.  Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum.

Authors:  Xing-Xing Shen; Dana A Opulente; Jacek Kominek; Xiaofan Zhou; Jacob L Steenwyk; Kelly V Buh; Max A B Haase; Jennifer H Wisecaver; Mingshuang Wang; Drew T Doering; James T Boudouris; Rachel M Schneider; Quinn K Langdon; Moriya Ohkuma; Rikiya Endoh; Masako Takashima; Ri-Ichiroh Manabe; Neža Čadež; Diego Libkind; Carlos A Rosa; Jeremy DeVirgilio; Amanda Beth Hulfachor; Marizeth Groenewald; Cletus P Kurtzman; Chris Todd Hittinger; Antonis Rokas
Journal:  Cell       Date:  2018-11-08       Impact factor: 41.582

7.  Extensive in situ radiation of feather lice on tinamous.

Authors:  Stephany Virrueta Herrera; Andrew D Sweet; Julie M Allen; Kimberly K O Walden; Jason D Weckstein; Kevin P Johnson
Journal:  Proc Biol Sci       Date:  2020-02-19       Impact factor: 5.349

8.  HIV-1 Transmission Patterns in Men Who Have Sex with Men: Insights from Genetic Source Attribution Analysis.

Authors:  Stéphane Le Vu; Oliver Ratmann; Valerie Delpech; Alison E Brown; O Noel Gill; Anna Tostevin; David Dunn; Christophe Fraser; Erik M Volz
Journal:  AIDS Res Hum Retroviruses       Date:  2019-09       Impact factor: 2.205

9.  Independent evolution of ancestral and novel defenses in a genus of toxic plants (Erysimum, Brassicaceae).

Authors:  Tobias Züst; Susan R Strickler; Adrian F Powell; Makenzie E Mabry; Hong An; Mahdieh Mirzaei; Thomas York; Cynthia K Holland; Pavan Kumar; Matthias Erb; Georg Petschenka; José-María Gómez; Francisco Perfectti; Caroline Müller; J Chris Pires; Lukas A Mueller; Georg Jander
Journal:  Elife       Date:  2020-04-07       Impact factor: 8.140

10.  Eukaryotic Acquisition of a Bacterial Operon.

Authors:  Jacek Kominek; Drew T Doering; Dana A Opulente; Xing-Xing Shen; Xiaofan Zhou; Jeremy DeVirgilio; Amanda B Hulfachor; Marizeth Groenewald; Mcsean A Mcgee; Steven D Karlen; Cletus P Kurtzman; Antonis Rokas; Chris Todd Hittinger
Journal:  Cell       Date:  2019-02-21       Impact factor: 41.582

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.