Alexey M Kozlov1, Diego Darriba1, Tomáš Flouri1, Benoit Morel1, Alexandros Stamatakis1,2. 1. Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany. 2. Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany.
Abstract
MOTIVATION: Phylogenies are important for fundamental biological research, but also have numerous applications in biotechnology, agriculture and medicine. Finding the optimal tree under the popular maximum likelihood (ML) criterion is known to be NP-hard. Thus, highly optimized and scalable codes are needed to analyze constantly growing empirical datasets. RESULTS: We present RAxML-NG, a from-scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML. RAxML-NG offers improved accuracy, flexibility, speed, scalability, and usability compared with RAxML/ExaML. On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference (although IQ-Tree shows better stability). Finally, RAxML-NG introduces several new features, such as the detection of terraces in tree space and the recently introduced transfer bootstrap support metric. AVAILABILITY AND IMPLEMENTATION: The code is available under GNU GPL at https://github.com/amkozlov/raxml-ng. RAxML-NG web service (maintained by Vital-IT) is available at https://raxml-ng.vital-it.ch/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Phylogenies are important for fundamental biological research, but also have numerous applications in biotechnology, agriculture and medicine. Finding the optimal tree under the popular maximum likelihood (ML) criterion is known to be NP-hard. Thus, highly optimized and scalable codes are needed to analyze constantly growing empirical datasets. RESULTS: We present RAxML-NG, a from-scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML. RAxML-NG offers improved accuracy, flexibility, speed, scalability, and usability compared with RAxML/ExaML. On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference (although IQ-Tree shows better stability). Finally, RAxML-NG introduces several new features, such as the detection of terraces in tree space and the recently introduced transfer bootstrap support metric. AVAILABILITY AND IMPLEMENTATION: The code is available under GNU GPL at https://github.com/amkozlov/raxml-ng. RAxML-NG web service (maintained by Vital-IT) is available at https://raxml-ng.vital-it.ch/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RAxML (Stamatakis, 2014) is a popular maximum likelihood (ML) tree inference tool which has been developed and supported by our group for the last 15 years. More recently, we also released ExaML (Kozlov ), a dedicated code for analyzing genome-scale datasets on supercomputers. ExaML implements the core tree search functionality of RAxML and scales to thousands of CPU cores. Other widely used ML inference tools are, for instance, IQ-Tree (Nguyen ), PhyML (Guindon ) and FastTree (Price ).Here, we introduce our new code called RAxML-NG (RAxML Next Generation). It combines the strengths and concepts of RAxML and ExaML, and offers several additional improvements which we describe in the next section.
2 New features and optimizations
2.1 Evolutionary model extensions
While RAxML/ExaML only fully supported the General Time Reversible (GTR) model of DNA substitution, RAxML-NG now supports all 22 ‘classical’ GTR-derived models. All model parameters (including branch lengths) can be either optimized or fixed to user-specified values. RAxML-NG also offers the following features:edge-proportional branch length estimation for multi-gene alignments,FreeRate model of rate heterogeneity (Yang, 1995),per-rate scalers in the Γ model of rate heterogeneity to prevent numerical underflow on large trees.
2.2 Search algorithm modifications
The subtree enumeration method used in RAxML/ExaML occasionally skipped promising topological moves; this has now been fixed in RAxML-NG (see Supplementary Material for details). Further, RAxML-NG employs a two-step L-BFGS-B method (Fletcher, 1987) to optimize the parameters of the LG4X model (Le ). This approach (first introduced in IQ-Tree) is usually faster and more stable than the sequential optimization using Brent’s method in RAxML/ExaML.
2.3 Transfer bootstrap
RAxML-NG can compute the novel branch support metric called transfer bootstrap expectation (TBE) recently proposed in (Lemoine ). When compared with the classic Felsenstein bootstrap, TBE is less sensitive to individual misplaced taxa in replicate trees, and thus better suited to reveal well-supported deep splits in large trees with thousands of taxa.
2.4 Phylogenetic terraces
Certain patterns of missing data in multi-gene alignments can yield multiple tree topologies with identical likelihood scores—a phenomenon known as terraces in tree space (Sanderson ). RAxML-NG employs the recently released terraphast library (Biczok ) to assess if the inferred best-scoring ML tree resides on a terrace, and report the size of that terrace.
2.5 Performance and scalability
In RAxML-NG, we further optimized the vectorized likelihood computation kernels and eliminated known sequential bottlenecks of RAxML. We also integrated an optimization technique for likelihood calculations known as site repeats (Kobert ) which yields runtime improvements of 10–60%. Finally, RAxML-NG implements several features for enhancing parallel efficiency, previously only available in ExaML:efficient fine-grained parallelization with MPI or MPI+pthreads,binary input file format (compressed alignment),restart from a checkpoint,improved load balancing for multi-gene alignments (Kobert )
2.6 Usability
Several RAxML-NG features aim to improve usability and avoid common pitfalls: auto-detection of CPU instruction set and number of cores, recommendation for the optimal number of threads, automatic restart from the last checkpoint after program interruption, search progress reporting in the log file etc.
2.7 Modularization
RAxML and ExaML are large monolithic codes. This hindered maintenance, extension and code reuse. In RAxML-NG, we encapsulated the phylogenetic likelihood kernels and numerical optimization routines in two libraries: libpll (https://github.com/xflouris/libpll-2) and pll-modules (https://github.com/ddarriba/pll-modules), respectively. Both libraries include unit tests and are also being used by other software tools developed in our lab such as ModelTest-NG and EPA-NG (Barbera ). This yields our likelihood computation code more error-proof than in RAxML/ExaML.
3 Evaluation
A recent evaluation of fast ML-based methods (Zhou ) showed that IQTree yields the best tree inference accuracy, closely followed by RAxML/ExaML. Thus, we benchmarked RAxML-NG against these three programs on the collection of empirical datasets used by Zhou et al. RAxML-NG found the best-scoring tree for the highest number of datasets (19/21) among all programs tested, while being 1.3× to 4.5× faster. Furthermore, it scales to the large number of cores with a parallel efficiency of up to 125% (see Supplementary Material for details). In summary, RAxML-NG is clearly superior to RAxML/ExaML, and thus we recommend that the users of these codes upgrade as soon as possible. Comparison to IQTree yielded mixed results: although RAxML-NG is generally faster and returns higher-scoring trees on taxon-rich alignments, IQTree results show much lower variance. Hence, on alignments with strong phylogenetic signal, IQTree may require fewer replicate searches than RAxML-NG to find the best-scoring tree.
4 Availability and user support
The RAxML-NG source code as well as pre-compiled binaries for Linux and MacOS are available at https://github.com/amkozlov/raxml-ng. RAxML-NG is also available as a web service (maintained by the Vital-IT unit of the Swiss Institute of Bioinformatics) at https://raxml-ng.vital-it.ch/. An up-to-date user manual is available at https://github.com/amkozlov/raxml-ng/wiki. User support is provided via the RAxML Google group at: https://groups.google.com/forum/#!forum/raxml.
5 Future work
In future versions of RAxML-NG, we plan to add site heterogeneity models such as RAxML-CAT (Stamatakis, 2006) and PhyloBayes-CAT (Le ), as well as non-reversible context-dependent models of evolution (Baele ). Furthermore, we plan to explore orthogonal parallelization schemes (across tree nodes and/or topological moves), for leveraging the capabilities of modern parallel hardware and more efficiently analyzing datasets with thousands of taxa.Click here for additional data file.
Authors: Gregory W Stull; Xiao-Jian Qu; Caroline Parins-Fukuchi; Ying-Ying Yang; Jun-Bo Yang; Zhi-Yun Yang; Yi Hu; Hong Ma; Pamela S Soltis; Douglas E Soltis; De-Zhu Li; Stephen A Smith; Ting-Shuang Yi Journal: Nat Plants Date: 2021-07-19 Impact factor: 15.793
Authors: Marcus A Koch; Johanna Möbus; Clara A Klöcker; Stephanie Lippert; Laura Ruppert; Christiane Kiefer Journal: Ann Bot Date: 2020-06-19 Impact factor: 4.357
Authors: Yasser F M Karar; Charles K Blend; Refaat M A Khalifa; Hemely Abdel-Shafy Hassan; Hoda S Mohamadain; Norman O Dronen Journal: Syst Parasitol Date: 2019-08-02 Impact factor: 1.431
Authors: Shaun T Cross; Bernadette L Maertens; Tillie J Dunham; Case P Rodgers; Ali L Brehm; Megan R Miller; Alissa M Williams; Brian D Foy; Mark D Stenglein Journal: J Virol Date: 2020-09-29 Impact factor: 5.103