Literature DB >> 29077904

UFBoot2: Improving the Ultrafast Bootstrap Approximation.

Diep Thi Hoang¹, Olga Chernomor², Arndt von Haeseler^2,3, Bui Quang Minh², Le Sy Vinh¹.

Abstract

The standard bootstrap (SBS), despite being computationally intensive, is widely used in maximum likelihood phylogenetic analyses. We recently proposed the ultrafast bootstrap approximation (UFBoot) to reduce computing time while achieving more unbiased branch supports than SBS under mild model violations. UFBoot has been steadily adopted as an efficient alternative to SBS and other bootstrap approaches. Here, we present UFBoot2, which substantially accelerates UFBoot and reduces the risk of overestimating branch supports due to polytomies or severe model violations. Additionally, UFBoot2 provides suitable bootstrap resampling strategies for phylogenomic data. UFBoot2 is 778 times (median) faster than SBS and 8.4 times (median) faster than RAxML rapid bootstrap on tested data sets. UFBoot2 is implemented in the IQ-TREE software package version 1.6 and freely available at http://www.iqtree.org.

Entities: Chemical Disease Species

Keywords: maximum likelihood; model violation; phylogenetic inference; polytomies; ultrafast bootstrap

Mesh：

Year: 2018 PMID： 29077904 PMCID： PMC5850222 DOI： 10.1093/molbev/msx281

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Standard nonparametric bootstrap (SBS) (Efron 1979; Felsenstein 1985) is widely used in maximum likelihood (ML) phylogenetic analyses to estimate branch supports of a phylogenetic tree inferred from a multiple sequence alignment (MSA). To achieve this, SBS generates a large number of resampled MSAs and reconstructs an ML-tree for each bootstrapped MSA. The resulting bootstrap ML trees are then used either to compute branch supports for the ML-tree reconstructed from the original MSA or to build a consensus tree with support values. Although fast ML-tree search algorithms exist for large data sets (Vinh and von Haeseler 2004; Stamatakis 2006; Guindon et al. 2010; Nguyen et al. 2015) SBS is still very computationally intensive. To improve computing time, rapid bootstrap (RBS; Stamatakis et al. 2008) and the ultrafast bootstrap (UFBoot; Minh et al. 2013) were developed. Although RBS resembles the conservative behavior of SBS (i.e., underestimating branch supports), UFBoot provides relatively unbiased bootstrap estimates under mild model misspecifications. The key idea behind UFBoot is to keep trees encountered during the ML-tree search for the original MSA and to use them to evaluate the tree likelihoods for the bootstrap MSAs. To speed up likelihood computation even further for bootstrap MSAs, IQ-TREE employed the resampling estimated log-likelihood (RELL) strategy (Kishino et al. 1990). For each bootstrap MSA, the tree with the highest RELL score (RELL-tree) represents the ML-bootstrap tree. Contrary to SBS, UFBoot does not further ML optimize this tree. The discrepancy in branch supports between UFBoot and SBS emerges as bootstrap trees inferred by UFBoot and SBS might be different. Here, we present UFBoot2 that substantially speeds up UFBoot and reduces the risk for overestimated branch support due to polytomies or severe model violations. We also discuss several resampling strategies for phylogenomics data recently implemented in UFBoot2. In the following, we will outline these improvements.

Accelerating UFBoot

The likelihood computation is the major runtime bottleneck of all ML software because it lies at the core of all analyses. The pruning algorithm (Felsenstein 1981) efficiently computes the likelihood of phylogenetic trees, but still does not scale well for large data sets. Therefore, we adopted a modification to Felsenstein’s algorithm (see supplementary method, Supplementary Material online), first introduced in RAxML. The modification exploits the reversible property of models of sequence evolution typically used in phylogenetic analysis, which led to a theoretical speedup of 4 (for DNA) or 20 (for protein data) when estimating branch lengths. Moreover, we employed the SIMD (single instruction, multiple data) feature to concurrently compute the likelihood of two MSA sites with streaming SIMD extensions or four MSA sites with advanced vector extensions, thus leading to a theoretical speedup of two or four compared with a non-SIMD implementation. IQ-TREE code was further optimized to avoid redundant computations. We benchmark the runtimes on 70 DNA and 45 protein MSAs (DOI 10.5281/zenodo.854445) from TreeBASE, previously analyzed in Nguyen et al. (2015). The command-lines used to perform bootstrap methods are provided in supplementary table S1, Supplementary Material online. UFBoot2 achieved a median speedup of 2.4 times (maximum: 77.3) compared with UFBoot version 0.9.6 (released on October 20, 2013).

Correction for Polytomies

Polytomies refer to multifurcating nodes in the tree that cannot be resolved due to low phylogenetic signal in the data. However, phylogenetic reconstruction always assumes strictly bifurcating trees. When resolving polytomies, there might be multiple equivalently optimal bifurcating trees (Whelan and Money 2010). As UFBoot (and other bootstrap approaches) saves only a single optimal bifurcating tree for each bootstrap MSA, it might cause overoptimistic bootstrap supports for short branches (Simmons and Norton 2014). To correct for this shortcoming, UFBoot2 implemented the following technique. Instead of assigning the bootstrap tree with the highest RELL for each bootstrap MSA, UFBoot2 will randomly select one of the trees encountered during tree search, whose RELL score is less than (default: 0.5) away from the highest RELL. As a result, UFBoot2 will not give high supports for branches resolving the multifurcations. It was shown with a star tree simulation that SBS and RBS sometimes led to false positives (bootstrap supports ≥95% for nonexisting branches), whereas with this technique UFBoot never supported such branches (support values ≤88%) (Simmons and Norton 2014). We repeated the star tree simulation for UFBoot2 with the same setting as proposed in (Simmons and Norton 2014). We used Seq-Gen 1.3.2x (Rambaut and Grass 1997) to evolve 100 DNA MSAs, each of 15,000 sites, along a 4-taxon star tree with four terminal branch lengths of 0.05, under JC model. For each MSA, we performed UFBoot2 runs under JC and GTR+Γ, each with 1,000 bootstrap replicates and up to 1,000 search iterations (invoked in IQ-TREE via “-bcor 1” option). The simulation results show that UFBoot2 resembles the original UFBoot in that it never supports nonexisting branches (support values ≤88%).

Reducing the Impact of Model Violations

Minh et al. (2013) showed that severe model violations inflate UFBoot support values. To resolve this issue, UFBoot2 provides an option to conduct an additional step once the tree search on the original MSA is completed. Here, the best RELL-trees are further optimized using a hill-climbing nearest-neighbor interchange (NNI) search based directly on the corresponding bootstrap MSA. Thus, this extra step operates like SBS, but with a quick tree search to save time. Bootstrap supports are then summarized from the resulting corrected bootstrap trees. In the following, we called this UFBoot2 + NNI, which can be invoked in IQ-TREE via “-bnni” option. We repeated the PANDIT simulations (Minh et al. 2013) to compare the accuracy of UFBoot2 and UFBoot2 + NNI with SBS (1,000 replicates using IQ-TREE) and RBS (RAxML bootstopping criterion). The simulations include 5,690 DNA MSAs (DOI 10.5281/zenodo.854445) generated by Seq-Gen (Rambaut and Grass 1997), where the model parameters and the tree (which we will call the true tree in the following) were inferred from the original MSAs downloaded from the PANDIT database (Whelan et al. 2006). The accuracy of a bootstrap method M is defined by , the percentage of branches with support value (across all reconstructed trees) that occur in the true tree (Hillis and Bull 1993). Thus, reflects the probability that a branch with support is a true branch. Figure 1 shows the results [y-axis depicts . If the sequence evolution model used to infer the ML-tree agrees with the model used for simulations, then SBS, RBS, and UFBoot2 + NNI underestimated branch supports, the latter to a lower degree (fig. 1 curves above the diagonal). This conservative behavior of SBS and RBS corroborates previous studies (Hillis and Bull 1993; Minh et al. 2013). Whereas UFBoot2 obtained almost unbiased branch supports (fig. 1 curve close to the diagonal), that is, closely matching the true probability of branches being correct. Thus, UFBoot2 resembles the behavior of the original UFBoot (Minh et al. 2013).

. 1.

Accuracy of the standard bootstrap (SBS), RAxML rapid bootstrap (RBS), ultrafast bootstrap (UFBoot2) and UFBoot2 with correction (UFBoot2 + NNI) for (A) correctly specified models and (B) severely misspecified models. The y-axis depicts the percentage of all branches with support value (across all reconstructed trees) that occur in the true tree. Curves above the diagonal indicate underestimation of branch supports whereas curves below the diagonal indicate overestimation. For each point in the curve representing the accuracy of bootstrap method M, x is an inferred bootstrap value by method M whereas y measures the probability of branches assigned by M with support value x to be true branches, that is, occurring on the true tree. Specifically, let () be the set of branches with support value in all trees and present (absent) in the true tree. The value is computed as the ratio between and |, where . Severe model violations do not influence SBS (fig. 1 RBS not shown because RAxML does not support simpler models). However, UFBoot2 (like UFBoot) overestimated the branch supports (fig. 1 curve below the diagonal), whereas UFBoot2 + NNI only slightly underestimated the bootstrap values (fig. 1 curve closest to the diagonal). Thus, UFBoot2 + NNI helps to overcome the problem of unduly high supports by UFBoot2 in the presence of severe model violations. In terms of computation times based on the analysis of 115 benchmark MSAs, UFBoot2, and UFBoot2 + NNI showed a median speedup of 778 (range: 200–1,848) and 424 (range: 233–749) compared with SBS, respectively. Compared with RBS, UFBoot2, and UFBoot2 + NNI are 8.4 (range: 1.5–51.2) and 5.0 (range: 0.8–32.6) times faster, respectively. Therefore, UFBoot2 + NNI is two times (median) slower than UFBoot2. Supplementary Figures S1–S3, Supplementary Material online, show the distributions of runtime ratios between SBS/RBS/UFBoot and UFBoot2/UFBoot2 + NNI. We conclude that UFBoot2 and UFBoot2 + NNI are fast alternatives to other bootstrap approaches. Under no or mild model violations, UFBoot2 has the interpretation of unbiased bootstrap support as suggested for UFBoot (Minh et al. 2013). That is, one can trust branches with UFBoot2 support . Users are advised to apply model violation detection methods (Goldman 1993; Weiss and von Haeseler 2003; Nguyen et al. 2011) before bootstrap analyses. UFBoot2 + NNI should be applied if severe model violations are present in the data set at hand.

Resampling Strategies for Phylogenomic Data

Recent phylogenetic analyses are typically based on multiple genes to infer the species tree, the so-called phylogenomics. To facilitate phylogenomic analysis, UFBoot2 implements several bootstrap resampling strategies: i) resampling MSA-sites within partitions (denoted as MSA-site resampling as the default option), ii) resampling genes instead of MSA-sites (gene-resampling, invoked via “-bsam GENE” option), and iii) resampling genes and subsequently resamples MSA-sites within each gene (gene-site resampling, invoked via “-bsam GENESITE” option) (Gadagkar et al. 2005). Strategy (i) preserves the number of MSA-sites for all genes in the bootstrap MSAs, whereas strategies (ii) and (iii) will lead to different number of sites in the bootstrap MSAs. To investigate the impact of the three resampling strategies, we reanalyzed the metazoan data with 21 species, 225 genes, and a total of 171,077 amino-acid sites (Salichos and Rokas 2013). Figure 2 shows the ML tree inferred with IQ-TREE under edge-unlinked partition model (Chernomor et al. 2016), which allows separate sets of branch lengths across partitions. The tree replicates previous results (Salichos and Rokas 2013) and shows the Protostomia clade (Telford et al. 2015). However, discrepancies between resampling strategies are observed: while MSA-site and gene-resamplings obtained high supports (>95%) for branches along the backbone of the tree (fig. 2; bold lines), lower supports (80%) were estimated by gene-site resampling.

. 2.

Maximum-likelihood tree inferred under the edge-unlinked partition model. Numbers attached to the branches show the UFBoot2 bootstrap supports using MSA-site, gene, and gene-site resampling strategies (omitted when all three supports are 100%). By further examining 14 other empirical data sets (Bouchenak-Khelladi et al. 2008; Fabre et al. 2009; Stamatakis and Alachiotis 2010; van der Linde et al. 2010; Pyron et al. 2011; Nyakatura and Bininda-Emonds 2012; Springer et al. 2012; Hinchliff and Roalson 2013; Salichos and Rokas 2013; Dell’Ampio et al. 2014), we observed more discrepancies between resampling strategies (data not shown). Exceptionally, for some data sets, a number of branches showed almost no support (≤10%) for one resampling but high supports (≥95%) for the other two resampling strategies. However, there is no tendency toward systematically lower supports obtained by one resampling strategy. Taking into account the above findings, we recommend to apply all alternative resampling strategies. If similar bootstrap supports are obtained, then one can be more confident about the results.

Conclusions

UFBoot2 significantly improves speed and accuracy of bootstrap values compared with UFBoot. It also offers new functionalities in the presence of model violations and in its applicability to phylogenomic data. In general, since SBS, RBS, and UFBoot2 + NNI share a disadvantage of being conservative, more research is necessary to understand the different biases introduced by the available phylogenetic bootstrap estimation methods.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

28 in total

Review 1. Testing substitution models within a phylogenetic tree.

Authors: Gunter Weiss; Arndt von Haeseler
Journal: Mol Biol Evol Date: 2003-04-02 Impact factor: 16.240

2. MISFITS: evaluating the goodness of fit between a phylogenetic model and an alignment.

Authors: Minh Anh Thi Nguyen; Steffen Klaere; Arndt von Haeseler
Journal: Mol Biol Evol Date: 2010-07-19 Impact factor: 16.240

3. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2006-08-23 Impact factor: 6.937

4. Large multi-gene phylogenetic trees of the grasses (Poaceae): progress towards complete tribal and generic level sampling.

Authors: Yanis Bouchenak-Khelladi; Nicolas Salamin; Vincent Savolainen; Felix Forest; Michelle van der Bank; Mark W Chase; Trevor R Hodkinson
Journal: Mol Phylogenet Evol Date: 2008-02-09 Impact factor: 4.286

5. Statistical tests of models of DNA substitution.

Authors: N Goldman
Journal: J Mol Evol Date: 1993-02 Impact factor: 2.395

6. CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP.

Authors: Joseph Felsenstein
Journal: Evolution Date: 1985-07 Impact factor: 3.694

7. Divergent maximum-likelihood-branch-support values for polytomies.

Authors: Mark P Simmons; Andrew P Norton
Journal: Mol Phylogenet Evol Date: 2014-02-04 Impact factor: 4.286

8. Updating the evolutionary history of Carnivora (Mammalia): a new species-level supertree complete with divergence time estimates.

Authors: Katrin Nyakatura; Olaf R P Bininda-Emonds
Journal: BMC Biol Date: 2012-02-27 Impact factor: 7.431

9. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees.

Authors: Simon Whelan; Paul I W de Bakker; Emmanuel Quevillon; Nicolas Rodriguez; Nick Goldman
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. Decisive data sets in phylogenomics: lessons from studies on the phylogenetic relationships of primarily wingless insects.

Authors: Emiliano Dell'Ampio; Karen Meusemann; Nikolaus U Szucsich; Ralph S Peters; Benjamin Meyer; Janus Borner; Malte Petersen; Andre J Aberer; Alexandros Stamatakis; Manfred G Walzl; Bui Quang Minh; Arndt von Haeseler; Ingo Ebersberger; Günther Pass; Bernhard Misof
Journal: Mol Biol Evol Date: 2013-10-18 Impact factor: 16.240

1195 in total

1. Aerosol Transmission from Infected Swine to Ferrets of an H3N2 Virus Collected from an Agricultural Fair and Associated with Human Variant Infections.

Authors: Bryan S Kaplan; J Brian Kimble; Jennifer Chang; Tavis K Anderson; Phillip C Gauger; Alicia Janas-Martindale; Mary Lea Killian; Andrew S Bowman; Amy L Vincent
Journal: J Virol Date: 2020-07-30 Impact factor: 5.103

2. Multilayered horizontal operon transfers from bacteria reconstruct a thiamine salvage pathway in yeasts.

Authors: Carla Gonçalves; Paula Gonçalves
Journal: Proc Natl Acad Sci U S A Date: 2019-10-14 Impact factor: 11.205

3. Analysis of a vinculin homolog in a sponge (phylum Porifera) reveals that vertebrate-like cell adhesions emerged early in animal evolution.

Authors: Phillip W Miller; Sabine Pokutta; Jennyfer M Mitchell; Jayanth V Chodaparambil; D Nathaniel Clarke; W James Nelson; William I Weis; Scott A Nichols
Journal: J Biol Chem Date: 2018-06-07 Impact factor: 5.157

4. Focal duodenal necrosis in chickens: attempts to reproduce the disease experimentally and diagnostic considerations.

Authors: Ana M Villegas; Lisa Stabler; Robert J Moore; Francisco A Uzal; Jake A Lacey; Charles Hofacre; Margie Lee; Naola Ferguson-Noel; Rosetta Barber; Claire-Sophie Rimet; Carmen Jerry; Woo Kyun Kim; Barquiesha Madison; Monique França
Journal: J Vet Diagn Invest Date: 2020-01-26 Impact factor: 1.279

5. Exploration of Plastid Phylogenomic Conflict Yields New Insights into the Deep Relationships of Leguminosae.

Authors: Rong Zhang; Yin-Huan Wang; Jian-Jun Jin; Gregory W Stull; Anne Bruneau; Domingos Cardoso; Luciano Paganucci De Queiroz; Michael J Moore; Shu-Dong Zhang; Si-Yun Chen; Jian Wang; De-Zhu Li; Ting-Shuang Yi
Journal: Syst Biol Date: 2020-07-01 Impact factor: 15.683

6. Draft genome sequences of three filamentous cyanobacteria isolated from brackish habitats.

Authors: Joanne Sarah Boden; Michele Grego; Henk Bolhuis; Patricia Sánchez-Baracaldo
Journal: J Genomics Date: 2021-02-17

7. Genomic Landscape of Ornithobacterium rhinotracheale in Commercial Turkey Production in the United States.

Authors: Emily A Smith; Elizabeth A Miller; Bonnie P Weber; Jeannette Munoz Aguayo; Cristian Flores Figueroa; Jared Huisinga; Jill Nezworski; Michelle Kromm; Ben Wileman; Timothy J Johnson
Journal: Appl Environ Microbiol Date: 2020-05-19 Impact factor: 4.792

Review 8. The evolution and clinical impact of hepatitis B virus genome diversity.

Authors: Peter A Revill; Thomas Tu; Hans J Netter; Lilly K W Yuen; Stephen A Locarnini; Margaret Littlejohn
Journal: Nat Rev Gastroenterol Hepatol Date: 2020-05-28 Impact factor: 46.802

9. Isolation and structure determination of new linear azole-containing peptides spongiicolazolicins A and B from Streptomyces sp. CWH03.

Authors: Mana Suzuki; Hisayuki Komaki; Issara Kaweewan; Hideo Dohra; Hikaru Hemmi; Hiroyuki Nakagawa; Hideki Yamamura; Masayuki Hayakawa; Shinya Kodani
Journal: Appl Microbiol Biotechnol Date: 2020-11-20 Impact factor: 4.813

10. Drosophila menthol sensitivity and the Precambrian origins of transient receptor potential-dependent chemosensation.

Authors: Nathaniel J Himmel; Jamin M Letcher; Akira Sakurai; Thomas R Gray; Maggie N Benson; Daniel N Cox
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-09-23 Impact factor: 6.237