| Literature DB >> 19377059 |
Morgan N Price1, Paramvir S Dehal, Adam P Arkin.
Abstract
Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N(2)) space and O(N(2)L) time, but FastTree requires just O(NLa + N ) memory and O(N log (N)La) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 h and 2.4 GB of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 h and 50 GB of memory. In simulations, FastTree was slightly more accurate than Neighbor-Joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19377059 PMCID: PMC2693737 DOI: 10.1093/molbev/msp077
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FOverview of FastTree.
CPU Time and Memory Usage for Computing Distances, Trees, and Support Values
| Program | Support | COG2814 | PF00005 | 16S rRNA | ||||
| h | GB | h | GB | h | GB | |||
| FastTree 1.0 | None | 0.06 | 0.16 | 0.52 | 0.3 | 16.3 | 2.4 | |
| FastTree 1.0 | Local 1,000 | 0.08 | 0.16 | 0.56 | 0.3 | 17.3 | 2.4 | |
| Log-corrected distances | 0.05 | 0.13 | 0.71 | 2.8 | 33.1 | 49.9 | ||
| Maximum likelihood distances | 138 | 0.72 | ≈ 3, 000 | — | ≈ 5, 000 | — | ||
| Clearcut 1.0.8 | None | 0.06 | 0.22 | 1.44 | 5.2 | ≈ 28.6 | ≈ 52 | |
| RapidNJ 1.0.0 | None | 0.05 | 2.2 | ≈ 0.9 | ≈ 55 | ≈ 22.1 | ≈ 549 | |
| FastME 1.1 | None | 0.51 | 4.2 | ≈ 12.5 | ≈ 105 | ≈ 138 | ≈ 1, 000 | |
| QuickTree 1.1 | None | 0.24 | 0.16 | 22.7 | 2.9 | ≈ 1, 500 | ≈ 47 | |
| QuickTree 1.1 | Boot 100 | 63.5 | 0.71 | ≈ 104 | ≈ 15.5 | ≈ 105 | ≈ 254 | |
| BIONJ | None | 32.9 | 0.44 | ≈ 820 | ≈ 10.9 | ≈ 105 | ≈ 110 | |
| PhyML 3 | Approximate | > 1,000 | 9.5 | — | — | — | — | |
| likelihood ratio test | ||||||||
| RAxML VI 1.0 | None | > 1,000 | 0.70 | — | — | — | — | |
| Consense | Boot 100 | 1.09 | 0.36 | 118 | 9.4 | ≈ 3, 700 | ≈ 94 | |
NOTE.—aLRT, approximate likelihood ratio test.
The time to compute the distances between all N2 pairs of sequences in the alignment, as implemented by the authors, and the space required to store the N(N – 1)/2 distinct entries of the distance matrix. For nucleotide sequences, these are the same as Jukes–Cantor distances.
For protein sequences, we used PHYLIP's protdist and default options (JTT model, no variation of rates across sites). For nucleotide sequences, we used PHYLIP's dnadist with the F84 model and gamma-distributed rates.
These timings include half of the time to compute N2 log-corrected distances because the method requires a distance matrix but each pair of sequences only needs to be considered once.
Using QuickTree's built-in implementation of %different distances and of global bootstrap.
For best performance, we used no variation of rates across sites.
For best performance, we used no variation of rates across sites and the fast hill-climbing option (-f d.). For an initial topology, we used the BIONJ tree.
This does not include the time to compute the resampled trees.
Topological Accuracy of Tree-Building Methods on Simulated Protein Alignments with Gaps
| Topological Accuracy | ||||||
| Method | Distances | |||||
| PhyML | JTT | 0.744 | 0.771 | 0.817 | 0.801 | — |
| FastTree | Log-corrected | 0.724 | 0.763 | 0.797 | 0.778 | 0.763 |
| FastME | Log-corrected | 0.716 | 0.754 | 0.796 | 0.777 | 0.753 |
| BIONJ | Log-corrected | 0.725 | 0.754 | 0.766 | 0.730 | 0.723 |
| BIONJ | JTT | 0.701 | 0.758 | 0.777 | 0.737 | 0.731 |
| BIONJ | JTT + Γ | 0.567 | 0.625 | 0.737 | 0.697 | — |
| QuickTree | Log-corrected | 0.716 | 0.746 | 0.760 | 0.726 | 0.716 |
| QuickTree | %Different | 0.673 | 0.678 | 0.699 | 0.672 | 0.655 |
| Clearcut | Log-corrected | 0.682 | 0.733 | 0.755 | 0.723 | 0.715 |
Significantly more accurate than FastTree (P < 0.01, paired t-test)
Not significantly different from FastTree (P > 0.01, paired t-test)
Significantly less accurate than FastTree (P < 0.01, paired t-test)
The Topological Accuracy of Variants of FastTree on Simulated Protein Alignments with Gaps
| Topological Accuracy | |||
| Method | |||
| FastTree, default settings | 0.797 | 0.778 | 0.763 |
| FastTree + extra NNI (20 rounds) | 0.797 | 0.778 | 0.763 |
| FastTree's Neighbor-Joining (no NNI) | 0.734 | 0.702 | 0.698 |
| FastTree, exhaustive search, no NNI | 0.733 | 0.701 | — |
| BIONJ, uncorrected distances | 0.731 | 0.699 | 0.694 |
| BIONJ, log-corrected distances | 0.766 | 0.730 | 0.723 |
The Relative Log Likelihoods of Topologies Inferred for 310 Genuine Protein Alignments of 500 Sequences Each
| Method | Distances/model | Average log likelihood | Lower likelihood than FastTree(%) |
| PhyML/FastTree | JTT + Γ4 | 440.7 | 0 |
| FastTree | Log-corrected | 0.0 | — |
| FastME | Log-corrected | – 165.2 | 86 |
| BIONJ | JTT | – 404.3 | 95 |
| BIONJ | Log-corrected | – 426.1 | > 99 |
| QuickTree | Log-corrected | – 495.3 | > 99 |
| Clearcut | Log-corrected | – 532.2 | 99 |
| QuickTree | %Different | – 667.0 | 100 |
| BIONJ | JTT + Γ | – 1,576.1 | 99 |
PhyML 3 with FastTree as the starting tree.
Γ4 means four categories of sites with gamma-distributed rates.
The Relative Log Likelihoods of Topologies Inferred for 100 Genuine 16S rRNA Alignments of 500 Sequences Each
| Method | Distances/model | Average log likelihood | Lower likelihood than FastTree(%) |
| PhyML | HKY85 + Γ4 | 510.4 | 0 |
| PhyML | HKY85 | 358.4 | 5 |
| FastME | F84 + Γ | 59.9 | 34 |
| FastTree | Jukes–Cantor | 0.0 | — |
| FastME | Kimura | – 7.4 | 53 |
| FastME | Jukes–Cantor | – 71.7 | 70 |
| BIONJ | F84 + Γ | – 749.1 | 100 |
| BIONJ | Kimura | – 781.0 | 100 |
| BIONJ | Jukes–Cantor | – 843.9 | 100 |
| QuickTree | F84 + Γ | – 878.8 | 100 |
| Clearcut | F84 + Γ | – 905.1 | 100 |
| QuickTree | Jukes–Cantor | – 941.1 | 100 |
| Clearcut | Jukes–Cantor | – 982.3 | 100 |
NOTE.—HKY, Hasegawa–Kishino–Yano.
Genuine Alignments for Performance Testing
| Alignment | COG2814 | PF00005 | 16S rRNA |
| Type | Protein | Protein | Nucleotide |
| #Sequences | 10,610 | 52,927 | 167,547 |
| #Distinct | 8,362 | 39,092 | 158,022 |
| #Columns | 394 | 214 | 1,287 |
| %Gaps | 10.8 | 15.2 | 4.3 |
FDistribution of support values for simulated alignments of 250 protein sequences with gaps. We compare the distribution of FastTree's local bootstrap and the traditional (global) bootstrap for correctly and incorrectly inferred splits. The right-most bin contains the strongly supported splits (0.95–1.0)