| Literature DB >> 23181935 |
Jakub Truszkowski1, Yanqi Hao, Daniel G Brown.
Abstract
: Recently, we have identified a randomized quartet phylogeny algorithm that has O(nlogn) runtime with high probability, which is asymptotically optimal. Our algorithm has high probability of returning the correct phylogeny when quartet errors are independent and occur with known probability, and when the algorithm uses a guide tree on O(loglogn) taxa that is correct with high probability. In practice, none of these assumptions is correct: quartet errors are positively correlated and occur with unknown probability, and the guide tree is often error prone. Here, we bring our work out of the purely theoretical setting. We present a variety of extensions which, while only slowing the algorithm down by a constant factor, make its performance nearly comparable to that of Neighbour Joining , which requires Θ(n3) runtime in existing implementations. Our results suggest a new direction for quartet-based phylogenetic reconstruction that may yield striking speed improvements at minimal accuracy cost. An early prototype implementation of our software is available at http://www.cs.uwaterloo.ca/jmtruszk/qtree.tar.gz.Entities:
Year: 2012 PMID: 23181935 PMCID: PMC3561654 DOI: 10.1186/1748-7188-7-32
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1A search tree. A phylogeny and a corresponding search tree. Internal nodes of the search tree correspond to subphylogenies: two, for r(B) and r(C), are indicated. Leaves of the search tree correspond to edges of the phylogeny.
The effect of heuristics on the performance of the algorithm
| basic RW+NJ guide tree | 56.3 ± 2.4 | 46.3 ± 4.6 |
| basic RW+true guide tree | 57.3 ± 2.1 | 49.4 ± 5.0 |
| (not feasible in practice) | | |
| 5 quartets per node query, UM | 76.4 ± 2.0 | 41.0 ± 3.8 |
| 5 quartets, WM | 85.6 ± 1.6 | 48.6 ± 3.7 |
| 5 quartets, WM, 2E | 95.4 ± 1.0 | 45.5 ± 3.4 |
| 5 quartets, WM, CT, 2E | 84.1 ± 1.8 | 57.4 ± 3.5 |
| 5 quartets, WM, re-running the RW, 2E | 78.8 ± 2.4 | 59.5 ± 3.7 |
| 5 quartets, WTA, CT, 2E | 80.2 ± 2.1 | 62.3 ± 2.9 |
| 20 quartets, WTA, CT, 2E | 92.1 ± 1.4 | 60.8 ± 2.9 |
| NJ | n/a | 62.6 |
Shown are results for the COG840 data set with 1250 taxa. We show our algorithm’s performance in various settings, and compare it to Neighbor Joining. We report accuracies using the Robinson-Foulds measure. Our algorithm places approximately 80% − 90% of taxa with accuracy around 60%. We ran each version of the algorithm 100 times. In all cases, the guide tree is on 200 taxa; except in the second line of the table, this was generated with Neighbor Joining, and had RF accuracy of 50% ± 3%. Three voting schemes were used in the experiments: unweighted majority (UM), weighted majority (WM), and winner-takes-all (WTA). In some experiments, we also added 2 additional rounds of insertions (2E), and a confidence threshold for insertion (CT).
Comparison of different algorithms
| | | | | | | ||||
|---|---|---|---|---|---|---|---|---|---|
| | |||||||||
| weighted majority, 5 quartets | 88.5 | 66.2 | 72.8 | 84.1 | 57.4 | 85.8 | 74.4 | 51.4 | 59.5 |
| WTA-vote, 5 quartets | 86.0 | 69.4 | 70.8 | 80.2 | 62.3 | 85.4 | 70.1 | 57.0 | 59.6 |
| weighted majority, 20 quartets | 95.6 | 60.4 | 69.9 | 96.4 | 50.6 | 83.9 | 94.3 | 41.1 | 56.5 |
| WTA-vote, 20 quartets | 94.0 | 69.4 | 73.4 | 92.1 | 60.8 | 83.1 | 89.7 | 57.6 | 55.3 |
| NJ | 100 | 73.6 | 70.0 | 100 | 62.6 | 88.0 | 100 | 73.0 | 66.3 |
| FastTree (NJ phase only) | 100 | 69.7 | 85.9 | 100 | 61.0 | 86.6 | 100 | 73.6 | 66.4 |
| weighted majority, 5 quartets, force all taxa | 100 | 59.0 | 69.7 | 100 | 48.7 | 80.8 | 100 | 37.3 | 52.4 |
Performance of the random walk algorithm on synthetic alignments. The size of the guide tree was 200 except for the 250 taxon data set, where it was set to 100. The average RF accuracy of the guide trees was 65,50, and 46% for the 250,1250, and 5000-taxon data sets, respectively. The average quartet accuracies for the guide trees were 73,83 and 55%. When all taxa are forced into the random walk tree (see text), the RF accuracy decreases by 7 − 14%, depending on the data set. All random walk runs use the confidence threshold heuristic and two additional rounds of insertions.
Running times on large data sets
| | |||
|---|---|---|---|
| weighted majority, 5 quartets | 6m 41s | 15m 52s | 34m |
| FastTree (NJ phase only) | 13m 52s | 41m 15s | 116m |
Running times of the random walk algorithm compared to FastTree. We used the huge.1 alignment from the original FastTree paper [9]. Smaller data sets were created by choosing a random subset of sequences from the large alignment. Our algorithm runs 2.1 to 3.4 times faster than FastTree on these very large data sets.
Accuracies on large data sets
| | | | |
|---|---|---|---|
| weighted majority, 5 quartets | 80.8 ±1.1 | 78.9 ±1.4 | 80.8 ±1.1 |
| weighted majority, 5 quartets, all taxa forced | 79.9 ±1.1 | 77.8 ±1.4 | 75.3 ±1.6 |
| weighted majority, 5 quartets+local search | 96.1 | 94.4 | 92.8 |
| weighted majority, 5 quartets, all taxa forced+local search | 95.8 | 93.8 | 92.0 |
| FastTree (NJ phase only) | 62.9 | 58.1 | 52.2 |
| FastTree + local search | 95.8 | 93.8 | 92.0 |
Robinson-Foulds accuracies of FastTree and the random walk algorithm for the huge.1 data set [9]. The figures for the random walk algorithm represent the average accuracy over 10 runs of the algorithm, together with empirical standard deviations. We used a confidence threshold, with two additional rounds of insertions. The average taxon coverage for weighted majority was 98.6, 98.6, and 98.0 per cent for the 20,000, 40,000, and 78,132 taxa alignments, respectively. After applying local search, the variance between the runs of the random walk algorithm is negligible.
Figure 2Comparison of the random walk algorithm and FastTree. The performance of the random walk algorithm and FastTree as a function of the length of the sequences. The four graphs represent the performance on 10 tree topologies with branch lengths scaled by constant factors 25,50,100, and 200. In all cases, the random walk algorithm compares increasingly favourably with FastTree as the sequence length increases. After applying local search, the differences between the average accuracies of the two methods are less than 1% for all the settings except the shortest sequences in the data set scaled by 200, where trees obtained from FastTree are 3.8% more accurate. The average taxon coverage for random walk trees was 97.3%, with only three experimental settings yielding coverage below 95%. Missing taxa were inserted into random walk trees before applying local search.
Running times on large data sets
| | | | |
|---|---|---|---|
| weighted majority, 5 quartets+local search | 20m 40s (27m 21s) | 41m 33s (57m 25s) | 79m (113m) |
| FastTree + local search | 21m 27s (35m 19s) | 42m 29s (83m 44s) | 80m (196m) |
Running times of the local search procedure of FastTree applied to trees produced by our algorithm and the Neighbour Joining phase of FastTree on the huge.1 data set. Total runtimes, including the time required to produce the initial tree, are shown in brackets.
Aggregating multiple trees
| | |||
|---|---|---|---|
| 5 input trees (average) | 83 | 62 | 85 |
| output tree | 94 | 48 | 85 |
| output tree with forcing | 100 | 39 | 83 |
The performance of the random walk algorithm as a supertree method. We generated 5 input trees by running the random walk five times independently on the COG840 alignment. We then ran the random walk algorithm with quartet queries evaluated by taking the induced quartet in each tree, and choosing the most common one. The guide tree was chosen as the subtree induced by 200 randomly chosen taxa on one of the 5 input trees.