| Literature DB >> 22949484 |
Yi-Chieh Wu1, Matthew D Rasmussen, Mukul S Bansal, Manolis Kellis.
Abstract
Accurate gene tree reconstruction is a fundamental problem in phylogenetics, with many important applications. However, sequence data alone often lack enough information to confidently support one gene tree topology over many competing alternatives. Here, we present a novel framework for combining sequence data and species tree information, and we describe an implementation of this framework in TreeFix, a new phylogenetic program for improving gene tree reconstructions. Given a gene tree (preferably computed using a maximum-likelihood phylogenetic program), TreeFix finds a "statistically equivalent" gene tree that minimizes a species tree-based cost function. We have applied TreeFix to 2 clades of 12 Drosophila and 16 fungal genomes, as well as to simulated phylogenies and show that it dramatically improves reconstructions compared with current state-of-the-art programs. Given its accuracy, speed, and simplicity, TreeFix should be applicable to a wide range of analyses and have many important implications for future investigations of gene evolution. The source code and a sample data set are available at http://compbio.mit.edu/treefix.Entities:
Mesh:
Year: 2012 PMID: 22949484 PMCID: PMC3526801 DOI: 10.1093/sysbio/sys076
Source DB: PubMed Journal: Syst Biol ISSN: 1063-5157 Impact factor: 15.683
Figure 1.Gene tree landscape. a) Each point within the landscape corresponds to a gene tree T, whose optimality can be measured through its likelihood L (height) and its reconciliation cost c (color). The ML tree TML is located at the peak of this landscape but may have a high cost. Rearranging TML to a nearby tree T can result in a negligible decrease in likelihood (δ = LML − L < δthr) while simultaneously reducing the tree cost (c < cML), thus producing a more congruent gene tree that is statistically equivalent to the ML tree. TreeFix utilizes this basic idea by balancing the 2 optimality criteria to return an optimal tree T* for which δ* is negligible and c* is minimal. b) The landscape for a simulated gene family shows a wide range of likelihood and cost values. In this instance, TreeFix searched over 3560 gene trees of the 8.2 × 1021 possible unrooted topologies (number of genes = 21). Although most trees have statistically worse likelihoods compared with the ML tree (×), a subset of high likelihood trees are statistically equivalent (circles). As the search progresses, the search space generally moves toward the top-left, corresponding to topologies with high likelihood and low duplication–loss cost (enlarged at right, accepted trees per iteration shown as squares). In this case, TreeFix has rearranged TML (beige triangle) to produce a new optimal tree T* (purple triangle) with equivalent likelihood and lower cost. Note that T* is incorrect because the true tree Ttrue (black triangle) has a slightly higher duplication–loss cost. (Likelihoods were computed with ϵ = 2.)
Figure 2.Phylogenetic accuracy and runtime using several phylogenetic methods on simulated fly and fungal data sets. a) Both hybrid and species tree aware methods have high reconstruction accuracy for correctly inferring the full gene tree topology for the fly data set. TreeFix has the highest reconstruction accuracy for the larger fungal data set. b) The percent of accurately reconstructed branches is similar across all methods for the fly data set, but the hybrid methods and SPIMAP show significant improvement over TreeBest and RAxML for the fungal data set. c) Despite topological and branch inaccuracies, pairwise ortholog detection is robust across all methods in both precision and sensitivity. d,e) TreeFix (long) and SPIMAP infer duplications and losses with a high degree of sensitivity and precision, with both these methods offering a slight improvement over TreeFix and NOTUNG (100). Again, the hybrid methods and SPIMAP greatly outperform TreeBest and RAxML, particularly in terms of precision for the fungal data set. f) TreeFix achieves performance comparable to SPIMAP at a fraction of the runtime (average runtimes provided for the fungal data set). Note that TreeBest and RAxML were run with 100 bootstraps, whereas all other methods were run without bootstrapping. g) TreeFix runtime can likely be improved if the program were ported to a more efficient programming language. For more metrics, see Supplementary Table S1.
Evaluation of several phylogenetic programs on real fungal dataset
| Program | % Orths | # Orths | # Dups | # Losses | DCS | % GC | % Fail SH | RF | Runtime |
|---|---|---|---|---|---|---|---|---|---|
| TreeFix (long) | 96.4 | 574,946 | 6,062 | 10,981 | 0.649 | 97.3 | * | 0.306 | 21.35 min |
| TreeFix | 95.2 | 569,664 | 6,505 | 12,532 | 0.609 | 94.6 | * | 0.302 | 45.68 s |
| NOTUNG (100) | 96.1 | 582,581 | 6,161 | 10,835 | 0.659 | 89.2 | 19.2 | 0.285 | 0.41 s |
| NOTUNG (90) | 89.1 | 556,685 | 9,906 | 23,917 | 0.395 | 94.6 | 7.5 | 0.211 | 0.38 s |
| NOTUNG (50) | 70.1 | 487,875 | 18,322 | 54,101 | 0.191 | 89.2 | 0.8 | 0.051 | 0.40 s |
| tt (3) | 82.8 | 522,834 | 9,780 | 26,621 | 0.354 | 89.2 | 16.3 | 0.224 | 3.29 s |
| tt (2) | 76.5 | 503,323 | 12,552 | 35,898 | 0.272 | 89.2 | 10.7 | 0.171 | 0.18 s |
| tt (1) | 70.0 | 482,439 | 16,310 | 48,036 | 0.206 | 89.2 | 5.1 | 0.100 | 0.05 s |
| SPIMAP | 96.5 | 557,981 | 5,407 | 10,384 | 0.650 | 83.8 | – | – | 21.89 min |
| TreeBest | 79.5 | 480,680 | 11,240 | 34,287 | 0.266 | 81.1 | – | – | 25.72 s |
| RAxML | 63.8 | 462,039 | 21,083 | 64,037 | 0.159 | 89.2 | – | – | 4.32 min |
aSee Supplementary Section S4 for program details.
bPercentage of 183,374 syntenic orthologs recovered.
cNumber of pairwise orthologs, duplications, and losses inferred across all gene trees.
dAverage duplication consistency score.
ePercentage of 37 recent gene converted paralogs recovered.
fFor the hybrid methods, percentage of trees that fail the SH test compared with the input RAxML trees at α = 0.05. By design, TreeFix always returns a statistically equivalent tree (* = 0).
gFor the hybrid methods, average RF distance compared with the input RAxML trees.
hAverage runtime for reconstructing each gene tree. Note that TreeBest and RAxML were run with 100 bootstraps.