| Literature DB >> 26744021 |
Manuel Binet1,2,3, Olivier Gascuel4,5, Celine Scornavacca6,7, Emmanuel J P Douzery8, Fabio Pardi9,10.
Abstract
BACKGROUND: Branch lengths are an important attribute of phylogenetic trees, providing essential information for many studies in evolutionary biology. Yet, part of the current methodology to reconstruct a phylogeny from genomic information - namely supertree methods - focuses on the topology or structure of the phylogenetic tree, rather than the evolutionary divergences associated to it. Moreover, accurate methods to estimate branch lengths - typically based on probabilistic analysis of a concatenated alignment - are limited by large demands in memory and computing time, and may become impractical when the data sets are too large.Entities:
Mesh:
Year: 2016 PMID: 26744021 PMCID: PMC4705742 DOI: 10.1186/s12859-015-0821-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Pipelines of the analyses applied to both data sets, represented as flowcharts. We refer to the “Analysis protocol” subsection for a detailed description of each analysis method
Names and short descriptions of the methods tested
| Name | Brief description |
|---|---|
| Concat+Dist | Distance-based analysis of the concatenated alignment |
| Concat+ML | ML analysis of the concatenated alignment |
| SDM*add | SDM* run on the gene tree distance matrices (+ post-processing) |
| DistRadd | DistR run on the gene tree distance matrices (+ post-processing) |
| ERaBLEadd | ERaBLE run on the gene tree distance matrices |
| SDM* | SDM* run on the estimated distance matrices (+ post-processing) |
| DistR | DistR run on the estimated distance matrices (+ post-processing) |
| ERaBLE | ERaBLE run on the estimated distance matrices |
Fig. 2Accuracy of branch length estimates in the simulated data set. For each method, model branch lengths b (x-axis) are plotted against estimation errors (y-axis) for all branches in all 500 model trees (500×77=38,500 points per plot). Colors (from blue to red) indicate increased density of points. The horizontal red line corresponds to no estimation error. Method names are shown at the top of each plot, followed by the mean (over 500 values) of the fraction of variance unexplained of (b ) relative to (see Additional file 3)
Fig. 3Estimation accuracy for gene rates in the simulated data set. Log-log scatterplots showing model gene rates r (x-axis) against error ratios (y-axis) for all genes in all 500 replicates (500 × 500=250,000 points per plot). Note that errors are measured with ratios, instead of differences. Colors (from blue to red) indicate increased density of points. The horizontal red line corresponds to no estimation error. Method names are shown at the top of each plot, followed by the mean absolute log-ratio between estimated and model gene rates (see Additional file 3)
Computational efficiencies on the OrthoMaM data set for the tested methods
| Concat+Dist | Concat+ML | SDM*add | DistRadd | ERaBLEadd | SDM* | DistR | ERaBLE | |
|---|---|---|---|---|---|---|---|---|
|
| ≈0 | 3 h 20 m/39 h 28 m | 2 m 46 s | 2 m 46 s | 2 m 46 s | |||
|
| 5 m 41 s | 41 h 16 m | 8 h 2 m | 2 h 9 m | 7 s | 8 h 33 m | 2 h 6 m | 7 s |
|
| 889 MB | 117 GB | 1.2 GB | 2.8 GB | 222 MB | 1.2 GB | 3.0 GB | 221 MB |
Note.— The first row gives (T 1) the running time to obtain the data on which subsequent computations are based: the superalignment and the distance-based gene trees for Concat+Dist, the superalignment and ML gene trees for Concat+ML, the ML gene trees and resulting additive distances for the three supertree methods, and the estimated distances for the three medium-level methods. When ML gene trees are used (Concat+ML, SDM*add, DistRadd and ERaBLEadd), two alternative approaches are possible and therefore two running times are provided: first that to infer trees with fixed topology (3 h 20 m), and then that to infer trees where the topology is also estimated (39 h 28 m). The second row gives (T 2) the remaining running time to obtain estimates for branch lengths and gene rates. The third row (M) gives the maximum amount of memory allocated. All the experiments were conducted on a PC with 4 GB RAM and a 2.7 GHz CPU, except branch length estimation (T 2 and M) for Concat+ML, which, because of the large memory requirements, was run on a cluster machine with 200 GB RAM and a 2.66 GHz CPU
Fig. 4Accuracy of branch length estimates in the OrthoMaM data set. For each method, the 77 branch lengths estimated by Concat+ML (x-axis) are plotted against the differences (y-axis) (where is the estimate for the length of e obtained by the method at the top of the plot). The horizontal red line corresponds to no difference between the two estimates. Method names are shown at the top of each plot, followed by the fraction of variance unexplained of relative to (see Additional file 3)
Fig. 5Estimation accuracy for gene rates in the OrthoMaM data set. Logarithmic scatterplots showing the 6,953 “reference” gene rates estimated by Concat+ML (x-axis), against ratios (y-axis). Note that errors relative to the reference gene rates are measured with ratios, instead of differences. Colors (from blue to red) indicate increased density of points. The horizontal red line corresponds to no difference between the two estimates. Method names are shown at the top of each plot, followed by the mean absolute log-ratio between estimated and reference gene rates (see Additional file 3)