| Literature DB >> 25380957 |
Celine Scornavacca1, Edwin Jacox2, Gergely J Szöllősi2.
Abstract
MOTIVATION: Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods-generally computationally more efficient-require a prior estimate of parameters and of the statistical support.Entities:
Mesh:
Year: 2014 PMID: 25380957 PMCID: PMC4380024 DOI: 10.1093/bioinformatics/btu728
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.CCPs can be used to estimate the posterior probability of any tree that can be amalgamated from clades present in a sample of gene trees (David and Alm, 2010; Höhna and Drummond, 2012). Conditional clade frequencies can be used to approximate CCPs and are computed as the proportion of occurrences of a particular split of a clade according to a tripartition π, e.g. (abc de) among all trees in which the clade, e.g. (abcde), is found. Estimates based on the sample of trees on the left are shown as fractions for two different gene trees that can be amalgamated. The estimate for a gene tree is given by the sum of the reconciliation score and the logarithm of the tree CCPs. Based on the sample on the left, the tree with the highest posterior probability is the third tree (blue online). Reconciling it with the species tree requires one transfer and one loss event. It is, however, possible to combine clades present in the second (green online) and third (blue online) trees to produce a gene tree that is not present in the original sample but is identical to the species tree, i.e. it requires no events to draw it into the species tree. Depending on the costs of transfer and loss events, and the self-consistently estimated c parameter, the scenario without transfer might be optimal for the joint score
Fig. 2.(a) To compare the accuracy of TERA and other methods we used the simulated data set of Szöllősi et al. (2013b). We find that TERA achieves statistically equivalent accuracy to ALE and better accuracy than the other methods, see main text for details. (b) To test for over and underfitting of the species tree we examined the 431 gene families with exactly one copy in each of the 36 cyanobacterial species. For each family we plot the difference of the R-F distance of the true tree to the species tree and the R-F distance of the reconstructed gene tree from the species tree. Negative values for the difference indicate overfitting, while in the case of underfitting we expect a positive value
Mean runtimes in seconds for the methods discussed in the main text on a cluster of 2.1 GHz Intel Xeon processors with 24 GB of RAM with maximum runtime limited to 10 h per family
| Stand-alone [s] | Input [s] | ||
|---|---|---|---|
| 1000 samples | 10 000 samples | ||
| 3.65 | 756.6 | 7566 | |
| 54.9 | 756.6 | — | |
| 159.2 | 756.6 | 7566 | |
| 6.3 | 182.5 | ||
| 5718.0 | 182.5 | ||
| 32 137.3 | 0 | ||
The time required to compute inputs is given by the runtime of PhyloBayes for 1000 and 10 000 samples and for the time required for PhyML to compute an ML tree with SH branch supports. Stand-alone runtimes are given for 10 000 samples for TERA and ALE and 1000 samples for AnGST.