| Literature DB >> 27436007 |
Prabhav Kalaghatgi1, Nico Pfeifer2, Thomas Lengauer2.
Abstract
The widely used model for evolutionary relationships is a bifurcating tree with all taxa/observations placed at the leaves. This is not appropriate if the taxa have been densely sampled across evolutionary time and may be in a direct ancestral relationship, or if there is not enough information to fully resolve all the branching points in the evolutionary tree. In this article, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing so-called generally labeled trees in which taxa may be placed at internal vertices and the tree may contain polytomies. FJ constructs such trees on the basis of pairwise distances and a distance threshold. We tested three methods for threshold selection, FJ-AIC, FJ-BIC, and FJ-CV, which minimize Akaike information criterion, Bayesian information criterion, and cross-validation error, respectively. When compared with related methods on simulated data, FJ-BIC was among the best at reconstructing the correct tree across a wide range of simulation scenarios. FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain. The FJ-BIC tree was found to be compatible with almost all transmission events. On average, internal branches in the FJ-BIC tree have higher bootstrap support than branches in the leaf-labeled bifurcating tree constructed using RAxML. 36% and 25% of the internal branches in the FJ-BIC tree and RAxML tree, respectively, have bootstrap support greater than 70%. To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.Entities:
Keywords: densely sampled taxa; distance-based phylogenies; generally labeled trees; latent tree graphical models.
Mesh:
Year: 2016 PMID: 27436007 PMCID: PMC5026249 DOI: 10.1093/molbev/msw123
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1(A) The tree-additive distances used in this example. Labeled vertices are represented by solid circles and latent vertices by white circles with black border. (B–G) The agglomeration steps of FJ which identifies the correct tree topology. The edges that are inferred in each agglomeration step are shown as solid lines. The dotted lines connect the labeled and latent vertices that will be used in the next iteration. (H) The correct branch lengths estimated using OLS.
Simulated Data Sets Were Constructed by Varying Either the Tree Type, Proportion of Labeled Internal Vertices, Type of Contracted Edge, Number of Labeled Vertices, and Sequence Length or Branch Length.
| Tree Type | Balanced | Random* | Unbalanced | ||
|---|---|---|---|---|---|
| Fraction of latent vertices | 0.5 | 0.37 | 0.25* | 0.12 | 0 |
| Contracted edge | |||||
| Average branch length | 0.001 | 0.004 | 0.016* | 0.064 | 0.256 |
| Number of labeled vertices | 20 | 40 | 80 | 160* | 320 |
| Sequence length | 250 | 500 | 1,000* | 2,000 | 4,000 |
Note.—All settings that were considered for each parameter are shown below. The default setting for each parameter is indicated with *.
. 2A comparison of the reconstruction accuracy of all methods in six simulation categories. One parameter (x-axis) was varied in each category. The default parameter settings are denoted as parameterValue (d) on each x-axis. For each parameter setting, 100 data sets were created. Precision is shown in blue and recall is shown in pink.
Methods with the Significantly High Precision and Recall Are Shown Below.
Note.—All methods that are not significantly worse than the best method are also shown. F, N, R, C, and S stand for FJ-BIC, NJc-BIC, RG-BIC, CLRG-BIC, and SA, respectively. Black and red indicate methods with the highest precision and recall, respectively. The default setting for each simulation parameter is indicated with *.
. 3A comparison of run times of all methods in the scenario where the number of labeled vertices was varied. Run times are shown on a log-scale.
. 4The FJ-BIC tree of 181 HIV-1 env gene sequences sampled from hosts involved in a known transmission chain. Each vertex is represented by a circle whose inner color is black if the vertex is labeled and white if the vertex is latent. The outer color of each circle indicates the host of the corresponding vertex. Branches reflecting transmission events have been labeled. Nine out of ten transmission events are compatible with the FJ-BIC tree. The red box highlights the transmission event which is not compatible with the FJ tree.
. 5Left: Comparing the support of common branches in the FJ-BIC tree and the RAxML tree. Right: Supports for branches that are only present in either the FJ-BIC tree or the RAxML tree.
. 6An illustration of the FJ algorithm. The main steps have been labeled with their time complexity.
. 7The three cases for the internal edge e0. Case 1: Both α and β are not labeled. Case 2: Only α is labeled. Case 3: Both α and β are labeled. The triangles represent subtrees.
. 8The two cases for the terminal edge e0. α is not labeled in case 1 and is labeled in case 2. The triangles represent subtrees.