| Literature DB >> 32657396 |
Erin K Molloy1, Tandy Warnow1.
Abstract
MOTIVATION: Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.Entities:
Mesh:
Year: 2020 PMID: 32657396 PMCID: PMC7355287 DOI: 10.1093/bioinformatics/btaa444
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Reduction of the RFS-multree problem to the Robinson-Foulds Supertree (RFS) problem. To compute the RF distance between a singly-labeled tree T (a; bottom left) and a multree M (b; top left), we replace M by a smaller singly-labeled tree (e; bottom center). We then compute the RF distance between T and using Equation (1). Here we explain why this works. Suppose that T (a) is a candidate singly-labeled, binary supertree for a set of multrees and that M (b) is one of the multrees in . To compute the RF distance between T and M, we extend T with respect to M, producing Ext(T, M) (c). Note that Ext(T, M) has the same non-trivial edges (shown in blue) and the same trivial edges (shown in orange) as T, and for every leaf label (species), it has the same number of leaves with that label as multree M. The trivial edges in Ext(T, M) exist in any possible singly-labeled, binary tree on S; thus, these edges do not impact the solution to the RFS-multree problem. Similarly, multree M has edges (shown in red) that will be incompatible with an extended version of any possible singly labeled, binary tree on S; thus, these edges do not impact the solution to the RFS-multree problem. An edge is incompatible with every possible singly-labeled supertree if and only if it fails to induce a bipartition (i.e. deleting an edge e splits the leaf labels into two non-disjoint sets). Thus, we collapse all internal edges in M that fail to induce a bipartition, producing (d). Furthermore, because all leaves with the same label are now on the same side of every bipartition in , we can delete all but one leaf with each label, producing (e). The resulting tree is a non-binary, singly-labeled tree on S, so we can compute the RF distance between T and using Equation (1) when searching for the solution to the RFS-multree problem. These observations are formalized in Lemma 13 (Appendix), and it follows that an RFS-multree supertree for is an RF supertree for , as summarized in Theorem 2
Fig. 2.Impact of gene duplications and losses (GDL) on species tree estimation using RFS-multree methods. (a) Shows a species tree and (b) through (d) show three gene family trees that evolved within the species tree. (b) Shows gene family tree with a duplication event in species Y (i.e. the most recent common ancestor of species A, B and C). All edges below the duplication (shown in red) fail to induce bipartitions and so will be contracted, and will therefore not impact the solution space for the RFS-multree criterion. (c) Shows gene tree with a duplication event in species Y followed by the first copy of the gene being lost from species B and the second copy of the gene being lost from species C. Because one of the species that evolved from Y retains both copies of the gene, the non-trivial edges below the duplication node fail to induce bipartitions, and so these edges also do not impact the solution space for RFS-multree. (d) Shows gene family tree with a duplication event in species Y followed by the first copy of the gene being lost from species B and the second copy of the gene being lost from both species A and C. None of the species that evolved from Y retain both copies of the gene, so all edges below the duplication node induce bipartitions and hence will not be contracted; we refer to this situation as ‘adversarial GDL, because it produces bipartitions in the singly-labeled trees in that conflict with the species tree (shown in blue). Such a scenario leads to the possibility that the true species tree may not be an optimal solution to the RFS-multree problem
Fig. 3.Species tree error rate (i.e. RF error rate) and running time (s) are shown for FastMulRFS, MulRF and ASTRAL-multi under the most challenging model conditions with 100 species. All datasets have the second highest GTEE level (moderate GTEE: 52%), the highest ILS level (low/moderate ILS: 12%) and the highest GDL rate (D/L rate: ). Red dots (first row) and bars (second row) are means for 10 replicate datasets