| Literature DB >> 27737628 |
Hussein A Hejase1, Kevin J Liu2.
Abstract
BACKGROUND: Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown.Entities:
Keywords: Gene flow; Incomplete lineage sorting; Large-scale; Mouse; Mutation; Performance study; Phylogenetic inference; Phylogenetic network; Phylogenetics; Phylogenomics; Scalability
Mesh:
Year: 2016 PMID: 27737628 PMCID: PMC5064893 DOI: 10.1186/s12859-016-1277-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The impact of dataset size on the computational requirements of multi-locus methods. The model conditions had dataset sizes ranging from 5 to 25 taxa. Results are shown for MLE, MLE-length, MPL, MP, and SNaQ analyses using true gene trees as input. a Average runtime (h) and b main memory usage (GiB) are shown with standard error bars (n=20). The analysis of MLE on 15 taxa, MLE-length on 25 taxa, and MPL on 25 taxa did not complete after ten days of runtime
Fig. 2The impact of dataset size on the topological accuracy of multi-locus methods. The model conditions had dataset sizes ranging from 5 to 25 taxa. Results are shown for MP, MLE, MLE-length, MPL, and SNaQ using true gene trees as input. The tripartition distance between an inferred network and the model network was used to measure topological accuracy. Average distance and standard error bars are shown (n=20)
Fig. 3The impact of mutation rate on the topological accuracy of MLE-length. We assessed the performance of MLE-length to characterize the accuracy of multi-locus inference methods since MLE-length was generally more accurate than MLE, SNaQ, MPL, and MP (Fig. 2). The seven-taxon model conditions had mutation rate θ ranging from 0.02 to 0.64. The tripartition distance between an inferred network and the model network was used to measure topological accuracy. Average distance and standard error bars are shown (n=20)
Topological distances between inferred phylogenies in the empirical study
| Average (SE) topological distance between inferred phylogenetic | ||||||
|---|---|---|---|---|---|---|
| networks | ||||||
| MLE-length | MP | SNaQ | ||||
| MLE-length | .11 | (.02) | .42 | (.06) | .44 | (.04) |
| MP | .36 | (.03) | .52 | (.05) | ||
| SNaQ | .23 | (.02) | ||||
Phylogenies were inferred using a representative method from each category of multi-locus methods: MLE-length (a full likelihood probabilistic method), MP (a parsimony-based method), and SNaQ (a pseudo-likelihood-based probabilistic method). The normalized tripartition distance between solutions that included gene flow (i.e., phylogenetic networks with one reticulation) is shown as an average (standard error) across replicates (n=20). When constrained to infer a phylogenetic tree rather than a phylogenetic network, all methods inferred an identical species tree across all replicates. Each replicate dataset consists of randomly selecting a sample from the following mouse species and subspecies: Mus musculus domesticus, M. musculus musculus, M. musculus castaneus, M. spretus, M. spicilegus, and M. macedonicus