| Literature DB >> 24167782 |
Ryan R Newton1, Irene L G Newton.
Abstract
A major goal of many evolutionary analyses is to determine the true evolutionary history of an organism. Molecular methods that rely on the phylogenetic signal generated by a few to a handful of loci can be used to approximate the evolution of the entire organism but fall short of providing a global, genome-wide, perspective on evolutionary processes. Indeed, individual genes in a genome may have different evolutionary histories. Therefore, it is informative to analyze the number and kind of phylogenetic topologies found within an orthologous set of genes across a genome. Here we present PhyBin: a flexible program for clustering gene trees based on topological structure. PhyBin can generate bins of topologies corresponding to exactly identical trees or can utilize Robinson-Fould's distance matrices to generate clusters of similar trees, using a user-defined threshold. Additionally, PhyBin allows the user to adjust for potential noise in the dataset (as may be produced when comparing very closely related organisms) by pre-processing trees to collapse very short branches or those nodes not meeting a defined bootstrap threshold. As a test case, we generated individual trees based on an orthologous gene set from 10 Wolbachia species across four different supergroups (A-D) and utilized PhyBin to categorize the complete set of topologies produced from this dataset. Using this approach, we were able to show that although a single topology generally dominated the analysis, confirming the separation of the supergroups, many genes supported alternative evolutionary histories. Because PhyBin's output provides the user with lists of gene trees in each topological cluster, it can be used to explore potential reasons for discrepancies between phylogenies including homoplasies, long-branch attraction, or horizontal gene transfer events.Entities:
Keywords: Evolutionary history; Horizontal gene transfer; Phylogenetics; Robinson-Foulds; Wolbachia
Year: 2013 PMID: 24167782 PMCID: PMC3807594 DOI: 10.7717/peerj.187
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Compute time for PhyBin compared to two other distance matrix calculation programs.
The times below correspond to distance matrix computations only and were measured on the 150-taxa benchmark included with HashRF. All times in seconds. PhyBin times are given with different numbers of threads in parentheses. All times were taken on a 4-socket, 32-core Intel Xeon E7-4830 server running at 2.13 GHz with RHEL 6. Phylip was compiled with gcc 4.4.7 and “−O2”.
| Trees | PhyBin | HashRF | Phylip | DendroPy |
|---|---|---|---|---|
| 100 | 0.269 | 0.056 | 22.1 | 12.8 |
| 1000 | 4.7 (1), 3.0 (2), 1.9 (4), 1.4 (8) | 1.7 |
Figure 1Wolbachia supergroup trees produced by concatenation of a dataset of 508 orthologs or by PhyBin’s binning and clustering algorithm.
In each of two modes (full clustering and binning) PhyBin is able to correctly recover the expected topology for the Wolbachia pipientis orthologs used herein. (A) Concatenated phylogeny based on 508 genes (using RAxML GTRGAMMA, bootstrap support based on 10,000 replicates). The four major supergroups are highlighted and denoted. (B) These same groups are recovered when PhyBin is run in either binning mode or (C) full clustering mode.
The behavior of PhyBin on an example dataset from the Wolbachia genus using binning mode.
Using PhyBin in binning mode on the Wolbachia orthologous gene set (503 trees total) results in different size and number of bins depending on branch length threshold. The number of bins drops dramatically between a branch length threshold of 0 and 0.02, indicating a small amount of noise in the dataset due to the use of fairly similar taxa.
| Branch length | Number of bins | Number of | Size of largest |
|---|---|---|---|
| 0 | 222 | 149 | 16 |
| 0.01 | 175 | 129 | 133 |
| 0.02 | 95 | 68 | 201 |
| 0.03 | 61 | 40 | 172 |
| 0.04 | 48 | 29 | 161 |
Figure 2Two trees of trees for the Wolbachia ortholog set as visualized by PhyBin.
Robinson-Foulds distance matricies produced by PhyBin are also visualized as a dendrogram by the software. (A) A tree of trees for the Wolbachia ortholog set (508 trees), clustered using an edit distance of 0, where identical topologies (nodes – grey ovals) are shown connected by a red line. Length of the branches connecting each node is proportional to the RF distance. (B) This dendogram is simplified by increasing the RF distance at which the trees are clustered (shown RF = 3). The top 10 clusters and their support different topologies are colored as indicated in the legend (with largest bin size for each cluster cluster in parentheses).
The behavior of PhyBin on an example dataset from the Wolbachia genus using full clustering mode.
Using PhyBin in full clustering mode on the Wolbachia orthologous gene set (503 trees total) using average neighbor clustering produces a relatively small number of clusters, the largest comprised of a majority of orthologous genes.
| RF-distance threshold | Branch length | Number of | Number of | Size of largest |
|---|---|---|---|---|
| 0 | n/a | 222 | 149 | 16 |
| 1 | n/a | 140 | 67 | 34 |
| 2 | n/a | 77 | 29 | 56 |
| 0 | 0.01 | 175 | 129 | 133 |
| 0 | 0.02 | 95 | 68 | 201 |
| 1 | 0.02 | 66 | 35 | 246 |
Wolbachia orthologs that do not conform to the dominant topology are highlighted by PhyBin.
List of Wolbachia orthologous gene sets not conforming to the dominant topology when PhyBin is run using full clustering mode (–UPGMA, –editdist = 3). Protein products predicted to be secreted (based on screening using the Effective database (Jehl, Arnold & Rattei, 2011) are italicized.
| Topology group | Orthologs (using wMel designations) |
|---|---|
| Support for splitting group A | Major facilitator family transporter (WD0470) |
|
| |
| GTP cyclohydrolase (WD0003) | |
|
| |
|
| |
|
|