| Literature DB >> 21062453 |
Ajanthah Sangaralingam1, Edward Susko, David Bryant, Matthew Spencer.
Abstract
BACKGROUND: Phylogenetic reconstruction methods based on gene content often place all the parasitic and endosymbiotic eubacteria (parasites for short) together in a clan. Many other lines of evidence point to this parasites clan being an artefact. This artefact could be a consequence of the methods used to construct ortholog databases (due to some unknown bias), the methods used to estimate the phylogeny, or both.We test the idea that the parasites clan is an ortholog identification artefact by analyzing three different ortholog databases (COG, TRIBES, and OFAM), which were constructed using different methods, and are thus unlikely to share the same biases. In each case, we estimate a phylogeny using an improved version of the conditioned logdet distance method. If the parasites clan appears in trees from all three databases, it is unlikely to be an ortholog identification artefact.Accelerated loss of a subset of gene families in parasites (a form of heterotachy) may contribute to the difficulty of estimating a phylogeny from gene content data. We test the idea that heterotachy is the underlying reason for the estimation of an artefactual parasites clan by applying two different mixture models (phylogenetic and non-phylogenetic), in combination with conditioned logdet. In these models, there are two categories of gene families, one of which has accelerated loss in parasites. Distances are estimated separately from each category by conditioned logdet. This should reduce the tendency for tree estimation methods to group the parasites together, if heterotachy is the underlying reason for estimation of the parasites clan.Entities:
Mesh:
Year: 2010 PMID: 21062453 PMCID: PMC2992526 DOI: 10.1186/1471-2148-10-343
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Unrooted radial cladogram from COG using conditioned logdet distances and modified BIONJ. Majority rule consensus, 200 bootstrap replicates from PHYLIP CONSENSE. This tree was drawn using Dendroscope [42]. Edge lengths not to scale.
Figure 2Unrooted radial cladogram from COG using conditioned logdet distances and non-phylogenetic model. Majority rule consensus from PHYLIP CONSENSE, 200 bootstrap replicates. This tree was drawn using Dendroscope [42]. Edge lengths not to scale.
Figure 3Unrooted radial cladogram from COG using conditioned logdet distances and phylogenetic model. Majority rule consensus, 200 bootstrap replicates, from PHYLIP CONSENSE. This tree was drawn using Dendroscope [42]. Edge lengths not to scale.
Figure 4Proportion of essential gene families among each COG functional category. Red circles: proportion of essential genes from phylogenetic model, green triangles proportions of essential genes from non-phylogenetic model.
Number of essential and non-essential gene families found in the same category.
| Number of gene families | Non-essential (phylogenetic) | Essential (phylogenetic) |
|---|---|---|
| Non-essential (non-phylogenetic) | 3356 | 76 |
| Essential (non-phylogenetic) | 1177 | 64 |
Agreement table showing the number of essential and non-essential genes found in the same category using both phylogenetic and non-phylogenetic mixture models.
Equilibrium probabilities of gene family presence from phylogenetic and non-phylogenetic mixture models.
| Database | Model | |||
|---|---|---|---|---|
| COG | Phylogenetic | 0.08 | 0.87 | 0.87 |
| COG | Non phylogenetic | 0.02 | 0.16 | 0.69 |
| OFAM | Non phylogenetic | 1.19 × 10-9 | 4.67 × 10-3 | 0.21 |
| TRIBES | Non phylogenetic | 0.019 | 0.074 | 0.58 |
πp (probability that a non-essential gene is present in a parasitic genome), πq (probability that a non-essential gene is present in a non-parasitic genome). πr is the probability of presence of essential gene families in both parasitic and non-parasitic genomes.
Robinson-Foulds distance between pairs of trees estimated using a range of methods
| Database | COG | TRIBES | OFAM |
|---|---|---|---|
| CL/SHOT | 30 | 74 | 89 |
| CL non-phylo/SHOT | 22 | 60 | 58 |
| CL phylo/SHOT | 32 | n/a | n/a |
| n/a | n/a | ||
| SHOT/RNA | 42 | 82 | 74 |
CL (conditioned logdet); CL non-phylo (conditioned logdet with non-phylogenetic mixture model); CL phylo (conditioned logdet with phylogenetic mixture model); SHOT (SHOT distances and BIONJ); RNA (16S rRNA and PHYML). Distances marked n/a were not calculated because the phylogenetic mixture model was only applied to the COG dataset.
Percentage of estimated trees containing parasites clan
| Phylogeny estimation method | Trees containing parasites clan (%) |
|---|---|
| Conditioned logdet | 100 |
| Conditioned logdet and non-phylogenetic model | 78 |
| Conditioned logdet and phylogenetic model | 0 |
Result of applying the three phylogeny estimation methods to 100 simulated datasets. Shows how many trees contain the parasites clan from conditioned logdet, conditioned logdet and non-phylogenetic mixture model and conditioned logdet and phylogenetic mixture model.