Literature DB >> 18820254

The Apicomplexan whole-genome phylogeny: an analysis of incongruence among gene trees.

Chih-Horng Kuo1, John P Wares, Jessica C Kissinger.   

Abstract

The protistan phylum Apicomplexa contains many important pathogens and is the subject of intense genome sequencing efforts. Based upon the genome sequences from seven apicomplexan species and a ciliate outgroup, we identified 268 single-copy genes suitable for phylogenetic inference. Both concatenation and consensus approaches inferred the same species tree topology. This topology is consistent with most prior conceptions of apicomplexan evolution based upon ultrastructural and developmental characters, that is, the piroplasm genera Theileria and Babesia form the sister group to the Plasmodium species, the coccidian genera Eimeria and Toxoplasma are monophyletic and are the sister group to the Plasmodium species and piroplasm genera, and Cryptosporidium forms the sister group to the above mentioned with the ciliate Tetrahymena as the outgroup. The level of incongruence among gene trees appears to be high at first glance; only 19% of the genes support the species tree, and a total of 48 different gene-tree topologies are observed. Detailed investigations suggest that the low signal-to-noise ratio in many genes may be the main source of incongruence. The probability of being consistent with the species tree increases as a function of the minimum bootstrap support observed at tree nodes for a given gene tree. Moreover, gene sequences that generate high bootstrap support are robust to the changes in alignment parameters or phylogenetic method used. However, caution should be taken in that some genes can infer a "wrong" tree with strong support because of paralogy, model violations, or other causes. The importance of examining multiple, unlinked genes that possess a strong phylogenetic signal cannot be overstated.

Entities:  

Mesh:

Year:  2008        PMID: 18820254      PMCID: PMC2582981          DOI: 10.1093/molbev/msn213

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

The protistan phylum Apicomplexa contains many important pathogens (Levine 1988). The most infamous members of this phylum are the causative agents of malaria from the genus Plasmodium, which causes more than one million human deaths per year globally (WHO and UNICEF 2005). Other important lineages include Babesia, which causes babesiosis in ruminants and humans (Brayton et al. 2007); Cryptosporidium, which causes cryptosporidiosis in humans and animals (Abrahamsen et al. 2004); Theileria, which causes tropical theileriosis and East Coast fever in cattle (Gardner et al. 2005; Pain et al. 2005); and Toxoplasma, which causes toxoplasmosis in immunocompromised patients and congenitally infected fetuses (Montoya and Liesenfeld 2004). These pathogens have been subjected to intense genome sequencing efforts in the hope of facilitating biomedical research (Tarleton and Kissinger 2001; Carlton 2003). The recent availability of fully annotated genome sequences from multiple species within this phylum provides a new and exciting opportunity for us to better understand the phylogeny of these important pathogens. The use of genome sequences for phylogenetic inference has only recently become possible. The large number of characters derived from genomic data allows robust inference of organismal phylogeny (Delsuc et al. 2005; Philippe, Delsuc, et al. 2005; Rokas 2006), even when the level of incomplete lineage sorting is high (Pollard et al. 2006). Initially, it was thought that use of genomic data would bring an end to the incongruence commonly observed in multigene molecular phylogenetic inference (Gee 2003; Rokas et al. 2003). However, further investigations suggest that the results from genome-scale phylogenetic inference should be interpreted with caution (Soltis et al. 2004; Jeffroy et al. 2006; Nishihara et al. 2007). Although genomic data can effectively suppress stochastic noise in shorter molecular sequences, the large amount of data can actually strengthen systematic biases when present (Phillips et al. 2004; Rodriguez-Ezpeleta et al. 2007). Previous studies that examined factors such as poor taxon sampling (Soltis et al. 2004; Philippe, Lartillot, and Brinkmann 2005), inappropriate choices of phylogenetic method (Phillips et al. 2004; Jeffroy et al. 2006), nucleotide or amino acid composition bias and deviation from compositional equilibrium (Phillips et al. 2004; Collins et al. 2005), and variation of evolutionary rates among or within sites (Dopazo H and Dopazo J 2005; Nishihara et al. 2007; Rodriguez-Ezpeleta et al. 2007), all found that systematic biases can lead to incorrect trees with strong support. Several approaches that can detect and remove systematic biases in genome-scale phylogenetic inference have been proposed, including modification of taxon sampling (Rodriguez-Ezpeleta et al. 2007), examination of model violations (Rodriguez-Ezpeleta et al. 2007), recoding of molecular sequences (Phillips et al. 2004; Rodriguez-Ezpeleta et al. 2007), removal of the fast-evolving sites (Nishihara et al. 2007; Rodriguez-Ezpeleta et al. 2007), and utilizing rare genomic changes (Delsuc et al. 2005). Among the approaches that have been developed to address the systematic biases in genome-scale analyses, examination of incongruence among individual genes is directly relevant to the design and interpretation of multigene analyses that are fundamental in molecular phylogenetics (Huelsenbeck et al. 1996; Taylor and Piel 2004; Jeffroy et al. 2006). Unfortunately, investigations of incongruence among gene trees at the genome-scale have been limited to a few selected groups such as gamma-Proteobacteria (Lerat et al. 2003), yeast (Taylor and Piel 2004; Gatesy and Baker 2005; Jeffroy et al. 2006), and Drosophila (Pollard et al. 2006) due to the limitation of data availability. In this study, we present the first genome-scale phylogenetic analysis in the phylum Apicomplexa. Because of the ancient origin of this phylum, estimated at approximately 700–900 Myr (Douzery et al. 2004), we perform our genome-scale phylogenetic inference at the protein level. The robust inference of the organismal phylogeny based on genomic data provides a solid foundation for comparative studies that improve our knowledge of apicomplexan evolution. In addition to facilitating the planning of future phylogenetic studies that involve other closely related pathogens, our systematic investigation of incongruence among gene trees can improve our understanding of multigene phylogenetic inference in general.

Materials and Methods

Data Sources and Ortholog Identification

Our data set contains seven apicomplexan species that have fully annotated genome sequence available, including Babesia bovis (Brayton et al. 2007) from GenBank (GenBank accession numbers AAXT01000001–AAXT01000013), Cryptosporidium parvum (Abrahamsen et al. 2004) from CryptoDB.org (Heiges et al. 2006), Eimeria tenella from GeneDB.org (Hertz-Fowler et al. 2004), Plasmodium falciparum (Gardner et al. 2002) and Plasmodium vivax from PlasmoDB.org (Bahl et al. 2003), Theileria annulata (Pain et al. 2005) from GeneDB.org (Hertz-Fowler et al. 2004), and Toxoplasma gondii from Toxo-DB.org (Gajria et al. 2008). A free-living ciliate, Tetrahymena thermophila (Eisen et al. 2006), is included as the outgroup. For each species, we obtained all annotated proteins in the genome for ortholog identification. The data sources and protein-encoding gene counts are summarized in table 1.
Table 1

List of Species Name Abbreviations and Data Sources

AbbreviationSpecies NameData SourceaVersion DateNumber of ProteinsbGenome Size (Mb)
BbBabesia bovisGenBank06 August 20073,7038
CpCryptosporidium parvumCryptoDB.org13 November 20073,8059
EtEimeria tenellaGeneDB.org01 January 200511,39360
PfPlasmodium falciparumPlasmoDB.org24 September 20075,46023
PvPlasmodium vivaxPlasmoDB.org24 September 20075,35227
TaTheileria annulataGeneDB.org17 July 20053,7958
TgToxoplasma gondiiToxoDB.org01 November 20077,79363
TtcTetrahymena thermophilaJ. Craig Venter Institute04 October 200627,424104

The annotated protein sequences were downloaded from the respective data source with the version date as indicated.

All annotated protein sequences from each species are used to identify single-copy genes that are shared by all species.

The free-living ciliate, T. thermophila, is included as the outgroup.

List of Species Name Abbreviations and Data Sources The annotated protein sequences were downloaded from the respective data source with the version date as indicated. All annotated protein sequences from each species are used to identify single-copy genes that are shared by all species. The free-living ciliate, T. thermophila, is included as the outgroup. Orthologous genes were identified using OrthoMCL (Li et al. 2003) (version 1.3) with BLASTP (Altschul et al. 1990) and E value cutoff set to 1 × 10−30. The ortholog identification process in OrthoMCL is largely based on the popular criterion of reciprocal best hits but also involves an additional step of Markov Clustering (van Dongen 2000) to improve sensitivity and specificity. A benchmarking study has found that this algorithm performed well among available methods for ortholog identification (Hulsen et al. 2006). We selected the orthologous genes that are shared by all eight species to infer the gene tree. Orthologous gene clusters that contain more than one gene from any given species were removed to avoid the complications introduced by paralogous genes in phylogenetic inference.

Phylogenetic Inference

The program ClustalW (Thompson et al. 1994) (version 1.83) was used for multiple sequence alignment. The “tossgaps” option was enabled to ignore gaps when constructing the guide tree, and all other parameters were set to the default values unless specifically stated otherwise. The alignments produced by ClustalW were filtered by GBLOCKS (Castresana 2000) (version 0.91b) to using default settings remove regions that contain gaps or are highly divergent. The resulting amino acid alignment for each gene (provided in supplementary data file 1, Supplementary Material online) was used in the main phylogenetic analysis as described below; a codon-based nucleotide alignment for each gene was generated by PAL2NAL (Suyama et al. 2006) and is provided in supplementary data file 2 (Supplementary Material online). Three phylogenetic methods, including maximum likelihood (ML), maximum parsimony (MP), and Neighbor-Joining (NJ), were used to infer the gene tree for each individual gene. ML inferences were performed using PHYML (Guindon and Gascuel 2003). The proportion of invariant sites and the gamma-distribution parameter with eight substitution categories were estimated from the data set. The substitution model was set to JTT (Jones et al. 1992), and we enabled the optimization options for tree topology, branch lengths, and rate parameters. MP trees were constructed using PROTPARS in the PHYLIP package (Felsenstein 1989) (version 3.65) with 100 randomizations of input order. When more than one equally parsimonious tree was found for a given gene, the strict consensus tree of all equally parsimonious trees was used as the MP tree of this gene. NJ trees were constructed using NEIGHBOR in the PHYLIP package with species input order randomization enabled. The distance matrices were calculated by Tree-Puzzle (Schmidt et al. 2002) (version 5.2). The parameters used in Tree-Puzzle were set to the JTT substitution model, the mixed model of rate heterogeneity with one invariant and eight gamma rate categories, and the exact and slow parameter estimation. The level of bootstrap support for each gene was inferred by 100 resamplings of the alignment using SEQBOOT in the PHYLIP package followed by ML inference. To investigate the sensitivity of a gene to the multiple sequence alignment parameter, we varied the gap opening penalty by 2-fold in both directions (i.e., increased the default cost from 10 to 20 or decreased it to 5) and inferred the gene tree under each setting. Individual genes are classified into three categories including robust, intermediate, and sensitive based on the ML gene-tree topologies from the three gap opening penalties examined. A gene is classified as robust if all three settings generated the same topology, intermediate if two out of the three settings generated the same topology, or sensitive if each setting generated a different topology. To investigate the effect of the substitution model used on the resulting gene-tree topology, we performed ML inference for each gene using two additional substitution models, including LG (Le and Gascuel 2008) and WAG (Whelan and Goldman 2001). The resulting gene trees are compared with the topology obtained using the JTT model (Jones et al. 1992).

Inference of the Species Tree

The species tree was inferred using two different approaches. The first approach was based on the consensus of individual gene trees. The consensus tree was inferred by the CONSENSE program in the PHYLIP package using extended majority rule. Gene trees inferred by different phylogenetic methods (i.e., ML, MP, and NJ) were analyzed separately. The second approach was based on the concatenated alignment of all individual genes following the phylogenetic inference procedures as described above.

Characterization of Gene Trees

The topology distance between each gene tree and the species tree was calculated based on the symmetric difference (Robinson and Foulds 1981) as implemented in TREEDIST in the PHYLIP package. For genes that inferred a topology that is different from the species tree, we performed the approximately unbiased (AU) test (Shimodaira 2002) and the Shimodaira–Hasegawa (SH) test (Shimodaira and Hasegawa 1999) using the CONSEL package (Shimodaira and Hasegawa 2001) to test if the species tree topology is significantly rejected by a gene.

Taxon Removal Tests

To evaluate the potential influence of long-branch attraction (LBA), we removed either of the two taxa that have a long terminal branch (i.e., the outgroup T. thermophila and the ingroup C. parvum) and repeated the phylogenetic inference for each gene. Our procedure is conceptually similar to the taxon jackknife method (Siddall 1995) but contains one important distinction. The traditional taxon jackknife method removes a taxon after multiple sequence alignment and prior to tree reconstruction. However, the taxon being removed still affects the alignment and thus can influence the resulting tree. We chose to perform the taxon removal prior to multiple sequence alignment to eliminate any effect on the phylogenetic inference from the taxon being removed.

Results and Discussion

Ortholog Identification

From the seven apicomplexans and the one ciliate examined, we identified 268 single-copy genes that are shared by all eight species. These genes represent less than 10% of the annotated genes from the smallest genome (table 1), indicating that these organisms are highly divergent in their gene content. The long evolutionary distance between ciliates and apicomplexans only partially explains this observation. When the outgroup is not considered, the seven apicomplexans share 508 orthologous genes (of which 433 are single copy in all species). One of our previous studies that examined a different set of apicomplexan species produced similar results and suggested that 28–45% of the genes in an apicomplexan genome are genus-specific (Kuo and Kissinger 2008). This high level of divergence in gene content is consistent with the ancient origin of the phylum. The divergence time between apicomplexans and ciliates was estimated to be in the range of 700–900 Myr based on 129 genes from 36 eukaryotes (Douzery et al. 2004). For the purpose of phylogenetic analysis, we focus on the 268 single-copy genes shared by all eight species. Many of these genes are responsible for basic cellular processes (e.g., DNA replication, transcription, translation, etc.), as noted in our previous study (Kuo and Kissinger 2008). The sequence identity and annotation information of these genes are provided in supplementary table S1 (Supplementary Material online).

The Apicomplexan Species Tree

The species tree was inferred using two different approaches. The first approach calculated the consensus tree among the 268 individual gene trees, and the second approach utilized a concatenated alignment of 71,830 amino acid sites. Both approaches resulted in the same species tree topology (fig. 1) by all three phylogenetic methods used. Groupings of three species pairs, including P. falciparum and P. vivax, B. bovis and T. annulata, and E. tenella and T. gondii, are supported by 87% or more of the genes based on ML consensus. In contrast, the two short internal branches are supported by less than 50% of the genes. Nevertheless, all internal branches received 100% ML bootstrap support based on the analysis of the concatenated alignment.
F

The inferred apicomplexan species tree. The ML tree is generated from the concatenated alignment of 268 single-copy genes (71,830 aligned amino acid sites). One free-living ciliate, Tetrahymena thermophila, is included as the outgroup to root the tree. Bootstrap support based on 100 replicates is 100% for all internal branches. Labels above branches indicate the level of consensus support (%) based on ML, MP, and NJ.

The inferred apicomplexan species tree. The ML tree is generated from the concatenated alignment of 268 single-copy genes (71,830 aligned amino acid sites). One free-living ciliate, Tetrahymena thermophila, is included as the outgroup to root the tree. Bootstrap support based on 100 replicates is 100% for all internal branches. Labels above branches indicate the level of consensus support (%) based on ML, MP, and NJ. This tree topology is consistent with most of our prior understanding of apicomplexan evolution based on morphology and development (Perkins et al. 2000), rDNA analyses (Escalante and Ayala 1995; Morrison and Ellis 1997), and multigene phylogenies (Douzery et al. 2004; Philippe et al. 2004; Kuo and Kissinger 2008). The piroplasmids (represented by B. bovis and T. annulata) form a sister group to the haemosporidians (represented by the Plasmodium lineage) with the cyst-forming coccidia (represented by E. tenella and T. gondii) as the next closely related group. Although the Cryptosporidium lineage was classified as a coccidian in early taxonomy work (Levine 1984), our result provides further support to the growing consensus that this lineage is basal to other apicomplexans and separate from other coccidia (Carreno et al. 1999; Zhu et al. 2000; Leander et al. 2003).

The Distribution of Gene Trees

Examination of individual genes revealed a seemingly high degree of incongruence among gene trees. Of the 268 gene trees examined, we observed a total of 48 topologies based on ML analysis (fig. 2). The most frequently observed topology (fig. 3) is consistent with the putative species tree and is supported by 19% of the genes. Each of the next three frequent topologies (fig. 3) is supported by approximately 7–10% of the genes and is different in the placement of C. parvum. Two additional topologies (fig. 3) are supported by 6% of the genes and exhibit alternative placements of the Plasmodium lineage. The observation that only a relatively small number of topologies are found may be attributed to our limited taxon sampling of eight species. For example, in an analysis of 106 genes from 14 yeast species, Jeffroy et al. (2006) found that each of the genes analyzed supports a distinct topology.
F

Frequency distribution of gene-tree topologies. Based on the 268 single-copy genes examined, we observed a total of 48 gene-tree topologies. The six most frequently observed gene-tree topologies, each supported by more than 5% of the genes, are provided in figure 3.

F

The six most frequently observed gene-tree topologies. Each topology is supported by more than 5% of the 268 genes examined. The exact count and frequency of genes that support (or significantly reject) each topology are provided under the tree. ML: frequency of genes that infer the specific topology using ML inference; AU: frequency of genes that significantly reject the topology using AU test; SH: frequency of genes that significantly reject the topology using SH test.

Frequency distribution of gene-tree topologies. Based on the 268 single-copy genes examined, we observed a total of 48 gene-tree topologies. The six most frequently observed gene-tree topologies, each supported by more than 5% of the genes, are provided in figure 3. The six most frequently observed gene-tree topologies. Each topology is supported by more than 5% of the 268 genes examined. The exact count and frequency of genes that support (or significantly reject) each topology are provided under the tree. ML: frequency of genes that infer the specific topology using ML inference; AU: frequency of genes that significantly reject the topology using AU test; SH: frequency of genes that significantly reject the topology using SH test. Despite the seemingly high level of incongruence among gene trees, only 16 genes significantly reject the putative species tree topology in the AU test (Shimodaira 2002). When using the more conservative SH test (Shimodaira and Hasegawa 1999), only two genes significantly reject the putative species tree. The first gene is annotated as a hypothetical protein in P. falciparum (gene ID: PF14_0326) and exhibits a high level of length variation among the species examined (i.e., varied from 2,452 amino acids in E. tenella to 8,094 amino acids in P. falciparum). The conserved regions that can be reliably aligned only account for 3% of the alignment. The second gene is annotated as a putative RNA-binding protein in P. falciparum (gene ID: PF08_0086) and also exhibits a high level of length variation (i.e., varied from 271 amino acids in B. bovis to 1,076 amino acids in P. vivax). The protein alignment obtained after GBLOCKS filtering only contains 29 sites. Based on the pattern of sequence length variation, we suspect that the gene annotations may be problematic in some of the species. For this reason, further analysis of these two genes was not pursued. The finding of a high level of topological incongruence among gene trees that lack statistical significance has been reported in previous genome-scale phylogenetic studies. Lerat et al. (2003) examined 205 single-copy genes shared by 13 gamma-Proteobacteria species and found only two significantly rejected the putative species tree in the SH test. In both cases, the discordance between the gene tree and the putative species tree can be explained by a single lateral gene transfer (LGT) event. Similarly, examinations of the 106 single-copy genes shared by a group of Saccharomyces spp. showed that the majority of bipartition conflicts among genes have low bootstrap support (Taylor and Piel 2004; Jeffroy et al. 2006). One possible hypothesis to explain the rare occurrences of a gene significantly rejecting the species tree is that single-copy genes are unlikely to be involved in LGT events (Daubin et al. 2002, 2003). Under this hypothesis, these genes have been confined in the organismal phylogeny throughout their evolutionary history, so the gene-tree topology is unlikely to be radically different from the species tree. By focusing on a small subset of genes that are highly conserved across all apicomplexan lineages examined, our methodology for orthologous gene selection may have effectively excluded genes that experienced LGT since the ciliate–apicomplexan divergence. Although LGT does not appear to influence our phylogenetic inference as presented here, caution should be taken in future studies because several previous studies suggest that LGT is an important evolutionary force in apicomplexans (Huang, Mullapudi, Lancto, et al. 2004; Huang, Mullapudi, Sicheritz-Ponten, and Kissinger 2004; Striepen et al. 2004; Nagamune and Sibley 2006) and other protists (Gogarten 2003; Richards et al. 2003; Andersson 2005).

Evaluation of Phylogenetic Signal by Bootstrap Support

To test if the observed topological incongruence among gene trees can be explained by a low resolving power for certain clades in some genes, we used the minimum bootstrap value observed in a gene tree to identify genes that possess strong phylogenetic signals. The results indicate that the percentage of genes that support the putative species tree increases as a function of the bootstrap cutoff used (table 2). In the most extreme example, when only the genes with a minimum bootstrap value of 90% at any node are examined, all five genes that meet this cutoff support the putative species tree topology. Even when the selection stringency is relaxed to a 70% bootstrap support, a cutoff that is commonly used in phylogenetic inference (Hillis and Bull 1993), 47% of these genes are consistent with the putative species tree and the two short internal branches received at least 60% of the consensus support. Curiously, we did not find any significant correlation between bootstrap support and alignment length, average pairwise protein distance, or other attributes of genes (supplementary table S1, Supplementary Material online).
Table 2

Effects of Removing Genes Based on the Minimum Bootstrap Support

Minimum Bootstrap Cutoff (%)aNumber of GenesNumber of TopologiesbPercentage of Genes that InferredClade Support Based on ML Consensus (%)
((Pf, Pv), (Bb, Ta))(((Pf, Pv), (Bb, Ta)), (Et, Tg))
026848194438
5013025255040
606915295549
703010476360
80155737380
9051100100100

 The bootstrap support for each gene is inferred by the ML method based on 100 replicates. A gene is removed from the analysis if the minimum bootstrap support observed on the gene tree does not meet the cutoff.

 Number of observed gene-tree topologies based on ML.

Effects of Removing Genes Based on the Minimum Bootstrap Support The bootstrap support for each gene is inferred by the ML method based on 100 replicates. A gene is removed from the analysis if the minimum bootstrap support observed on the gene tree does not meet the cutoff. Number of observed gene-tree topologies based on ML. In addition to being consistent with the putative species tree, genes with strong bootstrap support are often insensitive to changes in alignment parameter (table 3), substitution model (table 4), or the phylogenetic method used (table 5). In these tests, we are interested in investigating if a gene could infer the same gene-tree topology across a range of settings used in the phylogenetic inference process; the agreement between the gene-tree topology and the putative species tree is not considered. At 70% minimum bootstrap cutoff, we found that 90% of these genes are robust to a 4-fold change in the gap opening penalty (table 3), 93% of the genes are insensitive to the choice of substitution model (table 4), and 57% of the genes behave consistently across different phylogenetic methods (table 5). Although the use of methodological concordance as a criterion for selecting genes for phylogenetic inference was criticized (Grant and Kluge 2003), our results suggest that a gene is more likely to behave consistently across different phylogenetic methods when it contains a strong phylogenetic signal.
Table 3

Robustness to Alignment Settings as a Function of the Minimum Bootstrap Support

Minimum Bootstrap Cutoff (%)Percentage of Genes in Each Class
a
RobustbIntermediatecSensitived
0602712
5077185
6083161
7090100
809370
9010000

Genes are categorized into three classes based on the sensitivity to sequence alignment settings.

A gene is classified as robust if it produces the same gene-tree topology under all three alignment settings (for details, see Materials and Methods).

A gene is classified as intermediate if it produces the same gene-tree topology under two out of the three alignment settings.

A gene is classified as sensitive if each alignment setting leads to a different gene-tree topology.

Table 4

Robustness to Substitution Model as a Function of the Minimum Bootstrap Support

Minimum Bootstrap Cutoff (%)Precentage of Genes in Each Class
a
JTT = LG = WAGJTT = LGJTT = WAGLG = WAGAll Different
067610107
50795763
60846441
70930070
80930070
908000200

 Genes are categorized into five classes based on the agreements among the three substitution models used in ML inference. Note that this classification only concerns the consistency of gene-tree topologies inferred by different substitution models for each individual gene. The agreement between a gene tree and the species tree is not considered.

Table 5

Methodological Concordance as a Function of the Minimum Bootstrap Support

Minimum Bootstrap Cutoff (%)Percentage of Genes in Each Class
a
ML = MP = NJML = MPML = NJMP = NJAll Different
0221225834
50321434515
604373847
705773330
806073300
901000000

Genes are categorized into five classes based on the agreements among the three phylogenetic methods used. Note that this classification only concerns the consistency of gene-tree topologies inferred by different phylogenetic methods for each individual gene. The agreement between a gene tree and the species tree is not considered. Because we used the strict consensus method to consolidate all equally parsimonious trees of a gene, a multifurcating MP tree always has a nonzero topology distance from a fully bifurcating ML or NJ tree.

Robustness to Alignment Settings as a Function of the Minimum Bootstrap Support Genes are categorized into three classes based on the sensitivity to sequence alignment settings. A gene is classified as robust if it produces the same gene-tree topology under all three alignment settings (for details, see Materials and Methods). A gene is classified as intermediate if it produces the same gene-tree topology under two out of the three alignment settings. A gene is classified as sensitive if each alignment setting leads to a different gene-tree topology. Robustness to Substitution Model as a Function of the Minimum Bootstrap Support Genes are categorized into five classes based on the agreements among the three substitution models used in ML inference. Note that this classification only concerns the consistency of gene-tree topologies inferred by different substitution models for each individual gene. The agreement between a gene tree and the species tree is not considered. Methodological Concordance as a Function of the Minimum Bootstrap Support Genes are categorized into five classes based on the agreements among the three phylogenetic methods used. Note that this classification only concerns the consistency of gene-tree topologies inferred by different phylogenetic methods for each individual gene. The agreement between a gene tree and the species tree is not considered. Because we used the strict consensus method to consolidate all equally parsimonious trees of a gene, a multifurcating MP tree always has a nonzero topology distance from a fully bifurcating ML or NJ tree.

Removal of the Long Branches

In addition to the low signal-to-noise ratio in some genes, another possible source of incongruence among gene trees is the LBA problem that resulted from our nonideal taxon sampling. Several observations support this hypothesis. First, when a gene behaved inconsistently across different phylogenetic methods, ML and NJ often result in an identical gene-tree topology that is different from MP (table 5). In addition, the outgroup T. thermophila and the ingroup C. parvum both have a long evolutionary distance to the other taxa (fig. 1). The lack of additional species that can be used to break up the long branch leading to the Cryptosporidium lineage may be responsible for its unstable phylogenetic placement, as evidenced by the fact that three of the most frequently observed gene-tree topologies involve alternative placement of C. parvum (fig. 3). Although the genome sequence of C. hominis is available, adding this species is not particularly helpful. The genomes of these two Cryptosporidium spp. exhibit only 3–5% divergence at the nucleotide level (Xu et al. 2004). For the 268 conserved proteins that we used for phylogenetic inference, the sequences from these two species are essentially identical (data not shown). The issue of nonideal taxon sampling reflects a limitation that is often faced by genome-scale phylogentic inferences (Soltis et al. 2004). To circumvent this limitation, we utilized two other commonly suggested approaches to address the LBA problem (Bergsten 2005). First, all sites that contain gaps or are highly divergent were removed from the alignment prior to phylogenetic inference by GBLOCKS (see Materials and Methods). Second, we removed either the outgroup T. thermophila or the ingroup C. parvum prior to sequence alignment and repeated the phylogenetic inference. When the outgroup is removed from the data set, we observed a large increase in the consensus support for the PlasmodiumBabesiaTheileria clade (table 6). Two alternative bipartitions, as shown in panels E and F of figure 3, received substantially weaker consensus supports regardless of the minimum bootstrap cutoff used. Removal of the ingroup C. parvum resulted in a reduction of the number of observed gene-tree topologies (table 6), but the consensus support for the PlasmodiumBabesiaTheileria clade is relatively low compared with the removal of T. thermophila.
Table 6

Effects of Taxon Removal

Removal of the outgroup Tetraymena thermophila (Tt)
Minimum bootstrap cutoff (%)
Number of genes
Number of topologies
Consensus support based on ML (%)
((Pf, Pv, Bb, Ta), (Et, Tg, Cp))((Pf, Pv, Et, Tg), (Bb, Ta, Cp))((Bb, Ta, Et, Tg), (Pf, Pv, Cp))
026816572022
5021510621720
601698641620
701247481517
80814701415
90423711019
Effects of Taxon Removal

Conclusion

The recent availability of genome sequences allowed us to infer an organismal phylogeny that includes several important apicomplexan pathogens with high confidence. This robust species tree provides a solid foundation for future comparative studies that can improve our understanding of apicomplexan evolution and parasite biology. Although the level of incongruence among gene trees appears to be high at first glance, further investigation indicates that most of the observed conflict does not have strong statistical support. Interestingly, the minimum bootstrap support observed in a gene tree appears to be a useful predictor of phylogenetic performance. Genes that produce strong bootstrap support for all internal branches are more likely to be consistent with the species tree and robust to changes in the alignment parameter or the phylogenetic method used. Nevertheless, examination of multiple unlinked genes with strong phylogenetic signals is important for accurate phylogenetic inference because any single gene can have a different evolutionary history from the organismal phylogeny. Our systematic investigation provides a list of phylogenetically informative genes in the phylum Apicomplexa. These genes are good candidates for future sequencing efforts that aim at improving taxon sampling in this group of important pathogens.

Supplementary Material

Supplementary data files l and 2 and table S1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  63 in total

1.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.

Authors:  J Castresana
Journal:  Mol Biol Evol       Date:  2000-04       Impact factor: 16.240

2.  An approximately unbiased test of phylogenetic tree selection.

Authors:  Hidetoshi Shimodaira
Journal:  Syst Biol       Date:  2002-06       Impact factor: 15.683

3.  A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history.

Authors:  Vincent Daubin; Manolo Gouy; Guy Perrière
Journal:  Genome Res       Date:  2002-07       Impact factor: 9.043

4.  A first glimpse into the pattern and scale of gene transfer in Apicomplexa.

Authors:  Jinling Huang; Nandita Mullapudi; Thomas Sicheritz-Ponten; Jessica C Kissinger
Journal:  Int J Parasitol       Date:  2004-03-09       Impact factor: 3.981

5.  Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia.

Authors:  Hervé Philippe; Nicolas Lartillot; Henner Brinkmann
Journal:  Mol Biol Evol       Date:  2005-02-09       Impact factor: 16.240

6.  Combining data in phylogenetic analysis.

Authors:  J P Huelsenbeck; J J Bull; C W Cunningham
Journal:  Trends Ecol Evol       Date:  1996-04       Impact factor: 17.712

7.  Genome of the host-cell transforming parasite Theileria annulata compared with T. parva.

Authors:  Arnab Pain; Hubert Renauld; Matthew Berriman; Lee Murphy; Corin A Yeats; William Weir; Arnaud Kerhornou; Martin Aslett; Richard Bishop; Christiane Bouchier; Madeleine Cochet; Richard M R Coulson; Ann Cronin; Etienne P de Villiers; Audrey Fraser; Nigel Fosker; Malcolm Gardner; Arlette Goble; Sam Griffiths-Jones; David E Harris; Frank Katzer; Natasha Larke; Angela Lord; Pascal Maser; Sue McKellar; Paul Mooney; Fraser Morton; Vishvanath Nene; Susan O'Neil; Claire Price; Michael A Quail; Ester Rabbinowitsch; Neil D Rawlings; Simon Rutter; David Saunders; Kathy Seeger; Trushar Shah; Robert Squares; Steven Squares; Adrian Tivey; Alan R Walker; John Woodward; Dirk A E Dobbelaere; Gordon Langsley; Marie-Adele Rajandream; Declan McKeever; Brian Shiels; Andrew Tait; Bart Barrell; Neil Hall
Journal:  Science       Date:  2005-07-01       Impact factor: 47.728

8.  The genome of Cryptosporidium hominis.

Authors:  Ping Xu; Giovanni Widmer; Yingping Wang; Luiz S Ozaki; Joao M Alves; Myrna G Serrano; Daniela Puiu; Patricio Manque; Donna Akiyoshi; Aaron J Mackey; William R Pearson; Paul H Dear; Alan T Bankier; Darrell L Peterson; Mitchell S Abrahamsen; Vivek Kapur; Saul Tzipori; Gregory A Buck
Journal:  Nature       Date:  2004-10-28       Impact factor: 49.962

9.  Genome-scale evidence of the nematode-arthropod clade.

Authors:  Hernán Dopazo; Joaquín Dopazo
Journal:  Genome Biol       Date:  2005-04-28       Impact factor: 13.583

10.  Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote.

Authors:  Jonathan A Eisen; Robert S Coyne; Martin Wu; Dongying Wu; Mathangi Thiagarajan; Jennifer R Wortman; Jonathan H Badger; Qinghu Ren; Paolo Amedeo; Kristie M Jones; Luke J Tallon; Arthur L Delcher; Steven L Salzberg; Joana C Silva; Brian J Haas; William H Majoros; Maryam Farzad; Jane M Carlton; Roger K Smith; Jyoti Garg; Ronald E Pearlman; Kathleen M Karrer; Lei Sun; Gerard Manning; Nels C Elde; Aaron P Turkewitz; David J Asai; David E Wilkes; Yufeng Wang; Hong Cai; Kathleen Collins; B Andrew Stewart; Suzanne R Lee; Katarzyna Wilamowska; Zasha Weinberg; Walter L Ruzzo; Dorota Wloga; Jacek Gaertig; Joseph Frankel; Che-Chia Tsao; Martin A Gorovsky; Patrick J Keeling; Ross F Waller; Nicola J Patron; J Michael Cherry; Nicholas A Stover; Cynthia J Krieger; Christina del Toro; Hilary F Ryder; Sondra C Williamson; Rebecca A Barbeau; Eileen P Hamilton; Eduardo Orias
Journal:  PLoS Biol       Date:  2006-09       Impact factor: 8.029

View more
  42 in total

1.  Algorithms for MDC-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles.

Authors:  Yun Yu; Tandy Warnow; Luay Nakhleh
Journal:  J Comput Biol       Date:  2011-10-28       Impact factor: 1.479

2.  Comparative genomics of rhizobia nodulating soybean suggests extensive recruitment of lineage-specific genes in adaptations.

Authors:  Chang Fu Tian; Yuan Jie Zhou; Yan Ming Zhang; Qin Qin Li; Yun Zeng Zhang; Dong Fang Li; Shuang Wang; Jun Wang; Luz B Gilbert; Ying Rui Li; Wen Xin Chen
Journal:  Proc Natl Acad Sci U S A       Date:  2012-05-14       Impact factor: 11.205

3.  Parsimonious inference of hybridization in the presence of incomplete lineage sorting.

Authors:  Yun Yu; R Matthew Barnett; Luay Nakhleh
Journal:  Syst Biol       Date:  2013-06-04       Impact factor: 15.683

4.  Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting.

Authors:  Yun Yu; Cuong Than; James H Degnan; Luay Nakhleh
Journal:  Syst Biol       Date:  2011-01-19       Impact factor: 15.683

5.  Phylogenomic analyses of malaria parasites and evolution of their exported proteins.

Authors:  Christian Pick; Ingo Ebersberger; Tobias Spielmann; Iris Bruchhaus; Thorsten Burmester
Journal:  BMC Evol Biol       Date:  2011-06-15       Impact factor: 3.260

6.  Extensive Shared Chemosensitivity between Malaria and Babesiosis Blood-Stage Parasites.

Authors:  Aditya S Paul; Cristina K Moreira; Brendan Elsworth; David R Allred; Manoj T Duraisingh
Journal:  Antimicrob Agents Chemother       Date:  2016-07-22       Impact factor: 5.191

7.  Conservation and divergence of known apicomplexan transcriptional regulons.

Authors:  Kobby Essien; Christian J Stoeckert
Journal:  BMC Genomics       Date:  2010-03-03       Impact factor: 3.969

8.  Species tree inference by minimizing deep coalescences.

Authors:  Cuong Than; Luay Nakhleh
Journal:  PLoS Comput Biol       Date:  2009-09-11       Impact factor: 4.475

9.  Fine-scale phylogenetic discordance across the house mouse genome.

Authors:  Michael A White; Cécile Ané; Colin N Dewey; Bret R Larget; Bret A Payseur
Journal:  PLoS Genet       Date:  2009-11-20       Impact factor: 5.917

10.  A "shallow phylogeny" of shallow barnacles (chthamalus).

Authors:  John P Wares; M Sabrina Pankey; Fabio Pitombo; Liza Gómez Daglio; Yair Achituv
Journal:  PLoS One       Date:  2009-05-15       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.