Literature DB >> 21863127

Phylogenomics-based reconstruction of protozoan species tree.

Abstract

We have developed a semi-automatic methodology to reconstruct the phylogenetic species tree in Protozoa, integrating different phylogenetic algorithms and programs, and demonstrating the utility of a supermatrix approach to construct phylogenomics-based trees using 31 universal orthologs (UO). The species tree obtained was formed by three major clades that were related to three groups of data: i) Species containing at least 80% of UO (25/31) in the concatenated multiple alignment or supermatrix, this clade was called C1, ii) Species containing between 50%-79% (15-24/31) of UO called C2, and iii) Species containing less than 50% (1-14/31) of UO called C3. C1 was composed by only protozoan species, C2 was composed by species related to Protozoa, and C3 was composed by some species of C1 (Protozoa) and C2 (related to Protozoa). Our phylogenomics-based methodology using a supermatrix approach proved to be reliable with protozoan genome data and using at least 25 UO, suggesting that (a) the more UO used the better, (b) using the entire UO sequence or just a conserved block of it for the supermatrix produced similar phylogenomic trees.

Entities: Chemical Disease Gene Species

Keywords: phylogenomics; protozoan parasites; universal orthologs

Year: 2011 PMID： 21863127 PMCID： PMC3153122 DOI： 10.4137/EBO.S6861

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Phylogenomics is being intensively used in the “Post Genomic Era”. Many different phylogenetic trees have been published on the basis of different models of sequence evolution,1 applying different parameter settings and algorithms. Although the rDNA,2–4 COI2,5,6 and other single genes have been extremely valuable for phylogenetic studies, single-gene phylogeny has its limitations.7 Therefore, phylogenomics approaches, corroborated by the use of more representative phylogenetic markers, will allow, in principle, a more reliable and representative inference into the tree of life. Nowadays, one has the option to concatenate multiple gene sequences to construct trees on the genomic level, such as “genome trees” or also called “ supermatrix trees”, possessing more phylogenetic signals making them less susceptible to the stochastic errors than those built from a single gene. The other option is the construction of the “ supertree” that involves the concatenation of a set of trees.8–10 However, there are fundamental differences between the ways in which phylogenomic approaches integrate the phylogenetic information. Dutilh et al in 200711 systematically compared alternative methodologies such as gene content, superalignment, superdistance (to construct a supermatrix) and supertree approaches using various algorithms and tree-building methods on the Fungi, the eukaryotic clade with the largest number of sequenced genomes. The phylogenomic trees reproduced many of the clades in accordance with the current taxonomic views. Superalignment (supermatrix) and supertrees reproduced better target fungal phylogeny but they were not a guarantee for a successful phylogenomic tree.11 Phylogenomics involving the use of entire genomes to infer a species tree has become the de facto standard for reconstructing reliable species phylogenies.8,12 While some criticism has been made on the recent superalignment tree8 as being a ‘tree of one percent’ of the genome,13 single-gene phylogenetic trees have shown conflict14 due to a variety of causes. Yet, phylogenomic trees have held the promise of minimizing anomalies by the sheer power of genome-scale data as they are based on the maximum quantity of genetic information. A phylogenomic tree should be the best reflection of the evolutionary history of the species.15,16 There are more than 200,000 named species of unicellular eukaryotes that can be classified as Protozoa, of which approximately 10,000 are parasites, but only a small number are sufficiently important to be mentioned on the pages of Trends in Parasitology.18 The systematics of the Protozoa is a subject that has engaged the attention of protozoologists and evolutionists for some time, and advances in molecular methodology have revealed relationships among Protozoa and between Protozoa and other groups (http://www.ncbi.nlm.nih.gov/RefSeq) that can be used to draw up realistic and natural systems of classification.17–19 Protozoa are currently classified as a paraphyletic group, however the term is still used in several publications that relate it to areas such as systematics17 and taxonomy, besides parasitology,20,21 phylogeny,22–25 evolution26,27 and genomics.28 Difficulty is encountered in seeking a consensus taxonomic definition of several kingdoms. In particular, protozoan species are loosely characterized; deciding whether a species belongs to Protozoa or not is based on morphology and biological properties and also partially by the fact of not belonging to another kingdom.18,19,29 They are not a coherent phylogenetic group like other kingdoms or candidate kingdoms are, eg, stramenopila. Thus, to determine if a species belongs to Protozoa we do not just look at one characteristic (such as the multiparted, tubular flagellar hairs of stramenopila) and decide that. Instead, several criteria have to be met.30 Protozoa are phylogenetically connected to other eukaryotic groups and several eukaryotic groups are most likely derived from Protozoa.31 A molecular similarity has been established between alveolate (Protozoa) and stramenopiles.32 In corroboration, residual plastids in the malarial parasite (Plasmodium, Apicomplexa), and in non-malarial apicomplexans (Toxoplasma), suggest a relationship with dinoflagellate plastids and/or plastids of stramenopilous chromistans.33,34 Molecular data relate Choanoflagellate (Protozoa) to Sponges, thus to Animalia. Interestingly, molecular sequencing also supports a relationship of Choanoflagellates and Fungi.31 Sequences of the small subunit of ribosomal RNA gene (SSU rRNA) point to a grouping of Biliphyta (glaucophytes, Rhodophyta) with Cryptomonads.32 Also, several authors provided molecular evidence that primitive flagellates (protozoan species) occurred in the ancestry of Chlorophyta (green algae).24,35–37 Illnesses caused by parasitic Protozoa are a major cause of disease worldwide, but because they are concentrated in low socioeconomic parts of the world, they receive relatively little attention from the pharmaceutical industry. Of the ten diseases targeted as research priorities by the World Health Organization’s Special Program for Research and Training in Tropical Diseases (http://www.who.int/tdr), four are caused by protozoan parasites (malaria, leishmaniasis, Chagas disease and African trypanosomiasis). These diseases and other less dangerous ones (eg, amoebiasis and trichomoniasis) are having an alarming increase in cases, which are refractory to the main treatment. Treatment failure has potentially a multifactorial origin where drug resistance stands out as one of its major concerns.38–42 Our understanding of the phylogenetic position of protozoan within eukaryotes, as well as their relationships, is mainly based on ribosomal DNA analysis.2–4 At first, this process seemed relatively straightforward: trees generated from a single gene, most commonly a SSU rRNA,43 appeared to provide a basic structure for the topology of eukaryotes, although many branches of the tree remained controversial. The 1990s was a period of deconstruction of this theory since several protein coding gene trees revealed serious discrepancies.44,45 Currently, a hypothesis for the tree of eukaryotes resembles the tree presented by Keeling in 2005.36 The tree is a hypothesis composed from the various types of data, including molecular phylogenies and other molecular characteristics, as well as morphological and biochemical evidence. Five ‘supergroups’ were shown, each consisting of a diversity of eukaryotes, most of which were microbial (mostly protists and algae). Chaudhary in 200546 showed results on available sequencing data. Phylogenetic analysis defined five supergroups of eukaryotes: i) the plant and red/green algal lineage; ii) a clade comprised of animals, fungi, slime molds and amoebozoans named Unikonta eukaryotes which contained the species included in our study (Acanthamoeba, Entamoeba and Dictyostelium);24 and three supergroups that are entirely Protozoa: iii) chromalveolates, iv) excavates and v) rhizaria.36 The first supergroup of Protozoa (chromalveolates) showed three principal groups which included the parasitic phylum apicomplexa (Theileria, Babesia, Plasmodium, Toxoplasma, Neospora, Eimeria and Cryptosporidium); along with ciliates (Tetrahymena, Paramecium and Oxytricha); diatoms and many taxa for which no complete genomes are available (Diatoms and Phytophthora).37,47–50 The second supergroup (excavates) included kinetoplastid parasites (Trypanosoma and Leishmania) and other lineages (many anaerobic and/or parasitic Giardia, Spironucleus, Trichomonas and Naegleria).51 In this study, we explore the use of multiple genes to present a phylogenomics-based study among Protozoa and its relationship with other very close taxonomic species which are considered as mitochondrial or plastid protozoan according to RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq) taxonomic classification.

Material and Methods

Selection and preparation of marker gene families

We based our methodology on the study by Ciccarelli et al (2006)8 for the selection and construction of the species tree. Thirty-one universal orthologous (UO) genes showing 1:1 orthologous relationships were used (Supplementary Table S1). Those UO were originally identified by Ciccarelli et al (2006)8 showing the following characteristics: i) to be present in all complete genomes available at Genbank until 2006, ii) not to be involved in horizontal transfer, and iii) to be good ones for phylogenomic studies. As those 31 UO have a direct correspondence in the protozoan genome data available at RefSeq, they were mapped to the referred data using (a) the best blast hits (e-value < 1-e50), and (b) manual verification of the annotation (the RefSeq annotation of the best hits needed to match the UO annotation) (Supplementary Table S2). Once mapped, the protozoan protein sequences corresponding to those 31 UO were downloaded in fasta format from RefSeq (ftp://ftp.ncbi.nih.gov/refseq/release/protozoa/) and then aligned using Mafft v5.86152–54 with default parameters.

Selection and preparation of marker gene families for protozoan species trees

Our study used the data (protozoan species) that were obtained based on the taxonomic classification provided in the Release Catalogue of RefSeq (RefSeq-release35-05/04/2009) for Protozoa. This was done as a first tentative to include the different genus and species of Protozoa available in public databases. The full names of species used in our analysis are listed in Supplementary Table S3. Hidden Markov Models (HMM) profiles55 were constructed for the 31 aligned UO set, then this database (HMM profiles obtained from the alignment of the best hits of UO from protozoan sequences available at RefSeq) served as a seed to search for more UO hits in protozoan sequences available at Genbank and RefSeq. The HMM profiles used as a seed were created (hmmbuild) and calibrated (hmmcalibrate) and the searches (hmmpfam) were done with e-value “1e-5” as cut-off using HMMER version 2.3.2. All protozoan sequences (74 complete and draft genomes) available at RefSeq (ftp://ftp.ncbi.nih.gov/refseq/release/protozoa/) (RefSeq-release35-05/04/2009) and Genbank (http://www.ncbi.nlm.nih.gov/genbank/) (NCBI-Flat File Release 172.0-06/15/2009) were used as a target database. The best hits (e-value < 1e-5) of the HMMER (hmmpfam) search were added to the multiple alignment that originated the UO HMM profile, then a new multiple alignment was constructed (Mafft v5.861)52–54 containing: a) the original protozoan sequences from RefSeq that originated the UO HMM profile, plus b) the best hmmpfam hits of the UO HMM profiles obtained with protozoan sequences from Genbank and RefSeq. Those new multiple alignments (entire or trimmed) were used: i) to be concatenated and build a supermatrix tree, or ii) to build individual trees, then those trees were concatenated to obtain a supertree (Supplementary Table S3).

Concatenating multiple alignments to build a supermatrix tree

A supermatrix tree was obtained using concatenated multiple alignments, either entire or trimmed: Entire concatenated alignments (M1): The individual alignments were concatenated using an in-house perl script, resulting in a global supermatrix of 21,260 positions in a total of 74 species (43 Protozoa plus 31 mitochondrial or plastid Protozoa). The 31 plastid or mitochondrial protozoan genomes were included in the study based on the taxonomic classification provided in the Release Catalogue of RefSeq (RefSeq-release35-05/04/2009). Trimmed concatenated alignments (M2): The individual alignments were trimmed using TrimAl v1.256 (http://trimal.cgenomics.org/) aiming to obtain their most conserved blocks. Those extracted conserved blocks were concatenated using an in-house perl script, resulting in a global supermatrix of 12,807 positions in a total of 74 protozoan species. Positions in the alignment with gaps in more than 10% of the sequences were trimmed with TrimAl.56,57 The resulting supermatrix of M1 or M2 was used to generate separate trees with Phyml 2.4.458,59 using 100 bootstrap replicates and JTT, elected as the best evolutionary model. Each individual alignment was tested for the best evolutionary model using Modelgenerator 0.85. Several models were selected by Modelgenerator 0.85; however, because it is not simple to use multiple models in a single (concatenated) alignment, we decided to adopt JTT that was also the model adopted in the phylogenomics studies of Ciccarelli et al (2006).8 JTT assumed that there were two classes of sites, one class being invariable and the other class being free to change.8 The resulting clades C1, C2, and C3 obtained from the supermatrix tree of M1 or M2 were used to generate the individual trees Ct1, Ct2, and Ct3 with Phyml 2.4.458,59 using 100 bootstrap replicates and the evolutionary models (Supplementary Table S4) obtained with Modelgenerator 0.85.

Building individual trees to obtain a supertree

We also used either entire or trimmed individual alignments to build phylogenetic trees, then we concatenated them to build a supertree: the 31 individual total alignments (M3) and the 31 individual trimmed alignments (M4) were used to construct individual trees with Phyml 2.4.4 using 100 bootstrap replicates and the matrices of the evolutionary models (Supplementary Table S4) obtained with Modelgenerator 0.85.60 The resulting trees of M3 were concatenated to obtain supertrees with Clann 3.1.3.9 The same procedure was adopted for M4 (trimmed alignments).

Test of phylogenetic signal

The content and distribution of the phylogenetic signal of the 64 alignments (2 concatenated and 62 individual) were analyzed. Two statistical approaches, the PTP test and g1 statistics were employed to achieve a measure of the overall signal content. The PTP test (PTP—permutation test probability or permutation tail probability test)61 was implemented in PAUP* (Phylogenetic Analysis Using Parsimony) version 4.0b1062 and it was executed with heuristic search.63,64 G1 statistics was calculated from the characters using the RandTrees function in PAUP.65

Topological test

The Kishino and Hasegawa tests (KH tests)66 were performed in PAUP* 4.0b1062 to assess differences between the most parsimonious trees resulting from the analysis of the full dataset. Null distribution of the test statistic was simulated using 100 bootstrap replicates likelihoods (full dataset) obtained with Phyml 2.4.4 of the following eight groups: i) Complete total tree (M1), ii) Complete trimmed tree (M2), iii) M1-C1, iv) M2-C1, v) M1-C2, vi) M2-C2, vii) M1-C3 and viii) M2-C3. Kishino-Hasegawa test66 assumes a null hypothesis where the expected difference in the optimality score between alternative phylogenies is zero.67 This requires that the topologies under comparison must be specified a priori and without reference to the data used for the test. However, nearly all uses of these tests involve comparing alternative topologies to the optimal topology estimated from the data. This application guarantees that the null expectation of difference will always be larger than zero and does not violate any assumption of a normal distribution of differences in optimality scores between topologies.68

Results and Discussion

The vast majority of eukaryotic diversity is Protozoa and most of the sequenced protozoan species are parasites.46 It is important to establish the position of protozoan within the eukaryotic group; despite the taxonomic classification rules are not completely clear. Nowadays, the majority of phylogenies cannot be considered as complete information as they use only single or ribosomal genes. The evolutionary analysis of the data included several steps. First, we separately evaluated the presence or absence of a significantly structured phylogenetic signal for each data set, and second, we separately compared the trees signalized by the KH test as being “the best” and the trees obtained with Phyml 2.4.4 from the entire concatenated alignments (M1) and the trimmed concatenated alignments (M2). In order to evaluate the statistical significance of any particular branching pattern, statistical tests were performed. Statistical tests in phylogenetics allow the assessment of the degree of confidence in any given tree topology being the true topology or the most consistent that we accept as true. Thus, statistical tests are responsible for the mutual development between the abilities to estimate better trees and to create more realistic models of evolution.69 This was accomplished by conducting exhaustive parsimony searches on each alignment using PAUP* 4.0b262,70 and comparing the resulting g1 statistics of tree-length distribution skewness with critical values published by Hillis and Huelsenbeck in 1992.65 Negatively skewed distributions indicate the presence of trees shorter than what would be expected by chance. As another measure of the phylogenetic robustness of the data, a bootstrap analysis of the combined data set was conducted using 1,000 replicate branch-and-bound parsimony searches. Another popular statistical test that was used in the phylogenetic analysis was the PTP test, which is designed to address whether there is any true phylogenetic signal in any given dataset (alignment).69 The PTP test indicates that: A) the length value of the most parsimonious tree based on the trimmed concatenated alignments (78,498 steps, P < 0.001) obtained with the original data is distant from the others obtained through the permutation of the data (Figs. 1A and B) the length value of the most parsimonious tree based on the total concatenated alignments (85,268 steps, P < 0.001) obtained with the original data is distant from the others obtained through the permutation of the data (Fig. 1B).

Figure 1.

Permutation Tail Probability Test (PTP) of the concatenated alignments. A) PTP test of the trimmed concatenated alignments of the protozoan UO. (Number of replicates = 1,000, search = heuristic). The gray arrow indicates the most parsimonious tree (TMP = 78,498). p is the probability of getting a more extreme T-value under the null hypothesis of no difference between the two trees (two-tailed test). PTP test indicates significant difference at P < 0.05 between the original (unpermuted) data of AMP and the permuted data. B) PTP test of the total concatenated alignments of the protozoan UO. (Number of replicates = 1,000, search = heuristic). The gray arrow indicates the most parsimonious tree (TMP = 85,268). P is the probability of getting a more extreme T-value under the null hypothesis of no difference between the two trees (two-tailed test). PTP test indicates significant difference at P < 0.05 between the original (unpermuted) data of AMP and the permuted data.

The g1 statistics also indicates that there is phylogenetic signal in the data used in the analysis regarding the tree-length skewness because: A) the tree-length distribution based on the trimmed concatenated alignments showed left skewness (G1 = −0.57, P < 0.001) (Figs. 2A and B) the tree-length distribution based on the total concatenated alignments also showed left skewness (G1 = −0.58, P < 0.001) (Fig. 2B), as the tree-length distribution with significant left skewness contains more phylogenetical signals than more symmetrical or right-skewed distributions.

Figure 2.

The g1 statistics of the concatenated alignments. A) The g1 statistics of the trimmed concatenated alignments of the protozoan UO. (Number of replicates = 1,000,000). The gray arrow indicates the most parsimonious tree (TMP = 103,691). B) The g1 statistics of the total concatenated alignments of the protozoan UO. (Number of replicates = 1,000,000). The gray arrow indicates the most parsimonious tree (TMP = 111,342).

The results of the PTP test for concatenated alignments are (I) Total Characters: i) trimmed: concatenated-12,807 versus single average-412 and ii) total: concatenated-21,260 versus single average-686. (II) Parsimony-Informative Characters: i) trimmed: concatenated-8,586 versus single average-277 and ii) total: concatenated-9,308 versus single average-300 (Tables 1 and 2).

Table 1.

Results of the PTP test for the concatenated alignments of the universal orthologs.

Concatenated alignments of the universal orthologs	Total characters	Constant characters	Parsimony-uninformative variable characters	Parsimony-informative characters
Trimmed	12807	1836	2385	8586
Total	21260	7479	4473	9308

Table 2.

Results of the PTP test for the 31 single alignments of the universal orthologs.

Alignments of the universal orthologs		Total characters	Constant characters	Parsimony-uninformative variable characters	Parsimony-informative characters
COG0012	Total	605	153	59	393
	Trimmed	363	11	14	338
COG0016	Total	964	249	228	487
	Trimmed	578	44	59	475
COG0048	Total	241	49	58	134
	Trimmed	145	9	6	130
COG0049	Total	292	103	17	172
	Trimmed	175	6	12	157
COG0052	Total	339	18	23	298
	Trimmed	203	2	5	196
COG0080	Total	1543	1223	132	188
	Trimmed	926	606	132	188
COG0081	Total	477	55	112	307
	Trimmed	286	2	14	270
COG0087	Total	615	115	145	355
	Trimmed	369	16	29	324
COG0091	Total	522	348	27	147
	Trimmed	313	139	27	147
COG0092	Total	338	77	38	223
	Trimmed	203	5	7	191
COG0093	Total	256	84	26	146
	Trimmed	154	2	6	146
COG0094	Total	242	10	32	200
	Trimmed	148	2	5	141
COG0096	Total	170	28	16	126
	Trimmed	120	4	2	114
COG0097	Total	295	75	26	194
	Trimmed	177	1	2	174
COG0098	Total	298	27	32	239
	Trimmed	179	11	8	160
COG0099	Total	249	84	21	144
	Trimmed	149	5	10	134
COG0100	Total	274	114	32	128
	Trimmed	164	9	27	128
COG0102	Total	805	467	109	229
	Trimmed	438	145	109	229
COG0103	Total	738	85	421	232
	Trimmed	443	5	206	232
COG0172	Total	2065	1033	522	510
	Trimmed	1239	207	522	510
COG0184	Total	540	330	82	128
	Trimmed	324	129	67	128
COG0186	Total	320	122	45	153
	Trimmed	192	21	18	153
COG0197	Total	234	31	8	195
	Trimmed	140	5	3	132
COG0200	Total	666	161	287	218
	Trimmed	400	6	176	218
COG0201	Total	572	71	69	432
	Trimmed	374	15	12	347
COG0202	Total	1033	373	271	389
	Trimmed	620	24	207	389
COG0256	Total	704	346	78	280
	Trimmed	422	64	78	280
COG0495	Total	2307	521	629	1157
	Trimmed	1384	27	226	1131
COG0522	Total	880	542	79	259
	Trimmed	528	190	79	259
COG0525	Total	1611	311	396	904
	Trimmed	967	67	76	824
COG0533	Total	1065	274	450	341
	Trimmed	639	57	241	341

By several approaches, our data (concatenated alignments) showed to be more reliable or informative than single gene phylogenies: A) the two statistical approaches, PTP test and g1 statistics used to achieve a measure of the overall signal content, indicated that the molecular data related to the 64 alignments have phylogenetic signal. B) The use of concatenated alignments offers more Parsimony-Informative Characters in comparison to the use of separated single genes. On the strategy to build phylogenetic relationships our study was indeed based on UO originally described by Ciccarelli’s work.8 This choice was made because we believe that this approach is really nice and useful, especially when using genomes with partial or unfinished sequencing. While the strategy is pretty much the same, more data were used in our study. Ciccarelli’s tree8 present only six species of Protozoa (Dictyostelium discoideum, Cryptosporidium hominis, Plasmodium falciparum, Thalassiosira pseudonana, Leishmania major and Giardia lamblia) against 74 (complete and draft genomes) used in our study. Figure 3 shows the comparison of the trees M1 and M2, using the total (M1) and trimmed (M2) (using TrimAl) concatenated alignments. Both methodologies showed very similar topologies; however, the tree obtained with trimmed alignments showed higher bootstrap values. Also, in Figure 3 (at the right side) are presented the three sub-trees belonging to each of the three clades of the original M2 tree: M2-C1 (Clade 1 of the M2 tree), M2-C2 (Clade 2 of the M2 tree) and M2-C3 (Clade 3 of the M2 tree). Each presented higher bootstrap values when compared to the entire M2 tree. The trees signalized by the KH test as being “the best” (for scores see Supplementary File S3) and the trees M1 and M2 (and their sub-trees M1-C1, M2-C1, M1-C2, M2-C2, M1-C3 and M2-C3) obtained with Phyml 2.4.4 (for loglk see Table 3) presented the same topologies and they are shown in the Supplementary Figure S1 Group A-D, respectively.

Figure 3.

Phylogenomic supermatrix trees of protozoan species using the total and trimmed (using TrimAl) alignments. Supermatrices of 21,260 (M1) and 12,807 (M2) positions were used respectively in 74 protozoan species. Maximum likelihood tree was constructed with Phyml 2.4.4, JTT as evolutionary model and bootstrap 100. The resulting clades of M2: M2-C1 in red, M2-C2 in blue and M2-C3 in black with the evolutionary models obtained with Modelgenerator 0.85 were used to construct three individual trees: M2-Ct1 (Blosum62), M2-Ct2 (RtREV) and M2-Ct3 (WAG).

Table 3.

Kishino-Hasegawa test results.

Number	Tree	KH test	Phyml likelihood: loglk
1	Complete total tree	–	−418,592,428,692
2	Complete total tree (KH test)	Best tree of the 100 bootstraps of C1 total: bootstrap 36	–
3	Complete trimmed tree	–	−473,396,448,710
4	Complete trimmed tree (KH test)	Best tree of the 100 bootstraps of C1 trimmed: bootstrap 91	–
5	C1 total	–	−36,241,173,106
6	C1 total (KH test)	Best tree of the 100 bootstraps of C1 total: bootstrap 39	–
7	C1 trimmed	–	−313,069,182,861
8	C1 trimmed (KH test)	Best tree of the 100 bootstraps of C1 trimmed: bootstrap 83	–
9	C2 total	–	−5,837,927,846
10	C2 total (KH test)	Best tree of the 100 bootstraps of C2 total: bootstrap 20	–
11	C2 trimmed	–	−5,457,055,926
12	C2 trimmed (KH test)	Best tree of the 100 bootstraps of C2 trimmed: bootstrap 16	–
13	C3 total	–	−6,001,226,356
14	C3 total (KH test)	Best tree of the 100 bootstraps of C3 total: bootstrap 79	–
15	C3 trimmed	–	−5,603,518,000
16	C3 trimmed (KH test)	Best tree of the 100 bootstraps of C3 trimmed: bootstrap 61	–

The bootstrap values of the M2 tree (containing the three clades: C1, C2 and C3) are weak, and it is probably because not all taxons had the sequences of the 31 UO available at public databases. We tested the possibility to obtain reliable species trees using less than 31 UO, then showed that using at least 80% (25/31) for each taxon provide reliable inferences (most bootstraps equal or higher than 80). The C1, C2, and C3 clades of the M2 tree were re-analysed as separated trees (M2-C1, M2-C2 and M2-C3) (Fig. 3), then better bootstrap values were obtained. The M2-C1 tree showed bootstrap values above 80, except for the values 49, 60, 61, and 72; and the M2-C2 tree also showed bootstrap values above 80, except for the values 43, 63, and 67. Unfortunately bootstrap values could not be enhanced even better either using new/better alignment or more taxa. This weak bootstrap values behaviors has also been noted in Ciccarelli (2006);8 Dutilh (2007)11 and Hartmann (2008)71 using similar data and/or methodology. The species trees—M1 and M2—presented three major clades. Each clade was related to one of the following groups of data: i) 26 species presenting at least 80% of UO (25/31) in their concatenated alignments called C1, ii) 12 species presenting between 50%–79% (15–24/31) of UO called C2, and iii) 36 species presenting less than 50% (1–14/31) of UO called C3. C1 showed excavates represented by kinetoplastids, trichomonads and diplomonads. Kinetoplastids, a group of uncertain affinity,45,72,73 was characterized by the presence of a monophyly formed by the paraphyletic groups Leishmania and Trypanosoma. L. major was found to be more closely related to L. infantum than L. brasiliensis. T. brucei and T. cruzi were the representatives of Trypanosoma. Previous phylogenetic analyses36,47 confirmed a closer relationship between diplomonads (Giardia) and trichomonads (Trichomonas), which was confirmed by our results. Diplomonads were closely related to Monosiga, Entamoeba, Apicomplexa alveolates: Cryptosporidium and ciliates: Tetrahymena and Paramecium. Other Apicomplexa as Plasmodium, Theileria and Babesia were also closely related. Toxoplasma was wrongly placed in C3, separated from other Apicomplexa alveolates, which probably occurred because of insufficient phylogenetic information (less than 50% of UO) in the concatenated alignments. Ciccarelli et al (2006)8 showed that despite a highly resolved and robust tree, they could not exclude a few uncertainties in tree topology due to biased species sampling or Long Branch Attraction (LBA). This LBA was suggested to account for the placement of Diplomonadida (G. lamblia) as the most basal eukaryal taxon and as the most external taxon of Protozoa, followed by the Kinetoplastida (L. major), placing both of them as related to Chromoalveolatas (T. pseudonana, P. falciparum and C. hominis). Our results (Fig. 3, M2-C1) showed different/additional relationships, with the most relevant being that G. lamblia is more closely related to Cryptosporidium than L. major, and that the Giardia-Cryptosporidium clade is closely related to the Trypanosoma-Leishmania-Trichomonas clade. Also, since our M2-C1 tree is more reliable than M2-C2 and M2-C3, our results differ from Ciccarelli’s8 showing that the most basal protozoans could be the genus Bigelowiella and Guillardia forming together a clade. The use of more data from Protozoa in our study (than the original data used by Ciccarelli’s) gave us the opportunity to infer additional and more complete relationships than initially described by those authors. However, a highly resolved protozoan species tree is still a challenge, that will be better addressed by the inclusion of more UO from taxons forming M2-C2 and M2-C3 trees (Fig. 3), and also more genomic data from additional species, especially from basal genus as Bigelowiella and Guillardia. Animals and their unicellular relatives (together termed ‘Holozoa’) show a strong affinity with Fungi (together termed ‘opisthokonts’) providing strong evidence in a relationship with Protozoa (eg, Monosiga). Our tree showed this relationship: M. brevicollis was found closely related to C1. The opisthokont lineage is supported by insertions in the elongation factor-1a and enolase,74 as well as by many individual2,74 and concatenated gene phylogenies.47,75,76 On the other hand, amoebozoa are supported by a group of several individual and concatenated gene phylogenies,76 partially by the presence of fused genes encoding cytochrome oxidase 1 and 2 in the mitochondrial DNA of slime molds and lobose amoebae.77 ‘Unikont’ is used as the name for the union of two individually well supported groups: amoebozoans and opisthokonts.24 Overall, unikonts include animals and fungi, some amoebae (eg, Entamoeba), slime molds (eg, Dictyostelium), and a few parasitic protists. The analysis of multiple nuclear genes strongly supports the sharing of a common ancestry of choanoflagellates represented by Monosiga with animals with the exclusion of fungi and the other sampled eukaryotes;47 this is in agreement with single-gene studies2,78,79 and with a mitochondrial multi-gene phylogeny.80 In our phylogenomic trees, C2 was formed by four groups: i) rhodophyta: Porphyra and Gracilaria; ii) cryptophyta: Guillardia and Rhodomonas; glaucocystophyceae: Cyanophora, and haptophyceae: Emiliania (The cryptophyta Hemiselmis was wrongly placed closer to the apicomplexa Theileria and Babesia); iii) stramenopiles: Odontella, Phaeodactylum and Heterosigma; iv) rhodophyta: Cyanidioschyzon and Cyanidium. C3 was formed by the protozoan euglenozoa/kinetoplastida: L. amazonensis, L. donovani, L. enriettii; the alveolata/apicomplexa: T. gondii and E. tenella; the amoebozoa: A. castellanii, A. polyphaga, A. healyi, D. citrinum, D. fasciculatum and Polysphondylium pallidum; the alveolata/ciliophora: P. aurelia, P. caudatum, T. malaccensis, T. pigmentosa, T. pyriformis and T. paravorax and the euglenozoa/euglenida: E. gracilis, E. longa. Also, other eukaryotes were found in this clade: the stramenopiles: Cafeteria roenbergensis, Chrysodidymus synuroideus, Desmarestia viridis, Dictyota dichotoma, Fucus vesiculosus, Laminaria digitata, Ochromonas danica, Phytophthora infestans, P. sojae, P. ramorum, Pylaiella littoralis, Saprolegnia ferax and T. pseudonana; the rhodophyta: Chondrus crispus, the malawimonadidae: Malawimonas jakobiformis, the heterolobosea: Naegleria gruberi and the jakobidae Reclinomonas americana. Our phylogenomic analysis showed, as expected, the phylogenetic relationships and the monophyly of Protozoa (Fig. 3) that is in good agreement with previous studies.36,46,47,75,76,80 Unfortunately, because of the use of several incomplete genomes in this study, C3 phylogeny is not reliable, mainly because it is formed by a group of species containing less than 50% (1–14/31) of the 31 UO found. This could be also interpreted as the concatenated alignment having too many gaps (in this case a gap would be a missing UO or whole protein, not only a missing amino acid or nucleotide sequence), thus contributing to the low robustness of the clade or tree. We hypothesize that each missing UO can be treated as a gap in the concatenated alignment, then the more gaps (or less UO) the less reliable trees. It appears logical to expect an inverse relationship between the proportion of gapped sites in an alignment and the accuracy of the inferred phylogeny,71,81,82 particularly if the gaps are not treated as reflective of distinct evolutionary events,83 and thus, containing distinct phylogenetic signal. The supertree approach did not work in our hands because when Clann 3.1.3 was used for the supertree reconstruction (trees’ concatenation), only a Neighbor-Joining tree (the initial step to produce a supertree) was obtained. The next two steps (heuristic search or the bootstrapping) were not executed because the software showed an error message informing that the individual trees used as inputs could not be concatenated into a one supertree. Nevertheless, Supplementary Figure S2 shows the Neighbor-Joining supertree for the individual total alignments (M3) and for the individual trimmed alignments (M4) and what could be appreciated is the fact that they presented similar topologies when compared to the supermatrix tree (M1 and M2) in Figure 3. While most of the publicly available eukaryote genome sequence data were obtained with Sanger technology (medium to low coverage), second and third generation sequencing technologies will be probably used to sequence a larger number of species but at low coverage; hence the approach to use partial sequences from several genes (especially UO) appears to be a good option for future phylogenomics-based studies.

Conclusions

We have presented a phylogenomics-based overview for Protozoa. Relationships between protozoan groups are in agreement with previous studies, supporting monophyly. On the other hand, phylogenetic information inferred from C3 is not reliable due to incomplete information (missing UO in those genomes), suggesting that the use of less than 15 UO for phylogenomic reconstruction is not reliable. The inclusion of more data (UO) is necessary to obtain a robust tree in C3. Our phylogenomics-based methodology using a supermatrix approach proved to be reliable with protozoan genome data, suggesting that (a) the more UO used the better, and (b) that the use of the entire UO sequence or just a conserved block of it produce similar reliable results. The highest bootstrap values were obtained when the trees of the clades C1, C2 and C3 were constructed separately. The results of the supertree were obtained only for the Neighbor-Joining tree and were not conclusive. Finally, we need to further investigate if this methodology could be extrapolated or reproduced to other taxonomic groups. Figure S1 Group A. Comparison between the complete trees (total and trimmed) signalized as “the best” by the KH test and the trees M1 and M2 obtained with Phyml. The trees are: i) M1-Complete total tree (M1-Phyml), ii) Complete total tree (KH test), iii) M2-Complete trimmed tree (M2-Phyml), iv) Complete trimmed tree (KH test). All trees were constructed with Phyml 2.4.4. Figure S1 Group B. Comparison between the C1 trees (total and trimmed) signalized as “the best” by the KH test and the trees M1 and M2 obtained with Phyml. The trees are: i) M1-C1 total (M1-Phyml), ii) C1 total (KH test), iii) M2-C1 trimmed (M2-Phyml), iv) C1 trimmed (KH test). All trees were constructed with Phyml 2.4.4. Figure S1 Group C. Comparison between the C2 trees (total and trimmed) signalized as “the best” by the KH test and the trees M1 and M2 obtained with Phyml. The trees are: i) M1-C2 total (M1-Phyml), ii) C2 total (KH test), iii) M2-C2 trimmed (M2-Phyml), iv) C2 trimmed (KH test). All trees were constructed with Phyml 2.4.4. Figure S1 Group D. Comparison between the C3 trees (total and trimmed) signalized as “the best” by the KH test and the trees M1 and M2 obtained with Phyml. The trees are: i) M1-C2 total (M1-Phyml), ii) C3 total (KH test), iii) M2-C3 trimmed (M2-Phyml), iv) C2 trimmed (KH test). All trees were constructed with Phyml 2.4.4. Figure S2. Neighbor-Joining supertree of protozoan species built with Clann 3.1.3 using the total and trimmed (using TrimAl) alignments. Table S1. Universal orthologs mapped to KOGs found in protozoan. Table S2. RefSeq accession numbers of the protozoan sequences mapped from COGs and used in the phylogeny (for the sequences see Supplementary File S1). Table S3. List of the species analyzed in the study. (C1 is red, C2 blue and C3 black). Table S4. Amino acid evolutionary models obtained with Modelgenerator 0.85 of the individual and concatenated alignments of UO and the clades C1, C2 and C3. File S1. Fasta sequences used in the phylogeny divided by the UO clusters. File S2. Fasta sequences and alignments used as input to construct the supermatrices (total and trimmed) and the supertrees (total and trimmed). File S3. Scores obtained with the Kishino-Hasegawa test over the eight bootstrapped trees: M1, M2, C1-M1, C2-M1, C3-M1, C1-M2, C2-M2 and C3-M2.

76 in total

1. A kingdom-level phylogeny of eukaryotes based on combined protein data.

Authors: S L Baldauf; A J Roger; I Wenk-Siefert; W F Doolittle
Journal: Science Date: 2000-11-03 Impact factor: 47.728

2. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history.

Authors: Vincent Daubin; Manolo Gouy; Guy Perrière
Journal: Genome Res Date: 2002-07 Impact factor: 9.043

3. Phylogeny of triatomine vectors of Trypanosoma cruzi suggested by mitochondrial DNA sequences.

Authors: Andrés C Sainz; Laura V Mauro; Etsuko N Moriyama; Beatriz A García
Journal: Genetica Date: 2004-07 Impact factor: 1.082

4. Inferring evolutionary trees with PAUP*.

Authors: James C Wilgenbusch; David Swofford
Journal: Curr Protoc Bioinformatics Date: 2003-02

5. Evolutionary relationships among the eukaryotic crown taxa taking into account site-to-site rate variation in 18S rRNA.

Authors: Y Van de Peer; R De Wachter
Journal: J Mol Evol Date: 1997-12 Impact factor: 2.395

6. Pharmacokinetic investigations in patients from northern Angola refractory to melarsoprol treatment.

Authors: C Burri; J Keiser
Journal: Trop Med Int Health Date: 2001-05 Impact factor: 2.622

7. Influence of Leishmania (Viannia) species on the response to antimonial treatment in patients with American tegumentary leishmaniasis.

Authors: Jorge Arevalo; Luis Ramirez; Vanessa Adaui; Mirko Zimic; Gianfranco Tulliano; César Miranda-Verástegui; Marcela Lazo; Raúl Loayza-Muro; Simonne De Doncker; Anne Maurer; Francois Chappuis; Jean-Claude Dujardin; Alejandro Llanos-Cuentas
Journal: J Infect Dis Date: 2007-05-03 Impact factor: 5.226

8. Molecular phylogenetics and reproductive incompatibility in a complex of cryptic species of aphid parasitoids.

Authors: John M Heraty; James B Woolley; Keith R Hopper; David L Hawks; Jung-Wook Kim; Matthew Buffington
Journal: Mol Phylogenet Evol Date: 2007-07-12 Impact factor: 4.286

9. Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?

Authors: Stefanie Hartmann; Todd J Vision
Journal: BMC Evol Biol Date: 2008-03-26 Impact factor: 3.260

10. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors: Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3 in total

1. An orthology-based analysis of pathogenic protozoa impacting global health: an improved comparative genomics approach with prokaryotes and model eukaryote orthologs.

Authors: Rafael R C Cuadrat; Sérgio Manuel da Serra Cruz; Diogo Antônio Tschoeke; Edno Silva; Frederico Tosta; Henrique Jucá; Rodrigo Jardim; Maria Luiza M Campos; Marta Mattoso; Alberto M R Dávila
Journal: OMICS Date: 2014-06-24

2. The Comparative Genomics and Phylogenomics of Leishmania amazonensis Parasite.

Authors: Diogo A Tschoeke; Gisele L Nunes; Rodrigo Jardim; Joana Lima; Aline Sr Dumaresq; Monete R Gomes; Leandro de Mattos Pereira; Daniel R Loureiro; Patricia H Stoco; Herbert Leonel de Matos Guedes; Antonio Basilio de Miranda; Jeronimo Ruiz; André Pitaluga; Floriano P Silva; Christian M Probst; Nicholas J Dickens; Jeremy C Mottram; Edmundo C Grisard; Alberto Mr Dávila
Journal: Evol Bioinform Online Date: 2014-09-23 Impact factor: 1.625

3. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices.

Authors: Jenna Morgan Lang; Aaron E Darling; Jonathan A Eisen
Journal: PLoS One Date: 2013-04-25 Impact factor: 3.240

3 in total