| Literature DB >> 16362074 |
Fredj Tekaia1, Edouard Yeramian.
Abstract
The concept of the genome tree depends on the potential evolutionary significance in the clustering of species according to similarities in the gene content of their genomes. In this respect, genome trees have often been identified with species trees. With the rapid expansion of genome sequence data it becomes of increasing importance to develop accurate methods for grasping global trends for the phylogenetic signals that mutually link the various genomes. We therefore derive here the methodological concept of genome trees based on protein conservation profiles in multiple species. The basic idea in this derivation is that the multi-component "presence-absence" protein conservation profiles permit tracking of common evolutionary histories of genes across multiple genomes. We show that a significant reduction in informational redundancy is achieved by considering only the subset of distinct conservation profiles. Beyond these basic ideas, we point out various pitfalls and limitations associated with the data handling, paving the way for further improvements. As an illustration for the methods, we analyze a genome tree based on the above principles, along with a series of other trees derived from the same data and based on pair-wise comparisons (ancestral duplication-conservation and shared orthologs). In all trees we observe a sharp discrimination between the three primary domains of life: Bacteria, Archaea, and Eukarya. The new genome tree, based on conservation profiles, displays a significant correspondence with classically recognized taxonomical groupings, along with a series of departures from such conventional clusterings.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16362074 PMCID: PMC1314884 DOI: 10.1371/journal.pcbi.0010075
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Determination of Distinct Conservation Profiles for Proteins
The flow chart details steps in the determination of distinct conservation profiles for proteins in 99 predicted proteomes. The steps are as follows:
(A) Step 1: Species-specific predicted proteome comparisons. Each protein sequence of species Si (see list in Table S1) was compared to each database of all proteins from each surveyed species, using the BLASTP program (See Materials and Methods). Best significant matches in each of the considered species were determined. The original 541,880 protein sequences, lead to 442,460 non-specific proteins (i.e., 81.7%). Fractions of ancestral duplication and ancestral conservation were determined. Each protein was then described by a vector whose components are zeros (no matches) or best significant matches whenever hits occur in each of the considered species. From the list of proteins and their corresponding best hits, pairs of orthologs were determined by looking for reciprocal best significant hits.
(B) Step 2: Protein conservation profiles. In each species Si, the conservation profile of each protein k, denoted gi,k, is represented by a n-component vector of ones and zeros describing its pattern of conservation across all species. Each vector associated with a conservation profile is of size 99, corresponding to the total number of surveyed species (in the order indicated in Table S1).
(C) Step 3: Distinct conservation profiles. In each species Si, identical conservation profiles were represented by a single representative, leading to the set of distinct conservation profiles. In this simplification, a “weight” is associated to a given conservation profile, as the total number of proteins with that profile. For example 3,154 distinct conservation profiles were found in S. cerevisiae, 5,690 in A. gambiae, 6,225 in H. sapiens, and 1,716 in P. falciparum.
(D) Step 4: Overall characterization of distinct conservation profiles. The overall set of distinct conservation profiles amounted to 184,130 profiles. The “conservation weight” of each conservation profile is determined, as the total sum of 1.
Figure 2Distinct Conservation Profiles and Corresponding Weights
(A) Distribution of the whole set of distinct conservation profiles (184,130) according to the 99 possible weight classes varying from one to 99.
(B) Similar distribution restricted to the subset of distinct conservation profiles (2,044) associated with proteins from at least two species.
Figure 3Genome Tree Construction
The flow chart details the three steps in genome tree construction. In the first step a data matrix is constructed, based on overall similarity scores between pairs of species (i.e., the fraction of shared distinct conservation profiles, Jaccard scores, fraction of shared homologs, or ancestral duplication conservation weights). In the second step, correspondence analysis is performed on the data tables, for constructing the corresponding factorial spaces (orthogonal systems of dimensions n–1, with n the number of lines in the considered matrices). In the third step the genome trees are derived based on the reciprocal neighboring of the species from their Euclidean distances, as calculated in the factorial spaces.
Figure 4Genome Tree
Profiles tree based on Jaccard scores as obtained from the whole set of distinct conservation profiles.
Figure 5Bacterial Branch
Bacteria subtree (see Materials and Methods), based on the restriction of the Jaccard scores matrix to the lines corresponding to bacterial species.
Figure 6Archaeal Branch
The archaeal branch is represented in more detail with (A) archaea subtree (see Materials and Methods), based on the restriction of the Jaccard scores matrix to the lines corresponding to archaeal species, and (B) archaea only subtree based on the restriction of the Jaccard scores matrix to the lines and the columns corresponding to archaeal species.
Figure 7Eukaryal Branch
The eukaryal branch is represented in more detail with (A) eukarya subtree (see Materials and Methods) based on the restriction of the Jaccard scores matrix to the lines corresponding to eukaryal species, and (B) eukarya only subtree based on the restriction of the Jaccard scores matrix to the lines and the columns corresponding to eukaryal species.