| Literature DB >> 23341912 |
Fotis E Psomopoulos1, Pericles A Mitkas, Christos A Ouzounis.
Abstract
Phylogenetic profiles express the presence or absence of genes and their homologs across a number of reference genomes. They have emerged as an elegant representation framework for comparative genomics and have been used for the genome-wide inference and discovery of functionally linked genes or metabolic pathways. As the number of reference genomes grows, there is an acute need for faster and more accurate methods for phylogenetic profile analysis with increased performance in speed and quality. We propose a novel, efficient method for the detection of genomic idiosyncrasies, i.e. sets of genes found in a specific genome with peculiar phylogenetic properties, such as intra-genome correlations or inter-genome relationships. Our algorithm is a four-step process where genome profiles are first defined as fuzzy vectors, then discretized to binary vectors, followed by a de-noising step, and finally a comparison step to generate intra- and inter-genome distances for each gene profile. The method is validated with a carefully selected benchmark set of five reference genomes, using a range of approaches regarding similarity metrics and pre-processing stages for noise reduction. We demonstrate that the fuzzy profile method consistently identifies the actual phylogenetic relationship and origin of the genes under consideration for the majority of the cases, while the detected outliers are found to be particular genes with peculiar phylogenetic patterns. The proposed method provides a time-efficient and highly scalable approach for phylogenetic stratification, with the detected groups of genes being either similar to their own genome profile or different from it, thus revealing atypical evolutionary histories.Entities:
Mesh:
Year: 2013 PMID: 23341912 PMCID: PMC3544837 DOI: 10.1371/journal.pone.0052854
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Fuzzy genome profiles for the five reference species used in this study (x-axis), against 243 species in the COGENT database (y-axis).
The color-coding scheme for the five species is followed throughout all figures, where appropriate. Notice that the sequence of species ranks according to COGENT is #002, #039, #050, #088 and #148, reflected by the maximal values of the corresponding genome profiles.
Figure 2Example distance diagram, showing the four different areas of interest.
The specific diagram is derived from M. genitalium as described, using the following parameters: no discretization process (both on fuzzy genome profiles and de-noized phylogenetic profile data – therefore, parameter alpha is not applicable); SVD threshold λ = 0.75; distance measure: cosine (default choice for real-value vectors). Evidently, most genes in this case are found close to the main diagonal; this might not be the case for other species.
Figure 3Flow diagram of the fuzzy profile method – see Methods for details.
Figure 4Simplified dendrogram representing the phylogenetic distances of the five reference species; COGENT species codes are used for brevity.
Normalized phylogenetic distance values for the five reference species, pictorially shown in Figure 5.
| MGEN | UURE | SPYO | BAPH | NEQU | |
| MGEN | 0 | 0.7660 | 0.8250 | 0.8900 | 0.9740 |
| UURE | 0.7660 | 0 | 0.8500 | 0.9010 | 0.9810 |
| SPYO | 0.8250 | 0.8500 | 0 | 0.8300 | 0.9700 |
| BAPH | 0.8900 | 0.9010 | 0.8300 | 0 | 0.9750 |
| NEQU | 0.9740 | 0.9810 | 0.9700 | 0.9750 | 0 |
Figure 5Distance matrix representing the distances between the five reference species, using the genome conservation metric which ranges between 0 and 1 (normalized values) [.
The diagonal self-distance values are evidently zero.
Figure 6Distance diagrams of the 5 reference species.
The upper-left panel representing M. genitalium is identical to , except the color-coding scheme. This scheme encodes the genome profile of the species that produced the minimum inter-genome distance, as in . Parameter settings as in .
Figure 7Discretized fuzzy genome profiles of the 5 reference species, using a low, permissive fuzzy threshold α = 0.2.
Figure 8Discretized fuzzy genome profiles of the 5 reference species, using a high, stringent threshold α = 0.99.
Figure 9Discretized fuzzy genome profiles of the 5 reference species, using a fuzzy threshold α = 0.35.
Figure 10Distance diagrams of the 5 reference species, using the following parameters: fuzzy threshold α = 0.35; SVD threshold λ = 0.75; Jaccard distance metric.
Corresponding fuzzy profiles are identical to those displayed in Figure 9 and color-coding as in Figure 6.
Figure 11Parameter optimization for threshold α.
By keeping parameters distance metric (Jaccard) and SVD threshold λ (0.75) constant, α is set to different values (x-axis). Distance distributions for all genes are derived from the main diagonal and within the distance diagram; mean distance is shown (y-axis). It is evident that there is an inflection point at α = 0.4 beyond which distances become sharply larger, thus indicating a higher disparity of gene profiles and a divergence from the expected presence of their corresponding coordinates along the main diagonal. This value can be taken as a maximal optimum value. Aiming at the most flexible value of α, without losing the on-diagonal presence of genes, an optimum range is between 0.3 and 0.4, hence the selection of 0.35 as our default α value.
Figure 12Distance diagram for M. genitalium, with the twelve outlier genes highlighted (see also ).
This diagram corresponds to the upper-left panel of Figure 10, with the same parameter settings.
Twelve cases selected from the M. genitalium genome according to specified Jaccard distance metric cut-off values (see text).
| COGENT ID | ID | Intra-genomedist | Inter-genomedist | Function | Taxa withhomologs | Comments |
| MGEN-G37-01-000288 | MG283 | 0.8313 | 0.7105 | prolyl-tRNA synthetase (ProS) | Mollicutes, Firmicutes, | Belongs to the ProRS class II aaRS (present only in some bacteria), archaeal/eukaryotic type |
| MGEN-G37-01-000462 | MG454 | 0.6587 | 0.5541 | was: conserved hypothetical protein, Ohr/OsmC |
| Unique in |
| MGEN-G37-01-000065 | MG063 | 0.6555 | 0.5789 | 1-phosphofructokinase (FruK) | Mollicutes, Firmicutes, Fervidobacterium, Fusobacteriaceae, some Proteobacteria | Unique in |
| MGEN-G37-01-000041 | MG041 | 0.5673 | 0.4870 | phosphocarrier protein HPr | Mollicutes, Firmicutes, | Absent in |
| MGEN-G37-01-000298 | MG293 | 0.5668 | 0.5000 | glycerophosphoryl diester phosphodiesterase (GlpQ) | Mollicutes, Firmicutes, Thermoproteaceae | Unique in |
| MGEN-G37-01-000071 | MG069 | 0.5337 | 0.4615 | putative PTS system glucose-specific EIICBA component (PstG) | Mollicutes, Firmicutes | Unique in |
| MGEN-G37-01-000390 | MG380 | 0.5144 | 0.4286 | glucose-inhibited division protein B (GidB) | Mollicutes, Firmicutes, Spirochaetales, Thermotogaceae, some Proteobacteria | Somewhat dispersed phylogenetic distribution, Hydrogenothermaceae |
| MGEN-G37-01-000064 | MG062 | 0.5072 | 0.4167 | fructose-permease IIBC component (FruA) | Mollicutes, Firmicutes | Unique in |
| MGEN-G37-01-000217 | MG214 | 0.4327 | 0.3314 | conserved hypothetical protein | Mollicutes, Firmicutes | Similarity to a gene from |
| MGEN-G37-01-000192 | MG189 | 0.4234 | 0.3368 | ABC transporter (UgpE?) | Mollicutes, Firmicutes, Actinobacteridae | As case MG188 |
| MGEN-G37-01-000050 | MG050 | 0.4170 | 0.3472 | deoxyribose-phosphate aldolase (DeoC) | Mollicutes, Firmicutes, Flavobacteriales and some Proteobacteria | Somewhat dispersed phylogenetic distribution, similar to orthologs from |
| MGEN-G37-01-000191 | MG188 | 0.4136 | 0.3155 | ABC transporter (UgpA?) | Mollicutes, Firmicutes | Highly similar to group, glycerol transport |
Both values have been experimentally validated to yield the maximum number of genes with respect to the trend across the main diagonal (Figure 12). Column names: COGENT identifier, common identifier (ID), intra-genome and inter-genome distances, described function, taxonomic categories (taxa) with homologs of corresponding genes and comments. The twelve cases are sorted by intra-genome distance in descending order, highlighting genes with the most anomalous phylogenetic distribution first.
putative cases of HGT are marked as bold in the ID column; remaining cases are classified into the Ugp/Glp and Fru/Pst groups;
sorted by intra-genome distance.