| Literature DB >> 19860884 |
Shiri Freilich1, Leon Goldovsky, Assaf Gottlieb, Eric Blanc, Sophia Tsoka, Christos A Ouzounis.
Abstract
BACKGROUND: Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database.Entities:
Mesh:
Year: 2009 PMID: 19860884 PMCID: PMC2775751 DOI: 10.1186/1471-2105-10-355
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The rank-BLAST classification procedure. The colored circles and squares represent proteins; different shapes and colors represent different taxonomic origins. Protein sequences lacking taxonomic-annotations (retrieved for example form metagenomic samples which include partial or complete genome sequences of assortments of species) are subject to a BLAST search. For each protein, the results of the BLAST search are converted into a vector describing the ranking order of species where it recognizes homologues. Each species is ranked once, according to its first appearance. All possible protein-pairs combinations are compared in order to determine whether the positions of species on the vectors are correlated. Two vectors are considered to be correlated (green squares) when their Kendall tau correlation coefficient is higher than a threshold (see Methods). The correlation matrix is transformed into a probability matrix, estimating the significance of the similarity between the correlation profiles of each protein pair. Green boxes represent protein pairs where the P value is lower than a threshold (see Methods). In the final stage, proteins are clustered according to the similarity of their probability vectors.
Taxonomic classification and number of proteins of the fully sequenced species analyzed.
| - | 0 | 480 | ||
| Within family | 17 | 611 | ||
| Within phylum | 25 | 1696 | ||
| Within superkingdom | 31 | 574 | ||
| Archaea; Nanoarchaeota; Nanoarchaeum | From distinct super kingdoms | 59 | 536 |
† The taxonomic classification is according to . The common path between each and M. genitalium is shown in bold.
‡ The evaluated divergence distance aims to provide a quantitative assessment of the taxonomic distance. The values are derived from the scores of the pair-wise alignments between the 16S RNA from each species and the 16S RNA from M. genitalium (see Methods). The same divergence order was obtained when distance matrix was retrieved from greengenes - a 16s RNA gene database [38] (divergence distance scores ordered as in the table: 0, 0.14, 0.25, 0.31, 0.63, see Methods).
Figure 2Distributions of intra- and inter-genomic similarity scores and their ratios. Intra-genomic combinations are all combinations between proteins common to a single genome (green bars); inter-genomic combinations are all the combination between proteins from one species to proteins from the four other species (red bars). The y-axis on the right of each plot shows the fraction of combinations which fall within a given range, where all green bars sum to 1 and red bars sum to 1. The blue and gold lines show the ratio between the cumulative fractions of inter- to inter-genomic combination. The y-axis on the left of each plot corresponds to the blue line; the gold line corresponds to a unified scale in all graphs (0 to 500). (A) The distributions of the tau rank correlation coefficients calculated between the rank-BLAST profiles of pair combinations. (B) The distributions of the P values for the hypergeometric probability calculated for the correlation-profiles of pair combinations.
The efficiency of the rank-BLAST clustering procedure for the re-construction of genomic groups.
| Total | 2748 (71%) | 2505 (91%) | 64% | 20 |
| 454 (95%) | 445 (98%) | 92% | 2 | |
| 367 (60%) | 280 (76%) | 46% | 7 | |
| 1522 (90%) | 1452 (95%) | 86% | 4 | |
| 302 (53%) | 247 (82%) | 43% | 4 | |
| 102 (19%) | 81 (79%) | 15% | 3 |
‡ In brackets: the percentage out the total number of proteins in the species (or in the 5 species in the first row), as shown in Table 1.
† Genomic groups are clusters with at least 10 members. Their size and content is described in Figure 3. In brackets: the percentage out the number of proteins classified into a cluster.
Figure 3The size and species-specificity of the 20 largest genomic groups. Each genomic group is represented by a circle. The color of each circle corresponds to the genomic origin of most cluster members and the size of each circle corresponds to the number of cluster members, as listed at the adjacent table (right). The taxonomic-specificities of the genomic groups are indicated at the table, providing the number and fraction of proteins in a cluster which belong to the corresponding Dominant Species (DS). The clusters were constructed using the MCL algorithm (Methods). Layout and network construction were performed using the Biolayout software [42].
Figure 4Barcode representations of the rank-BLAST vector of three conserved ribosomal proteins and their corresponding clusters. The rank-BLAST profile of each cluster was constructed by calculating the mean position in the cluster of each of the 243 database species (i.e., for each database species we calculated its mean position in the individual vectors of all protein-members of the cluster). Each line in the barcode represents a species. The color code represents the phylogenetic proximity between the species in the vector (database species) and the species dominating the cluster (listed on top), i.e., the species coding the majority of the proteins in the cluster (specificity of the clusters is described at Figure 3). The 50S ribosomal protein L27 (rpl27) is the conserved protein from cluster 1. The 50S ribosomal protein L2 (rpl2) is the conserved protein from cluster 2. The 50S ribosomal protein L21 (rpl21) is the conserved protein from cluster 6.
Intra- and inter-cluster correlation in the four clusters dominated by S. pyogenes proteins.
| Cluster 1 | 0.30‡ | 0.40‡ | 0.14‡ | 0.57 | 0.41‡ | 034‡ | 0.52 | 0.40‡ | 034‡ | |||
| Cluster 4 | 0.30‡ | 0.17‡ | 0.56 | 0.43‡ | 0.27‡ | 0.16‡ | 0.00 | 0.00 | 0.00 | |||
| Cluster 14 | 0.40 | 0.17‡ | 0.18‡ | 0.44 | 0.17‡ | 0.27‡ | 0.47 | 0.18‡ | 0.28‡ | |||
| Cluster 19 | 0.14‡ | 0.56 | 0.18‡ | 0.22‡ | 0.43 | 0.29‡ | 0.25‡ | 0.43 | 0.31‡ | |||
Intra-cluster combinations are shown in bold.
† Each cell shows the mean correlation for all possible protein-pair combination between the protein-members of the row-cluster and the protein-members of the column-cluster. Diagonal values show the intra-cluster correlation.
†† Each cell shows the mean correlation between cluster-profile of the row-cluster and all members of the column-cluster.
ℑ Reduced cluster profile is composed solely of data points defined as highly-agreeable. Agreeability was calculated for each data point (species) in a cluster as the inverse coefficient of variation weighted by the fraction of appearances of a species in a cluster:
Agr = (mean*app)/(sd*mmb)
Where mean is the mean position of a species in a cluster; app is the number of appearances of a species in a cluster; sd is the standard deviation calculated for the species; and mmb is number of proteins in a cluster.
Data points for which Agr > 0.5 are considered highly agreeable.
‡ The inter-cluster correlation is significantly lower than the intra-cluster correlation observed for the row-cluster. Significance is defined as P value < 0.05 in the Wilcoxon rank sum test (equivalent to the Mann-Whitney test).
Figure 5Differences in the rank-BLAST profile between the main cluster (cluster 1) and secondary cluster (cluster 19) from . (A) The ranking-order of the species in the main cluster versus the top species in the secondary cluster. The phylogenetic tree shows the taxonomic relationship between Bacilli groups. (B) Barcode representations of Streptococcus pyogenes proteins from cluster 1 and cluster 19 which are involved in transfer, metabolism and regulation of Maltose/maltodextrin. The order of the proteins corresponds with the genomic organization of their encoding genes. The color code is the same as in A.
Distribution of BLAST best-hit and sequence-signature based methods for prediction of HGT events.
| Cluster 1 | 1313 | 1101 (173,70) | 146 | 22 (7,6) | 28 (7,5) | 1 (0,1) | 3 (1,0) | |
| Total | 154 (35,18) | 31 (9,2) | 22 (6,3) | 9 (6,4) | 10 (4,2) | 2 (0,1) | 1 (0,0) | |
| Cluster 4 | 125 (23,14) | 19 (6,0) | 14 (1,2) | 5 (4,3) | 7 (2,2) | 2 (0,1) | 1 (0,0) | |
| Cluster 14 | 18 (8,2) | 9 (2,1) | 4 (3,0) | 2 (2,1) | 2 (1,0) | 0 | 0 | |
| Cluster 19 | 11 (4,2) | 3 (1,1) | 4 (2,1) | 2 (0,0) | 1 (1,0) | 0 | 0 | |
| 239 (67,23) | 58 (16,3) | 86 | 9 (4,1) | 21 (4,0) | 3 (1,2) | 3 (1,0) | ||
In brackets: predictions for HGT events based on sequence-signature, retrieved from two public data sources: right - [43]; left - [44].
1 other than S. pyogenes; 2other than Streptococcus; 3 other than Bacilli; 4other than Bacteria;
† Since some proteins recognize homologues only in strains of S. pyogenes, the number of proteins in cluster might be higher than the sum of the four columns on the right.
Figure 6The individual rank-BLAST profiles of two proteins from .