| Literature DB >> 19223325 |
Sean D Hooper1, Iain J Anderson, Amrita Pati, Daniel Dalevi, Konstantinos Mavromatis, Nikos C Kyrpides.
Abstract
In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19223325 PMCID: PMC2673424 DOI: 10.1093/nar/gkp075
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Distribution of leaf cluster consistency scores for functional annotation and four phenotypic metadata types. A score of 1 indicates that all proteins in the cluster share the same character state (e.g., all belong to hyperthermophiles). A score of <1 means that the cluster includes member proteins with different character states (e.g., some from hyperthermophiles and some from mesophiles). If any cluster member lacked a functional annotation, we defined that cluster's; consistency score for function metadata as 0.
The distribution of the number (N) of COGs, arCOGs and Pfams associated with individual SBCs
| Root SBC | Leaf SBC | |||||
|---|---|---|---|---|---|---|
| 0 | 4248 | 5144 | 4212 | 4984 | 6643 | 4934 |
| 1 | 3682 | 2576 | 3373 | 7512 | 5920 | 6950 |
| 2 | 367 | 466 | 616 | 881 | 857 | 1342 |
| 3 | 100 | 110 | 156 | 157 | 121 | 271 |
| 4 | 24 | 59 | 52 | 34 | 32 | 67 |
| 5 | 14 | 32 | 21 | 15 | 9 | 13 |
| 6 | 8 | 16 | 11 | 1 | 3 | 4 |
| 7 | 2 | 14 | 6 | 2 | 0 | 4 |
| 8 | 2 | 9 | 3 | 0 | 1 | 1 |
| 9 | 5 | 10 | 3 | 0 | 0 | 0 |
| More | 11 | 27 | 10 | 0 | 0 | 0 |
The members of most SBCs are all assigned to the same functional cluster (i.e. COG, arCOG, Pfam), but some SBCs contain members from, for instance, several COGs, and the members of ∼4000 SBCs have no associated COGs. (Leaf SBCs also include those root SBCs that could not be partitioned.)
Figure 2.Distribution of cluster consistency scores for phylogeny at each of five taxonomic levels.
Figure 3.Screenshot of the COAL user interface available at http://coal.jgi-psf.org/