| Literature DB >> 22121228 |
Nadav Rappoport1, Solange Karsenty, Amos Stern, Nathan Linial, Michal Linial.
Abstract
ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom-up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162,088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22121228 PMCID: PMC3245180 DOI: 10.1093/nar/gkr1027
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Clustering performance evaluation based on Pfam keywords
| Database | Clustering | CS | Specificity | Sensitivity |
|---|---|---|---|---|
| UniRef90 | MC-ProtoNet | |||
| Single Linkage | 0.78 | 0.93 | 0.24 | |
| ProtoNet 4.0 | 0.75 | 0.94 | 0.79 | |
| UniRef50 | MC-ProtoNet | |||
| Single Linkage | 0.72 | 0.91 | 0.79 | |
| SwissProt | MC-ProtoNet | |||
| Single Linkage | 0.81 | 0.90 | 0.91 |
Tests were performed on UniRef90 (1.8M), UniRef50 (960 K) and SwissProt (220 K)
Figure 1.ProtoNet clusters following pruning at selected thresholds. (A) A scheme of the binary tree following low and high condensations (LT ≥ x and LT ≥ y). The high level of compression (LT = 5) results in a smaller number of stable clusters. (B) Each panel represents a cluster summary according to a selected threshold (LT). Low (LT = 0.2) and high condensation level (LT = 5) differ in their cluster size and other statistical properties. Details on the cluster size, depth (by PL), the number of hypothetical proteins, solved structures in the PDB database and more are shown.
Figure 2.The contribution of annotation types to ProtoNet clusters. (A) About 40 annotation types that cover different aspects of function are included. Some of the minor annotation sources were combined and depicted as ‘others’. (B) The major annotation types and their coverage as measured by the fraction of proteins that are assigned with the indicated annotation type are listed. In ProtoNet 6.0, a total of 143 849 828 annotations (74 416 565 without taxonomy) is associated with the ∼9 million protein sequences.
Figure 3.ProtoNet cluster page and a tree viewer in simplified and advanced modes. (Top) From the cluster page (Cluster ID 4201544) the user can focus on the ProtoName and the collection of additional high quality annotations that are associated with this cluster. The number of proteins from the selected organisms is indicated with a framed T-symbol (for Taxonomy). Similarity, clusters that include proteins with 3D solved structures as marked by a symbol for PDB. Each cluster provides a short summary as a popup box with the number of proteins and the appearance of pre-selected organisms. The red edges in the tree indicate the branches that include the selected organisms. All other branched are faded. (Bottom) Using the advanced mode, the number of clusters in the ProtoNet tree is listed according to the predetermined LT and PL values. There are several sorting options according to the cluster size and the properties of the tree. An interactive use of the condensation levels allows inspecting the near vicinity of a subjected cluster in the ProtoNet hierarchy.